• Open

    Courses needed to become proficient in reinforcement learning
    Hi all, I'm new to reinforcement learning. I'd like to learn this type of machine learning training method fundamentally; that is, I want to initially be able to use "pen and paper" to work through simple reinforcement problems before using programming languages. My goal is to find and structure online courses (ideally open courses) that will teach me the material and concepts that I need. My big current issue right is that I don't know what pre-requisites I'm not proficient in. Pretend that I'm an undergraduate student beginning my freshman year. Any help is appreciated. submitted by /u/Glad-Amphibian-8636 [link] [comments]  ( 1 min )
    Teaching an AI How to KILL
    submitted by /u/Stochastic_Machine [link] [comments]
    How to train multi agents with PPO/DQN for playing Atari Game
    Hi, I would like to train two agents with either PPO or DQN, so that they are able to play a competitive Atari game like Pong or Slimevolleyball. I wanted to train them using Stable Baselines, but I could not find an example for training two agents. Then I found RLlib but this lib does not provide an example for the two games named before and I struggled with configuring a new environment. Is there any way I can train two agents with PPO or DQN in the gym environment using one of these Libraries? Thank you! submitted by /u/theresa_k96 [link] [comments]  ( 1 min )
    question about the recurrent
    I have a question about the recurrent. Under the recurrent setting, there is not the computation graph betweent step t and step t-1 , how to backprop gradient? https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail/blob/master/a2c_ppo_acktr/storage.py#L145 submitted by /u/Traditional-Ad4492 [link] [comments]  ( 1 min )
    Reference about training an agent on multiple environment
    Hello, I am searching good references on the practice of training one agent on many different tasks at the same time, does someone have recommendations ? :) Thanks! submitted by /u/krenast [link] [comments]  ( 1 min )
    Robbins-Monro approximation for expectation
    submitted by /u/promach [link] [comments]  ( 1 min )
  • Open

    How to animate still images - FILM: Frame Interpolation for Large Motion, a 5-minute paper summary by Casual GAN Papers
    Motion interpolation between two images surely sounds like an exciting task to solve. Not in the least, because it has many real-world applications, from framerate upsampling in TVs and gaming to image animations derived from near-duplicate images from the user’s gallery. Specifically, Fitsum Reda and the team at Google Research propose a model that can handle large scene motion, which is a common point of failure for existing methods. Additionally, existing methods often rely on multiple networks for depth or motion estimations, for which the training data is often hard to come by. FILM, on the contrary, learns directly from frames with a single multi-scale model. Last but not least, the results produced by FILM look quite a bit sharper and more visually appealing than those of existing alternatives. As for the details, let’s dive in, shall we? Full summary: https://t.me/casual_gan/268 Blog post: https://www.casualganpapers.com/motion-interpolation-image-animation-implicit-model/FILM-explained.html FILM arxiv / code (unofficial) Subscribe to Casual GAN Papers and follow me on Twitter for weekly AI paper summaries!h submitted by /u/KirillTheMunchKing [link] [comments]  ( 1 min )
    Hey Folks, Here is a really cool research by Deepmind researchers where they probe image-language transformers and propose SVO probes for verb understanding
    Fine-tuning is required for a range of functions performed by multimodal image–language transformers. Researchers worldwide are interested in whether these algorithms can detect verbs or merely use the nouns in a phrase. A dataset of image-sentence pairs with 447 verbs that are either visible or widely encountered in the pretraining data was compiled to perform this task. Researchers from DeepMind propose to evaluate the pretrained models in a zero-shot manner using this dataset. In comparison to other elements of speech, it is found that the pretrained models underperform more in scenarios that demand verb interpretation. An investigation into which verbs are the most difficult for these models is underway. You can read this full summary here But if you are interested in only paper then you check the paper here https://preview.redd.it/wvg7cmlxyek81.png?width=1548&format=png&auto=webp&s=00a33a8d30b1285780d13fce2d271586852d7736 submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    Like Futurama but...
    submitted by /u/AutoMeta [link] [comments]
    Pytorch Introduces ‘TorchRec’: A Python-based PyTorch Domain Library For Recommendation Systems (RecSys)
    Recommendation Systems (RecSys) are a big part of today’s production-ready AI, although you wouldn’t know it from glancing at Github. In contrast to domains like Vision and NLP, most of RecSys’ continuous discovery and development takes place behind closed doors. The field is far from democratized for academic researchers exploring these approaches or creating individualized user experiences. RecSys as a field is also defined by learning models over sparse and/or sequential events, which has a lot of overlap with other AI fields. Many of the approaches, particularly those for scalability and distributed execution, are portable. RecSys approaches account for a significant amount of worldwide AI investment; therefore, enclosing them prevents this money from moving into the larger AI field. TorchRec is a new PyTorch domain library for Recommendation Systems. This library includes standard sparsity and parallelism primitives, allowing researchers to create and implement cutting-edge customization models. CONTINUE READING Github: https://github.com/pytorch/torchrec https://preview.redd.it/u5mu339xkak81.png?width=1536&format=png&auto=webp&s=3f700fc09c9e3491827b9f4509f073a56bf84336 submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
  • Open

    [D] How to animate still images - FILM: Frame Interpolation for Large Motion, a 5-minute paper summary by Casual GAN Papers
    Motion interpolation between two images surely sounds like an exciting task to solve. Not in the least, because it has many real-world applications, from framerate upsampling in TVs and gaming to image animations derived from near-duplicate images from the user’s gallery. Specifically, Fitsum Reda and the team at Google Research propose a model that can handle large scene motion, which is a common point of failure for existing methods. Additionally, existing methods often rely on multiple networks for depth or motion estimations, for which the training data is often hard to come by. FILM, on the contrary, learns directly from frames with a single multi-scale model. Last but not least, the results produced by FILM look quite a bit sharper and more visually appealing than those of existing alternatives. As for the details, let’s dive in, shall we? Full summary: https://t.me/casual_gan/268 Blog post: https://www.casualganpapers.com/motion-interpolation-image-animation-implicit-model/FILM-explained.html FILM arxiv / code (unofficial) Subscribe to Casual GAN Papers and follow me on Twitter for weekly AI paper summaries! submitted by /u/KirillTheMunchKing [link] [comments]  ( 1 min )
    [R][P] Self-supervised Transformers for Unsupervised Object Discovery using Normalized Cut + Hugging Face Spaces Gradio Demo
    submitted by /u/Illustrious_Row_9971 [link] [comments]
    [D] Beef up the workstation or move to the cloud?
    Hi there, I am new to the ML area and would like your feedback. I am at a turning point, either beefing up my workstation or using cloud providers like AWS, GCP, Azure or smaller providers like paperspace. Looking for advice on which way to go. If you are using the cloud providers what pitfalls to be aware of or other useful pointers. So far I have more seriously looked into the AWS sagemaker and it's a “bit of a learning curve” to get started with. The google colab and sagemaker studio lab yet again are too limited for my needs. Thanks submitted by /u/Academic_Arrak [link] [comments]  ( 1 min )
    [D] Machine Learning - WAYR (What Are You Reading) - Week 132
    This is a place to share machine learning research papers, journals, and articles that you're reading this week. If it relates to what you're researching, by all means elaborate and give us your insight, otherwise it could just be an interesting paper you've read. Please try to provide some insight from your understanding and please don't post things which are present in wiki. Preferably you should link the arxiv page (not the PDF, you can easily access the PDF from the summary page but not the other way around) or any other pertinent links. Previous weeks : 1-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80 81-90 91-100 101-110 111-120 121-130 131-140 Week 1 Week 11 Week 21 Week 31 Week 41 Week 51 Week 61 Week 71 Week 81 Week 91 Week 101 Week 111 Week 121 Week 131 Week 2 Week 1…  ( 1 min )
    [D] Terminology discussion: Does "model" refer to the raw or to the fully trained mathematical method?
    When looking into Machine Learning models, I came across something that confuses me: On the one hand "model" is referred to (e.g. by Google) as something you pick before you start the training process, meaning the raw mathematical method (e.g. linear regression). On the other hand "model" is referred to (e.g. by Google) as the fully trained mathematical method, so the mapping from input to output that you get after training (e.g. the result of conducting a linear regression on the training data). Is both correct? If not, how would you call the remaining thing from the two options above? submitted by /u/karlssonvomdach [link] [comments]  ( 1 min )
    [D] ML in the real-world : How did you solved a business problem using ML?
    Solving a business problem can be tricky at first glance, no one ever comes to you saying "I need a classifier to classify this and that" or "I'm want a recommendation system to help me do that". Have you any problem you have translated into ML-terms and solved it? How was the outcome? What data did you used? In which context your project was set in if any? What did you use to solve the problem (tool, algos...)? submitted by /u/charlesaten [link] [comments]  ( 1 min )
    [D] How satisfied are you with the current Interpretability methods for Neural Networks?
    In the past few years, papers on interpretability for neural networks have been an important part of top-tier ML/CV conferences. I would like to know how satisfied is the ML community with this progress. View Poll submitted by /u/aritipandu_san [link] [comments]  ( 1 min )
    [P] I made the kind of tutorial I wish someone had made for me when I first started trying to connect the math in research papers with code examples I found online
    submitted by /u/clam004 [link] [comments]  ( 1 min )
    [D] Simple Questions Thread
    Please post your questions here instead of creating a new thread. Encourage others who create new posts for questions to post here instead! Thread will stay alive until next one so keep posting after the date in the title. Thanks to everyone for answering questions in the previous thread! submitted by /u/AutoModerator [link] [comments]  ( 2 min )
    Constant validation metrics [D]
    I am training a neural network. The learning curves for the training metrics change over time, but the validation metrics are always constant. Originally I thought it might be over fitting but I changed the learning rate and it is still constant. Any advice for fixing this issue would be appreciated. submitted by /u/Dog-eating-chicken [link] [comments]  ( 1 min )
    [D] evolutionary algorithms roadmap
    after watching this video i want to dive more deep into the topic and create such algorithms. questions: is there enough research on this topic so i could do on higher education? are these methods being used in AI/ML? how much is this methods explored? is it just taking off and not much research is done or they found equivalent methods in order to simulate such procedures and get the outcome? what things should i learn in order to get my hands into this topic? i took linear algebra and numerical analysis this semester and I'm trying to go toward more advanced topics (to be honest i just studied in past 2 years, had no job and didn't choose a topic to dive deep and specialize myself (some people do web and i hate development, also i like academia route) submitted by /u/sweetchocolotepie [link] [comments]  ( 1 min )
    [D] Favorite paper integrating Causality and Machine Learning
    Hi, For Causality-inspired machine learning and Machine-learning inspired causality research, I think it is difficult to understand what the state-of-the-art is for subproblems in these domains given the lack of transparent and unambiguously good benchmarks. I also haven't seen best paper awards for work in the above areas at least not in a major machine learning conference. I am curious for those familiar with this work, what are your favorite papers from one of or both areas even if it focuses on a relatively applied problem (i.e problems in economics). It would also be helpful if you could provide some rationale. submitted by /u/HybridRxN [link] [comments]  ( 1 min )
    [Discussion] Visualization for Machine Learning as a scrub
    I'm trying to build a small portfolio for personal projects as part of my college application, and I was wondering if there was a way to visualize the data in a sense? New to ML so not sure if its even necessary besides making it more complicated. submitted by /u/Risk-Personal [link] [comments]  ( 1 min )
    [D] Two steps model for noisy data
    Suppose you have a labeled dataset with noisy data. How can you model for that? I have found this very interesting reply from this question: https://www.reddit.com/r/MachineLearning/comments/dgnpcb/d_learning_with_noisy_data_but_perfect_labels/ which says: You model for it. You can encapsulate this "junk probability" in a prior. You can even do a kind of two stage model where you have a first stage to predict whether or not the observation is total garbage to isolate it from the observations presented to your more "informed" model. It really depends on the situation. In general, the strategy is to capture this kind of noise in your variance inference. You make a prediction on an observation, but you don't report it as a point estimate: you report it with a prediction interval. Wide intervals indicate low confidence predictions, which result from noisy data. In the two stage model, you might have two different intervals: one for the prediction that the observation was useful at all, and a second interval for what your prediction is given the assumption that the observation was in fact useful. Breaking up the variance in this way should give you a (potentially deceptively) tighter conditional variance, which is what you ultimately want (tighter intervals equals more "explained variance"). I kind of understand, but I was curious if there is any source I can read more about this or some paper in which this is applied. submitted by /u/macoit18 [link] [comments]  ( 1 min )
    [D] Growing your "brain trust" - how are you networking while WFH?
    You're as good as the five closest people in your network the saying goes...and this extended WFH period has slowly diminished my network. I'd love to hear your experiences building a network of supportive DS and engineering peers. submitted by /u/Shap177 [link] [comments]
    [D] Recent Paper / Lecture Recommendations on classical ML (not DL)
    Where I work, we have a bi-weekly research meeting where we discuss papers or watch lectures together, and soon it's my turn to come up with something. So I'm looking for new exciting work that might be relevant to us. It's a fintech company, and we mostly use tabular data with classical models (Logistic, boosting trees, etc.) If anything comes to mind, please let me know Thanks! submitted by /u/shgoren [link] [comments]  ( 1 min )
    [D] Meta AI Residency interview
    Hey everyone. I have my first interview for a AI Residency position at Meta soon. The information i have recieved both mention that i should be prepared for both normal SWE interview questions (Leetcode etc) and more machine learning foucsed interviews (calculating metrics, data cleaning etc.) If anyone has been through this process recently or something similiar i would appreciate it if you could share your experience or share some tips for how i should be preparing. Thanks in advance. submitted by /u/WholeReward9042 [link] [comments]  ( 1 min )
    [D] Running your computationally heavy code on a remote server
    I have a general question regarding running computationally heavy computer vision algorithms on remote servers. So far I have been working with smaller datasets which takes only a few hours to run and yet I have faced many situations where the process broke down due to a multitude of reasons. Maybe it was my internet connection or my VSCode got disconnected? I wonder how people set their computational environment so they can run a process without trouble for days? other than checkpoints, which methods are common practice? I would be glad to hear about your opinion submitted by /u/Substantial-Ad4518 [link] [comments]  ( 1 min )
    [R] A Program to Build E(N)-Equivariant Steerable CNNs - Link to a free online lecture by the author in comments
    submitted by /u/pinter69 [link] [comments]  ( 2 min )
    [P] I made live digit recognizer with machine learning and web technologies. (Live Demo in Comments...)
    submitted by /u/dusklight00 [link] [comments]  ( 2 min )
    [R][P] Why does TfidfVectorizer from Sklearn create a document-term matrix and not a term-document matrix?
    Why does TfidfVectorizer from Sklearn create a document-term matrix and not a term-document matrix? When we use TFIDF for Latent Semantic Analysis we begin with a term-document matrix, not a document-term matrix. submitted by /u/Arup15 [link] [comments]  ( 1 min )
    [D][R] People in the industry, what image resolution do you use?
    In computer vision, the most common datasets to benchmark the new method against is CIFAR and ImageNet, which have the resolution of 32x32 and 224x244 (generally) respectively. Some methods run on larger ImageNet resolution, say 384. That resolution is small, compared to the normal image we usually take with our phones (720p, 1080p). What image resolution do you use in industry? submitted by /u/AerysSk [link] [comments]  ( 2 min )
  • Open

    100 digits worth memorizing
    I was thinking today about how people memorize many digits of π, and how it would be much more practical to memorize a moderate amount of numbers to low precision. So suppose instead of memorizing 100 digits of π, you memorized 100 digits of other numbers. What might those numbers be? I decided to take […] 100 digits worth memorizing first appeared on John D. Cook.  ( 2 min )
  • Open

    Amazing Benefits of Big Data Analytics Career
    Big data is all over the place! and it has the ability to provide us with far more than we expect. Be it:  A hospital that gathers information in order to assess specific outcomes and improve outcomes.  To enhance the appeal, a business collects data from customers.  Users’ data is collected by businesses to improve… Read More »Amazing Benefits of Big Data Analytics Career The post Amazing Benefits of Big Data Analytics Career appeared first on Data Science Central.  ( 4 min )
    Approach to AGI: how can a “synthetic being” learn?
    Three inspirational articles from Nikko Ström about Symbolic AI [1], Melanie Mitchell about AI-based Understanding [2] and Yoshua Bengio about the Conscious Prior [3], have inspired me to write my own small contribution. Since some years ago I have been thinking of how our brain learns and its relationship to human being goal of approach… Read More »Approach to AGI: how can a “synthetic being” learn? The post Approach to AGI: how can a “synthetic being” learn? appeared first on Data Science Central.  ( 2 min )
  • Open

    Will learning how to program a neural network from scratch help me understand neural networks better?
    I tried pytorch and a bit of keras, and I just can't seem to comprehend it right now for some reason submitted by /u/Hananelroe [link] [comments]  ( 1 min )
    Need help with time series classification using transformers
    Hello everyone, I am following this link to get myself familiar with transformer networks. This example demonstrates a binary classification task about engines, calling them as faulty or non-faulty. I am currently working on classifying 9 different dynamic hand gestures. I have created my own dataset with mediapipe, extracting the 3D keypoint locations of my hands for these gestures for 30 frames. I have trained LSTM's and bidirectional LSTM's with attention layers in between, and they perform arguably well. ​ I have 9 classes and have created 90 samples for each one. Each keypoint on my hand has 3 features : normalized coordinates for (x,y,z) each hand has 21 keypoints, and I have two hands, which makes a total of 126 features at any given instance. I have gathered 30 frames of data for each sample. Thus, the shape of my entire input X without train and test split is (9*90 , 30 , 3*21*2) -> (810 , 30 , 126) ​ you can see that there are 30 rows and 126 columns for my X ​ Similarly, the shape of my output y is (810 , 9) where each row consists of 1's and 0's. ​ The outputs are stored like this, because the dataset is not randomized yet, the first 90 elements of y corresponds to a single gesture and similarly, the next 90 corresponds to another one. Whiel doing the train and test split, X and y orders are randomized. ​ Here is the randomized example for y_train ​ ​ ​ The example that I mentioned above works with a y vector that has a shape of (y,1) where y is either 0 or 1 because its a binary classification problem. When I feed my y matrix to this network, I get an error about shapes being incompatible. How should I go about this? When I convert this whole y matrix to a vector by using np.argmax to extract which index is maximum (1, and thus true label), I get such a vector (0,5,7,3,4,5...) as expected. When I try to train on this, loss never drops. Any help is greatly appreciated! I can provide any further detail in case it is needed. submitted by /u/ege6211 [link] [comments]  ( 2 min )
    Can we use Softmax activation function in hidden layers ?
    submitted by /u/Actual-Performer-832 [link] [comments]  ( 1 min )

  • Open

    [D] I have come up with a new importance sampling weighting method for active learning. How do I prove it is better than uniform sampling or entropy sampling?
    In active learning for posterior density estimation, two common baselines are uniform sampling, or sampling based on the most uncertain (ranked via entropy). How do I provide some theoretical justification for a new sampling scheme? submitted by /u/PK_thundr [link] [comments]  ( 1 min )
    [D] Paper Explained - Can Wikipedia Help Offline Reinforcement Learning? (Full Video Walkthrough)
    https://youtu.be/XHGh19Hbx48 Transformers have come to overtake many domain-targeted custom models in a wide variety of fields, such as Natural Language Processing, Computer Vision, Generative Modelling, and recently also Reinforcement Learning. This paper looks at the Decision Transformer and shows that, surprisingly, pre-training the model on a language-modelling task significantly boosts its performance on Offline Reinforcement Learning. The resulting model achieves higher scores, can get away with less parameters, and exhibits superior scaling properties. This raises many questions about the fundamental connection between the domains of language and RL. ​ OUTLINE: 0:00 - Intro 1:35 - Paper Overview 7:35 - Offline Reinforcement Learning as Sequence Modelling 12:00 - Input Embedding Alignment & other additions 16:50 - Main experimental results 20:45 - Analysis of the attention patterns across models 32:25 - More experimental results (scaling properties, ablations, etc.) 37:30 - Final thoughts ​ Paper: https://arxiv.org/abs/2201.12122 Code: https://github.com/machelreid/can-wikipedia-help-offline-rl My Video on Decision Transformer: https://youtu.be/-buULmf7dec submitted by /u/ykilcher [link] [comments]  ( 1 min )
    [D] Quantum Mechanics in Machine Learning
    I've been looking up ideas and models in ML, born out of inspirations from Quantum Mechanics phenomenons. I'm specifically talking about ideas like Bolzmann machines which are based on another physical phenomenon, the Ising model. I must be using the wrong terms and keywords because all my searches bring up research related to Quantum Neural Networks. I was wondering if anyone here is pre-acquiated with such works and would be kind enough to share them. submitted by /u/mfarahmand98 [link] [comments]  ( 1 min )
    [N] MIT Media lab founder describing the importance of personal data
    https://www.youtube.com/watch?v=sKhCKWR8OwI&list=PLhCQIYxdniNtEto-CO33yVb0ah7yKOVV5&index=5 submitted by /u/Corddot [link] [comments]
    [R][P] Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework + VQA Hugging Face Spaces Demo
    submitted by /u/Illustrious_Row_9971 [link] [comments]
    [D] Different ways to consolidate ML concepts/techniques/research/
    Aside from practical learning through hands on building ML models, etc., what memory techniques and note-taking methods do you go about retaining the concepts/techniques/research in ML and AI and the devices that you use to help consolidate into long term memory? I'm just seeing if there are other ways of retention specific to ML other than the traditional learning methodologies like spaced repetition, active recall, etc. that have helped you understand it and articulate it more eloquently in conversation. submitted by /u/cloudlessjedi [link] [comments]  ( 1 min )
    [D] Anomaly detection using only benign examples
    Dear fellow redditors. I recently started developing an anomaly detection system using only benign examples (think benign credit card transactions) and don't want to use known fraudulent transactions in order to be able to detect zero-day fraudulent activity for which we don't have data examples yet. In modern literature I've seen these approaches being used : - Local Outlier Factor - Autoencoders - Bayesian Networks - Random Cut Forest - Elliptic Envelope - Fuzzy Logic - Genetic Algorithms What are your thoughts and personal experience with zero-day anomaly detection ? And which techniques would you recommend, including some which may not be listed here. Thank you. submitted by /u/Meow7788 [link] [comments]  ( 1 min )
    [D] Using Gumbel Softmax for Discrete Action and Label selection. Is it the best option?
    If I want to train a classifier that selects from a set of let's say 5-labels and depending on a particular label (if-condition) there's another representation/embedding that is trained to predict another label and I want to train this whole model in an end to end setting; I would require something like Gumbel Softmax to sample the label for the first 5-label classifier right? Is this the only method? Actually, the problem is that there are 5 labels, and 1 of the labels has let's say 5 more sub types! So, I think I can just directly train using normal softmax? and based on the prediction like with teacher-forcing I can see if the sub-type label is predicted and if it is I just pass the input to the second classifier! It should work properly as well.. and still be differentiable! But, could gumbel softmax be used in a setting like this? and help? If someone has information on this, can you please point me towards other models/approaches/techniques for selection and yet having an end-to-end model? submitted by /u/goobuddy [link] [comments]  ( 2 min )
    [P] How to predict price.
    Hello, i want to use ml code to predict the patrolprice at my local patrol station. I have a dataset with this columns: variety | price | date | time I want to predict the price for example the next week, or next day. Is this possible to solve with h2o automl or how you would solve it? Do you have a tutorial for me? submitted by /u/Icy_Collection_1951 [link] [comments]  ( 1 min )
    [Announcement] AMA Friday morning EST (4th March) with Sebastian Raschka author of Machine Learning with Pytorch and SciKit-learn
    AMA friday morning EST with Sebastian Raschka /u/seraschka author of Machine Learning with Pytorch and Scikit-Learn Book It is on goodreads here https://www.goodreads.com/book/show/60098440-machine-learning-with-pytorch-and-scikit-learn?from_search=true&from_srp=true&qid=MNIHuvctFr&rank=1 And a link to the code and such here https://www.reddit.com/r/learnmachinelearning/comments/t1gqqe/machine_learning_with_pytorch_and_scikitlearn_book/ ​ Ask him questions about his new book, academic research, or his job at http://Grid.ai submitted by /u/cavedave [link] [comments]  ( 1 min )
    [R] Cloning a musical instrument from 16 seconds of audio (WIP)
    submitted by /u/More_Return_1166 [link] [comments]  ( 1 min )
    [R] What are some Text Similarity methods?
    So, first post here. I have been working with ML and Data Science for a year now. I used to work with classical Machine Learning majorly (Trees and Linear Regression). I recently shifted to a new job with focus on NLP and I've been given a task to optimise our product. I can't go and explain everything but here is brief and where I am stuck. The task is we're trying to have a field in our portal in which when someone posts a word(skill like Python, Management etc) we should get reccomended with a similar skill from our Data which we have. So, basically Text Similarity. Where, I am stuck is. I have looked at multiple papers and they majorly focus on sentence whereas I only have singular words. Two-Three methods I'm trying are 1. sense2vec 2. word2vec (which is not much useful) 3. Tensorflow Universal Sentence Encoder Would appreciate someone who can help me lead the way. submitted by /u/jeevanshud [link] [comments]  ( 2 min )
    [D] Lecture on Concise Slides
    I have been asked to make a short video series of the concise slides I am making. I would really love some guidance on this before I spend a lot of time making videos. Can you please let me know your opinions on this poll? If you haven't seen them yet, please check them out here: https://www.reddit.com/r/MachineLearningDervs/ Thank you for helping me :) View Poll submitted by /u/Ok_Can2425 [link] [comments]  ( 1 min )
    [R] I need help in identifying which requirements are needed to create human-centred AI?
    Hi everyone, I am currently studying my PhD in software engineering and my topic is on identifying Requirements Engineering techniques for AI systems. I am in the process of conducting a survey to investigate how current practices manage requirements when building AI systems. If you have worked as a Data Scientist, Machine Learning Specialist or a Software Engineer working with an AI component, I would really appreciate your assistance in completing my survey. We have received Deakin University’s Ethics approval (reference number: SEBE-2021-41- MOD01). This is an anonymous survey and would take approximately 10 to 15 minuets of your time. If you are interested in participating, please fill out our questionnaire at: https://researchsurveys.deakin.edu.au/jfe/form/SV_dd6vdkrNMUr8pfg submitted by /u/Positive-Object-7296 [link] [comments]  ( 1 min )
    [D] Model Extraction Attack
    Does anyone have experience with data-free model extraction (training a duplicate model without using the original data), especially for video models? submitted by /u/One-Yogurt7320 [link] [comments]
    [D] Self-serve for users
    How many of your models are actually deployed in automated environments, and what proportion is used locally that you then send out as a report/csv/dashboard etc.? I see a lot of conversations here that talk about production pipelines and deployments but I work with teams that do tons of analysis, especially for smaller projects, that are done locally. DS teams end up working on business requests that involve downloading files, building some analysis and then sending results out as csvs or dashboards. Is that true for you as well? Are there tools you use to publish non production models to business users (that can't code) to use on a self-serve basis? submitted by /u/dmart89 [link] [comments]  ( 1 min )
    [D] Paper Explained – How Do Vision Transformers Work?
    https://youtu.be/dOwRXpSSc8E In this video, we present the lessons of the recent "How do Vision Transformers work?" paper: https://arxiv.org/abs/2202.06709v1 Multi-head self-attention (MSA) and convolutions are complementary, or so it turns out. So what are the differences between MSA and Convs? Outline: 00:00 Transformers vs ConvNets 01:04 Sponsor: Weights & Biases 02:21 Convolutions explained in a nutshell 03:35 Multi-Head Self-Attention explained 06:46 Why we thought that MSA is cool 09:56 Paper insights 15:26 MSA vs. Convs (more insight) 16:07 Low-pass filters (MSA) and high-pass filters (Convs) submitted by /u/AICoffeeBreak [link] [comments]  ( 1 min )
    Machine learning in games [P]
    Hi. One month ago, I created my first game(for my senior project) in Lua using the love2d framework. This game is an icy tower clone. I planned to make use of AI so that eventually a software agent will play the game for me. AI is a very big topic and I know only the basics. I am in my last semester as a CS student and I do not have much time in hand. Any tips concerning how can I achieve what I want in a reasonable amount of time? submitted by /u/Joe2001r [link] [comments]  ( 1 min )
    [D] Anyone able to reproduce cpn detection results from PoseFormer or VideoPose3D?
    See links attached for the two specifically the cpn results: Poseformer - https://github.com/zczcwh/PoseFormer VideoPose - https://github.com/facebookresearch/VideoPose3D submitted by /u/AbjectDrink3276 [link] [comments]
    [D] what percent of top conference papers fudge results?
    I’ve noticed recently a lot of top conference papers with available repos are not reproducible. What have been everyone’s experience with this and are there any studies estimating what percent of papers fudge results? submitted by /u/AbjectDrink3276 [link] [comments]  ( 3 min )
  • Open

    From "All models are wrong, but some are useful" to "All models are biased, but some are usefully biased"
    I guess most AI practitioners are familiar with the quote: "All models are wrong, but some are useful" George E. P. Box Which is used to convey the idea that modeling reality is virtually impossible, and that one should try instead to model a bounded problem with a clear purpose, even if it fails to represent reality faithfully. To inspire my students, I came up with the following quote: "All models are biased, but some are usefully biased" In the context of XAI, my goal is to convey the idea that machine learning is all about learning biases, and that is only our human perception which eventually decides which biases are appropriate and acceptable practice, and which are wrong and should be avoided. IMO this is a key lesson for anyone working in ML bias detection, analysis and suppression. So, does the quote help in any way? Or is it too obvious and/or arrogant? ​ View Poll submitted by /u/Phastolf [link] [comments]  ( 1 min )
    Hey Friends, Here is Cool Research from Google AI Where They Proposed a Deep Learning–based Algorithm to Improve Medical Ventilator Control for Invasive Ventilation
    Mechanical ventilators assist patients in breathing undergoing surgery or who cannot breathe independently due to acute disease. The patients are A hollow tube (artificial airway) is inserted into the patient’s mouth and down into their primary airway, or trachea, to connect them to the ventilator. The ventilator follows a clinician-prescribed breathing waveform based on a respiratory measurement from the patient(e.g., airway pressure, tidal volume). Therefore, it’s a challenging task and needs systems that are both robust to variances or changes in the lungs of patients and adherent to the ideal waveform to avoid injury. That is one of the reasons why they are monitored closely by highly experienced professionals ensuring that their performance meets the needs of the patients and does not cause lung damage. A recent Google study proposes a neural network-based controller that improves medical ventilator control by balancing these properties, reducing the risk for injury to patients’ lungs. The new control algorithm detects airway pressure and computes necessary airflow modifications to meet prescribed values using signals from an artificial lung. This method requires less manual intervention from clinicians and shows great robustness and performance compared to other systems while requiring. You can read my full summary of this paper here If you want to read the paper, you can check it here Please let me know in the comments what do you think about this research. https://preview.redd.it/61vqjqpza8k81.png?width=1410&format=png&auto=webp&s=3f4eb60fd0ab34bb934d1b36281aac5008fed1e3 submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    How far are we off the AI singularity? Where are we now what happens next and whats the process that "will" lead to the singularity?
    How far are we off the AI singularity? Where are we now what happens next and whats the process that "will" lead to the singularity? submitted by /u/Fun_Sun_97 [link] [comments]  ( 1 min )
    AI researchers: how do you recruit participants for your research?
    I'm working on game-playing AI as part of my 4th year dissertation, and am now required to test said AI through user evaluation. The problem is that I haven't thought about how to recruit participants for my study. It'd be really interesting to know how people who have experience with this go about recruiting participants. submitted by /u/HaresM [link] [comments]  ( 2 min )
    How can A.I. contribute to the Environment?
    submitted by /u/Beautiful-Credit-868 [link] [comments]
    Grammar, Pronunciation & Background Noise Correction with Perceiver IO
    submitted by /u/OnlyProggingForFun [link] [comments]
    Using A.I. to Determine the True Color of the Environment of Mars (70 A.I. Colored Photos) | The Newest Raw Photographs of Mars Will be Produced and Uploaded to a New Article Every Few Days.
    submitted by /u/Alehti [link] [comments]  ( 1 min )
    AI art generation challenge
    I created a webpage "challenge" for creating AI images with a set of base prompt: https://ai.botbox.dev/ai-image-generation-challenge/ submitted by /u/Recent_Coffee_2551 [link] [comments]
    AI Tutorials
    I recently made this website to show how to run various AIs locally (currently only one), does anyone have any suggestions on how I could improve it? https://ai.botbox.dev/ submitted by /u/Recent_Coffee_2551 [link] [comments]
  • Open

    approximation of Optimal Action equation for continuous time domain
    In section 3.4 of QLBS: Q-Learner in the Black-Scholes(-Merton) Worlds , how to derive equation 40 from equation 39 ? https://preview.redd.it/1w8qjqv4f6k81.png?width=802&format=png&auto=webp&s=d5f1736e64cbeec9db584d876a7db83bc8be4a66 submitted by /u/promach [link] [comments]  ( 1 min )
    Updating actor/critic after each episode
    Apologies if this has been asked before but I couldn't find anything. Is there a major issue with updating the weights of the actor/critic after the completion of an episode, as opposed to after every time step? I get that you're using a slightly 'out of date' value/advantage function for your actor policy updates, but is this actually a big problem? I ask as obviously deferring updating until the end of an episode runs significantly faster. My agent for playing Pacman using this method currently does not learn very well but I figure this may just be down to sensitivity to the hyperparameters. submitted by /u/claybecray [link] [comments]  ( 1 min )
    Why doesn't RL from pixel use pre-trained backbone?
    So I have read about papers like DrQ, RAD, etc., that aim to improve sample efficiency for RL from pixels by using data augmentation. My question is like the title, why don't people use a pre-trained backbone as the feature extractor, like VGG, ResNet, etc.? Intuitively, this should also improve sample efficiency. I haven't tried to do this, and a quick search did not lead me to any resource about this, so please let me know if it works for RL or not, and why so. submitted by /u/dx_rd_to_DX [link] [comments]  ( 2 min )
  • Open

    Interpretable Unsupervised Diversity Denoising and Artefact Removal. (arXiv:2104.01374v2 [eess.IV] UPDATED)
    Image denoising and artefact removal are complex inverse problems admitting multiple valid solutions. Unsupervised diversity restoration, that is, obtaining a diverse set of possible restorations given a corrupted image, is important for ambiguity removal in many applications such as microscopy where paired data for supervised training are often unobtainable. In real world applications, imaging noise and artefacts are typically hard to model, leading to unsatisfactory performance of existing unsupervised approaches. This work presents an interpretable approach for unsupervised and diverse image restoration. To this end, we introduce a capable architecture called Hierarchical DivNoising (HDN) based on hierarchical Variational Autoencoder. We show that HDN learns an interpretable multi-scale representation of artefacts and we leverage this interpretability to remove imaging artefacts commonly occurring in microscopy data. Our method achieves state-of-the-art results on twelve benchmark image denoising datasets while providing access to a whole distribution of sensibly restored solutions. Additionally, we demonstrate on three real microscopy datasets that HDN removes artefacts without supervision, being the first method capable of doing so while generating multiple plausible restorations all consistent with the given corrupted image.  ( 2 min )
    Physics perception in sloshing scenes with guaranteed thermodynamic consistency. (arXiv:2106.13301v3 [cs.CV] UPDATED)
    Physics perception very often faces the problem that only limited data or partial measurements on the scene are available. In this work, we propose a strategy to learn the full state of sloshing liquids from measurements of the free surface. Our approach is based on recurrent neural networks (RNN) that project the limited information available to a reduced-order manifold so as to not only reconstruct the unknown information, but also to be capable of performing fluid reasoning about future scenarios in real time. To obtain physically consistent predictions, we train deep neural networks on the reduced-order manifold that, through the employ of inductive biases, ensure the fulfillment of the principles of thermodynamics. RNNs learn from history the required hidden information to correlate the limited information with the latent space where the simulation occurs. Finally, a decoder returns data back to the high-dimensional manifold, so as to provide the user with insightful information in the form of augmented reality. This algorithm is connected to a computer vision system to test the performance of the proposed methodology with real information, resulting in a system capable of understanding and predicting future states of the observed fluid in real-time.  ( 2 min )
    Learning from distinctive candidates to optimize reduced-precision convolution program on tensor cores. (arXiv:2202.06819v2 [cs.LG] UPDATED)
    Convolution is one of the fundamental operations of deep neural networks with demanding matrix computation. In a graphic processing unit (GPU), Tensor Core is a specialized matrix processing hardware equipped with reduced-precision matrix-multiply-accumulate (MMA) instructions to increase throughput. However, it is challenging to achieve optimal performance since the best scheduling of MMA instructions varies for different convolution sizes. In particular, reduced-precision MMA requires many elements grouped as a matrix operand, seriously limiting data reuse and imposing packing and layout overhead on the schedule. This work proposes an automatic scheduling method of reduced-precision MMA for convolution operation. In this method, we devise a search space that explores the thread tile and warp sizes to increase the data reuse despite a large matrix operand of reduced-precision MMA. The search space also includes options of register-level packing and layout optimization to lesson overhead of handling reduced-precision data. Finally, we propose a search algorithm to find the best schedule by learning from the distinctive candidates. This reduced-precision MMA optimization method is evaluated on convolution operations of popular neural networks to demonstrate substantial speedup on Tensor Core compared to the state of the arts with shortened search time.  ( 2 min )
    Score-based diffusion models for accelerated MRI. (arXiv:2110.05243v2 [eess.IV] UPDATED)
    Score-based diffusion models provide a powerful way to model images using the gradient of the data distribution. Leveraging the learned score function as a prior, here we introduce a way to sample data from a conditional distribution given the measurements, such that the model can be readily used for solving inverse problems in imaging, especially for accelerated MRI. In short, we train a continuous time-dependent score function with denoising score matching. Then, at the inference stage, we iterate between numerical SDE solver and data consistency projection step to achieve reconstruction. Our model requires magnitude images only for training, and yet is able to reconstruct complex-valued data, and even extends to parallel imaging. The proposed method is agnostic to sub-sampling patterns, and can be used with any sampling schemes. Also, due to its generative nature, our approach can quantify uncertainty, which is not possible with standard regression settings. On top of all the advantages, our method also has very strong performance, even beating the models trained with full supervision. With extensive experiments, we verify the superiority of our method in terms of quality and practicality.  ( 2 min )
    Enterprise-Scale Search: Accelerating Inference for Sparse Extreme Multi-Label Ranking Trees. (arXiv:2106.02697v3 [cs.LG] UPDATED)
    Tree-based models underpin many modern semantic search engines and recommender systems due to their sub-linear inference times. In industrial applications, these models operate at extreme scales, where every bit of performance is critical. Memory constraints at extreme scales also require that models be sparse, hence tree-based models are often back-ended by sparse matrix algebra routines. However, there are currently no sparse matrix techniques specifically designed for the sparsity structure one encounters in tree-based models for extreme multi-label ranking/classification (XMR/XMC) problems. To address this issue, we present the masked sparse chunk multiplication (MSCM) technique, a sparse matrix technique specifically tailored to XMR trees. MSCM is easy to implement, embarrassingly parallelizable, and offers a significant performance boost to any existing tree inference pipeline at no cost. We perform a comprehensive study of MSCM applied to several different sparse inference schemes and benchmark our methods on a general purpose extreme multi-label ranking framework. We observe that MSCM gives consistently dramatic speedups across both the online and batch inference settings, single- and multi-threaded settings, and on many different tree models and datasets. To demonstrate its utility in industrial applications, we apply MSCM to an enterprise-scale semantic product search problem with 100 million products and achieve sub-millisecond latency of 0.88 ms per query on a single thread -- an 8x reduction in latency over vanilla inference techniques. The MSCM technique requires absolutely no sacrifices to model accuracy as it gives exactly the same results as standard sparse matrix techniques. Therefore, we believe that MSCM will enable users of XMR trees to save a substantial amount of compute resources in their inference pipelines at very little cost.  ( 2 min )
    Formalizing the Generalization-Forgetting Trade-off in Continual Learning. (arXiv:2109.14035v3 [cs.LG] UPDATED)
    We formulate the continual learning (CL) problem via dynamic programming and model the trade-off between catastrophic forgetting and generalization as a two-player sequential game. In this approach, player 1 maximizes the cost due to lack of generalization whereas player 2 minimizes the cost due to catastrophic forgetting. We show theoretically that a balance point between the two players exists for each task and that this point is stable (once the balance is achieved, the two players stay at the balance point). Next, we introduce balanced continual learning (BCL), which is designed to attain balance between generalization and forgetting and empirically demonstrate that BCL is comparable to or better than the state of the art.  ( 2 min )
    HODA: Hardness-Oriented Detection of Model Extraction Attacks. (arXiv:2106.11424v2 [cs.LG] UPDATED)
    Model Extraction attacks exploit the target model's prediction API to create a surrogate model in order to steal or reconnoiter the functionality of the target model in the black-box setting. Several recent studies have shown that a data-limited adversary who has no or limited access to the samples from the target model's training data distribution can use synthesis or semantically similar samples to conduct model extraction attacks. In this paper, we define the hardness degree of a sample using the concept of learning difficulty. The hardness degree of a sample depends on the epoch number that the predicted label of that sample converges. We investigate the hardness degree of samples and demonstrate that the hardness degree histogram of a data-limited adversary's sample sequences is distinguishable from the hardness degree histogram of benign users' samples sequences. We propose Hardness-Oriented Detection Approach (HODA) to detect the sample sequences of model extraction attacks. The results demonstrate that HODA can detect the sample sequences of model extraction attacks with a high success rate by only monitoring 100 samples of them, and it outperforms all previous model extraction detection methods.  ( 2 min )
    Object-aware Monocular Depth Prediction with Instance Convolutions. (arXiv:2112.01521v2 [cs.CV] UPDATED)
    With the advent of deep learning, estimating depth from a single RGB image has recently received a lot of attention, being capable of empowering many different applications ranging from path planning for robotics to computational cinematography. Nevertheless, while the depth maps are in their entirety fairly reliable, the estimates around object discontinuities are still far from satisfactory. This can be contributed to the fact that the convolutional operator naturally aggregates features across object discontinuities, resulting in smooth transitions rather than clear boundaries. Therefore, in order to circumvent this issue, we propose a novel convolutional operator which is explicitly tailored to avoid feature aggregation of different object parts. In particular, our method is based on estimating per-part depth values by means of superpixels. The proposed convolutional operator, which we dub "Instance Convolution", then only considers each object part individually on the basis of the estimated superpixels. Our evaluation with respect to the NYUv2 as well as the iBims dataset clearly demonstrates the superiority of Instance Convolutions over the classical convolution at estimating depth around occlusion boundaries, while producing comparable results elsewhere. Code will be made publicly available upon acceptance.  ( 2 min )
    Time Series Generation with Masked Autoencoder. (arXiv:2201.07006v2 [cs.LG] UPDATED)
    This paper shows that masked autoencoder with interpolator (InterpoMAE) is a scalable self-supervised generative model for time series. InterpoMAE masks random patches of the input time series and recover the missing patches in latent space. The core design is that no mask token is used. InterpoMAE disentangles missing patch recovery from the decoder. An interpolator directly recovers the missing patches without mask tokens. This design helps InterpoMAE to consistently and significantly outperforms state-of-the-art (SoTA) benchmarks in time series generation. Our approach also shows promising scaling behaviour in various downstream tasks such as time series classification, prediction and imputation. As the only self-supervised generative model for time series, InterpoMAE is the first in literature that allows explicit management on the synthetic data. Time series generation may follow the trajectory of self-supervised learning now.  ( 2 min )
    A Survey on Interpretable Reinforcement Learning. (arXiv:2112.13112v2 [cs.LG] UPDATED)
    Although deep reinforcement learning has become a promising machine learning approach for sequential decision-making problems, it is still not mature enough for high-stake domains such as autonomous driving or medical applications. In such contexts, a learned policy needs for instance to be interpretable, so that it can be inspected before any deployment (e.g., for safety and verifiability reasons). This survey provides an overview of various approaches to achieve higher interpretability in reinforcement learning (RL). To that aim, we distinguish interpretability (as a property of a model) and explainability (as a post-hoc operation, with the intervention of a proxy) and discuss them in the context of RL with an emphasis on the former notion. In particular, we argue that interpretable RL may embrace different facets: interpretable inputs, interpretable (transition/reward) models, and interpretable decision-making. Based on this scheme, we summarize and analyze recent work related to interpretable RL with an emphasis on papers published in the past 10 years. We also discuss briefly some related research areas and point to some potential promising research directions.  ( 2 min )
    Fast Non-Bayesian Poisson Factorization for Implicit-Feedback Recommendations. (arXiv:1811.01908v5 [cs.LG] UPDATED)
    This work explores non-negative low-rank matrix factorization based on regularized Poisson models (PF or "Poisson factorization" for short) for recommender systems with implicit-feedback data. The properties of Poisson likelihood allow a shortcut for very fast computations over zero-valued inputs, and oftentimes results in very sparse factors for both users and items. Compared to HPF (a popular Bayesian formulation of the problem with hierarchical priors), the frequentist optimization-based approach presented here tends to produce better top-N recommendations with significantly shorter fitting times, on top of having sparse solutions.  ( 2 min )
    Probing Predictions on OOD Images via Nearest Categories. (arXiv:2011.08485v4 [cs.LG] UPDATED)
    We study out-of-distribution (OOD) prediction behavior of neural networks when they classify images from unseen classes or corrupted images. To probe the OOD behavior, we introduce a new measure, nearest category generalization (NCG), where we compute the fraction of OOD inputs that are classified with the same label as their nearest neighbor in the training set. Our motivation stems from understanding the prediction patterns of adversarially robust networks, since previous work has identified unexpected consequences of training to be robust to norm-bounded perturbations. We find that robust networks have consistently higher NCG accuracy than natural training, even when the OOD data is much farther away than the robustness radius. This implies that the local regularization of robust training has a significant impact on the network's decision regions. We replicate our findings using many datasets, comparing new and existing training methods. Overall, adversarially robust networks resemble a nearest neighbor classifier when it comes to OOD data. Code available at https://github.com/yangarbiter/nearest-category-generalization.  ( 2 min )
    Solving optimization problems with Blackwell approachability. (arXiv:2202.12277v1 [math.OC])
    We introduce the Conic Blackwell Algorithm$^+$ (CBA$^+$) regret minimizer, a new parameter- and scale-free regret minimizer for general convex sets. CBA$^+$ is based on Blackwell approachability and attains $O(\sqrt{T})$ regret. We show how to efficiently instantiate CBA$^+$ for many decision sets of interest, including the simplex, $\ell_{p}$ norm balls, and ellipsoidal confidence regions in the simplex. Based on CBA$^+$, we introduce SP-CBA$^+$, a new parameter-free algorithm for solving convex-concave saddle-point problems, which achieves a $O(1/\sqrt{T})$ ergodic rate of convergence. In our simulations, we demonstrate the wide applicability of SP-CBA$^+$ on several standard saddle-point problems, including matrix games, extensive-form games, distributionally robust logistic regression, and Markov decision processes. In each setting, SP-CBA$^+$ achieves state-of-the-art numerical performance, and outperforms classical methods, without the need for any choice of step sizes or other algorithmic parameters.  ( 2 min )
    Symbolic Brittleness in Sequence Models: on Systematic Generalization in Symbolic Mathematics. (arXiv:2109.13986v2 [cs.LG] UPDATED)
    Neural sequence models trained with maximum likelihood estimation have led to breakthroughs in many tasks, where success is defined by the gap between training and test performance. However, their ability to achieve stronger forms of generalization remains unclear. We consider the problem of symbolic mathematical integration, as it requires generalizing systematically beyond the test set. We develop a methodology for evaluating generalization that takes advantage of the problem domain's structure and access to a verifier. Despite promising in-distribution performance of sequence-to-sequence models in this domain, we demonstrate challenges in achieving robustness, compositionality, and out-of-distribution generalization, through both carefully constructed manual test suites and a genetic algorithm that automatically finds large collections of failures in a controllable manner. Our investigation highlights the difficulty of generalizing well with the predominant modeling and learning approach, and the importance of evaluating beyond the test set, across different aspects of generalization.  ( 2 min )
    Real-world challenges for multi-agent reinforcement learning in grid-interactive buildings. (arXiv:2112.06127v2 [cs.LG] UPDATED)
    Building upon prior research that highlighted the need for standardizing environments for building control research, and inspired by recently introduced challenges for real life reinforcement learning control, here we propose a non-exhaustive set of nine real world challenges for reinforcement learning control in grid-interactive buildings. We argue that research in this area should be expressed in this framework in addition to providing a standardized environment for repeatability. Advanced controllers such as model predictive control and reinforcement learning (RL) control have both advantages and disadvantages that prevent them from being implemented in real world problems. Comparisons between the two are rare, and often biased. By focusing on the challenges, we can investigate the performance of the controllers under a variety of situations and generate a fair comparison. As a demonstration, we implement the offline learning challenge in CityLearn and study the impact of different levels of domain knowledge and complexity of RL algorithms. We show that the sequence of operations utilized in a rule based controller (RBC) used for offline training affects the performance of the RL agents when evaluated on a set of four energy flexibility metrics. Longer offline learning from an optimized RBC leads to improved performance in the long run. RL agents that learn from a simplified RBC risk poorer performance as the offline learning period increases. We also observe no impact on performance from information sharing amongst agents. We call for a more interdisciplinary effort of the research community to address the real world challenges, and unlock the potential of grid-interactive building  ( 2 min )
    Intelligent Zero Trust Architecture for 5G/6G Networks: Principles, Challenges, and the Role of Machine Learning in the context of O-RAN. (arXiv:2105.01478v2 [cs.NI] UPDATED)
    In this position paper, we discuss the critical need for integrating zero trust (ZT) principles into next-generation communication networks (5G/6G). We highlight the challenges and introduce the concept of an intelligent zero trust architecture (i-ZTA) as a security framework in 5G/6G networks with untrusted components. While network virtualization, software-defined networking (SDN), and service-based architectures (SBA) are key enablers of 5G networks, operating in an untrusted environment has also become a key feature of the networks. Further, seamless connectivity to a high volume of devices has broadened the attack surface on information infrastructure. Network assurance in a dynamic untrusted environment calls for revolutionary architectures beyond existing static security frameworks. To the best of our knowledge, this is the first position paper that presents the architectural concept design of an i-ZTA upon which modern artificial intelligence (AI) algorithms can be developed to provide information security in untrusted networks. We introduce key ZT principles as real-time Monitoring of the security state of network assets, Evaluating the risk of individual access requests, and Deciding on access authorization using a dynamic trust algorithm, called MED components. To ensure ease of integration, the envisioned architecture adopts an SBA-based design, similar to the 3GPP specification of 5G networks, by leveraging the open radio access network (O-RAN) architecture with appropriate real-time engines and network interfaces for collecting necessary machine learning data. Therefore, this work provides novel research directions to design machine learning based components that contribute towards i-ZTA for the future 5G/6G networks.  ( 2 min )
    Exact Community Recovery over Signed Graphs. (arXiv:2202.12255v1 [cs.SI])
    Signed graphs encode similarity and dissimilarity relationships among different entities with positive and negative edges. In this paper, we study the problem of community recovery over signed graphs generated by the signed stochastic block model (SSBM) with two equal-sized communities. Our approach is based on the maximum likelihood estimation (MLE) of the SSBM. Unlike many existing approaches, our formulation reveals that the positive and negative edges of a signed graph should be treated unequally. We then propose a simple two-stage iterative algorithm for solving the regularized MLE. It is shown that in the logarithmic degree regime, the proposed algorithm can exactly recover the underlying communities in nearly-linear time at the information-theoretic limit. Numerical results on both synthetic and real data are reported to validate and complement our theoretical developments and demonstrate the efficacy of the proposed method.  ( 2 min )
    Inflation of test accuracy due to data leakage in deep learning-based classification of OCT images. (arXiv:2202.12267v1 [eess.IV])
    In the application of deep learning on optical coherence tomography (OCT) data, it is common to train classification networks using 2D images originating from volumetric data. Given the micrometer resolution of OCT systems, consecutive images are often very similar in both visible structures and noise. Thus, an inappropriate data split can result in overlap between the training and testing sets, with a large portion of the literature overlooking this aspect. In this study, the effect of improper dataset splitting on model evaluation is demonstrated for two classification tasks using two OCT open-access datasets extensively used in the literature, Kermany's ophthalmology dataset and AIIMS breast tissue dataset. Our results show that the classification accuracy is inflated by 3.9 to 26 percentage units for models tested on a dataset with improper splitting, highlighting the considerable effect of dataset handling on model evaluation. This study intends to raise awareness on the importance of dataset splitting for research on deep learning using OCT data and volumetric data in general.  ( 2 min )
    Online Graph Learning in Dynamic Environments. (arXiv:2110.05023v2 [cs.LG] UPDATED)
    Inferring the underlying graph topology that characterizes structured data is pivotal to many graph-based models when pre-defined graphs are not available. This paper focuses on learning graphs in the case of sequential data in dynamic environments. For sequential data, we develop an online version of classic batch graph learning method. To better track graphs in dynamic environments, we assume graphs evolve in certain patterns such that dynamic priors might be embedded in the online graph learning framework. When the information of these hidden patterns is not available, we use history data to predict the evolution of graphs. Furthermore, dynamic regret analysis of the proposed method is performed and illustrates that our online graph learning algorithms can reach sublinear dynamic regret. Experimental results support the fact that our method is superior to the state-of-art methods.  ( 2 min )
    MLDemon: Deployment Monitoring for Machine Learning Systems. (arXiv:2104.13621v5 [cs.LG] UPDATED)
    Post-deployment monitoring of ML systems is critical for ensuring reliability, especially as new user inputs can differ from the training distribution. Here we propose a novel approach, MLDemon, for ML DEployment MONitoring. MLDemon integrates both unlabeled data and a small amount of on-demand labels to produce a real-time estimate of the ML model's current performance on a given data stream. Subject to budget constraints, MLDemon decides when to acquire additional, potentially costly, expert supervised labels to verify the model. On temporal datasets with diverse distribution drifts and models, MLDemon outperforms existing approaches. Moreover, we provide theoretical analysis to show that MLDemon is minimax rate optimal for a broad class of distribution drifts.  ( 2 min )
    Separation of Scales and a Thermodynamic Description of Feature Learning in Some CNNs. (arXiv:2112.15383v2 [stat.ML] UPDATED)
    Deep neural networks (DNNs) are powerful tools for compressing and distilling information. Their scale and complexity, often involving billions of inter-dependent internal degrees of freedom, renders direct microscopic analysis difficult. Under such circumstances, a common strategy is to identify slow degrees of freedom that average out the erratic behavior of the underlying fast microscopic variables. Here, we identify such a separation of scales occurring in fully trained over-parameterized deep convolutional neural networks (CNNs). Specifically, we show that DNN layers couple only through the second moment (kernels) of their activations and pre-activations. Moreover, in various settings, the latter fluctuate in a nearly Gaussian manner. For CNNs with infinitely many channels, these kernels are inert, while for finite CNNs they adapt to the data. In several deep non-linear CNN models trained on real data, the resulting thermodynamic theory of deep learning yields accurate predictions. In addition, it provides new ways of analyzing and understanding CNNs, and DNNs in general.  ( 2 min )
    Adaptive Multi-Goal Exploration. (arXiv:2111.12045v2 [cs.LG] UPDATED)
    We introduce a generic strategy for provably efficient multi-goal exploration. It relies on AdaGoal, a novel goal selection scheme that leverages a measure of uncertainty in reaching states to adaptively target goals that are neither too difficult nor too easy. We show how AdaGoal can be used to tackle the objective of learning an $\epsilon$-optimal goal-conditioned policy for the (initially unknown) set of goal states that are reachable within $L$ steps in expectation from a reference state $s_0$ in a reward-free Markov decision process. In the tabular case with $S$ states and $A$ actions, our algorithm requires $\tilde{O}(L^3 S A \epsilon^{-2})$ exploration steps, which is nearly minimax optimal. We also readily instantiate AdaGoal in linear mixture Markov decision processes, yielding the first goal-oriented PAC guarantee with linear function approximation. Beyond its strong theoretical guarantees, we anchor AdaGoal in goal-conditioned deep reinforcement learning, both conceptually and empirically, by connecting its idea of selecting "uncertain" goals to maximizing value ensemble disagreement.  ( 2 min )
    INTERN: A New Learning Paradigm Towards General Vision. (arXiv:2111.08687v2 [cs.CV] UPDATED)
    Enormous waves of technological innovations over the past several years, marked by the advances in AI technologies, are profoundly reshaping the industry and the society. However, down the road, a key challenge awaits us, that is, our capability of meeting rapidly-growing scenario-specific demands is severely limited by the cost of acquiring a commensurate amount of training data. This difficult situation is in essence due to limitations of the mainstream learning paradigm: we need to train a new model for each new scenario, based on a large quantity of well-annotated data and commonly from scratch. In tackling this fundamental problem, we move beyond and develop a new learning paradigm named INTERN. By learning with supervisory signals from multiple sources in multiple stages, the model being trained will develop strong generalizability. We evaluate our model on 26 well-known datasets that cover four categories of tasks in computer vision. In most cases, our models, adapted with only 10% of the training data in the target domain, outperform the counterparts trained with the full set of data, often by a significant margin. This is an important step towards a promising prospect where such a model with general vision capability can dramatically reduce our reliance on data, thus expediting the adoption of AI technologies. Furthermore, revolving around our new paradigm, we also introduce a new data system, a new architecture, and a new benchmark, which, together, form a general vision ecosystem to support its future development in an open and inclusive manner. See project website at https://opengvlab.shlab.org.cn .  ( 3 min )
    Systematic review of deep learning and machine learning for building energy. (arXiv:2202.12269v1 [cs.LG])
    The building energy (BE) management has an essential role in urban sustainability and smart cities. Recently, the novel data science and data-driven technologies have shown significant progress in analyzing the energy consumption and energy demand data sets for a smarter energy management. The machine learning (ML) and deep learning (DL) methods and applications, in particular, have been promising for the advancement of the accurate and high-performance energy models. The present study provides a comprehensive review of ML and DL-based techniques applied for handling BE systems, and it further evaluates the performance of these techniques. Through a systematic review and a comprehensive taxonomy, the advances of ML and DL-based techniques are carefully investigated, and the promising models are introduced. According to the results obtained for energy demand forecasting, the hybrid and ensemble methods are located in high robustness range, SVM-based methods are located in good robustness limitation, ANN-based methods are located in medium robustness limitation and linear regression models are located in low robustness limitations. On the other hand, for energy consumption forecasting, DL-based, hybrid, and ensemble-based models provided the highest robustness score. ANN, SVM, and single ML models provided good and medium robustness and LR-based models provided the lower robustness score. In addition, for energy load forecasting, LR-based models provided the lower robustness score. The hybrid and ensemble-based models provided a higher robustness score. The DL-based and SVM-based techniques provided a good robustness score and ANN-based techniques provided a medium robustness score.  ( 2 min )
    Integration of neural network and fuzzy logic decision making compared with bilayered neural network in the simulation of daily dew point temperature. (arXiv:2202.12256v1 [cs.LG])
    In this research, dew point temperature (DPT) is simulated using the data-driven approach. Adaptive Neuro-Fuzzy Inference System (ANFIS) is utilized as a data-driven technique to forecast this parameter at Tabriz in East Azerbaijan. Various input patterns, namely T min, T max, and T mean, are utilized for training the architecture whilst DPT is the model's output. The findings indicate that, in general, ANFIS method is capable of identifying data patterns with a high degree of accuracy. However, the approach demonstrates that processing time and computer resources may substantially increase by adding additional functions. Based on the results, the number of iterations and computing resources might change dramatically if new functionalities are included. As a result, tuning parameters have to be optimized inside the method framework. The findings demonstrate a high agreement between results by the data-driven technique (machine learning method) and the observed data. Using this prediction toolkit, DPT can be adequately forecasted solely based on the temperature distribution of Tabriz. This kind of modeling is extremely promising for predicting DPT at various sites. Besides, this study thoroughly compares the Bilayered Neural Network (BNN) and ANFIS models on various scales. Whilst the ANFIS model is extremely stable for almost all numbers of membership functions, the BNN model is highly sensitive to this scale factor to predict DPT.  ( 2 min )
    SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing. (arXiv:2110.07205v2 [eess.AS] UPDATED)
    Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning. The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets. After preprocessing the input speech/text through the pre-nets, the shared encoder-decoder network models the sequence-to-sequence transformation, and then the post-nets generate the output in the speech/text modality based on the output of the decoder. Leveraging large-scale unlabeled speech and text data, we pre-train SpeechT5 to learn a unified-modal representation, hoping to improve the modeling capability for both speech and text. To align the textual and speech information into this unified semantic space, we propose a cross-modal vector quantization approach that randomly mixes up speech/text states with latent units as the interface between encoder and decoder. Extensive evaluations show the superiority of the proposed SpeechT5 framework on a wide variety of spoken language processing tasks, including automatic speech recognition, speech synthesis, speech translation, voice conversion, speech enhancement, and speaker identification. We will release our code and model at https://github.com/microsoft/SpeechT5.  ( 2 min )
    Identifying causal relations in tweets using deep learning: Use case on diabetes-related tweets from 2017-2021. (arXiv:2111.01225v4 [cs.CL] UPDATED)
    Objective: Leveraging machine learning methods, we aim to extract both explicit and implicit cause-effect associations in patient-reported, diabetes-related tweets and provide a tool to better understand opinion, feelings and observations shared within the diabetes online community from a causality perspective. Materials and Methods: More than 30 million diabetes-related tweets in English were collected between April 2017 and January 2021. Deep learning and natural language processing methods were applied to focus on tweets with personal and emotional content. A cause-effect-tweet dataset was manually labeled and used to train 1) a fine-tuned Bertweet model to detect causal sentences containing a causal association 2) a CRF model with BERT based features to extract possible cause-effect associations. Causes and effects were clustered in a semi-supervised approach and visualised in an interactive cause-effect-network. Results: Causal sentences were detected with a recall of 68% in an imbalanced dataset. A CRF model with BERT based features outperformed a fine-tuned BERT model for cause-effect detection with a macro recall of 68%. This led to 96,676 sentences with cause-effect associations. "Diabetes" was identified as the central cluster followed by "Death" and "Insulin". Insulin pricing related causes were frequently associated with "Death". Conclusions: A novel methodology was developed to detect causal sentences and identify both explicit and implicit, single and multi-word cause and corresponding effect as expressed in diabetes-related tweets leveraging BERT-based architectures and visualised as cause-effect-network. Extracting causal associations on real-life, patient reported outcomes in social media data provides a useful complementary source of information in diabetes research.  ( 3 min )
    Simultaneous Missing Value Imputation and Structure Learning with Groups. (arXiv:2110.08223v2 [cs.LG] UPDATED)
    Learning structures between groups of variables from data with missing values is an important task in the real world, yet difficult to solve. One typical scenario is discovering the structure among topics in the education domain to identify learning pathways. Here, the observations are student performances for questions under each topic which contain missing values. However, most existing methods focus on learning structures between a few individual variables from the complete data. In this work, we propose VISL, a novel scalable structure learning approach that can simultaneously infer structures between groups of variables under missing data and perform missing value imputations with deep learning. Particularly, we propose a generative model with a structured latent space and a graph neural network-based architecture, scaling to a large number of variables. Empirically, we conduct extensive experiments on synthetic, semi-synthetic, and real-world education data sets. We show improved performances on both imputation and structure learning accuracy compared to popular and recent approaches.  ( 2 min )
    In a Nutshell, the Human Asked for This: Latent Goals for Following Temporal Specifications. (arXiv:2110.09461v2 [cs.AI] UPDATED)
    We address the problem of building agents whose goal is to learn to execute out-of distribution (OOD) multi-task instructions expressed in temporal logic (TL) by using deep reinforcement learning (DRL). Recent works provided evidence that the agent's neural architecture is a key feature when DRL agents are learning to solve OOD tasks in TL. Yet, the studies on this topic are still in their infancy. In this work, we propose a new deep learning configuration with inductive biases that lead agents to generate latent representations of their current goal, yielding a stronger generalization performance. We use these latent-goal networks within a neuro-symbolic framework that executes multi-task formally-defined instructions and contrast the performance of the proposed neural networks against employing different state-of-the-art (SOTA) architectures when generalizing to unseen instructions in OOD environments.  ( 2 min )
    Multivariate Deep Evidential Regression. (arXiv:2104.06135v4 [cs.LG] UPDATED)
    There is significant need for principled uncertainty reasoning in machine learning systems as they are increasingly deployed in safety-critical domains. A new approach with uncertainty-aware neural networks (NNs), based on learning evidential distributions for aleatoric and epistemic uncertainties, shows promise over traditional deterministic methods and typical Bayesian NNs, yet several important gaps in the theory and implementation of these networks remain. We discuss three issues with a proposed solution to extract aleatoric and epistemic uncertainties from regression-based neural networks. The approach derives a technique by placing evidential priors over the original Gaussian likelihood function and training the NN to infer the hyperparameters of the evidential distribution. Doing so allows for the simultaneous extraction of both uncertainties without sampling or utilization of out-of-distribution data for univariate regression tasks. We describe the outstanding issues in detail, provide a possible solution, and generalize the deep evidential regression technique for multivariate cases.  ( 2 min )
    On Multimarginal Partial Optimal Transport: Equivalent Forms and Computational Complexity. (arXiv:2108.07992v2 [stat.ML] UPDATED)
    We study the multi-marginal partial optimal transport (POT) problem between $m$ discrete (unbalanced) measures with at most $n$ supports. We first prove that we can obtain two equivalence forms of the multimarginal POT problem in terms of the multimarginal optimal transport problem via novel extensions of cost tensor. The first equivalence form is derived under the assumptions that the total masses of each measure are sufficiently close while the second equivalence form does not require any conditions on these masses but at the price of more sophisticated extended cost tensor. Our proof techniques for obtaining these equivalence forms rely on novel procedures of moving mass in graph theory to push transportation plan into appropriate regions. Finally, based on the equivalence forms, we develop optimization algorithm, named ApproxMPOT algorithm, that builds upon the Sinkhorn algorithm for solving the entropic regularized multimarginal optimal transport. We demonstrate that the ApproxMPOT algorithm can approximate the optimal value of multimarginal POT problem with a computational complexity upper bound of the order $\tilde{\mathcal{O}}(m^3(n+1)^{m}/ \varepsilon^2)$ where $\varepsilon > 0$ stands for the desired tolerance.  ( 2 min )
    High-quality Thermal Gibbs Sampling with Quantum Annealing Hardware. (arXiv:2109.01690v2 [quant-ph] UPDATED)
    Quantum Annealing (QA) was originally intended for accelerating the solution of combinatorial optimization tasks that have natural encodings as Ising models. However, recent experiments on QA hardware platforms have demonstrated that, in the operating regime corresponding to weak interactions, the QA hardware behaves like a noisy Gibbs sampler at a hardware-specific effective temperature. This work builds on those insights and identifies a class of small hardware-native Ising models that are robust to noise effects and proposes a procedure for executing these models on QA hardware to maximize Gibbs sampling performance. Experimental results indicate that the proposed protocol results in high-quality Gibbs samples from a hardware-specific effective temperature. Furthermore, we show that this effective temperature can be adjusted by modulating the annealing time and energy scale. The procedure proposed in this work provides an approach to using QA hardware for Ising model sampling presenting potential new opportunities for applications in machine learning and physics simulation.  ( 2 min )
    Covariance-Generalized Matching Component Analysis for Data Fusion and Transfer Learning. (arXiv:2110.13194v2 [cs.LG] UPDATED)
    In order to allow for the encoding of additional statistical information in data fusion and transfer learning applications, we introduce a generalized covariance constraint for the matching component analysis (MCA) transfer learning technique. After proving a semi-orthogonally constrained trace maximization lemma, we develop a closed-form solution to the resulting covariance-generalized optimization problem and provide an algorithm for its computation. We call this technique -- applicable to both data fusion and transfer learning -- covariance-generalized MCA (CGMCA).  ( 2 min )
    Robust Deep Learning from Crowds with Belief Propagation. (arXiv:2111.00734v2 [cs.LG] UPDATED)
    Crowdsourcing systems enable us to collect large-scale dataset, but inherently suffer from noisy labels of low-paid workers. We address the inference and learning problems using such a crowdsourced dataset with noise. Due to the nature of sparsity in crowdsourcing, it is critical to exploit both probabilistic model to capture worker prior and neural network to extract task feature despite risks from wrong prior and overfitted feature in practice. We hence establish a neural-powered Bayesian framework, from which we devise deepMF and deepBP with different choice of variational approximation methods, mean field (MF) and belief propagation (BP), respectively. This provides a unified view of existing methods, which are special cases of deepMF with different priors. In addition, our empirical study suggests that deepBP is a new approach, which is more robust against wrong prior, feature overfitting and extreme workers thanks to the more sophisticated BP than MF.  ( 2 min )
    Equivariant Deep Dynamical Model for Motion Prediction. (arXiv:2111.01892v2 [cs.LG] UPDATED)
    Learning representations through deep generative modeling is a powerful approach for dynamical modeling to discover the most simplified and compressed underlying description of the data, to then use it for other tasks such as prediction. Most learning tasks have intrinsic symmetries, i.e., the input transformations leave the output unchanged, or the output undergoes a similar transformation. The learning process is, however, usually uninformed of these symmetries. Therefore, the learned representations for individually transformed inputs may not be meaningfully related. In this paper, we propose an SO(3) equivariant deep dynamical model (EqDDM) for motion prediction that learns a structured representation of the input space in the sense that the embedding varies with symmetry transformations. EqDDM is equipped with equivariant networks to parameterize the state-space emission and transition models. We demonstrate the superior predictive performance of the proposed model on various motion data.  ( 2 min )
    BioLCNet: Reward-modulated Locally Connected Spiking Neural Networks. (arXiv:2109.05539v4 [cs.NE] UPDATED)
    Brain-inspired computation and information processing alongside compatibility with neuromorphic hardware have made spiking neural networks (SNN) a promising method for solving learning tasks in machine learning (ML). However, the mere use of spiking neurons is not sufficient for building models that mimic the learning mechanisms in the biological brain; network architecture and learning rules are important factors to consider when developing such artificial agents. In this work, inspired by the human visual pathway and the role of dopamine in learning, we propose a reward-modulated locally connected spiking neural network, BioLCNet, for visual learning tasks. To extract visual features from Poisson-distributed spike trains, we used local filters that are more analogous to the biological visual system compared to convolutional filters with weight sharing. In the decoding layer, we applied a spike population-based voting scheme to determine the decision of the network. We employed Spike-timing-dependent plasticity (STDP) for learning the visual features, and its reward-modulated variant (R-STDP) for training the decoder based on the reward or punishment feedback signal. For evaluation, we first assessed the robustness of our rewarding mechanism to varying target responses in a classical conditioning experiment. Afterwards, we evaluated the performance of our network on image classification tasks of MNIST and XOR MNIST datasets.  ( 2 min )
    Drawing Multiple Augmentation Samples Per Image During Training Efficiently Decreases Test Error. (arXiv:2105.13343v2 [cs.LG] UPDATED)
    In computer vision, it is standard practice to draw a single sample from the data augmentation procedure for each unique image in the mini-batch. However recent work has suggested drawing multiple samples can achieve higher test accuracies. In this work, we provide a detailed empirical evaluation of how the number of augmentation samples per unique image influences model performance on held out data when training deep ResNets. We demonstrate drawing multiple samples per image consistently enhances the test accuracy achieved for both small and large batch training. Crucially, this benefit arises even if different numbers of augmentations per image perform the same number of parameter updates and gradient evaluations (requiring the same total compute). Although prior work has found variance in the gradient estimate arising from subsampling the dataset has an implicit regularization benefit, our experiments suggest variance which arises from the data augmentation process harms generalization. We apply these insights to the highly performant NFNet-F5, achieving 86.8$\%$ top-1 w/o extra data on ImageNet.  ( 2 min )
    Reconfigurable Intelligent Surfaces in Action for Non-Terrestrial Networks. (arXiv:2012.00968v4 [cs.IT] UPDATED)
    Next-generation communication technology will be made possible by cooperation between terrestrial networks with non-terrestrial networks (NTN) comprised of high-altitude platform stations and satellites. Further, as humanity embarks on the long road to establish new habitats on other planets, cooperation between NTN and deep-space networks (DSN) will be necessary. In this regard, we propose the use of reconfigurable intelligent surfaces (RIS) to improve coordination between these networks given that RIS perfectly match the size, weight, and power restrictions of operating in space. A comprehensive framework of RIS-assisted non-terrestrial and interplanetary communications is presented that pinpoints challenges, use cases, and open issues. Furthermore, the performance of RIS-assisted NTN under environmental effects such as solar scintillation and satellite drag is discussed in light of simulation results.  ( 2 min )
    Double Control Variates for Gradient Estimation in Discrete Latent Variable Models. (arXiv:2111.05300v2 [stat.ML] UPDATED)
    Stochastic gradient-based optimisation for discrete latent variable models is challenging due to the high variance of gradients. We introduce a variance reduction technique for score function estimators that makes use of double control variates. These control variates act on top of a main control variate, and try to further reduce the variance of the overall estimator. We develop a double control variate for the REINFORCE leave-one-out estimator using Taylor expansions. For training discrete latent variable models, such as variational autoencoders with binary latent variables, our approach adds no extra computational cost compared to standard training with the REINFORCE leave-one-out estimator. We apply our method to challenging high-dimensional toy examples and training variational autoencoders with binary latent variables. We show that our estimator can have lower variance compared to other state-of-the-art estimators.  ( 2 min )
    Sobolev Norm Learning Rates for Conditional Mean Embeddings. (arXiv:2105.07446v3 [stat.ML] UPDATED)
    We develop novel learning rates for conditional mean embeddings by applying the theory of interpolation for reproducing kernel Hilbert spaces (RKHS). We derive explicit, adaptive convergence rates for the sample estimator under the misspecifed setting, where the target operator is not Hilbert-Schmidt or bounded with respect to the input/output RKHSs. We demonstrate that in certain parameter regimes, we can achieve uniform convergence rates in the output RKHS. We hope our analyses will allow the much broader application of conditional mean embeddings to more complex ML/RL settings involving infinite dimensional RKHSs and continuous state spaces.  ( 2 min )
    A Perceptual Measure for Evaluating the Resynthesis of Automatic Music Transcriptions. (arXiv:2202.12257v1 [cs.SD])
    This study focuses on the perception of music performances when contextual factors, such as room acoustics and instrument, change. We propose to distinguish the concept of "performance" from the one of "interpretation", which expresses the "artistic intention". Towards assessing this distinction, we carried out an experimental evaluation where 91 subjects were invited to listen to various audio recordings created by resynthesizing MIDI data obtained through Automatic Music Transcription (AMT) systems and a sensorized acoustic piano. During the resynthesis, we simulated different contexts and asked listeners to evaluate how much the interpretation changes when the context changes. Results show that: (1) MIDI format alone is not able to completely grasp the artistic intention of a music performance; (2) usual objective evaluation measures based on MIDI data present low correlations with the average subjective evaluation. To bridge this gap, we propose a novel measure which is meaningfully correlated with the outcome of the tests. In addition, we investigate multimodal machine learning by providing a new score-informed AMT method and propose an approximation algorithm for the $p$-dispersion problem.  ( 2 min )
    A Multilayered Block Network Model to Forecast Large Dynamic Transportation Graphs: an Application to US Air Transport. (arXiv:1911.13136v4 [stat.ML] UPDATED)
    Dynamic transportation networks have been analyzed for years by means of static graph-based indicators in order to study the temporal evolution of relevant network components, and to reveal complex dependencies that would not be easily detected by a direct inspection of the data. This paper presents a state-of-the-art probabilistic latent network model to forecast multilayer dynamic graphs that are increasingly common in transportation and proposes a community-based extension to reduce the computational burden. Flexible time series analysis is obtained by modeling the probability of edges between vertices through latent Gaussian processes. The models and Bayesian inference are illustrated on a sample of 10-year data from four major airlines within the US air transportation system. Results show how the estimated latent parameters from the models are related to the airline's connectivity dynamics, and their ability to project the multilayer graph into the future for out-of-sample full network forecasts, while stochastic blockmodeling allows for the identification of relevant communities. Reliable network predictions would allow policy-makers to better understand the dynamics of the transport system, and help in their planning on e.g. route development, or the deployment of new regulations.  ( 2 min )
    A Simple and General Debiased Machine Learning Theorem with Finite Sample Guarantees. (arXiv:2105.15197v2 [stat.ML] UPDATED)
    Debiased machine learning is a meta algorithm based on bias correction and sample splitting to calculate confidence intervals for functionals (i.e. scalar summaries) of machine learning algorithms. For example, an analyst may desire the confidence interval for a treatment effect estimated with a neural network. We provide a nonasymptotic debiased machine learning theorem that encompasses any global or local functional of any machine learning algorithm that satisfies a few simple, interpretable conditions. Formally, we prove consistency, Gaussian approximation, and semiparametric efficiency by finite sample arguments. The rate of convergence is $n^{-1/2}$ for global functionals, and it degrades gracefully for local functionals. Our results culminate in a simple set of conditions that an analyst can use to translate modern learning theory rates into traditional statistical inference. The conditions reveal a general double robustness property for ill posed inverse problems.  ( 2 min )
    Disentanglement via Mechanism Sparsity Regularization: A New Principle for Nonlinear ICA. (arXiv:2107.10098v3 [stat.ML] UPDATED)
    This work introduces a novel principle we call disentanglement via mechanism sparsity regularization, which can be applied when the latent factors of interest depend sparsely on past latent factors and/or observed auxiliary variables. We propose a representation learning method that induces disentanglement by simultaneously learning the latent factors and the sparse causal graphical model that relates them. We develop a rigorous identifiability theory, building on recent nonlinear independent component analysis (ICA) results, that formalizes this principle and shows how the latent variables can be recovered up to permutation if one regularizes the latent mechanisms to be sparse and if some graph connectivity criterion is satisfied by the data generating process. As a special case of our framework, we show how one can leverage unknown-target interventions on the latent factors to disentangle them, thereby drawing further connections between ICA and causality. We propose a VAE-based method in which the latent mechanisms are learned and regularized via binary masks, and validate our theory by showing it learns disentangled representations in simulations.  ( 2 min )
    Effect Identification in Cluster Causal Diagrams. (arXiv:2202.12263v1 [stat.ME])
    One pervasive task found throughout the empirical sciences is to determine the effect of interventions from non-experimental data. It is well-understood that assumptions are necessary to perform causal inferences, which are commonly articulated through causal diagrams (Pearl, 2000). Despite the power of this approach, there are settings where the knowledge necessary to specify a causal diagram over all observed variables may not be available, particularly in complex, high-dimensional domains. In this paper, we introduce a new type of graphical model called cluster causal diagrams (for short, C-DAGs) that allows for the partial specification of relationships among variables based on limited prior knowledge, alleviating the stringent requirement of specifying a full causal diagram. A C-DAG specifies relationships between clusters of variables, while the relationships between the variables within a cluster are left unspecified. We develop the foundations and machinery for valid causal inferences over C-DAGs. In particular, we first define a new version of the d-separation criterion and prove its soundness and completeness. Secondly, we extend these new separation rules and prove the validity of the corresponding do-calculus. Lastly, we show that a standard identification algorithm is sound and complete to systematically compute causal effects from observational data given a C-DAG.  ( 2 min )
    Submodular Maximization in Clean Linear Time. (arXiv:2006.09327v4 [cs.DS] UPDATED)
    In this paper, we provide the first deterministic algorithm that achieves the tight $1-1/e$ approximation guarantee for submodular maximization under a cardinality (size) constraint while making a number of queries that scales only linearly with the size of the ground set $n$. To complement our result, we also show strong information-theoretic lower bounds. More specifically, we show that when the maximum cardinality allowed for a solution is constant, no algorithm making a sub-linear number of function evaluations can guarantee any constant approximation ratio. Furthermore, when the constraint allows the selection of a constant fraction of the ground set, we show that any algorithm making fewer than $\Omega(n/\log(n))$ function evaluations cannot perform better than an algorithm that simply outputs a uniformly random subset of the ground set of the right size. We then provide a variant of our deterministic algorithm for the more general knapsack constraint, which is the first linear-time algorithm that achieves $1/2$-approximation guarantee for this constraint. Finally, we extend our results to the general case of maximizing a monotone submodular function subject to the intersection of a $p$-set system and multiple knapsack constraints. We extensively evaluate the performance of our algorithms on multiple real-life machine learning applications, including movie recommendation, location summarization, twitter text summarization and video summarization.  ( 2 min )
    Quadric Hypersurface Intersection for Manifold Learning in Feature Space. (arXiv:2102.06186v2 [cs.LG] UPDATED)
    The knowledge that data lies close to a particular submanifold of the ambient Euclidean space may be useful in a number of ways. For instance, one may want to automatically mark any point far away from the submanifold as an outlier or to use the geometry to come up with a better distance metric. Manifold learning problems are often posed in a very high dimension, e.g. for spaces of images or spaces of words. Today, with deep representation learning on the rise in areas such as computer vision and natural language processing, many problems of this kind may be transformed into problems of moderately high dimension, typically of the order of hundreds. Motivated by this, we propose a manifold learning technique suitable for moderately high dimension and large datasets. The manifold is learned from the training data in the form of an intersection of quadric hypersurfaces -- simple but expressive objects. At test time, this manifold can be used to introduce a computationally efficient outlier score for arbitrary new data points and to improve a given similarity metric by incorporating the learned geometric structure into it.  ( 2 min )
    Factorizer: A Scalable Interpretable Approach to Context Modeling for Medical Image Segmentation. (arXiv:2202.12295v1 [eess.IV])
    Convolutional Neural Networks (CNNs) with U-shaped architectures have dominated medical image segmentation, which is crucial for various clinical purposes. However, the inherent locality of convolution makes CNNs fail to fully exploit global context, essential for better recognition of some structures, e.g., brain lesions. Transformers have recently proved promising performance on vision tasks, including semantic segmentation, mainly due to their capability of modeling long-range dependencies. Nevertheless, the quadratic complexity of attention makes existing Transformer-based models use self-attention layers only after somehow reducing the image resolution, which limits the ability to capture global contexts present at higher resolutions. Therefore, this work introduces a family of models, dubbed Factorizer, which leverages the power of low-rank matrix factorization for constructing an end-to-end segmentation model. Specifically, we propose a linearly scalable approach to context modeling, formulating Nonnegative Matrix Factorization (NMF) as a differentiable layer integrated into a U-shaped architecture. The shifted window technique is also utilized in combination with NMF to effectively aggregate local information. Factorizers compete favorably with CNNs and Transformers in terms of accuracy, scalability, and interpretability, achieving state-of-the-art results on the BraTS dataset for brain tumor segmentation, with Dice scores of 79.33%, 83.14%, and 90.16% for enhancing tumor, tumor core, and whole tumor, respectively. Highly meaningful NMF components give an additional interpretability advantage to Factorizers over CNNs and Transformers. Moreover, our ablation studies reveal a distinctive feature of Factorizers that enables a significant speed-up in inference for a trained Factorizer without any extra steps and without sacrificing much accuracy.  ( 2 min )
    Policy design in experiments with unknown interference. (arXiv:2011.08174v6 [econ.EM] UPDATED)
    This paper proposes an experimental design for estimation and inference on welfare-maximizing policies in the presence of spillover effects. I consider a setting where units are organized into a finite number of large clusters and interact in unknown ways within each cluster. As a first contribution, I introduce a single-wave experiment which, by carefully varying the randomization across pairs of clusters, estimates the marginal effect of a change in treatment probabilities, taking spillover effects into account. Using the marginal effect, I propose a practical test for policy optimality. The idea is that researchers should report the marginal effect and test for policy optimality: the marginal effect indicates the direction for a welfare improvement, and the test provides evidence on whether it is worth conducting additional experiments to estimate a welfare-improving treatment allocation. As a second contribution, I design a multiple-wave experiment to estimate treatment assignment rules and maximize welfare. I derive strong small-sample guarantees on the difference between the maximum attainable welfare and the welfare evaluated at the estimated policy, which converges linearly in the number of clusters and iterations. Simulations calibrated to existing experiments on information diffusion and cash-transfer programs show welfare improvements up to fifty percentage points.  ( 2 min )
    On the Omnipresence of Spurious Local Minima in Certain Neural Network Training Problems. (arXiv:2202.12262v1 [cs.LG])
    We study the loss landscape of training problems for deep artificial neural networks with a one-dimensional real output whose activation functions contain an affine segment and whose hidden layers have width at least two. It is shown that such problems possess a continuum of spurious (i.e., not globally optimal) local minima for all target functions that are not affine. In contrast to previous works, our analysis covers all sampling and parameterization regimes, general differentiable loss functions, arbitrary continuous nonpolynomial activation functions, and both the finite- and infinite-dimensional setting. It is further shown that the appearance of the spurious local minima in the considered training problems is a direct consequence of the universal approximation theorem and that the underlying mechanisms also cause, e.g., Lp-best approximation problems to be ill-posed in the sense of Hadamard for all networks that do not have a dense image. The latter result also holds without the assumption of local affine linearity and without any conditions on the hidden layers. The paper concludes with a numerical experiment which demonstrates that spurious local minima can indeed affect the convergence behavior of gradient-based solution algorithms in practice.  ( 2 min )
    Measuring CLEVRness: Blackbox testing of Visual Reasoning Models. (arXiv:2202.12162v1 [cs.LG])
    How can we measure the reasoning capabilities of intelligence systems? Visual question answering provides a convenient framework for testing the model's abilities by interrogating the model through questions about the scene. However, despite scores of various visual QA datasets and architectures, which sometimes yield even a super-human performance, the question of whether those architectures can actually reason remains open to debate. To answer this, we extend the visual question answering framework and propose the following behavioral test in the form of a two-player game. We consider black-box neural models of CLEVR. These models are trained on a diagnostic dataset benchmarking reasoning. Next, we train an adversarial player that re-configures the scene to fool the CLEVR model. We show that CLEVR models, which otherwise could perform at a human level, can easily be fooled by our agent. Our results put in doubt whether data-driven approaches can do reasoning without exploiting the numerous biases that are often present in those datasets. Finally, we also propose a controlled experiment measuring the efficiency of such models to learn and perform reasoning.  ( 2 min )
    Partitioned Variational Inference: A framework for probabilistic federated learning. (arXiv:2202.12275v1 [stat.ML])
    The proliferation of computing devices has brought about an opportunity to deploy machine learning models on new problem domains using previously inaccessible data. Traditional algorithms for training such models often require data to be stored on a single machine with compute performed by a single node, making them unsuitable for decentralised training on multiple devices. This deficiency has motivated the development of federated learning algorithms, which allow multiple data owners to train collaboratively and use a shared model whilst keeping local data private. However, many of these algorithms focus on obtaining point estimates of model parameters, rather than probabilistic estimates capable of capturing model uncertainty, which is essential in many applications. Variational inference (VI) has become the method of choice for fitting many modern probabilistic models. In this paper we introduce partitioned variational inference (PVI), a general framework for performing VI in the federated setting. We develop new supporting theory for PVI, demonstrating a number of properties that make it an attractive choice for practitioners; use PVI to unify a wealth of fragmented, yet related literature; and provide empirical results that showcase the effectiveness of PVI in a variety of federated settings.  ( 2 min )
    Memorizing without overfitting: Bias, variance, and interpolation in over-parameterized models. (arXiv:2010.13933v4 [stat.ML] UPDATED)
    The bias-variance trade-off is a central concept in supervised learning. In classical statistics, increasing the complexity of a model (e.g., number of parameters) reduces bias but also increases variance. Until recently, it was commonly believed that optimal performance is achieved at intermediate model complexities which strike a balance between bias and variance. Modern Deep Learning methods flout this dogma, achieving state-of-the-art performance using "over-parameterized models" where the number of fit parameters is large enough to perfectly fit the training data. As a result, understanding bias and variance in over-parameterized models has emerged as a fundamental problem in machine learning. Here, we use methods from statistical physics to derive analytic expressions for bias and variance in two minimal models of over-parameterization (linear regression and two-layer neural networks with nonlinear data distributions), allowing us to disentangle properties stemming from the model architecture and random sampling of data. In both models, increasing the number of fit parameters leads to a phase transition where the training error goes to zero and the test error diverges as a result of the variance (while the bias remains finite). Beyond this threshold, the test error of the two-layer neural network decreases due to a monotonic decrease in \emph{both} the bias and variance in contrast with the classical bias-variance trade-off. We also show that in contrast with classical intuition, over-parameterized models can overfit even in the absence of noise and exhibit bias even if the student and teacher models match. We synthesize these results to construct a holistic understanding of generalization error and the bias-variance trade-off in over-parameterized models and relate our results to random matrix theory.  ( 3 min )
    On the influence of roundoff errors on the convergence of the gradient descent method with low-precision floating-point computation. (arXiv:2202.12276v1 [cs.LG])
    The employment of stochastic rounding schemes helps prevent stagnation of convergence, due to vanishing gradient effect when implementing the gradient descent method in low precision. Conventional stochastic rounding achieves zero bias by preserving small updates with probabilities proportional to their relative magnitudes. In this study, we propose a new stochastic rounding scheme that trades the zero bias property with a larger probability to preserve small gradients. Our method yields a constant rounding bias that, at each iteration, lies in a descent direction. For convex problems, we prove that the proposed rounding method has a beneficial effect on the convergence rate of gradient descent. We validate our theoretical analysis by comparing the performances of various rounding schemes when optimizing a multinomial logistic regression model and when training a simple neural network with 8-bit floating-point format.  ( 2 min )
    Evaluating Feature Attribution Methods in the Image Domain. (arXiv:2202.12270v1 [cs.CV])
    Feature attribution maps are a popular approach to highlight the most important pixels in an image for a given prediction of a model. Despite a recent growth in popularity and available methods, little attention is given to the objective evaluation of such attribution maps. Building on previous work in this domain, we investigate existing metrics and propose new variants of metrics for the evaluation of attribution maps. We confirm a recent finding that different attribution metrics seem to measure different underlying concepts of attribution maps, and extend this finding to a larger selection of attribution metrics. We also find that metric results on one dataset do not necessarily generalize to other datasets, and methods with desirable theoretical properties such as DeepSHAP do not necessarily outperform computationally cheaper alternatives. Based on these findings, we propose a general benchmarking approach to identify the ideal feature attribution method for a given use case. Implementations of attribution metrics and our experiments are available online.  ( 2 min )
    Communication-Compressed Adaptive Gradient Method for Distributed Nonconvex Optimization. (arXiv:2111.00705v2 [cs.LG] UPDATED)
    Due to the explosion in the size of the training datasets, distributed learning has received growing interest in recent years. One of the major bottlenecks is the large communication cost between the central server and the local workers. While error feedback compression has been proven to be successful in reducing communication costs with stochastic gradient descent (SGD), there are much fewer attempts in building communication-efficient adaptive gradient methods with provable guarantees, which are widely used in training large-scale machine learning models. In this paper, we propose a new communication-compressed AMSGrad for distributed nonconvex optimization problem, which is provably efficient. Our proposed distributed learning framework features an effective gradient compression strategy and a worker-side model update design. We prove that the proposed communication-efficient distributed adaptive gradient method converges to the first-order stationary point with the same iteration complexity as uncompressed vanilla AMSGrad in the stochastic nonconvex optimization setting. Experiments on various benchmarks back up our theory.  ( 2 min )
    Analyzing Micro-Founded General Equilibrium Models with Many Agents using Deep Reinforcement Learning. (arXiv:2201.01163v2 [cs.GT] UPDATED)
    Real economies can be modeled as a sequential imperfect-information game with many heterogeneous agents, such as consumers, firms, and governments. Dynamic general equilibrium (DGE) models are often used for macroeconomic analysis in this setting. However, finding general equilibria is challenging using existing theoretical or computational methods, especially when using microfoundations to model individual agents. Here, we show how to use deep multi-agent reinforcement learning (MARL) to find $\epsilon$-meta-equilibria over agent types in microfounded DGE models. Whereas standard MARL fails to learn non-trivial solutions, our structured learning curricula enable stable convergence to meaningful solutions. Conceptually, our approach is more flexible and does not need unrealistic assumptions, e.g., continuous market clearing, that are commonly used for analytical tractability. Furthermore, our end-to-end GPU implementation enables fast real-time convergence with a large number of RL economic agents. We showcase our approach in open and closed real-business-cycle (RBC) models with 100 worker-consumers, 10 firms, and a social planner who taxes and redistributes. We validate the learned solutions are $\epsilon$-meta-equilibria through best-response analyses, show that they align with economic intuitions, and show our approach can learn a spectrum of qualitatively distinct $\epsilon$-meta-equilibria in open RBC models. As such, we show that hardware-accelerated MARL is a promising framework for modeling the complexity of economies based on microfoundations.  ( 2 min )
    Reinforcement Learning for Personalized Drug Discovery and Design for Complex Diseases: A Systems Pharmacology Perspective. (arXiv:2201.08894v3 [q-bio.BM] UPDATED)
    Many multi-genic systemic diseases such as neurological disorders, inflammatory diseases, and the majority of cancers do not have effective treatments yet. Reinforcement learning powered systems pharmacology is a potentially effective approach to design personalized therapies for untreatable complex diseases. In this survey, state-of-the-art reinforcement learning methods and their latest applications to drug design are reviewed. The challenges on harnessing reinforcement learning for systems pharmacology and personalized medicine are discussed. Potential solutions to overcome the challenges are proposed. In spite of successful application of advanced reinforcement learning techniques to target-based drug discovery, new reinforcement learning strategies are needed to address systems pharmacology-oriented personalized de novo drug design.  ( 2 min )
    Multiscale Generative Models: Improving Performance of a Generative Model Using Feedback from Other Dependent Generative Models. (arXiv:2201.09644v2 [cs.LG] UPDATED)
    Realistic fine-grained multi-agent simulation of real-world complex systems is crucial for many downstream tasks such as reinforcement learning. Recent work has used generative models (GANs in particular) for providing high-fidelity simulation of real-world systems. However, such generative models are often monolithic and miss out on modeling the interaction in multi-agent systems. In this work, we take a first step towards building multiple interacting generative models (GANs) that reflects the interaction in real world. We build and analyze a hierarchical set-up where a higher-level GAN is conditioned on the output of multiple lower-level GANs. We present a technique of using feedback from the higher-level GAN to improve performance of lower-level GANs. We mathematically characterize the conditions under which our technique is impactful, including understanding the transfer learning nature of our set-up. We present three distinct experiments on synthetic data, time series data, and image domain, revealing the wide applicability of our technique.  ( 2 min )
    Block Policy Mirror Descent. (arXiv:2201.05756v2 [cs.LG] UPDATED)
    In this paper, we present a new class of policy gradient (PG) methods, namely the block policy mirror descent (BPMD) methods for solving a class of regularized reinforcement learning (RL) problems with (strongly) convex regularizers. Compared to the traditional PG methods with a batch update rule, which visits and updates the policy for every state, BPMD methods have cheap per-iteration computation via a partial update rule that performs the policy update on a sampled state. Despite the nonconvex nature of the problem and a partial update rule, BPMD methods achieve fast linear convergence to the global optimality. We further extend BPMD methods to the stochastic setting, by utilizing stochastic first-order information constructed from samples. We establish $\mathcal{O}(1/\epsilon)$ (resp. $\mathcal{O}(1/\epsilon^2)$) sample complexity for the strongly convex (resp. non-strongly convex) regularizers, where $\epsilon$ denotes the target accuracy, with different procedures for constructing the stochastic first-order information. To the best of our knowledge, this is the first time that block coordinate descent methods have been developed and analyzed for policy optimization in reinforcement learning, which provides a new perspective on solving large-scale RL problems.  ( 2 min )
    Self-Supervised Radio-Visual Representation Learning for 6G Sensing. (arXiv:2111.02887v2 [cs.NI] UPDATED)
    In future 6G cellular networks, a joint communication and sensing protocol will allow the network to perceive the environment, opening the door for many new applications atop a unified communication-perception infrastructure. However, interpreting the sparse radio representation of sensing scenes is challenging, which hinders the potential of these emergent systems. We propose to combine radio and vision to automatically learn a radio-only sensing model with minimal human intervention. We want to build a radio sensing model that can feed on millions of uncurated data points. To this end, we leverage recent advances in self-supervised learning and formulate a new label-free radio-visual co-learning scheme, whereby vision trains radio via cross-modal mutual information. We implement and evaluate our scheme according to the common linear classification benchmark, and report qualitative and quantitative performance metrics. In our evaluation, the representation learnt by radio-visual self-supervision works well for a downstream sensing demonstrator, and outperforms its fully-supervised counterpart when less labelled data is used. This indicates that self-supervised learning could be an important enabler for future scalable radio sensing systems.  ( 2 min )
    Neighborhood Region Smoothing Regularization for Finding Flat Minima In Deep Neural Networks. (arXiv:2201.06064v2 [cs.LG] UPDATED)
    Due to diverse architectures in deep neural networks (DNNs) with severe overparameterization, regularization techniques are critical for finding optimal solutions in the huge hypothesis space. In this paper, we propose an effective regularization technique, called Neighborhood Region Smoothing (NRS). NRS leverages the finding that models would benefit from converging to flat minima, and tries to regularize the neighborhood region in weight space to yield approximate outputs. Specifically, gap between outputs of models in the neighborhood region is gauged by a defined metric based on Kullback-Leibler divergence. This metric provides similar insights with the minimum description length principle on interpreting flat minima. By minimizing both this divergence and empirical loss, NRS could explicitly drive the optimizer towards converging to flat minima. We confirm the effectiveness of NRS by performing image classification tasks across a wide range of model architectures on commonly-used datasets such as CIFAR and ImageNet, where generalization ability could be universally improved. Also, we empirically show that the minima found by NRS would have relatively smaller Hessian eigenvalues compared to the conventional method, which is considered as the evidence of flat minima.  ( 2 min )
    Conditionally Gaussian PAC-Bayes. (arXiv:2110.11886v2 [cs.LG] UPDATED)
    Recent studies have empirically investigated different methods to train stochastic neural networks on a classification task by optimising a PAC-Bayesian bound via stochastic gradient descent. Most of these procedures need to replace the misclassification error with a surrogate loss, leading to a mismatch between the optimisation objective and the actual generalisation bound. The present paper proposes a novel training algorithm that optimises the PAC-Bayesian bound, without relying on any surrogate loss. Empirical results show that this approach outperforms currently available PAC-Bayesian training methods.  ( 2 min )
    Universal Efficient Variable-rate Neural Image Compression. (arXiv:2111.11305v4 [eess.IV] UPDATED)
    Recently, Learning-based image compression has reached comparable performance with traditional image codecs(such as JPEG, BPG, WebP). However, computational complexity and rate flexibility are still two major challenges for its practical deployment. To tackle these problems, this paper proposes two universal modules named Energy-based Channel Gating(ECG) and Bit-rate Modulator(BM), which can be directly embedded into existing end-to-end image compression models. ECG uses dynamic pruning to reduce FLOPs for more than 50\% in convolution layers, and a BM pair can modulate the latent representation to control the bit-rate in a channel-wise manner. By implementing these two modules, existing learning-based image codecs can obtain ability to output arbitrary bit-rate with a single model and reduced computation.  ( 2 min )
    An Alternate Policy Gradient Estimator for Softmax Policies. (arXiv:2112.11622v2 [cs.LG] UPDATED)
    Policy gradient (PG) estimators are ineffective in dealing with softmax policies that are sub-optimally saturated, which refers to the situation when the policy concentrates its probability mass on sub-optimal actions. Sub-optimal policy saturation may arise from bad policy initialization or sudden changes in the environment that occur after the policy has already converged. Current softmax PG estimators require a large number of updates to overcome policy saturation, which causes low sample efficiency and poor adaptability to new situations. To mitigate this problem, we propose a novel PG estimator for softmax policies that utilizes the bias in the critic estimate and the noise present in the reward signal to escape the saturated regions of the policy parameter space. Our theoretical analysis and experiments, conducted on bandits and various reinforcement learning environments, show that this new estimator is significantly more robust to policy saturation.  ( 2 min )
    TENGraD: Time-Efficient Natural Gradient Descent with Exact Fisher-Block Inversion. (arXiv:2106.03947v3 [cs.LG] UPDATED)
    This work proposes a time-efficient Natural Gradient Descent method, called TENGraD, with linear convergence guarantees. Computing the inverse of the neural network's Fisher information matrix is expensive in NGD because the Fisher matrix is large. Approximate NGD methods such as KFAC attempt to improve NGD's running time and practical application by reducing the Fisher matrix inversion cost with approximation. However, the approximations do not reduce the overall time significantly and lead to less accurate parameter updates and loss of curvature information. TENGraD improves the time efficiency of NGD by computing Fisher block inverses with a computationally efficient covariance factorization and reuse method. It computes the inverse of each block exactly using the Woodbury matrix identity to preserve curvature information while admitting (linear) fast convergence rates. Our experiments on image classification tasks for state-of-the-art deep neural architecture on CIFAR-10, CIFAR-100, and Fashion-MNIST show that TENGraD significantly outperforms state-of-the-art NGD methods and often stochastic gradient descent in wall-clock time.  ( 2 min )
    Cognitive Explainers of Graph Neural Networks Based on Medical Concepts. (arXiv:2201.07798v2 [cs.LG] UPDATED)
    Although deep neural networks (DNN) have achieved state-of-the-art performance in various fields, some unexpected errors are often found in the neural network, which is very dangerous for some tasks requiring high reliability and high security. The non-transparency and unexplainably of Convolutional Neural Networks (CNN) still limit its application in many fields, such as medical care and finance. Despite current studies that have been committed to visualizing the decision process of DNN, most of these methods focus on the low level and do not take into account the prior knowledge of medicine. In this work, we propose an interpretable framework based on key medical concepts, enabling CNN to explain from the perspective of doctors' cognition. We propose an interpretable automatic recognition framework for the ultrasonic standard plane, which uses a concept-based graph convolutional neural network to construct the relationships between key medical concepts, to obtain an interpretation consistent with a doctor's cognition. Extensive experiments have empirically shown that our model can meaningfully explain the decision of the classifier and provide quantitative support.  ( 2 min )
    Capturing Failures of Large Language Models via Human Cognitive Biases. (arXiv:2202.12299v1 [cs.CL])
    Large language models generate complex, open-ended outputs: instead of outputting a single class, they can write summaries, generate dialogue, and produce working code. In order to study the reliability of these open-ended systems, we must understand not just when they fail, but also how they fail. To approach this, we draw inspiration from human cognitive biases -- systematic patterns of deviation from rational judgement. Specifically, we use cognitive biases to (i) identify inputs that models are likely to err on, and (ii) develop tests to qualitatively characterize their errors on these inputs. Using code generation as a case study, we find that OpenAI's Codex errs predictably based on how the input prompt is framed, adjusts outputs towards anchors, and is biased towards outputs that mimic frequent training examples. We then use our framework to uncover high-impact errors such as incorrectly deleting files. Our experiments suggest that cognitive science can be a useful jumping-off point to better understand how contemporary machine learning systems behave.  ( 2 min )
    Fast and Scalable Spike and Slab Variable Selection in High-Dimensional Gaussian Processes. (arXiv:2111.04558v2 [stat.ML] UPDATED)
    Variable selection in Gaussian processes (GPs) is typically undertaken by thresholding the inverse lengthscales of automatic relevance determination kernels, but in high-dimensional datasets this approach can be unreliable. A more probabilistically principled alternative is to use spike and slab priors and infer a posterior probability of variable inclusion. However, existing implementations in GPs are very costly to run in both high-dimensional and large-$n$ datasets, or are only suitable for unsupervised settings with specific kernels. As such, we develop a fast and scalable variational inference algorithm for the spike and slab GP that is tractable with arbitrary differentiable kernels. We improve our algorithm's ability to adapt to the sparsity of relevant variables by Bayesian model averaging over hyperparameters, and achieve substantial speed ups using zero temperature posterior restrictions, dropout pruning and nearest neighbour minibatching. In experiments our method consistently outperforms vanilla and sparse variational GPs whilst retaining similar runtimes (even when $n=10^6$) and performs competitively with a spike and slab GP using MCMC but runs up to $1000$ times faster.  ( 2 min )
    Resampling Base Distributions of Normalizing Flows. (arXiv:2110.15828v2 [stat.ML] UPDATED)
    Normalizing flows are a popular class of models for approximating probability distributions. However, their invertible nature limits their ability to model target distributions whose support have a complex topological structure, such as Boltzmann distributions. Several procedures have been proposed to solve this problem but many of them sacrifice invertibility and, thereby, tractability of the log-likelihood as well as other desirable properties. To address these limitations, we introduce a base distribution for normalizing flows based on learned rejection sampling, allowing the resulting normalizing flow to model complicated distributions without giving up bijectivity. Furthermore, we develop suitable learning algorithms using both maximizing the log-likelihood and the optimization of the Kullback-Leibler divergence, and apply them to various sample problems, i.e. approximating 2D densities, density estimation of tabular data, image generation, and modeling Boltzmann distributions. In these experiments our method is competitive with or outperforms the baselines.  ( 2 min )
    Step-unrolled Denoising Autoencoders for Text Generation. (arXiv:2112.06749v2 [cs.CL] UPDATED)
    In this paper we propose a new generative model of text, Step-unrolled Denoising Autoencoder (SUNDAE), that does not rely on autoregressive models. Similarly to denoising diffusion techniques, SUNDAE is repeatedly applied on a sequence of tokens, starting from random inputs and improving them each time until convergence. We present a simple new improvement operator that converges in fewer iterations than diffusion methods, while qualitatively producing better samples on natural language datasets. SUNDAE achieves state-of-the-art results (among non-autoregressive methods) on the WMT'14 English-to-German translation task and good qualitative results on unconditional language modeling on the Colossal Cleaned Common Crawl dataset and a dataset of Python code from GitHub. The non-autoregressive nature of SUNDAE opens up possibilities beyond left-to-right prompted generation, by filling in arbitrary blank patterns in a template.  ( 2 min )
    Serverless Model Serving for Data Science. (arXiv:2103.02958v2 [cs.DC] UPDATED)
    Machine learning (ML) is an important part of modern data science applications. Data scientists today have to manage the end-to-end ML life cycle that includes both model training and model serving, the latter of which is essential, as it makes their works available to end-users. Systems of model serving require high performance, low cost, and ease of management. Cloud providers are already offering model serving choices, including managed services and self-rented servers. Recently, serverless computing, whose advantages include high elasticity and a fine-grained cost model, brings another option for model serving. Our goal in this paper is to examine the viability of serverless as a mainstream model serving platform. To this end, we first conduct a comprehensive evaluation of the performance and cost of serverless against other model serving systems on Amazon Web Service and Google Cloud Platform. We find that serverless outperforms many cloud-based alternatives. Further, there are settings under which it even achieves better performance than GPU-based systems. Next, we present the design space of serverless model serving, which comprises multiple dimensions, including cloud platforms, serving runtimes, and other function-specific parameters. For each dimension, we analyze the impact of different choices and provide suggestions for data scientists to better utilize serverless model serving. Finally, we discuss challenges and opportunities in building a more practical serverless model serving system.  ( 2 min )
    Inductive Bias of Multi-Channel Linear Convolutional Networks with Bounded Weight Norm. (arXiv:2102.12238v3 [cs.LG] UPDATED)
    We provide a function space characterization of the inductive bias resulting from minimizing the $\ell_2$ norm of the weights in multi-channel convolutional neural networks with linear activations and empirically test our resulting hypothesis on ReLU networks trained using gradient descent. We define an induced regularizer in the function space as the minimum $\ell_2$ norm of weights of a network required to realize a function. For two layer linear convolutional networks with $C$ output channels and kernel size $K$, we show the following: (a) If the inputs to the network are single channeled, the induced regularizer for any $K$ is independent of the number of output channels $C$. Furthermore, we derive the regularizer is a norm given by a semidefinite program (SDP). (b) In contrast, for multi-channel inputs, multiple output channels can be necessary to merely realize all matrix-valued linear functions and thus the inductive bias does depend on $C$. However, for sufficiently large $C$, the induced regularizer is again given by an SDP that is independent of $C$. In particular, the induced regularizer for $K=1$ and $K=D$ (input dimension) are given in closed form as the nuclear norm and the $\ell_{2,1}$ group-sparse norm, respectively, of the Fourier coefficients of the linear predictor. We investigate the broader applicability of our theoretical results to implicit regularization from gradient descent on linear and ReLU networks through experiments on MNIST and CIFAR-10 datasets.  ( 2 min )
    Flat latent manifolds for music improvisation between human and machine. (arXiv:2202.12243v1 [cs.SD])
    The use of machine learning in artistic music generation leads to controversial discussions of the quality of art, for which objective quantification is nonsensical. We therefore consider a music-generating algorithm as a counterpart to a human musician, in a setting where reciprocal improvisation is to lead to new experiences, both for the musician and the audience. To obtain this behaviour, we resort to the framework of recurrent Variational Auto-Encoders (VAE) and learn to generate music, seeded by a human musician. In the learned model, we generate novel musical sequences by interpolation in latent space. Standard VAEs however do not guarantee any form of smoothness in their latent representation. This translates into abrupt changes in the generated music sequences. To overcome these limitations, we regularise the decoder and endow the latent space with a flat Riemannian manifold, i.e., a manifold that is isometric to the Euclidean space. As a result, linearly interpolating in the latent space yields realistic and smooth musical changes that fit the type of machine--musician interactions we aim for. We provide empirical evidence for our method via a set of experiments on music datasets and we deploy our model for an interactive jam session with a professional drummer. The live performance provides qualitative evidence that the latent representation can be intuitively interpreted and exploited by the drummer to drive the interplay. Beyond the musical application, our approach showcases an instance of human-centred design of machine-learning models, driven by interpretability and the interaction with the end user.  ( 2 min )
    Sample Efficiency of Data Augmentation Consistency Regularization. (arXiv:2202.12230v1 [cs.LG])
    Data augmentation is popular in the training of large neural networks; currently, however, there is no clear theoretical comparison between different algorithmic choices on how to use augmented data. In this paper, we take a step in this direction - we first present a simple and novel analysis for linear regression, demonstrating that data augmentation consistency (DAC) is intrinsically more efficient than empirical risk minimization on augmented data (DA-ERM). We then propose a new theoretical framework for analyzing DAC, which reframes DAC as a way to reduce function class complexity. The new framework characterizes the sample efficiency of DAC for various non-linear models (e.g., neural networks). Further, we perform experiments that make a clean and apples-to-apples comparison (i.e., with no extra modeling or data tweaks) between ERM and consistency regularization using CIFAR-100 and WideResNet; these together demonstrate the superior efficacy of DAC.  ( 2 min )
    Overcoming a Theoretical Limitation of Self-Attention. (arXiv:2202.12172v1 [cs.LG])
    Although transformers are remarkably effective for many tasks, there are some surprisingly easy-looking regular languages that they struggle with. Hahn shows that for languages where acceptance depends on a single input symbol, a transformer's classification decisions become less and less confident (that is, with cross-entropy approaching 1 bit per string) as input strings get longer and longer. We examine this limitation using two languages: PARITY, the language of bit strings with an odd number of 1s, and FIRST, the language of bit strings starting with a 1. We demonstrate three ways of overcoming the limitation suggested by Hahn's lemma. First, we settle an open question by constructing a transformer that recognizes PARITY with perfect accuracy, and similarly for FIRST. Second, we use layer normalization to bring the cross-entropy of both models arbitrarily close to zero. Third, when transformers need to focus on a single position, as for FIRST, we find that they can fail to generalize to longer strings; we offer a simple remedy to this problem that also improves length generalization in machine translation.  ( 2 min )
    Sequential Asset Ranking within Nonstationary Time Series. (arXiv:2202.12186v1 [cs.CE])
    Financial time series are both autocorrelated and nonstationary, presenting modelling challenges that violate the independent and identically distributed random variables assumption of most regression and classification models. The prediction with expert advice framework makes no assumptions on the data-generating mechanism yet generates predictions that work well for all sequences, with performance nearly as good as the best expert with hindsight. We conduct research using S&P 250 daily sampled data, extending the academic research into cross-sectional momentum trading strategies. We introduce a novel ranking algorithm from the prediction with expert advice framework, the naive Bayes asset ranker, to select subsets of assets to hold in either long-only or long/short portfolios. Our algorithm generates the best total returns and risk-adjusted returns, net of transaction costs, outperforming the long-only holding of the S&P 250 with hindsight. Furthermore, our ranking algorithm outperforms a proxy for the regress-then-rank cross-sectional momentum trader, a sequentially fitted curds and whey multivariate regression procedure.  ( 2 min )
    Clustering Edges in Directed Graphs. (arXiv:2202.12265v1 [cs.SI])
    How do vertices exert influence in graph data? We develop a framework for edge clustering, a new method for exploratory data analysis that reveals how both vertices and edges collaboratively accomplish directed influence in graphs, especially for directed graphs. In contrast to the ubiquitous vertex clustering which groups vertices, edge clustering groups edges. Edges sharing a functional affinity are assigned to the same group and form an influence subgraph cluster. With a complexity comparable to that of vertex clustering, this framework presents three different methods for edge spectral clustering that reveal important influence subgraphs in graph data, with each method providing different insight into directed influence processes. We present several diverse examples demonstrating the potential for widespread application of edge clustering in scientific research.  ( 2 min )
    Clarifying MCMC-based training of modern EBMs : Contrastive Divergence versus Maximum Likelihood. (arXiv:2202.12176v1 [cs.LG])
    The Energy-Based Model (EBM) framework is a very general approach to generative modeling that tries to learn and exploit probability distributions only defined though unnormalized scores. It has risen in popularity recently thanks to the impressive results obtained in image generation by parameterizing the distribution with Convolutional Neural Networks (CNN). However, the motivation and theoretical foundations behind modern EBMs are often absent from recent papers and this sometimes results in some confusion. In particular, the theoretical justifications behind the popular MCMC-based learning algorithm Contrastive Divergence (CD) are often glossed over and we find that this leads to theoretical errors in recent influential papers (Du & Mordatch, 2019; Du et al., 2020). After offering a first-principles introduction of MCMC-based training, we argue that the learning algorithm they use can in fact not be described as CD and reinterpret theirs methods in light of a new interpretation. Finally, we discuss the implications of our new interpretation and provide some illustrative experiments.  ( 2 min )
    Collaborative Training of Heterogeneous Reinforcement Learning Agents in Environments with Sparse Rewards: What and When to Share?. (arXiv:2202.12174v1 [cs.LG])
    In the early stages of human life, babies develop their skills by exploring different scenarios motivated by their inherent satisfaction rather than by extrinsic rewards from the environment. This behavior, referred to as intrinsic motivation, has emerged as one solution to address the exploration challenge derived from reinforcement learning environments with sparse rewards. Diverse exploration approaches have been proposed to accelerate the learning process over single- and multi-agent problems with homogeneous agents. However, scarce studies have elaborated on collaborative learning frameworks between heterogeneous agents deployed into the same environment, but interacting with different instances of the latter without any prior knowledge. Beyond the heterogeneity, each agent's characteristics grant access only to a subset of the full state space, which may hide different exploration strategies and optimal solutions. In this work we combine ideas from intrinsic motivation and transfer learning. Specifically, we focus on sharing parameters in actor-critic model architectures and on combining information obtained through intrinsic motivation with the aim of having a more efficient exploration and faster learning. We test our strategies through experiments performed over a modified ViZDooM's My Way Home scenario, which is more challenging than its original version and allows evaluating the heterogeneity between agents. Our results reveal different ways in which a collaborative framework with little additional computational cost can outperform an independent learning process without knowledge sharing. Additionally, we depict the need for modulating correctly the importance between the extrinsic and intrinsic rewards to avoid undesired agent behaviors.  ( 2 min )
    Machine Learning for Intrusion Detection in Industrial Control Systems: Applications, Challenges, and Recommendations. (arXiv:2202.11917v1 [cs.CR])
    Methods from machine learning are being applied to design Industrial Control Systems resilient to cyber-attacks. Such methods focus on two major areas: the detection of intrusions at the network-level using the information acquired through network packets, and detection of anomalies at the physical process level using data that represents the physical behavior of the system. This survey focuses on four types of methods from machine learning in use for intrusion and anomaly detection, namely, supervised, semi-supervised, unsupervised, and reinforcement learning. Literature available in the public domain was carefully selected, analyzed, and placed in a 7-dimensional space for ease of comparison. The survey is targeted at researchers, students, and practitioners. Challenges associated in using the methods and research gaps are identified and recommendations are made to fill the gaps.  ( 2 min )
    Self-Training: A Survey. (arXiv:2202.12040v1 [cs.LG])
    In recent years, semi-supervised algorithms have received a lot of interest in both academia and industry. Among the existing techniques, self-training methods have arguably received more attention in the last few years. These models are designed to search the decision boundary on low density regions without making extra assumptions on the data distribution, and use the unsigned output score of a learned classifier, or its margin, as an indicator of confidence. The working principle of self-training algorithms is to learn a classifier iteratively by assigning pseudo-labels to the set of unlabeled training samples with a margin greater than a certain threshold. The pseudo-labeled examples are then used to enrich the labeled training data and train a new classifier in conjunction with the labeled training set. We present self-training methods for binary and multiclass classification and their variants which were recently developed using Neural Networks. Finally, we discuss our ideas for future research in self-training. To the best of our knowledge, this is the first thorough and complete survey on this subject.  ( 2 min )
    Embedded Ensembles: Infinite Width Limit and Operating Regimes. (arXiv:2202.12297v1 [stat.ML])
    A memory efficient approach to ensembling neural networks is to share most weights among the ensembled models by means of a single reference network. We refer to this strategy as Embedded Ensembling (EE); its particular examples are BatchEnsembles and Monte-Carlo dropout ensembles. In this paper we perform a systematic theoretical and empirical analysis of embedded ensembles with different number of models. Theoretically, we use a Neural-Tangent-Kernel-based approach to derive the wide network limit of the gradient descent dynamics. In this limit, we identify two ensemble regimes - independent and collective - depending on the architecture and initialization strategy of ensemble models. We prove that in the independent regime the embedded ensemble behaves as an ensemble of independent models. We confirm our theoretical prediction with a wide range of experiments with finite networks, and further study empirically various effects such as transition between the two regimes, scaling of ensemble performance with the network width and number of models, and dependence of performance on a number of architecture and hyperparameter choices.  ( 2 min )
    Bounding Membership Inference. (arXiv:2202.12232v1 [cs.LG])
    Differential Privacy (DP) is the de facto standard for reasoning about the privacy guarantees of a training algorithm. Despite the empirical observation that DP reduces the vulnerability of models to existing membership inference (MI) attacks, a theoretical underpinning as to why this is the case is largely missing in the literature. In practice, this means that models need to be trained with DP guarantees that greatly decrease their accuracy. In this paper, we provide a tighter bound on the accuracy of any MI adversary when a training algorithm provides $\epsilon$-DP. Our bound informs the design of a novel privacy amplification scheme, where an effective training set is sub-sampled from a larger set prior to the beginning of training, to greatly reduce the bound on MI accuracy. As a result, our scheme enables $\epsilon$-DP users to employ looser DP guarantees when training their model to limit the success of any MI adversary; this ensures that the model's accuracy is less impacted by the privacy guarantee. Finally, we discuss implications of our MI bound on the field of machine unlearning.  ( 2 min )
    Activation Functions: Dive into an optimal activation function. (arXiv:2202.12065v1 [cs.LG])
    Activation functions have come up as one of the essential components of neural networks. The choice of adequate activation function can impact the accuracy of these methods. In this study, we experiment for finding an optimal activation function by defining it as a weighted sum of existing activation functions and then further optimizing these weights while training the network. The study uses three activation functions, ReLU, tanh, and sin, over three popular image datasets, MNIST, FashionMNIST, and KMNIST. We observe that the ReLU activation function can easily overlook other activation functions. Also, we see that initial layers prefer to have ReLU or LeakyReLU type of activation functions, but deeper layers tend to prefer more convergent activation functions.  ( 2 min )
    Optimal Learning Rates of Deep Convolutional Neural Networks: Additive Ridge Functions. (arXiv:2202.12119v1 [cs.LG])
    Convolutional neural networks have shown extraordinary abilities in many applications, especially those related to the classification tasks. However, for the regression problem, the abilities of convolutional structures have not been fully understood, and further investigation is needed. In this paper, we consider the mean squared error analysis for deep convolutional neural networks. We show that, for additive ridge functions, convolutional neural networks followed by one fully connected layer with ReLU activation functions can reach optimal mini-max rates (up to a log factor). The convergence rates are dimension independent. This work shows the statistical optimality of convolutional neural networks and may shed light on why convolutional neural networks are able to behave well for high dimensional input.  ( 2 min )
    SQuadMDS: a lean Stochastic Quartet MDS improving global structure preservation in neighbor embedding like t-SNE and UMAP. (arXiv:2202.12087v1 [cs.LG])
    Multidimensional scaling is a statistical process that aims to embed high dimensional data into a lower-dimensional space; this process is often used for the purpose of data visualisation. Common multidimensional scaling algorithms tend to have high computational complexities, making them inapplicable on large data sets. This work introduces a stochastic, force directed approach to multidimensional scaling with a time and space complexity of O(N), with N data points. The method can be combined with force directed layouts of the family of neighbour embedding such as t-SNE, to produce embeddings that preserve both the global and the local structures of the data. Experiments assess the quality of the embeddings produced by the standalone version and its hybrid extension both quantitatively and qualitatively, showing competitive results outperforming state-of-the-art approaches. Codes are available at https://github.com/PierreLambert3/SQuaD-MDS-and-FItSNE-hybrid.  ( 2 min )
    Debugging Differential Privacy: A Case Study for Privacy Auditing. (arXiv:2202.12219v1 [cs.LG])
    Differential Privacy can provide provable privacy guarantees for training data in machine learning. However, the presence of proofs does not preclude the presence of errors. Inspired by recent advances in auditing which have been used for estimating lower bounds on differentially private algorithms, here we show that auditing can also be used to find flaws in (purportedly) differentially private schemes. In this case study, we audit a recent open source implementation of a differentially private deep learning algorithm and find, with 99.99999999% confidence, that the implementation does not satisfy the claimed differential privacy guarantee.  ( 2 min )
    EMOTHAW: A novel database for emotional state recognition from handwriting. (arXiv:2202.12245v1 [cs.CV])
    The detection of negative emotions through daily activities such as handwriting is useful for promoting well-being. The spread of human-machine interfaces such as tablets makes the collection of handwriting samples easier. In this context, we present a first publicly available handwriting database which relates emotional states to handwriting, that we call EMOTHAW. This database includes samples of 129 participants whose emotional states, namely anxiety, depression and stress, are assessed by the Depression Anxiety Stress Scales (DASS) questionnaire. Seven tasks are recorded through a digitizing tablet: pentagons and house drawing, words copied in handprint, circles and clock drawing, and one sentence copied in cursive writing. Records consist in pen positions, on-paper and in-air, time stamp, pressure, pen azimuth and altitude. We report our analysis on this database. From collected data, we first compute measurements related to timing and ductus. We compute separate measurements according to the position of the writing device: on paper or in-air. We analyse and classify this set of measurements (referred to as features) using a random forest approach. This latter is a machine learning method [2], based on an ensemble of decision trees, which includes a feature ranking process. We use this ranking process to identify the features which best reveal a targeted emotional state. We then build random forest classifiers associated to each emotional state. Our results, obtained from cross-validation experiments, show that the targeted emotional states can be identified with accuracies ranging from 60% to 71%.  ( 2 min )
    BERTVision -- A Parameter-Efficient Approach for Question Answering. (arXiv:2202.12210v1 [cs.CL])
    We present a highly parameter efficient approach for Question Answering that significantly reduces the need for extended BERT fine-tuning. Our method uses information from the hidden state activations of each BERT transformer layer, which is discarded during typical BERT inference. Our best model achieves maximal BERT performance at a fraction of the training time and GPU or TPU expense. Performance is further improved by ensembling our model with BERTs predictions. Furthermore, we find that near optimal performance can be achieved for QA span annotation using less training data. Our experiments show that this approach works well not only for span annotation, but also for classification, suggesting that it may be extensible to a wider range of tasks.  ( 2 min )
    A comparative study of in-air trajectories at short and long distances in online handwriting. (arXiv:2202.12237v1 [cs.CV])
    Introduction Existing literature about online handwriting analysis to support pathology diagnosis has taken advantage of in-air trajectories. A similar situation occurred in biometric security applications where the goal is to identify or verify an individual using his signature or handwriting. These studies do not consider the distance of the pen tip to the writing surface. This is due to the fact that current acquisition devices do not provide height formation. However, it is quite straightforward to differentiate movements at two different heights: a) short distance: height lower or equal to 1 cm above a surface of digitizer, the digitizer provides x and y coordinates. b) long distance: height exceeding 1 cm, the only information available is a time stamp that indicates the time that a specific stroke has spent at long distance. Although short distance has been used in several papers, long distances have been ignored and will be investigated in this paper. Methods In this paper, we will analyze a large set of databases (BIOSECURID, EMOTHAW, PaHaW, Oxygen-Therapy and SALT), which contain a total amount of 663 users and 17951 files. We have specifically studied: a) the percentage of time spent on-surface, in-air at short distance, and in-air at long distance for different user profiles (pathological and healthy users) and different tasks; b) The potential use of these signals to improve classification rates. Results and conclusions Our experimental results reveal that long-distance movements represent a very small portion of the total execution time (0.5 % in the case of signatures and 10.4% for uppercase words of BIOSECUR-ID, which is the largest database). In addition, significant differences have been found in the comparison of pathological versus control group for letter l in PaHaW database (p=0.0157) and crossed pentagons in SALT database (p=0.0122)  ( 3 min )
    Heterogeneous Graph based Deep Learning for Biomedical Network Link Prediction. (arXiv:2102.01649v4 [cs.SI] UPDATED)
    Multi-scale biomedical knowledge networks are expanding with emerging experimental technologies that generates multi-scale biomedical big data. Link prediction is increasingly used especially in bipartite biomedical networks to identify hidden biological interactions and relationshipts between key entities such as compounds, targets, gene and diseases. We propose a Graph Neural Networks (GNN) method, namely Graph Pair based Link Prediction model (GPLP), for predicting biomedical network links simply based on their topological interaction information. In GPLP, 1-hop subgraphs extracted from known network interaction matrix is learnt to predict missing links. To evaluate our method, three heterogeneous biomedical networks were used, i.e. Drug-Target Interaction network (DTI), Compound-Protein Interaction network (CPI) from NIH Tox21, and Compound-Virus Inhibition network (CVI). Our proposed GPLP method significantly outperforms over the state-of-the-art baselines. In addition, different network incompleteness is analysed with our devised protocol, and we also design an effective approach to improve the model robustness towards incomplete networks. Our method demonstrates the potential applications in other biomedical networks.  ( 2 min )
    On-line signature verification system with failure to enroll managing. (arXiv:2202.12242v1 [cs.CV])
    In this paper we simulate a real biometric verification system based on on-line signatures. For this purpose we have split the MCYT signature database in three subsets: one for classifier training, another for system adjustment and a third one for system testing simulating enrollment and verification. This context corresponds to a real operation, where a new user tries to enroll an existing system and must be automatically guided by the system in order to detect the failure to enroll situations. The main contribution of this work is the management of failure to enroll situations by means of a new proposal, called intelligent enrollment, which consists of consistency checking in order to automatically reject low quality samples. This strategy lets to enhance the verification errors up to 22% when leaving out 8% of the users. In this situation 8% of the people cannot be enrolled in the system and must be verified by other biometrics or by human abilities. These people are identified with intelligent enrollment and the situation can be thus managed. In addition we also propose a DCT-based feature extractor with threshold coding and discriminability criteria.  ( 2 min )
    Evolutionary Multi-Objective Reinforcement Learning Based Trajectory Control and Task Offloading in UAV-Assisted Mobile Edge Computing. (arXiv:2202.12028v1 [eess.SP])
    This paper studies the trajectory control and task offloading (TCTO) problem in an unmanned aerial vehicle (UAV)-assisted mobile edge computing system, where a UAV flies along a planned trajectory to collect computation tasks from smart devices (SDs). We consider a scenario that SDs are not directly connected by the base station (BS) and the UAV has two roles to play: MEC server or wireless relay. The UAV makes task offloading decisions online, in which the collected tasks can be executed locally on the UAV or offloaded to the BS for remote processing. The TCTO problem involves multi-objective optimization as its objectives are to minimize the task delay and the UAV's energy consumption, and maximize the number of tasks collected by the UAV, simultaneously. This problem is challenging because the three objectives conflict with each other. The existing reinforcement learning (RL) algorithms, either single-objective RLs or single-policy multi-objective RLs, cannot well address the problem since they cannot output multiple policies for various preferences (i.e. weights) across objectives in a single run. This paper adapts the evolutionary multi-objective RL (EMORL), a multi-policy multi-objective RL, to the TCTO problem. This algorithm can output multiple optimal policies in just one run, each optimizing a certain preference. The simulation results demonstrate that the proposed algorithm can obtain more excellent nondominated policies by striking a balance between the three objectives regarding policy quality, compared with two evolutionary and two multi-policy RL algorithms.  ( 2 min )
    Testing Deep Learning Models: A First Comparative Study of Multiple Testing Techniques. (arXiv:2202.12139v1 [cs.SE])
    Deep Learning (DL) has revolutionized the capabilities of vision-based systems (VBS) in critical applications such as autonomous driving, robotic surgery, critical infrastructure surveillance, air and maritime traffic control, etc. By analyzing images, voice, videos, or any type of complex signals, DL has considerably increased the situation awareness of these systems. At the same time, while relying more and more on trained DL models, the reliability and robustness of VBS have been challenged and it has become crucial to test thoroughly these models to assess their capabilities and potential errors. To discover faults in DL models, existing software testing methods have been adapted and refined accordingly. In this article, we provide an overview of these software testing methods, namely differential, metamorphic, mutation, and combinatorial testing, as well as adversarial perturbation testing and review some challenges in their deployment for boosting perception systems used in VBS. We also provide a first experimental comparative study on a classical benchmark used in VBS and discuss its results.  ( 2 min )
    Attention Enables Zero Approximation Error. (arXiv:2202.12166v1 [cs.LG])
    Deep learning models have been widely applied in various aspects of daily life. Many variant models based on deep learning structures have achieved even better performances. Attention-based architectures have become almost ubiquitous in deep learning structures. Especially, the transformer model has now defeated the convolutional neural network in image classification tasks to become the most widely used tool. However, the theoretical properties of attention-based models are seldom considered. In this work, we show that with suitable adaptations, the single-head self-attention transformer with a fixed number of transformer encoder blocks and free parameters is able to generate any desired polynomial of the input with no error. The number of transformer encoder blocks is the same as the degree of the target polynomial. Even more exciting, we find that these transformer encoder blocks in this model do not need to be trained. As a direct consequence, we show that the single-head self-attention transformer with increasing numbers of free parameters is universal. These surprising theoretical results clearly explain the outstanding performances of the transformer model and may shed light on future modifications in real applications. We also provide some experiments to verify our theoretical result.  ( 2 min )
    A Transformer-based Network for Deformable Medical Image Registration. (arXiv:2202.12104v1 [eess.IV])
    Deformable medical image registration plays an important role in clinical diagnosis and treatment. Recently, the deep learning (DL) based image registration methods have been widely investigated and showed excellent performance in computational speed. However, these methods cannot provide enough registration accuracy because of insufficient ability in representing both the global and local features of the moving and fixed images. To address this issue, this paper has proposed the transformer based image registration method. This method uses the distinctive transformer to extract the global and local image features for generating the deformation fields, based on which the registered image is produced in an unsupervised way. Our method can improve the registration accuracy effectively by means of self-attention mechanism and bi-level information flow. Experimental results on such brain MR image datasets as LPBA40 and OASIS-1 demonstrate that compared with several traditional and DL based registration methods, our method provides higher registration accuracy in terms of dice values.  ( 2 min )
    Large-scale Stochastic Optimization of NDCG Surrogates for Deep Learning with Provable Convergence. (arXiv:2202.12183v1 [cs.LG])
    NDCG, namely Normalized Discounted Cumulative Gain, is a widely used ranking metric in information retrieval and machine learning. However, efficient and provable stochastic methods for maximizing NDCG are still lacking, especially for deep models. In this paper, we propose a principled approach to optimize NDCG and its top-$K$ variant. First, we formulate a novel compositional optimization problem for optimizing the NDCG surrogate, and a novel bilevel compositional optimization problem for optimizing the top-$K$ NDCG surrogate. Then, we develop efficient stochastic algorithms with provable convergence guarantees for the non-convex objectives. Different from existing NDCG optimization methods, the per-iteration complexity of our algorithms scales with the mini-batch size instead of the number of total items. To improve the effectiveness for deep learning, we further propose practical strategies by using initial warm-up and stop gradient operator. Experimental results on multiple datasets demonstrate that our methods outperform prior ranking approaches in terms of NDCG. To the best of our knowledge, this is the first time that stochastic algorithms are proposed to optimize NDCG with a provable convergence guarantee.  ( 2 min )
    Towards Effective and Robust Neural Trojan Defenses via Input Filtering. (arXiv:2202.12154v1 [cs.CR])
    Trojan attacks on deep neural networks are both dangerous and surreptitious. Over the past few years, Trojan attacks have advanced from using only a simple trigger and targeting only one class to using many sophisticated triggers and targeting multiple classes. However, Trojan defenses have not caught up with this development. Most defense methods still make out-of-date assumptions about Trojan triggers and target classes, thus, can be easily circumvented by modern Trojan attacks. In this paper, we advocate general defenses that are effective and robust against various Trojan attacks and propose two novel "filtering" defenses with these characteristics called Variational Input Filtering (VIF) and Adversarial Input Filtering (AIF). VIF and AIF leverage variational inference and adversarial training respectively to purify all potential Trojan triggers in the input at run time without making any assumption about their numbers and forms. We further extend "filtering" to "filtering-then-contrasting" - a new defense mechanism that helps avoid the drop in classification accuracy on clean data caused by filtering. Extensive experimental results show that our proposed defenses significantly outperform 4 well-known defenses in mitigating 5 different Trojan attacks including the two state-of-the-art which defeat many strong defenses.  ( 2 min )
    Learning Stochastic Dynamics with Statistics-Informed Neural Network. (arXiv:2202.12278v1 [cs.LG])
    We introduce a machine-learning framework named statistics-informed neural network (SINN) for learning stochastic dynamics from data. This new architecture was theoretically inspired by a universal approximation theorem for stochastic systems introduced in this paper and the projection-operator formalism for stochastic modeling. We devise mechanisms for training the neural network model to reproduce the correct \emph{statistical} behavior of a target stochastic process. Numerical simulation results demonstrate that a well-trained SINN can reliably approximate both Markovian and non-Markovian stochastic dynamics. We demonstrate the applicability of SINN to model transition dynamics. Furthermore, we show that the obtained reduced-order model can be trained on temporally coarse-grained data and hence is well suited for rare-event simulations.  ( 2 min )
    Tighter Expected Generalization Error Bounds via Convexity of Information Measures. (arXiv:2202.12150v1 [cs.IT])
    Generalization error bounds are essential to understanding machine learning algorithms. This paper presents novel expected generalization error upper bounds based on the average joint distribution between the output hypothesis and each input training sample. Multiple generalization error upper bounds based on different information measures are provided, including Wasserstein distance, total variation distance, KL divergence, and Jensen-Shannon divergence. Due to the convexity of the information measures, the proposed bounds in terms of Wasserstein distance and total variation distance are shown to be tighter than their counterparts based on individual samples in the literature. An example is provided to demonstrate the tightness of the proposed generalization error bounds.  ( 2 min )
    Investigating the Use of One-Class Support Vector Machine for Software Defect Prediction. (arXiv:2202.12074v1 [cs.SE])
    Early software defect identification is considered an important step towards software quality assurance. Software defect prediction aims at identifying software components that are likely to cause faults before a software is made available to the end-user. To date, this task has been modeled as a two-class classification problem, however its nature also allows it to be formulated as a one-class classification task. Preliminary results obtained in prior work show that One-Class Support Vector Machine (OCSVM) can outperform two-class classifiers for defect prediction. If confirmed, these results would overcome the data imbalance problem researchers have for long attempted to tackle in this field. In this paper, we further investigate whether learning from one class only is sufficient to produce effective defect prediction models by conducting a thorough large-scale empirical study investigating 15 real-world software projects, three validation scenarios, eight classifiers, robust evaluation measures and statistical significance tests. The results reveal that OCSVM is more suitable for cross-version and cross-project, rather than for within-project defect prediction, thus suggesting it performs better with heterogeneous data. While, we cannot conclude that OCSVM is the best classifier (Random Forest performs best herein), our results show interesting findings that open up further research avenues for training accurate defect prediction classifiers when defective instances are scarce or unavailable.  ( 2 min )
    Temporal Convolution Domain Adaptation Learning for Crops Growth Prediction. (arXiv:2202.12120v1 [cs.LG])
    Existing Deep Neural Nets on crops growth prediction mostly rely on availability of a large amount of data. In practice, it is difficult to collect enough high-quality data to utilize the full potential of these deep learning models. In this paper, we construct an innovative network architecture based on domain adaptation learning to predict crops growth curves with limited available crop data. This network architecture overcomes the challenge of data availability by incorporating generated data from the developed crops simulation model. We are the first to use the temporal convolution filters as the backbone to construct a domain adaptation network architecture which is suitable for deep learning regression models with very limited training data of the target domain. We conduct experiments to test the performance of the network and compare our proposed architecture with other state-of-the-art methods, including a recent LSTM-based domain adaptation network architecture. The results show that the proposed temporal convolution-based network architecture outperforms all benchmarks not only in accuracy but also in model size and convergence rate.  ( 2 min )
    Is Neuro-Symbolic AI Meeting its Promise in Natural Language Processing? A Structured Review. (arXiv:2202.12205v1 [cs.AI])
    Advocates for Neuro-Symbolic AI (NeSy) assert that combining deep learning with symbolic reasoning will lead to stronger AI than either paradigm on its own. As successful as deep learning has been, it is generally accepted that even our best deep learning systems are not very good at abstract reasoning. And since reasoning is inextricably linked to language, it makes intuitive sense that Natural Language Processing (NLP), would be a particularly well-suited candidate for NeSy. We conduct a structured review of studies implementing NeSy for NLP, challenges and future directions, and aim to answer the question of whether NeSy is indeed meeting its promises: reasoning, out-of-distribution generalization, interpretability, learning and reasoning from small data, and transferability to new domains. We examine the impact of knowledge representation, such as rules and semantic networks, language structure and relational structure, and whether implicit or explicit reasoning contributes to higher promise scores. We find that knowledge encoded in relational structures and explicit reasoning tend to lead to more NeSy goals being satisfied. We also advocate for a more methodical approach to the application of theories of reasoning, which we hope can reduce some of the friction between the symbolic and sub-symbolic schools of AI.  ( 2 min )
    An optimal scheduled learning rate for a randomized Kaczmarz algorithm. (arXiv:2202.12224v1 [math.NA])
    We study how the learning rate affects the performance of a relaxed randomized Kaczmarz algorithm for solving $A x \approx b + \varepsilon$, where $A x =b$ is a consistent linear system and $\varepsilon$ has independent mean zero random entries. We derive a scheduled learning rate which optimizes a bound on the expected error that is sharp in certain cases; in contrast to the exponential convergence of the standard randomized Kaczmarz algorithm, our optimized bound involves the reciprocal of the Lambert-$W$ function of an exponential.  ( 2 min )
    Interfering Paths in Decision Trees: A Note on Deodata Predictors. (arXiv:2202.12064v1 [cs.LG])
    A technique for improving the prediction accuracy of decision trees is proposed. It consists in evaluating the tree's branches in parallel over multiple paths. The technique enables predictions that are more aligned with the ones generated by the nearest neighborhood variant of the deodata algorithms. The technique also enables the hybridization of the decision tree algorithm with the nearest neighborhood variant.  ( 2 min )
    Quantum Deep Reinforcement Learning for Robot Navigation Tasks. (arXiv:2202.12180v1 [cs.RO])
    In this work, we utilize Quantum Deep Reinforcement Learning as method to learn navigation tasks for a simple, wheeled robot in three simulated environments of increasing complexity. We show similar performance of a parameterized quantum circuit trained with well established deep reinforcement learning techniques in a hybrid quantum-classical setup compared to a classical baseline. To our knowledge this is the first demonstration of quantum machine learning (QML) for robotic behaviors. Thus, we establish robotics as a viable field of study for QML algorithms and henceforth quantum computing and quantum machine learning as potential techniques for future advancements in autonomous robotics. Beyond that, we discuss current limitations of the presented approach as well as future research directions in the field of quantum machine learning for autonomous robots.  ( 2 min )
    Closing the Gap between Single-User and Multi-User VoiceFilter-Lite. (arXiv:2202.12169v1 [eess.AS])
    VoiceFilter-Lite is a speaker-conditioned voice separation model that plays a crucial role in improving speech recognition and speaker verification by suppressing overlapping speech from non-target speakers. However, one limitation of VoiceFilter-Lite, and other speaker-conditioned speech models in general, is that these models are usually limited to a single target speaker. This is undesirable as most smart home devices now support multiple enrolled users. In order to extend the benefits of personalization to multiple users, we previously developed an attention-based speaker selection mechanism and applied it to VoiceFilter-Lite. However, the original multi-user VoiceFilter-Lite model suffers from significant performance degradation compared with single-user models. In this paper, we devised a series of experiments to improve the multi-user VoiceFilter-Lite model. By incorporating a dual learning rate schedule and by using feature-wise linear modulation (FiLM) to condition the model with the attended speaker embedding, we successfully closed the performance gap between multi-user and single-user VoiceFilter-Lite models on single-speaker evaluations. At the same time, the new model can also be easily extended to support any number of users, and significantly outperforms our previously published model on multi-speaker evaluations.  ( 2 min )
    Dynamic Defense Against Byzantine Poisoning Attacks in Federated Learning. (arXiv:2007.15030v2 [cs.LG] UPDATED)
    Federated learning, as a distributed learning that conducts the training on the local devices without accessing to the training data, is vulnerable to Byzatine poisoning adversarial attacks. We argue that the federated learning model has to avoid those kind of adversarial attacks through filtering out the adversarial clients by means of the federated aggregation operator. We propose a dynamic federated aggregation operator that dynamically discards those adversarial clients and allows to prevent the corruption of the global learning model. We assess it as a defense against adversarial attacks deploying a deep learning classification model in a federated learning setting on the Fed-EMNIST Digits, Fashion MNIST and CIFAR-10 image datasets. The results show that the dynamic selection of the clients to aggregate enhances the performance of the global learning model and discards the adversarial and poor (with low quality models) clients.  ( 2 min )
    Attentive Temporal Pooling for Conformer-based Streaming Language Identification in Long-form Speech. (arXiv:2202.12163v1 [eess.AS])
    In this paper, we introduce a novel language identification system based on conformer layers. We propose an attentive temporal pooling mechanism to allow the model to carry information in long-form audio via a recurrent form, such that the inference can be performed in a streaming fashion. Additionally, a simple domain adaptation mechanism is introduced to allow adapting an existing language identification model to a new domain where the prior language distribution is different. We perform a comparative study of different model topologies under different constraints of model size, and find that conformer-base models outperform LSTM and transformer based models. Our experiments also show that attentive temporal pooling and domain adaptation significantly improve the model accuracy.  ( 2 min )
    Exploring the Unfairness of DP-SGD Across Settings. (arXiv:2202.12058v1 [cs.LG])
    End users and regulators require private and fair artificial intelligence models, but previous work suggests these objectives may be at odds. We use the CivilComments to evaluate the impact of applying the {\em de facto} standard approach to privacy, DP-SGD, across several fairness metrics. We evaluate three implementations of DP-SGD: for dimensionality reduction (PCA), linear classification (logistic regression), and robust deep learning (Group-DRO). We establish a negative, logarithmic correlation between privacy and fairness in the case of linear classification and robust deep learning. DP-SGD had no significant impact on fairness for PCA, but upon inspection, also did not seem to lead to private representations.  ( 2 min )
    All You Need Is Supervised Learning: From Imitation Learning to Meta-RL With Upside Down RL. (arXiv:2202.11960v1 [cs.LG])
    Upside down reinforcement learning (UDRL) flips the conventional use of the return in the objective function in RL upside down, by taking returns as input and predicting actions. UDRL is based purely on supervised learning, and bypasses some prominent issues in RL: bootstrapping, off-policy corrections, and discount factors. While previous work with UDRL demonstrated it in a traditional online RL setting, here we show that this single algorithm can also work in the imitation learning and offline RL settings, be extended to the goal-conditioned RL setting, and even the meta-RL setting. With a general agent architecture, a single UDRL agent can learn across all paradigms.  ( 2 min )
    Explanatory Paradigms in Neural Networks. (arXiv:2202.11838v1 [cs.LG])
    In this article, we present a leap-forward expansion to the study of explainability in neural networks by considering explanations as answers to abstract reasoning-based questions. With $P$ as the prediction from a neural network, these questions are `Why P?', `What if not P?', and `Why P, rather than Q?' for a given contrast prediction $Q$. The answers to these questions are observed correlations, observed counterfactuals, and observed contrastive explanations respectively. Together, these explanations constitute the abductive reasoning scheme. We term the three explanatory schemes as observed explanatory paradigms. The term observed refers to the specific case of post-hoc explainability, when an explanatory technique explains the decision $P$ after a trained neural network has made the decision $P$. The primary advantage of viewing explanations through the lens of abductive reasoning-based questions is that explanations can be used as reasons while making decisions. The post-hoc field of explainability, that previously only justified decisions, becomes active by being involved in the decision making process and providing limited, but relevant and contextual interventions. The contributions of this article are: ($i$) realizing explanations as reasoning paradigms, ($ii$) providing a probabilistic definition of observed explanations and their completeness, ($iii$) creating a taxonomy for evaluation of explanations, and ($iv$) positioning gradient-based complete explanainability's replicability and reproducibility across multiple applications and data modalities, ($v$) code repositories, publicly available at https://github.com/olivesgatech/Explanatory-Paradigms.  ( 2 min )
    "Is not the truth the truth?": Analyzing the Impact of User Validations for Bus In/Out Detection in Smartphone-based Surveys. (arXiv:2202.11961v1 [cs.HC])
    Passenger flow allows the study of users' behavior through the public network and assists in designing new facilities and services. This flow is observed through interactions between passengers and infrastructure. For this task, Bluetooth technology and smartphones represent the ideal solution. The latter component allows users' identification, authentication, and billing, while the former allows short-range implicit interactions, device-to-device. To assess the potential of such a use case, we need to verify how robust Bluetooth signal and related machine learning (ML) classifiers are against the noise of realistic contexts. Therefore, we model binary passenger states with respect to a public vehicle, where one can either be-in or be-out (BIBO). The BIBO label identifies a fundamental building block of continuously-valued passenger flow. This paper describes the Human-Computer interaction experimental setting in a semi-controlled environment, which involves: two autonomous vehicles operating on two routes, serving three bus stops and eighteen users, as well as a proprietary smartphone-Bluetooth sensing platform. The resulting dataset includes multiple sensors' measurements of the same event and two ground-truth levels, the first being validation by participants, the second by three video-cameras surveilling buses and track. We performed a Monte-Carlo simulation of labels-flip to emulate human errors in the labeling process, as is known to happen in smartphone surveys; next we used such flipped labels for supervised training of ML classifiers. The impact of errors on model performance bias can be large. Results show ML tolerance to label flips caused by human or machine errors up to 30%.  ( 2 min )
    A fair pricing model via adversarial learning. (arXiv:2202.12008v1 [stat.ML])
    At the core of insurance business lies classification between risky and non-risky insureds, actuarial fairness meaning that risky insureds should contribute more and pay a higher premium than non-risky or less-risky ones. Actuaries, therefore, use econometric or machine learning techniques to classify, but the distinction between a fair actuarial classification and "discrimination" is subtle. For this reason, there is a growing interest about fairness and discrimination in the actuarial community Lindholm, Richman, Tsanakas, and Wuthrich (2022). Presumably, non-sensitive characteristics can serve as substitutes or proxies for protected attributes. For example, the color and model of a car, combined with the driver's occupation, may lead to an undesirable gender bias in the prediction of car insurance prices. Surprisingly, we will show that debiasing the predictor alone may be insufficient to maintain adequate accuracy (1). Indeed, the traditional pricing model is currently built in a two-stage structure that considers many potentially biased components such as car or geographic risks. We will show that this traditional structure has significant limitations in achieving fairness. For this reason, we have developed a novel pricing model approach. Recently some approaches have Blier-Wong, Cossette, Lamontagne, and Marceau (2021); Wuthrich and Merz (2021) shown the value of autoencoders in pricing. In this paper, we will show that (2) this can be generalized to multiple pricing factors (geographic, car type), (3) it perfectly adapted for a fairness context (since it allows to debias the set of pricing components): We extend this main idea to a general framework in which a single whole pricing model is trained by generating the geographic and car pricing components needed to predict the pure premium while mitigating the unwanted bias according to the desired metric.  ( 2 min )
    Can deep neural networks learn process model structure? An assessment framework and analysis. (arXiv:2202.11985v1 [cs.LG])
    Predictive process monitoring concerns itself with the prediction of ongoing cases in (business) processes. Prediction tasks typically focus on remaining time, outcome, next event or full case suffix prediction. Various methods using machine and deep learning havebeen proposed for these tasks in recent years. Especially recurrent neural networks (RNNs) such as long short-term memory nets (LSTMs) have gained in popularity. However, no research focuses on whether such neural network-based models can truly learn the structure of underlying process models. For instance, can such neural networks effectively learn parallel behaviour or loops? Therefore, in this work, we propose an evaluation scheme complemented with new fitness, precision, and generalisation metrics, specifically tailored towards measuring the capacity of deep learning models to learn process model structure. We apply this framework to several process models with simple control-flow behaviour, on the task of next-event prediction. Our results show that, even for such simplistic models, careful tuning of overfitting countermeasures is required to allow these models to learn process model structure.  ( 2 min )
    Assessing generalisability of deep learning-based polyp detection and segmentation methods through a computer vision challenge. (arXiv:2202.12031v1 [cs.CV])
    Polyps are well-known cancer precursors identified by colonoscopy. However, variability in their size, location, and surface largely affect identification, localisation, and characterisation. Moreover, colonoscopic surveillance and removal of polyps (referred to as polypectomy ) are highly operator-dependent procedures. There exist a high missed detection rate and incomplete removal of colonic polyps due to their variable nature, the difficulties to delineate the abnormality, the high recurrence rates, and the anatomical topography of the colon. There have been several developments in realising automated methods for both detection and segmentation of these polyps using machine learning. However, the major drawback in most of these methods is their ability to generalise to out-of-sample unseen datasets that come from different centres, modalities and acquisition systems. To test this hypothesis rigorously we curated a multi-centre and multi-population dataset acquired from multiple colonoscopy systems and challenged teams comprising machine learning experts to develop robust automated detection and segmentation methods as part of our crowd-sourcing Endoscopic computer vision challenge (EndoCV) 2021. In this paper, we analyse the detection results of the four top (among seven) teams and the segmentation results of the five top teams (among 16). Our analyses demonstrate that the top-ranking teams concentrated on accuracy (i.e., accuracy > 80% on overall Dice score on different validation sets) over real-time performance required for clinical applicability. We further dissect the methods and provide an experiment-based hypothesis that reveals the need for improved generalisability to tackle diversity present in multi-centre datasets.  ( 3 min )
    DC and SA: Robust and Efficient Hyperparameter Optimization of Multi-subnetwork Deep Learning Models. (arXiv:2202.11841v1 [cs.LG])
    We present two novel hyperparameter optimization strategies for optimization of deep learning models with a modular architecture constructed of multiple subnetworks. As complex networks with multiple subnetworks become more frequently applied in machine learning, hyperparameter optimization methods are required to efficiently optimize their hyperparameters. Existing hyperparameter searches are general, and can be used to optimize such networks, however, by exploiting the multi-subnetwork architecture, these searches can be sped up substantially. The proposed methods offer faster convergence to a better-performing final model. To demonstrate this, we propose 2 independent approaches to enhance these prior algorithms: 1) a divide-and-conquer approach, in which the best subnetworks of top-performing models are combined, allowing for more rapid sampling of the hyperparameter search space. 2) A subnetwork adaptive approach that distributes computational resources based on the importance of each subnetwork, allowing more intelligent resource allocation. These approaches can be flexibily applied to many hyperparameter optimization algorithms. To illustrate this, we combine our approaches with the commonly-used Bayesian optimization method. Our approaches are then tested against both synthetic examples and real-world examples and applied to multiple network types including convolutional neural networks and dense feed forward neural networks. Our approaches show an increased optimization efficiency of up to 23.62x, and a final performance boost of up to 3.5% accuracy for classification and 4.4 MSE for regression, when compared to comparable BO approach.  ( 2 min )
    A Rigorous Study of Integrated Gradients Method and Extensions to Internal Neuron Attributions. (arXiv:2202.11912v1 [cs.LG])
    As the efficacy of deep learning (DL) grows, so do concerns about the lack of transparency of these black-box models. Attribution methods aim to improve transparency of DL models by quantifying an input feature's importance to a model's prediction. The method of Integrated gradients (IG) sets itself apart by claiming other methods failed to satisfy desirable axioms, while IG and methods like it uniquely satisfied said axioms. This paper comments on fundamental aspects of IG and its applications/extensions: 1) We identify key unaddressed differences between DL-attribution function spaces and the supporting literature's function spaces which problematize previous claims of IG uniqueness. We show that with the introduction of an additional axiom, $\textit{non-decreasing positivity}$, the uniqueness claim can be established. 2) We address the question of input sensitivity by identifying function spaces where the IG is/is not Lipschitz continuous in the attributed input. 3) We show how axioms for single-baseline methods in IG impart analogous properties for methods where the baseline is a probability distribution over the input sample space. 4) We introduce a means of decomposing the IG map with respect to a layer of internal neurons while simultaneously gaining internal-neuron attributions. Finally, we present experimental results validating the decomposition and internal neuron attributions.  ( 2 min )
    Predicting the impact of treatments over time with uncertainty aware neural differential equations. (arXiv:2202.11987v1 [cs.LG])
    Predicting the impact of treatments from observational data only still represents a majorchallenge despite recent significant advances in time series modeling. Treatment assignments are usually correlated with the predictors of the response, resulting in a lack of data support for counterfactual predictions and therefore in poor quality estimates. Developments in causal inference have lead to methods addressing this confounding by requiring a minimum level of overlap. However,overlap is difficult to assess and usually notsatisfied in practice. In this work, we propose Counterfactual ODE (CF-ODE), a novel method to predict the impact of treatments continuously over time using Neural Ordinary Differential Equations equipped with uncertainty estimates. This allows to specifically assess which treatment outcomes can be reliably predicted. We demonstrate over several longitudinal data sets that CF-ODE provides more accurate predictions and more reliable uncertainty estimates than previously available methods.  ( 2 min )
    On Learning Mixture Models with Sparse Parameters. (arXiv:2202.11940v1 [cs.LG])
    Mixture models are widely used to fit complex and multimodal datasets. In this paper we study mixtures with high dimensional sparse latent parameter vectors and consider the problem of support recovery of those vectors. While parameter learning in mixture models is well-studied, the sparsity constraint remains relatively unexplored. Sparsity of parameter vectors is a natural constraint in variety of settings, and support recovery is a major step towards parameter estimation. We provide efficient algorithms for support recovery that have a logarithmic sample complexity dependence on the dimensionality of the latent space. Our algorithms are quite general, namely they are applicable to 1) mixtures of many different canonical distributions including Uniform, Poisson, Laplace, Gaussians, etc. 2) Mixtures of linear regressions and linear classifiers with Gaussian covariates under different assumptions on the unknown parameters. In most of these settings, our results are the first guarantees on the problem while in the rest, our results provide improvements on existing works.  ( 2 min )
    Drawing Inductor Layout with a Reinforcement Learning Agent: Method and Application for VCO Inductors. (arXiv:2202.11798v1 [cs.AI])
    Design of Voltage-Controlled Oscillator (VCO) inductors is a laborious and time-consuming task that is conventionally done manually by human experts. In this paper, we propose a framework for automating the design of VCO inductors, using Reinforcement Learning (RL). We formulate the problem as a sequential procedure, where wire segments are drawn one after another, until a complete inductor is created. We then employ an RL agent to learn to draw inductors that meet certain target specifications. In light of the need to tweak the target specifications throughout the circuit design cycle, we also develop a variant in which the agent can learn to quickly adapt to draw new inductors for moderately different target specifications. Our empirical results show that the proposed framework is successful at automatically generating VCO inductors that meet or exceed the target specification.  ( 2 min )
    A Fair Empirical Risk Minimization with Generalized Entropy. (arXiv:2202.11966v1 [cs.LG])
    Recently a parametric family of fairness metrics to quantify algorithmic fairness has been proposed based on generalized entropy which have been originally used in economics and public welfare. Since these metrics have several advantages such as quantifying unfairness at the individual-level and group-level, and unfold trade-off between the individual fairness and group-level fairness, algorithmic fairness requirement may be given in terms of generalized entropy for a fair classification problem. We consider a fair empirical risk minimization with a fairness constraint specified by generalized entropy. We theoretically investigate if the fair empirical fair classification problem is learnable and how to find an approximate optimal classifier of it.  ( 2 min )
    Rare Gems: Finding Lottery Tickets at Initialization. (arXiv:2202.12002v1 [cs.LG])
    It has been widely observed that large neural networks can be pruned to a small fraction of their original size, with little loss in accuracy, by typically following a time-consuming "train, prune, re-train" approach. Frankle & Carbin (2018) conjecture that we can avoid this by training lottery tickets, i.e., special sparse subnetworks found at initialization, that can be trained to high accuracy. However, a subsequent line of work presents concrete evidence that current algorithms for finding trainable networks at initialization, fail simple baseline comparisons, e.g., against training random sparse subnetworks. Finding lottery tickets that train to better accuracy compared to simple baselines remains an open problem. In this work, we partially resolve this open problem by discovering rare gems: subnetworks at initialization that attain considerable accuracy, even before training. Refining these rare gems - "by means of fine-tuning" - beats current baselines and leads to accuracy competitive or better than magnitude pruning methods.  ( 2 min )
    AutoCl : A Visual Interactive System for Automatic Deep Learning Classifier Recommendation Based on Models Performance. (arXiv:2202.11928v1 [cs.LG])
    Nowadays, deep learning (DL) models being increasingly applied to various fields, people without technical expertise and domain knowledge struggle to find an appropriate model for their task. In this paper, we introduce AutoCl a visual interactive recommender system aimed at helping non-experts to adopt an appropriate DL classifier. Our system enables users to compare the performance and behavior of multiple classifiers trained with various hyperparameter setups as well as automatically recommends a best classifier with appropriate hyperparameter. We compare features of AutoCl against several recent AutoML systems and show that it helps non-experts better in choosing DL classifier. Finally, we demonstrate use cases for image classification using publicly available dataset to show the capability of our system.  ( 2 min )
    Physics-informed neural networks for inverse problems in supersonic flows. (arXiv:2202.11821v1 [math.NA])
    Accurate solutions to inverse supersonic compressible flow problems are often required for designing specialized aerospace vehicles. In particular, we consider the problem where we have data available for density gradients from Schlieren photography as well as data at the inflow and part of wall boundaries. These inverse problems are notoriously difficult and traditional methods may not be adequate to solve such ill-posed inverse problems. To this end, we employ the physics-informed neural networks (PINNs) and its extended version, extended PINNs (XPINNs), where domain decomposition allows deploying locally powerful neural networks in each subdomain, which can provide additional expressivity in subdomains, where a complex solution is expected. Apart from the governing compressible Euler equations, we also enforce the entropy conditions in order to obtain viscosity solutions. Moreover, we enforce positivity conditions on density and pressure. We consider inverse problems involving two-dimensional expansion waves, two-dimensional oblique and bow shock waves. We compare solutions obtained by PINNs and XPINNs and invoke some theoretical results that can be used to decide on the generalization errors of the two methods.  ( 2 min )
    Learning Multi-Object Dynamics with Compositional Neural Radiance Fields. (arXiv:2202.11855v1 [cs.CV])
    We present a method to learn compositional predictive models from image observations based on implicit object encoders, Neural Radiance Fields (NeRFs), and graph neural networks. A central question in learning dynamic models from sensor observations is on which representations predictions should be performed. NeRFs have become a popular choice for representing scenes due to their strong 3D prior. However, most NeRF approaches are trained on a single scene, representing the whole scene with a global model, making generalization to novel scenes, containing different numbers of objects, challenging. Instead, we present a compositional, object-centric auto-encoder framework that maps multiple views of the scene to a \emph{set} of latent vectors representing each object separately. The latent vectors parameterize individual NeRF models from which the scene can be reconstructed and rendered from novel viewpoints. We train a graph neural network dynamics model in the latent space to achieve compositionality for dynamics prediction. A key feature of our approach is that the learned 3D information of the scene through the NeRF model enables us to incorporate structural priors in learning the dynamics models, making long-term predictions more stable. The model can further be used to synthesize new scenes from individual object observations. For planning, we utilize RRTs in the learned latent space, where we can exploit our model and the implicit object encoder to make sampling the latent space informative and more efficient. In the experiments, we show that the model outperforms several baselines on a pushing task containing many objects. Video: https://dannydriess.github.io/compnerfdyn/  ( 2 min )
    Threading the Needle of On and Off-Manifold Value Functions for Shapley Explanations. (arXiv:2202.11919v1 [cs.LG])
    A popular explainable AI (XAI) approach to quantify feature importance of a given model is via Shapley values. These Shapley values arose in cooperative games, and hence a critical ingredient to compute these in an XAI context is a so-called value function, that computes the "value" of a subset of features, and which connects machine learning models to cooperative games. There are many possible choices for such value functions, which broadly fall into two categories: on-manifold and off-manifold value functions, which take an observational and an interventional viewpoint respectively. Both these classes however have their respective flaws, where on-manifold value functions violate key axiomatic properties and are computationally expensive, while off-manifold value functions pay less heed to the data manifold and evaluate the model on regions for which it wasn't trained. Thus, there is no consensus on which class of value functions to use. In this paper, we show that in addition to these existing issues, both classes of value functions are prone to adversarial manipulations on low density regions. We formalize the desiderata of value functions that respect both the model and the data manifold in a set of axioms and are robust to perturbation on off-manifold regions, and show that there exists a unique value function that satisfies these axioms, which we term the Joint Baseline value function, and the resulting Shapley value the Joint Baseline Shapley (JBshap), and validate the effectiveness of JBshap in experiments.  ( 2 min )
    Validating an SVM-based neonatal seizure detection algorithm for generalizability, non-inferiority and clinical efficacy. (arXiv:2202.12023v1 [cs.LG])
    Neonatal seizure detection algorithms (SDA) are approaching the benchmark of human expert annotation. Measures of algorithm generalizability and non-inferiority as well as measures of clinical efficacy are needed to assess the full scope of neonatal SDA performance. We validated our neonatal SDA on an independent data set of 28 neonates. Generalizability was tested by comparing the performance of the original training set (cross-validation) to its performance on the validation set. Non-inferiority was tested by assessing inter-observer agreement between combinations of SDA and two human expert annotations. Clinical efficacy was tested by comparing how the SDA and human experts quantified seizure burden and identified clinically significant periods of seizure activity in the EEG. Algorithm performance was consistent between training and validation sets with no significant worsening in AUC (p>0.05, n =28). SDA output was inferior to the annotation of the human expert, however, re-training with an increased diversity of data resulted in non-inferior performance ($\Delta\kappa$=0.077, 95% CI: -0.002-0.232, n=18). The SDA assessment of seizure burden had an accuracy ranging from 89-93%, and 87% for identifying periods of clinical interest. The proposed SDA is approaching human equivalence and provides a clinically relevant interpretation of the EEG.  ( 2 min )
    A general framework for adaptive two-index fusion attribute weighted naive Bayes. (arXiv:2202.11963v1 [cs.LG])
    Naive Bayes(NB) is one of the essential algorithms in data mining. However, it is rarely used in reality because of the attribute independent assumption. Researchers have proposed many improved NB methods to alleviate this assumption. Among these methods, due to high efficiency and easy implementation, the filter attribute weighted NB methods receive great attentions. However, there still exists several challenges, such as the poor representation ability for single index and the fusion problem of two indexes. To overcome above challenges, we propose a general framework for Adaptive Two-index Fusion attribute weighted NB(ATFNB). Two types of data description category are used to represent the correlation between classes and attributes, intercorrelation between attributes and attributes, respectively. ATFNB can select any one index from each category. Then, we introduce a switching factor \{beta} to fuse two indexes, which can adaptively adjust the optimal ratio of the two index on various datasets. And a quick algorithm is proposed to infer the optimal interval of switching factor \{beta}. Finally, the weight of each attribute is calculated using the optimal value \{beta} and is integrated into NB classifier to improve the accuracy. The experimental results on 50 benchmark datasets and a Flavia dataset show that ATFNB outperforms the basic NB and state-of-the-art filter weighted NB models. In addition, the ATFNB framework can improve the existing two-index NB model by introducing the adaptive switching factor \{beta}. Auxiliary experimental results demonstrate the improved model significantly increases the accuracy compared to the original model without the adaptive switching factor \{beta}.  ( 2 min )
    XAutoML: A Visual Analytics Tool for Establishing Trust in Automated Machine Learning. (arXiv:2202.11954v1 [cs.LG])
    In the last ten years, various automated machine learning (AutoML) systems have been proposed to build end-to-end machine learning (ML) pipelines with minimal human interaction. Even though such automatically synthesized ML pipelines are able to achieve a competitive performance, recent studies have shown that users do not trust models constructed by AutoML due to missing transparency of AutoML systems and missing explanations for the constructed ML pipelines. In a requirements analysis study with 26 domain experts, data scientists, and AutoML researchers from different professions with vastly different expertise in ML, we collect detailed informational needs to establish trust in AutoML. We propose XAutoML, an interactive visual analytics tool for explaining arbitrary AutoML optimization procedures and ML pipelines constructed by AutoML. XAutoML combines interactive visualizations with established techniques from explainable artificial intelligence (XAI) to make the complete AutoML procedure transparent and explainable. By integrating XAutoML with JupyterLab, experienced users can extend the visual analytics with ad-hoc visualizations based on information extracted from XAutoML. We validate our approach in a user study with the same diverse user group from the requirements analysis. All participants were able to extract useful information from XAutoML, leading to a significantly increased trust in ML pipelines produced by AutoML and the AutoML optimization itself.  ( 2 min )
    Sky Computing: Accelerating Geo-distributed Computing in Federated Learning. (arXiv:2202.11836v1 [cs.LG])
    Federated learning is proposed by Google to safeguard data privacy through training models locally on users' devices. However, with deep learning models growing in size to achieve better results, it becomes increasingly difficult to accommodate the whole model on one single device. Thus, model parallelism is then used to divide the model weights among several devices. With this logic, the approach currently used evenly allocates weights among devices. However, in reality, a computation bottleneck may occur resulting from variant computing power of different users' devices. To address this problem, load balancing is needed to allocate the model weights based on the computational capability of the device. In this paper, we proposed Sky Computing, a load-balanced model parallelism framework to adaptively allocate the weights to devices. Sky Computing outperforms the baseline method by 55% in training time when training 160-layer BERT with 64 nodes. The source code can be found at https://github.com/hpcaitech/SkyComputing.  ( 2 min )
    Large Scale Passenger Detection with Smartphone/Bus Implicit Interaction and Multisensory Unsupervised Cause-effect Learning. (arXiv:2202.11962v1 [cs.LG])
    Intelligent Transportation Systems (ITS) underpin the concept of Mobility as a Service (MaaS), which requires universal and seamless users' access across multiple public and private transportation systems while allowing operators' proportional revenue sharing. Current user sensing technologies such as Walk-in/Walk-out (WIWO) and Check-in/Check-out (CICO) have limited scalability for large-scale deployments. These limitations prevent ITS from supporting analysis, optimization, calculation of revenue sharing, and control of MaaS comfort, safety, and efficiency. We focus on the concept of implicit Be-in/Be-out (BIBO) smartphone-sensing and classification. To close the gap and enhance smartphones towards MaaS, we developed a proprietary smartphone-sensing platform collecting contemporary Bluetooth Low Energy (BLE) signals from BLE devices installed on buses and Global Positioning System (GPS) locations of both buses and smartphones. To enable the training of a model based on GPS features against the BLE pseudo-label, we propose the Cause-Effect Multitask Wasserstein Autoencoder (CEMWA). CEMWA combines and extends several frameworks around Wasserstein autoencoders and neural networks. As a dimensionality reduction tool, CEMWA obtains an auto-validated representation of a latent space describing users' smartphones within the transport system. This representation allows BIBO clustering via DBSCAN. We perform an ablation study of CEMWA's alternative architectures and benchmark against the best available supervised methods. We analyze performance's sensitivity to label quality. Under the na\"ive assumption of accurate ground truth, XGBoost outperforms CEMWA. Although XGBoost and Random Forest prove to be tolerant to label noise, CEMWA is agnostic to label noise by design and provides the best performance with an 88\% F1 score.  ( 2 min )
    Fine-grained TLS Services Classification with Reject Option. (arXiv:2202.11984v1 [cs.LG])
    The recent success and proliferation of machine learning and deep learning have provided powerful tools, which are also utilized for encrypted traffic analysis, classification, and threat detection. These methods, neural networks in particular, are often complex and require a huge corpus of training data. Therefore, this paper focuses on collecting a large up-to-date dataset with almost 200 fine-grained service labels and 140 million network flows extended with packet-level metadata. The number of flows is three orders of magnitude higher than in other existing public labeled datasets of encrypted traffic. The number of service labels, which is important to make the problem hard and realistic, is four times higher than in the public dataset with the most class labels. The published dataset is intended as a benchmark for identifying services in encrypted traffic. Service identification can be further extended with the task of "rejecting" unknown services, i.e., the traffic not seen during the training phase. Neural networks offer superior performance for tackling this more challenging problem. To showcase the dataset's usefulness, we implemented a neural network with a multi-modal architecture, which is the state-of-the-art approach, and achieved 97.04% classification accuracy and detected 91.94% of unknown services with 5% false positive rate.  ( 2 min )
    A Unified Framework for Campaign Performance Forecasting in Online Display Advertising. (arXiv:2202.11877v1 [cs.LG])
    Advertisers usually enjoy the flexibility to choose criteria like target audience, geographic area and bid price when planning an campaign for online display advertising, while they lack forecast information on campaign performance to optimize delivery strategies in advance, resulting in a waste of labour and budget for feedback adjustments. In this paper, we aim to forecast key performance indicators for new campaigns given any certain criteria. Interpretable and accurate results could enable advertisers to manage and optimize their campaign criteria. There are several challenges for this very task. First, platforms usually offer advertisers various criteria when they plan an advertising campaign, it is difficult to estimate campaign performance unifiedly because of the great difference among bidding types. Furthermore, complex strategies applied in bidding system bring great fluctuation on campaign performance, making estimation accuracy an extremely tough problem. To address above challenges, we propose a novel Campaign Performance Forecasting framework, which firstly reproduces campaign performance on historical logs under various bidding types with a unified replay algorithm, in which essential auction processes like match and rank are replayed, ensuring the interpretability on forecast results. Then, we innovatively introduce a multi-task learning method to calibrate the deviation of estimation brought by hard-to-reproduce bidding strategies in replay. The method captures mixture calibration patterns among related forecast indicators to map the estimated results to the true ones, improving both accuracy and efficiency significantly. Experiment results on a dataset from Taobao.com demonstrate that the proposed framework significantly outperforms other baselines by a large margin, and an online A/B test verifies its effectiveness in the real world.  ( 2 min )
    Learning to Merge Tokens in Vision Transformers. (arXiv:2202.12015v1 [cs.CV])
    Transformers are widely applied to solve natural language understanding and computer vision tasks. While scaling up these architectures leads to improved performance, it often comes at the expense of much higher computational costs. In order for large-scale models to remain practical in real-world systems, there is a need for reducing their computational overhead. In this work, we present the PatchMerger, a simple module that reduces the number of patches or tokens the network has to process by merging them between two consecutive intermediate layers. We show that the PatchMerger achieves a significant speedup across various model sizes while matching the original performance both upstream and downstream after fine-tuning.  ( 2 min )
    Counterfactual Explanations for Predictive Business Process Monitoring. (arXiv:2202.12018v1 [cs.AI])
    Predictive business process monitoring increasingly leverages sophisticated prediction models. Although sophisticated models achieve consistently higher prediction accuracy than simple models, one major drawback is their lack of interpretability, which limits their adoption in practice. We thus see growing interest in explainable predictive business process monitoring, which aims to increase the interpretability of prediction models. Existing solutions focus on giving factual explanations.While factual explanations can be helpful, humans typically do not ask why a particular prediction was made, but rather why it was made instead of another prediction, i.e., humans are interested in counterfactual explanations. While research in explainable AI produced several promising techniques to generate counterfactual explanations, directly applying them to predictive process monitoring may deliver unrealistic explanations, because they ignore the underlying process constraints. We propose LORELEY, a counterfactual explanation technique for predictive process monitoring, which extends LORE, a recent explainable AI technique. We impose control flow constraints to the explanation generation process to ensure realistic counterfactual explanations. Moreover, we extend LORE to enable explaining multi-class classification models. Experimental results using a real, public dataset indicate that LORELEY can approximate the prediction models with an average fidelity of 97.69\% and generate realistic counterfactual explanations.  ( 2 min )
    Attainability and Optimality: The Equalized Odds Fairness Revisited. (arXiv:2202.11853v1 [cs.LG])
    Fairness of machine learning algorithms has been of increasing interest. In order to suppress or eliminate discrimination in prediction, various notions as well as approaches have been proposed to impose fairness. Given a notion of fairness, an essential problem is then whether or not it can always be attained, even if with an unlimited amount of data. This issue is, however, not well addressed yet. In this paper, focusing on the Equalized Odds notion of fairness, we consider the attainability of this criterion and, furthermore, if it is attainable, the optimality of the prediction performance under various settings. In particular, for prediction performed by a deterministic function of input features, we give conditions under which Equalized Odds can hold true; if the stochastic prediction is acceptable, we show that under mild assumptions, fair predictors can always be derived. For classification, we further prove that compared to enforcing fairness by post-processing, one can always benefit from exploiting all available features during training and get potentially better prediction performance while remaining fair. Moreover, while stochastic prediction can attain Equalized Odds with theoretical guarantees, we also discuss its limitation and potential negative social impacts.  ( 2 min )
    No-Regret Learning in Games is Turing Complete. (arXiv:2202.11871v1 [cs.GT])
    Games are natural models for multi-agent machine learning settings, such as generative adversarial networks (GANs). The desirable outcomes from algorithmic interactions in these games are encoded as game theoretic equilibrium concepts, e.g. Nash and coarse correlated equilibria. As directly computing an equilibrium is typically impractical, one often aims to design learning algorithms that iteratively converge to equilibria. A growing body of negative results casts doubt on this goal, from non-convergence to chaotic and even arbitrary behaviour. In this paper we add a strong negative result to this list: learning in games is Turing complete. Specifically, we prove Turing completeness of the replicator dynamic on matrix games, one of the simplest possible settings. Our results imply the undecicability of reachability problems for learning algorithms in games, a special case of which is determining equilibrium convergence.  ( 2 min )
    Auto-scaling Vision Transformers without Training. (arXiv:2202.11921v1 [cs.LG])
    This work targets automated designing and scaling of Vision Transformers (ViTs). The motivation comes from two pain spots: 1) the lack of efficient and principled methods for designing and scaling ViTs; 2) the tremendous computational cost of training ViT that is much heavier than its convolution counterpart. To tackle these issues, we propose As-ViT, an auto-scaling framework for ViTs without training, which automatically discovers and scales up ViTs in an efficient and principled manner. Specifically, we first design a "seed" ViT topology by leveraging a training-free search process. This extremely fast search is fulfilled by a comprehensive study of ViT's network complexity, yielding a strong Kendall-tau correlation with ground-truth accuracies. Second, starting from the "seed" topology, we automate the scaling rule for ViTs by growing widths/depths to different ViT layers. This results in a series of architectures with different numbers of parameters in a single run. Finally, based on the observation that ViTs can tolerate coarse tokenization in early training stages, we propose a progressive tokenization strategy to train ViTs faster and cheaper. As a unified framework, As-ViT achieves strong performance on classification (83.5% top1 on ImageNet-1k) and detection (52.7% mAP on COCO) without any manual crafting nor scaling of ViT architectures: the end-to-end model design and scaling process cost only 12 hours on one V100 GPU. Our code is available at https://github.com/VITA-Group/AsViT.  ( 2 min )
    A Note on Machine Learning Approach for Computational Imaging. (arXiv:2202.11883v1 [eess.IV])
    Computational imaging has been playing a vital role in the development of natural sciences. Advances in sensory, information, and computer technologies have further extended the scope of influence of imaging, making digital images an essential component of our daily lives. For the past three decades, we have witnessed phenomenal developments of mathematical and machine learning methods in computational imaging. In this note, we will review some of the recent developments of the machine learning approach for computational imaging and discuss its differences and relations to the mathematical approach. We will demonstrate how we may combine the wisdom from both approaches, discuss the merits and potentials of such a combination and present some of the new computational and theoretical challenges it brings about.  ( 2 min )
    First is Better Than Last for Training Data Influence. (arXiv:2202.11844v1 [cs.LG])
    The ability to identify influential training examples enables us to debug training data and explain model behavior. Existing techniques are based on the flow of influence through the model parameters. For large models in NLP applications, it is often computationally infeasible to study this flow through all model parameters, therefore techniques usually pick the last layer of weights. Our first observation is that for classification problems, the last layer is reductive and does not encode sufficient input level information. Deleting influential examples, according to this measure, typically does not change the model's behavior much. We propose a technique called TracIn-WE that modifies a method called TracIn to operate on the word embedding layer instead of the last layer. This could potentially have the opposite concern, that the word embedding layer does not encode sufficient high level information. However, we find that gradients (unlike embeddings) do not suffer from this, possibly because they chain through higher layers. We show that TracIn-WE significantly outperforms other data influence methods applied on the last layer by 4-10 times on the case deletion evaluation on three language classification tasks. In addition, TracIn-WE can produce scores not just at the training data level, but at the word training data level, a further aid in debugging.  ( 2 min )
    Differentially Private Speaker Anonymization. (arXiv:2202.11823v1 [cs.SD])
    Sharing real-world speech utterances is key to the training and deployment of voice-based services. However, it also raises privacy risks as speech contains a wealth of personal data. Speaker anonymization aims to remove speaker information from a speech utterance while leaving its linguistic and prosodic attributes intact. State-of-the-art techniques operate by disentangling the speaker information (represented via a speaker embedding) from these attributes and re-synthesizing speech based on the speaker embedding of another speaker. Prior research in the privacy community has shown that anonymization often provides brittle privacy protection, even less so any provable guarantee. In this work, we show that disentanglement is indeed not perfect: linguistic and prosodic attributes still contain speaker information. We remove speaker information from these attributes by introducing differentially private feature extractors based on an autoencoder and an automatic speech recognizer, respectively, trained using noise layers. We plug these extractors in the state-of-the-art anonymization pipeline and generate, for the first time, differentially private utterances with a provable upper bound on the speaker information they contain. We evaluate empirically the privacy and utility resulting from our differentially private speaker anonymization approach on the LibriSpeech data set. Experimental results show that the generated utterances retain very high utility for automatic speech recognition training and inference, while being much better protected against strong adversaries who leverage the full knowledge of the anonymization process to try to infer the speaker identity.  ( 2 min )
    Consistent Dropout for Policy Gradient Reinforcement Learning. (arXiv:2202.11818v1 [cs.LG])
    Dropout has long been a staple of supervised learning, but is rarely used in reinforcement learning. We analyze why naive application of dropout is problematic for policy-gradient learning algorithms and introduce consistent dropout, a simple technique to address this instability. We demonstrate consistent dropout enables stable training with A2C and PPO in both continuous and discrete action environments across a wide range of dropout probabilities. Finally, we show that consistent dropout enables the online training of complex architectures such as GPT without needing to disable the model's native dropout.  ( 2 min )
    Robust Probabilistic Time Series Forecasting. (arXiv:2202.11910v1 [cs.LG])
    Probabilistic time series forecasting has played critical role in decision-making processes due to its capability to quantify uncertainties. Deep forecasting models, however, could be prone to input perturbations, and the notion of such perturbations, together with that of robustness, has not even been completely established in the regime of probabilistic forecasting. In this work, we propose a framework for robust probabilistic time series forecasting. First, we generalize the concept of adversarial input perturbations, based on which we formulate the concept of robustness in terms of bounded Wasserstein deviation. Then we extend the randomized smoothing technique to attain robust probabilistic forecasters with theoretical robustness certificates against certain classes of adversarial perturbations. Lastly, extensive experiments demonstrate that our methods are empirically effective in enhancing the forecast quality under additive adversarial attacks and forecast consistency under supplement of noisy observations.  ( 2 min )
    Analysis of Coronavirus Envelope Protein with Cellular Automata (CA) Model. (arXiv:2202.11752v1 [q-bio.BM])
    The reason of significantly higher transmissibility of SARS Covid (2019 CoV-2) compared to SARS Covid (2003 CoV) and MERS Covid (2012 MERS) can be attributed to mutations reported in structural proteins, and the role played by non-structural proteins (nsps) and accessory proteins (ORFs) for viral replication, assembly, and shedding. Envelope protein E is one of the four structural proteins of minimum length. Recent studies have confirmed critical role played by the envelope protein in the viral life cycle including assembly of virion exported from infected cell for its transmission. However, the determinants of the highly complex viral - host interactions of envelope protein, particularly with host Golgi complex, have not been adequately characterized. CoV-2 and CoV Envelope proteins of length 75 and 76 amino acids differ in four amino acid locations. The additional amino acid Gly (G) at location 70 makes CoV length 76. The amino acid pair EG at location 69-70 of CoV in place of amino acid R in location 69 of CoV-2, has been identified as a major determining factor in the current investigation. This paper concentrates on the design of computational model to compare the structure/function of wild and mutants of CoV-2 with wild and mutants of CoV in the functionally important region of the protein chain pair. We hypothesize that differences of CAML model parameter of CoV-2 and CoV characterize the deviation in structure and function of envelope proteins in respect of interaction of virus with host Golgi complex; and this difference gets reflected in the difference of their transmissibility. The hypothesis has been validated from single point mutational study on- (i) human HBB beta-globin hemoglobin protein associated with sickle cell anemia, (ii) mutants of envelope protein of Covid-2 infected patients reported in recent publications.  ( 3 min )
    Controlling Memorability of Face Images. (arXiv:2202.11896v1 [cs.CV])
    Everyday, we are bombarded with many photographs of faces, whether on social media, television, or smartphones. From an evolutionary perspective, faces are intended to be remembered, mainly due to survival and personal relevance. However, all these faces do not have the equal opportunity to stick in our minds. It has been shown that memorability is an intrinsic feature of an image but yet, it is largely unknown what attributes make an image more memorable. In this work, we aimed to address this question by proposing a fast approach to modify and control the memorability of face images. In our proposed method, we first found a hyperplane in the latent space of StyleGAN to separate high and low memorable images. We then modified the image memorability (while maintaining the identity and other facial features such as age, emotion, etc.) by moving in the positive or negative direction of this hyperplane normal vector. We further analyzed how different layers of the StyleGAN augmented latent space contribute to face memorability. These analyses showed how each individual face attribute makes an image more or less memorable. Most importantly, we evaluated our proposed method for both real and synthesized face images. The proposed method successfully modifies and controls the memorability of real human faces as well as unreal synthesized faces. Our proposed method can be employed in photograph editing applications for social media, learning aids, or advertisement purposes.  ( 2 min )
    The Need for Interpretable Features: Motivation and Taxonomy. (arXiv:2202.11748v1 [cs.LG])
    Through extensive experience developing and explaining machine learning (ML) applications for real-world domains, we have learned that ML models are only as interpretable as their features. Even simple, highly interpretable model types such as regression models can be difficult or impossible to understand if they use uninterpretable features. Different users, especially those using ML models for decision-making in their domains, may require different levels and types of feature interpretability. Furthermore, based on our experiences, we claim that the term "interpretable feature" is not specific nor detailed enough to capture the full extent to which features impact the usefulness of ML explanations. In this paper, we motivate and discuss three key lessons: 1) more attention should be given to what we refer to as the interpretable feature space, or the state of features that are useful to domain experts taking real-world actions, 2) a formal taxonomy is needed of the feature properties that may be required by these domain experts (we propose a partial taxonomy in this paper), and 3) transforms that take data from the model-ready state to an interpretable form are just as essential as traditional ML transforms that prepare features for the model.  ( 2 min )
    Nowcasting the Financial Time Series with Streaming Data Analytics under Apache Spark. (arXiv:2202.11820v1 [cs.LG])
    This paper proposes nowcasting of high-frequency financial datasets in real-time with a 5-minute interval using the streaming analytics feature of Apache Spark. The proposed 2 stage method consists of modelling chaos in the first stage and then using a sliding window approach for training with machine learning algorithms namely Lasso Regression, Ridge Regression, Generalised Linear Model, Gradient Boosting Tree and Random Forest available in the MLLib of Apache Spark in the second stage. For testing the effectiveness of the proposed methodology, 3 different datasets, of which two are stock markets namely National Stock Exchange & Bombay Stock Exchange, and finally One Bitcoin-INR conversion dataset. For evaluating the proposed methodology, we used metrics such as Symmetric Mean Absolute Percentage Error, Directional Symmetry, and Theil U Coefficient. We tested the significance of each pair of models using the Diebold Mariano (DM) test.  ( 2 min )
    Adversarially-regularized mixed effects deep learning (ARMED) models for improved interpretability, performance, and generalization on clustered data. (arXiv:2202.11783v1 [cs.LG])
    Data in the natural sciences frequently violate assumptions of independence. Such datasets have samples with inherent clustering (e.g. by study site, subject, experimental batch), which may lead to spurious associations, poor model fitting, and confounded analyses. While largely unaddressed in deep learning, mixed effects models have been used in traditional statistics for clustered data. Mixed effects models separate cluster-invariant, population-level fixed effects from cluster-specific random effects. We propose a general-purpose framework for building Adversarially-Regularized Mixed Effects Deep learning (ARMED) models through 3 non-intrusive additions to existing networks: 1) a domain adversarial classifier constraining the original model to learn only cluster-invariant features, 2) a random effects subnetwork capturing cluster-specific features, and 3) a cluster-inferencing approach to predict on clusters unseen during training. We apply this framework to dense feedforward neural networks (DFNNs), convolutional neural networks, and autoencoders on 4 applications including simulations, dementia prognosis and diagnosis, and cell microscopy. We compare to conventional models, domain adversarial-only models, and the naive inclusion of cluster membership as a covariate. Our models better distinguish confounded from true associations in simulations and emphasize more biologically plausible features in clinical applications. ARMED DFNNs quantify inter-cluster variance in clinical data while ARMED autoencoders visualize batch effects in cell images. Finally, ARMED improves accuracy on data from clusters seen during training (up to 28% vs. conventional models) and generalizes better to unseen clusters (up to 9% vs. conventional models). By incorporating powerful mixed effects modeling into deep learning, ARMED increases performance, interpretability, and generalization on clustered data.  ( 2 min )
    When do GANs replicate? On the choice of dataset size. (arXiv:2202.11765v1 [cs.LG])
    Do GANs replicate training images? Previous studies have shown that GANs do not seem to replicate training data without significant change in the training procedure. This leads to a series of research on the exact condition needed for GANs to overfit to the training data. Although a number of factors has been theoretically or empirically identified, the effect of dataset size and complexity on GANs replication is still unknown. With empirical evidence from BigGAN and StyleGAN2, on datasets CelebA, Flower and LSUN-bedroom, we show that dataset size and its complexity play an important role in GANs replication and perceptual quality of the generated images. We further quantify this relationship, discovering that replication percentage decays exponentially with respect to dataset size and complexity, with a shared decaying factor across GAN-dataset combinations. Meanwhile, the perceptual image quality follows a U-shape trend w.r.t dataset size. This finding leads to a practical tool for one-shot estimation on minimal dataset size to prevent GAN replication which can be used to guide datasets construction and selection.  ( 2 min )
    Robust Federated Learning with Connectivity Failures: A Semi-Decentralized Framework with Collaborative Relaying. (arXiv:2202.11850v1 [cs.DC])
    Intermittent client connectivity is one of the major challenges in centralized federated edge learning frameworks. Intermittently failing uplinks to the central parameter server (PS) can induce a large generalization gap in performance especially when the data distribution among the clients exhibits heterogeneity. In this work, to mitigate communication blockages between clients and the central PS, we introduce the concept of knowledge relaying wherein the successfully participating clients collaborate in relaying their neighbors' local updates to a central parameter server (PS) in order to boost the participation of clients with intermittently failing connectivity. We propose a collaborative relaying based semi-decentralized federated edge learning framework where at every communication round each client first computes a local consensus of the updates from its neighboring clients and eventually transmits a weighted average of its own update and those of its neighbors to the PS. We appropriately optimize these averaging weights to reduce the variance of the global update at the PS while ensuring that the global update is unbiased, consequently improving the convergence rate. Finally, by conducting experiments on CIFAR-10 dataset we validate our theoretical results and demonstrate that our proposed scheme is superior to Federated averaging benchmark especially when data distribution among clients is non-iid.  ( 2 min )
    A modification of the conjugate direction method for motion estimation. (arXiv:2202.11831v1 [cs.CV])
    A comparative study of different block matching alternatives for motion estimation is presented. The study is focused on computational burden and objective measures on the accuracy of prediction. Together with existing algorithms several new variations have been tested. An interesting modification of the conjugate direction method previously related in literature is reported. This new algorithm shows a good trade-off between computational complexity and accuracy of motion vector estimation. Computational complexity is evaluated using a sequence of artificial images designed to incorporate a great variety of motion vectors. The performance of block matching methods has been measured in terms of the entropy in the error signal between the motion compensated and the original frames.  ( 2 min )
    Benefit of Interpolation in Nearest Neighbor Algorithms. (arXiv:2202.11817v1 [stat.ML])
    In some studies \citep[e.g.,][]{zhang2016understanding} of deep learning, it is observed that over-parametrized deep neural networks achieve a small testing error even when the training error is almost zero. Despite numerous works towards understanding this so-called "double descent" phenomenon \citep[e.g.,][]{belkin2018reconciling,belkin2019two}, in this paper, we turn into another way to enforce zero training error (without over-parametrization) through a data interpolation mechanism. Specifically, we consider a class of interpolated weighting schemes in the nearest neighbors (NN) algorithms. By carefully characterizing the multiplicative constant in the statistical risk, we reveal a U-shaped performance curve for the level of data interpolation in both classification and regression setups. This sharpens the existing result \citep{belkin2018does} that zero training error does not necessarily jeopardize predictive performances and claims a counter-intuitive result that a mild degree of data interpolation actually {\em strictly} improve the prediction performance and statistical stability over those of the (un-interpolated) $k$-NN algorithm. In the end, the universality of our results, such as change of distance measure and corrupted testing data, will also be discussed.  ( 2 min )
    Exploiting Correlation to Achieve Faster Learning Rates in Low-Rank Preference Bandits. (arXiv:2202.11795v1 [cs.LG])
    We introduce the \emph{Correlated Preference Bandits} problem with random utility-based choice models (RUMs), where the goal is to identify the best item from a given pool of $n$ items through online subsetwise preference feedback. We investigate whether models with a simple correlation structure, e.g. low rank, can result in faster learning rates. While we show that the problem can be impossible to solve for the general `low rank' choice models, faster learning rates can be attained assuming more structured item correlations. In particular, we introduce a new class of \emph{Block-Rank} based RUM model, where the best item is shown to be $(\epsilon,\delta)$-PAC learnable with only $O(r \epsilon^{-2} \log(n/\delta))$ samples. This improves on the standard sample complexity bound of $\tilde{O}(n\epsilon^{-2} \log(1/\delta))$ known for the usual learning algorithms which might not exploit the item-correlations ($r \ll n$). We complement the above sample complexity with a matching lower bound (up to logarithmic factors), justifying the tightness of our analysis. Surprisingly, we also show a lower bound of $\Omega(n\epsilon^{-2}\log(1/\delta))$ when the learner is forced to play just duels instead of larger subsetwise queries. Further, we extend the results to a more general `\emph{noisy Block-Rank}' model, which ensures robustness of our techniques. Overall, our results justify the advantage of playing subsetwise queries over pairwise preferences $(k=2)$, we show the latter provably fails to exploit correlation.  ( 2 min )
    Generative modeling via tensor train sketching. (arXiv:2202.11788v1 [math.NA])
    In this paper we introduce a sketching algorithm for constructing a tensor train representation of a probability density from its samples. Our method deviates from the standard recursive SVD-based procedure for constructing a tensor train. Instead we formulate and solve a sequence of small linear systems for the individual tensor train cores. This approach can avoid the curse of dimensionality that threatens both the algorithmic and sample complexities of the recovery problem. Specifically, for Markov models, we prove that the tensor cores can be recovered with a sample complexity that is constant with respect to the dimension. Finally, we illustrate the performance of the method with several numerical experiments.  ( 2 min )
    Flow-based sampling in the lattice Schwinger model at criticality. (arXiv:2202.11712v1 [hep-lat])
    Recent results suggest that flow-based algorithms may provide efficient sampling of field distributions for lattice field theory applications, such as studies of quantum chromodynamics and the Schwinger model. In this work, we provide a numerical demonstration of robust flow-based sampling in the Schwinger model at the critical value of the fermion mass. In contrast, at the same parameters, conventional methods fail to sample all parts of configuration space, leading to severely underestimated uncertainties.  ( 2 min )
    An Efficient Binary Harris Hawks Optimization based on Quantum SVM for Cancer Classification Tasks. (arXiv:2202.11899v1 [cs.LG])
    Cancer classification based on gene expression increases early diagnosis and recovery, but high-dimensional genes with a small number of samples are a major challenge. This work introduces a new hybrid quantum kernel support vector machine (QKSVM) combined with a Binary Harris hawk optimization (BHHO) based gene selection for cancer classification on a quantum simulator. This study aims to improve the microarray cancer prediction performance with the quantum kernel estimation based on the informative genes by BHHO. The feature selection is a critical step in large-dimensional features, and BHHO is used to select important features. The BHHO mimics the behavior of the cooperative action of Harris hawks in nature. The principal component analysis (PCA) is applied to reduce the selected genes to match the qubit numbers. After which, the quantum computer is used to estimate the kernel with the training data of the reduced genes and generate the quantum kernel matrix. Moreover, the classical computer is used to draw the support vectors based on the quantum kernel matrix. Also, the prediction stage is performed with the classical device. Finally, the proposed approach is applied to colon and breast microarray datasets and evaluated with all genes and the selected genes by BHHO. The proposed approach is found to enhance the overall performance with two datasets. Also, the proposed approach is evaluated with different quantum feature maps (kernels) and classical kernel (RBF).  ( 2 min )
    NeuroView-RNN: It's About Time. (arXiv:2202.11811v1 [cs.LG])
    Recurrent Neural Networks (RNNs) are important tools for processing sequential data such as time-series or video. Interpretability is defined as the ability to be understood by a person and is different from explainability, which is the ability to be explained in a mathematical formulation. A key interpretability issue with RNNs is that it is not clear how each hidden state per time step contributes to the decision-making process in a quantitative manner. We propose NeuroView-RNN as a family of new RNN architectures that explains how all the time steps are used for the decision-making process. Each member of the family is derived from a standard RNN architecture by concatenation of the hidden steps into a global linear classifier. The global linear classifier has all the hidden states as the input, so the weights of the classifier have a linear mapping to the hidden states. Hence, from the weights, NeuroView-RNN can quantify how important each time step is to a particular decision. As a bonus, NeuroView-RNN also offers higher accuracy in many cases compared to the RNNs and their variants. We showcase the benefits of NeuroView-RNN by evaluating on a multitude of diverse time-series datasets.  ( 2 min )
    Using Deep Learning to Detect Digitally Encoded DNA Trigger for Trojan Malware in Bio-Cyber Attacks. (arXiv:2202.11824v1 [cs.CR])
    This article uses Deep Learning technologies to safeguard DNA sequencing against Bio-Cyber attacks. We consider a hybrid attack scenario where the payload is encoded into a DNA sequence to activate a Trojan malware implanted in a software tool used in the sequencing pipeline in order to allow the perpetrators to gain control over the resources used in that pipeline during sequence analysis. The scenario considered in the paper is based on perpetrators submitting synthetically engineered DNA samples that contain digitally encoded IP address and port number of the perpetrators machine in the DNA. Genetic analysis of the samples DNA will decode the address that is used by the software trojan malware to activate and trigger a remote connection. This approach can open up to multiple perpetrators to create connections to hijack the DNA sequencing pipeline. As a way of hiding the data, the perpetrators can avoid detection by encoding the address to maximise similarity with genuine DNAs, which we showed previously. However, in this paper we show how Deep Learning can be used to successfully detect and identify the trigger encoded data, in order to protect a DNA sequencing pipeline from trojan attacks. The result shows nearly up to 100% accuracy in detection in such a novel Trojan attack scenario even after applying fragmentation encryption and steganography on the encoded trigger data. In addition, feasibility of designing and synthesizing encoded DNA for such Trojan payloads is validated by a wet lab experiment.  ( 2 min )
    Investigating the effect of binning on causal discovery. (arXiv:2202.11789v1 [cs.LG])
    Binning (a.k.a. discretization) of numerically continuous measurements is a wide-spread but controversial practice in data collection, analysis, and presentation. The consequences of binning have been evaluated for many different kinds of data analysis methods, however so far the effect of binning on causal discovery algorithms has not been directly investigated. This paper reports the results of a simulation study that examined the effect of binning on the Greedy Equivalence Search (GES) causal discovery algorithm. Our findings suggest that unbinned continuous data often result in the highest search performance, but some exceptions are identified. We also found that binned data are more sensitive to changes in sample size and tuning parameters, and identified some interactive effects between sample size, binning, and tuning parameter on performance.  ( 2 min )
    Single Image Super-Resolution Methods: A Survey. (arXiv:2202.11763v1 [eess.IV])
    Super-resolution (SR), the process of obtaining high-resolution images from one or more low-resolution observations of the same scene, has been a very popular topic of research in the last few decades in both signal processing and image processing areas. Due to the recent developments in Convolutional Neural Networks, the popularity of SR algorithms has skyrocketed as the barrier of entry has been lowered significantly. Recently, this popularity has spread into video processing areas to the lengths of developing SR models that work in real-time. In this paper, we compare different SR models that specialize in single image processing and will take a glance at how they evolved to take on many different objectives and shapes over the years.  ( 2 min )
    Loss as the Inconsistency of a Probabilistic Dependency Graph: Choose Your Model, Not Your Loss Function. (arXiv:2202.11862v1 [cs.LG])
    In a world blessed with a great diversity of loss functions, we argue that that choice between them is not a matter of taste or pragmatics, but of model. Probabilistic depencency graphs (PDGs) are probabilistic models that come equipped with a measure of "inconsistency". We prove that many standard loss functions arise as the inconsistency of a natural PDG describing the appropriate scenario, and use the same approach to justify a well-known connection between regularizers and priors. We also show that the PDG inconsistency captures a large class of statistical divergences, and detail benefits of thinking of them in this way, including an intuitive visual language for deriving inequalities between them. In variational inference, we find that the ELBO, a somewhat opaque objective for latent variable models, and variants of it arise for free out of uncontroversial modeling assumptions -- as do simple graphical proofs of their corresponding bounds. Finally, we observe that inconsistency becomes the log partition function (free energy) in the setting where PDGs are factor graphs.  ( 2 min )
    Truncated LinUCB for Stochastic Linear Bandits. (arXiv:2202.11735v1 [stat.ML])
    This paper considers contextual bandits with a finite number of arms, where the contexts are independent and identically distributed $d$-dimensional random vectors, and the expected rewards are linear in both the arm parameters and contexts. The LinUCB algorithm, which is near minimax optimal for related linear bandits, is shown to have a cumulative regret that is suboptimal in both the dimension $d$ and time horizon $T$, due to its over-exploration. A truncated version of LinUCB is proposed and termed "Tr-LinUCB", which follows LinUCB up to a truncation time $S$ and performs pure exploitation afterwards. The Tr-LinUCB algorithm is shown to achieve $O(d\log(T))$ regret if $S = Cd\log(T)$ for a sufficiently large constant $C$, and a matching lower bound is established, which shows the rate optimality of Tr-LinUCB in both $d$ and $T$ under a low dimensional regime. Further, if $S = d\log^{\kappa}(T)$ for some $\kappa>1$, the loss compared to the optimal is a multiplicative $\log\log(T)$ factor, which does not depend on $d$. This insensitivity to overshooting in choosing the truncation time of Tr-LinUCB is of practical importance.  ( 2 min )
    Training Characteristic Functions with Reinforcement Learning: XAI-methods play Connect Four. (arXiv:2202.11797v1 [cs.LG])
    One of the goals of Explainable AI (XAI) is to determine which input components were relevant for a classifier decision. This is commonly know as saliency attribution. Characteristic functions (from cooperative game theory) are able to evaluate partial inputs and form the basis for theoretically "fair" attribution methods like Shapley values. Given only a standard classifier function, it is unclear how partial input should be realised. Instead, most XAI-methods for black-box classifiers like neural networks consider counterfactual inputs that generally lie off-manifold. This makes them hard to evaluate and easy to manipulate. We propose a setup to directly train characteristic functions in the form of neural networks to play simple two-player games. We apply this to the game of Connect Four by randomly hiding colour information from our agents during training. This has three advantages for comparing XAI-methods: It alleviates the ambiguity about how to realise partial input, makes off-manifold evaluation unnecessary and allows us to compare the methods by letting them play against each other.  ( 2 min )
    Discovering Multiple and Diverse Directions for Cognitive Image Properties. (arXiv:2202.11772v1 [cs.CV])
    Recent research has shown that it is possible to find interpretable directions in the latent spaces of pre-trained GANs. These directions enable controllable generation and support a variety of semantic editing operations. While previous work has focused on discovering a single direction that performs a desired editing operation such as zoom-in, limited work has been done on the discovery of multiple and diverse directions that can achieve the desired edit. In this work, we propose a novel framework that discovers multiple and diverse directions for a given property of interest. In particular, we focus on the manipulation of cognitive properties such as Memorability, Emotional Valence and Aesthetics. We show with extensive experiments that our method successfully manipulates these properties while producing diverse outputs. Our project page and source code can be found at this http URL  ( 2 min )
    ML-based Anomaly Detection in Optical Fiber Monitoring. (arXiv:2202.11756v1 [cs.CR])
    Secure and reliable data communication in optical networks is critical for high-speed internet. We propose a data driven approach for the anomaly detection and faults identification in optical networks to diagnose physical attacks such as fiber breaks and optical tapping. The proposed methods include an autoencoder-based anomaly detection and an attention-based bidirectional gated recurrent unit algorithm for the fiber fault identification and localization. We verify the efficiency of our methods by experiments under various attack scenarios using real operational data.  ( 2 min )
    Are All Linear Regions Created Equal?. (arXiv:2202.11749v1 [cs.LG])
    The number of linear regions has been studied as a proxy of complexity for ReLU networks. However, the empirical success of network compression techniques like pruning and knowledge distillation, suggest that in the overparameterized setting, linear regions density might fail to capture the effective nonlinearity. In this work, we propose an efficient algorithm for discovering linear regions and use it to investigate the effectiveness of density in capturing the nonlinearity of trained VGGs and ResNets on CIFAR-10 and CIFAR-100. We contrast the results with a more principled nonlinearity measure based on function variation, highlighting the shortcomings of linear regions density. Furthermore, interestingly, our measure of nonlinearity clearly correlates with model-wise deep double descent, connecting reduced test error with reduced nonlinearity, and increased local similarity of linear regions.  ( 2 min )
    Using Bayesian Deep Learning to infer Planet Mass from Gaps in Protoplanetary Disks. (arXiv:2202.11730v1 [astro-ph.EP])
    Planet induced sub-structures, like annular gaps, observed in dust emission from protoplanetary disks provide a unique probe to characterize unseen young planets. While deep learning based model has an edge in characterizing the planet's properties over traditional methods, like customized simulations and empirical relations, it lacks in its ability to quantify the uncertainty associated with its predictions. In this paper, we introduce a Bayesian deep learning network "DPNNet-Bayesian" that can predict planet mass from disk gaps and provides uncertainties associated with the prediction. A unique feature of our approach is that it can distinguish between the uncertainty associated with the deep learning architecture and uncertainty inherent in the input data due to measurement noise. The model is trained on a data set generated from disk-planet simulations using the \textsc{fargo3d} hydrodynamics code with a newly implemented fixed grain size module and improved initial conditions. The Bayesian framework enables estimating a gauge/confidence interval over the validity of the prediction when applied to unknown observations. As a proof-of-concept, we apply DPNNet-Bayesian to dust gaps observed in HL Tau. The network predicts masses of $ 86.0 \pm 5.5 M_{\Earth} $, $ 43.8 \pm 3.3 M_{\Earth} $, and $ 92.2 \pm 5.1 M_{\Earth} $ respectively, which are comparable to other studies based on specialized simulations.  ( 2 min )
    Completely Quantum Neural Networks. (arXiv:2202.11727v1 [quant-ph])
    Artificial neural networks are at the heart of modern deep learning algorithms. We describe how to embed and train a general neural network in a quantum annealer without introducing any classical element in training. To implement the network on a state-of-the-art quantum annealer, we develop three crucial ingredients: binary encoding the free parameters of the network, polynomial approximation of the activation function, and reduction of binary higher-order polynomials into quadratic ones. Together, these ideas allow encoding the loss function as an Ising model Hamiltonian. The quantum annealer then trains the network by finding the ground state. We implement this for an elementary network and illustrate the advantages of quantum training: its consistency in finding the global minimum of the loss function and the fact that the network training converges in a single annealing step, which leads to short training times while maintaining a high classification performance. Our approach opens a novel avenue for the quantum training of general machine learning models.  ( 2 min )
    Prune and Tune Ensembles: Low-Cost Ensemble Learning With Sparse Independent Subnetworks. (arXiv:2202.11782v1 [cs.LG])
    Ensemble Learning is an effective method for improving generalization in machine learning. However, as state-of-the-art neural networks grow larger, the computational cost associated with training several independent networks becomes expensive. We introduce a fast, low-cost method for creating diverse ensembles of neural networks without needing to train multiple models from scratch. We do this by first training a single parent network. We then create child networks by cloning the parent and dramatically pruning the parameters of each child to create an ensemble of members with unique and diverse topologies. We then briefly train each child network for a small number of epochs, which now converge significantly faster when compared to training from scratch. We explore various ways to maximize diversity in the child networks, including the use of anti-random pruning and one-cycle tuning. This diversity enables "Prune and Tune" ensembles to achieve results that are competitive with traditional ensembles at a fraction of the training cost. We benchmark our approach against state of the art low-cost ensemble methods and display marked improvement in both accuracy and uncertainty estimation on CIFAR-10 and CIFAR-100.  ( 2 min )
  • Open

    Sobolev Norm Learning Rates for Conditional Mean Embeddings. (arXiv:2105.07446v3 [stat.ML] UPDATED)
    We develop novel learning rates for conditional mean embeddings by applying the theory of interpolation for reproducing kernel Hilbert spaces (RKHS). We derive explicit, adaptive convergence rates for the sample estimator under the misspecifed setting, where the target operator is not Hilbert-Schmidt or bounded with respect to the input/output RKHSs. We demonstrate that in certain parameter regimes, we can achieve uniform convergence rates in the output RKHS. We hope our analyses will allow the much broader application of conditional mean embeddings to more complex ML/RL settings involving infinite dimensional RKHSs and continuous state spaces.  ( 2 min )
    Entropic Gromov-Wasserstein between Gaussian Distributions. (arXiv:2108.10961v3 [math.ST] UPDATED)
    We study the entropic Gromov-Wasserstein and its unbalanced version between (unbalanced) Gaussian distributions with different dimensions. When the metric is the inner product, which we refer to as inner product Gromov-Wasserstein (IGW), we demonstrate that the optimal transportation plans of entropic IGW and its unbalanced variant are (unbalanced) Gaussian distributions. Via an application of von Neumann's trace inequality, we obtain closed-form expressions for the entropic IGW between these Gaussian distributions. Finally, we consider an entropic inner product Gromov-Wasserstein barycenter of multiple Gaussian distributions. We prove that the barycenter is a Gaussian distribution when the entropic regularization parameter is small. We further derive a closed-form expression for the covariance matrix of the barycenter.  ( 2 min )
    Conditionally Gaussian PAC-Bayes. (arXiv:2110.11886v2 [cs.LG] UPDATED)
    Recent studies have empirically investigated different methods to train stochastic neural networks on a classification task by optimising a PAC-Bayesian bound via stochastic gradient descent. Most of these procedures need to replace the misclassification error with a surrogate loss, leading to a mismatch between the optimisation objective and the actual generalisation bound. The present paper proposes a novel training algorithm that optimises the PAC-Bayesian bound, without relying on any surrogate loss. Empirical results show that this approach outperforms currently available PAC-Bayesian training methods.  ( 2 min )
    MLDemon: Deployment Monitoring for Machine Learning Systems. (arXiv:2104.13621v5 [cs.LG] UPDATED)
    Post-deployment monitoring of ML systems is critical for ensuring reliability, especially as new user inputs can differ from the training distribution. Here we propose a novel approach, MLDemon, for ML DEployment MONitoring. MLDemon integrates both unlabeled data and a small amount of on-demand labels to produce a real-time estimate of the ML model's current performance on a given data stream. Subject to budget constraints, MLDemon decides when to acquire additional, potentially costly, expert supervised labels to verify the model. On temporal datasets with diverse distribution drifts and models, MLDemon outperforms existing approaches. Moreover, we provide theoretical analysis to show that MLDemon is minimax rate optimal for a broad class of distribution drifts.  ( 2 min )
    Disentanglement via Mechanism Sparsity Regularization: A New Principle for Nonlinear ICA. (arXiv:2107.10098v3 [stat.ML] UPDATED)
    This work introduces a novel principle we call disentanglement via mechanism sparsity regularization, which can be applied when the latent factors of interest depend sparsely on past latent factors and/or observed auxiliary variables. We propose a representation learning method that induces disentanglement by simultaneously learning the latent factors and the sparse causal graphical model that relates them. We develop a rigorous identifiability theory, building on recent nonlinear independent component analysis (ICA) results, that formalizes this principle and shows how the latent variables can be recovered up to permutation if one regularizes the latent mechanisms to be sparse and if some graph connectivity criterion is satisfied by the data generating process. As a special case of our framework, we show how one can leverage unknown-target interventions on the latent factors to disentangle them, thereby drawing further connections between ICA and causality. We propose a VAE-based method in which the latent mechanisms are learned and regularized via binary masks, and validate our theory by showing it learns disentangled representations in simulations.  ( 2 min )
    Inductive Bias of Multi-Channel Linear Convolutional Networks with Bounded Weight Norm. (arXiv:2102.12238v3 [cs.LG] UPDATED)
    We provide a function space characterization of the inductive bias resulting from minimizing the $\ell_2$ norm of the weights in multi-channel convolutional neural networks with linear activations and empirically test our resulting hypothesis on ReLU networks trained using gradient descent. We define an induced regularizer in the function space as the minimum $\ell_2$ norm of weights of a network required to realize a function. For two layer linear convolutional networks with $C$ output channels and kernel size $K$, we show the following: (a) If the inputs to the network are single channeled, the induced regularizer for any $K$ is independent of the number of output channels $C$. Furthermore, we derive the regularizer is a norm given by a semidefinite program (SDP). (b) In contrast, for multi-channel inputs, multiple output channels can be necessary to merely realize all matrix-valued linear functions and thus the inductive bias does depend on $C$. However, for sufficiently large $C$, the induced regularizer is again given by an SDP that is independent of $C$. In particular, the induced regularizer for $K=1$ and $K=D$ (input dimension) are given in closed form as the nuclear norm and the $\ell_{2,1}$ group-sparse norm, respectively, of the Fourier coefficients of the linear predictor. We investigate the broader applicability of our theoretical results to implicit regularization from gradient descent on linear and ReLU networks through experiments on MNIST and CIFAR-10 datasets.  ( 2 min )
    On Multimarginal Partial Optimal Transport: Equivalent Forms and Computational Complexity. (arXiv:2108.07992v2 [stat.ML] UPDATED)
    We study the multi-marginal partial optimal transport (POT) problem between $m$ discrete (unbalanced) measures with at most $n$ supports. We first prove that we can obtain two equivalence forms of the multimarginal POT problem in terms of the multimarginal optimal transport problem via novel extensions of cost tensor. The first equivalence form is derived under the assumptions that the total masses of each measure are sufficiently close while the second equivalence form does not require any conditions on these masses but at the price of more sophisticated extended cost tensor. Our proof techniques for obtaining these equivalence forms rely on novel procedures of moving mass in graph theory to push transportation plan into appropriate regions. Finally, based on the equivalence forms, we develop optimization algorithm, named ApproxMPOT algorithm, that builds upon the Sinkhorn algorithm for solving the entropic regularized multimarginal optimal transport. We demonstrate that the ApproxMPOT algorithm can approximate the optimal value of multimarginal POT problem with a computational complexity upper bound of the order $\tilde{\mathcal{O}}(m^3(n+1)^{m}/ \varepsilon^2)$ where $\varepsilon > 0$ stands for the desired tolerance.  ( 2 min )
    Separation of Scales and a Thermodynamic Description of Feature Learning in Some CNNs. (arXiv:2112.15383v2 [stat.ML] UPDATED)
    Deep neural networks (DNNs) are powerful tools for compressing and distilling information. Their scale and complexity, often involving billions of inter-dependent internal degrees of freedom, renders direct microscopic analysis difficult. Under such circumstances, a common strategy is to identify slow degrees of freedom that average out the erratic behavior of the underlying fast microscopic variables. Here, we identify such a separation of scales occurring in fully trained over-parameterized deep convolutional neural networks (CNNs). Specifically, we show that DNN layers couple only through the second moment (kernels) of their activations and pre-activations. Moreover, in various settings, the latter fluctuate in a nearly Gaussian manner. For CNNs with infinitely many channels, these kernels are inert, while for finite CNNs they adapt to the data. In several deep non-linear CNN models trained on real data, the resulting thermodynamic theory of deep learning yields accurate predictions. In addition, it provides new ways of analyzing and understanding CNNs, and DNNs in general.  ( 2 min )
    Dynamic Defense Against Byzantine Poisoning Attacks in Federated Learning. (arXiv:2007.15030v2 [cs.LG] UPDATED)
    Federated learning, as a distributed learning that conducts the training on the local devices without accessing to the training data, is vulnerable to Byzatine poisoning adversarial attacks. We argue that the federated learning model has to avoid those kind of adversarial attacks through filtering out the adversarial clients by means of the federated aggregation operator. We propose a dynamic federated aggregation operator that dynamically discards those adversarial clients and allows to prevent the corruption of the global learning model. We assess it as a defense against adversarial attacks deploying a deep learning classification model in a federated learning setting on the Fed-EMNIST Digits, Fashion MNIST and CIFAR-10 image datasets. The results show that the dynamic selection of the clients to aggregate enhances the performance of the global learning model and discards the adversarial and poor (with low quality models) clients.  ( 2 min )
    Functional Classification of Bitcoin Addresses. (arXiv:2202.12019v1 [stat.AP])
    This paper proposes a classification model for predicting the main activity of bitcoin addresses based on their balances. Since the balances are functions of time, we apply methods from functional data analysis; more specifically, the features of the proposed classification model are the functional principal components of the data. Classifying bitcoin addresses is a relevant problem for two main reasons: to understand the composition of the bitcoin market, and to identify accounts used for illicit activities. Although other bitcoin classifiers have been proposed, they focus primarily on network analysis rather than curve behavior. Our approach, on the other hand, does not require any network information for prediction. Furthermore, functional features have the advantage of being straightforward to build, unlike expert-built features. Results show improvement when combining functional features with scalar features, and similar accuracy for the models using those features separately, which points to the functional model being a good alternative when domain-specific knowledge is not available.  ( 2 min )
    Multiscale Generative Models: Improving Performance of a Generative Model Using Feedback from Other Dependent Generative Models. (arXiv:2201.09644v2 [cs.LG] UPDATED)
    Realistic fine-grained multi-agent simulation of real-world complex systems is crucial for many downstream tasks such as reinforcement learning. Recent work has used generative models (GANs in particular) for providing high-fidelity simulation of real-world systems. However, such generative models are often monolithic and miss out on modeling the interaction in multi-agent systems. In this work, we take a first step towards building multiple interacting generative models (GANs) that reflects the interaction in real world. We build and analyze a hierarchical set-up where a higher-level GAN is conditioned on the output of multiple lower-level GANs. We present a technique of using feedback from the higher-level GAN to improve performance of lower-level GANs. We mathematically characterize the conditions under which our technique is impactful, including understanding the transfer learning nature of our set-up. We present three distinct experiments on synthetic data, time series data, and image domain, revealing the wide applicability of our technique.  ( 2 min )
    Resampling Base Distributions of Normalizing Flows. (arXiv:2110.15828v2 [stat.ML] UPDATED)
    Normalizing flows are a popular class of models for approximating probability distributions. However, their invertible nature limits their ability to model target distributions whose support have a complex topological structure, such as Boltzmann distributions. Several procedures have been proposed to solve this problem but many of them sacrifice invertibility and, thereby, tractability of the log-likelihood as well as other desirable properties. To address these limitations, we introduce a base distribution for normalizing flows based on learned rejection sampling, allowing the resulting normalizing flow to model complicated distributions without giving up bijectivity. Furthermore, we develop suitable learning algorithms using both maximizing the log-likelihood and the optimization of the Kullback-Leibler divergence, and apply them to various sample problems, i.e. approximating 2D densities, density estimation of tabular data, image generation, and modeling Boltzmann distributions. In these experiments our method is competitive with or outperforms the baselines.  ( 2 min )
    Communication-Compressed Adaptive Gradient Method for Distributed Nonconvex Optimization. (arXiv:2111.00705v2 [cs.LG] UPDATED)
    Due to the explosion in the size of the training datasets, distributed learning has received growing interest in recent years. One of the major bottlenecks is the large communication cost between the central server and the local workers. While error feedback compression has been proven to be successful in reducing communication costs with stochastic gradient descent (SGD), there are much fewer attempts in building communication-efficient adaptive gradient methods with provable guarantees, which are widely used in training large-scale machine learning models. In this paper, we propose a new communication-compressed AMSGrad for distributed nonconvex optimization problem, which is provably efficient. Our proposed distributed learning framework features an effective gradient compression strategy and a worker-side model update design. We prove that the proposed communication-efficient distributed adaptive gradient method converges to the first-order stationary point with the same iteration complexity as uncompressed vanilla AMSGrad in the stochastic nonconvex optimization setting. Experiments on various benchmarks back up our theory.  ( 2 min )
    A Simple and General Debiased Machine Learning Theorem with Finite Sample Guarantees. (arXiv:2105.15197v2 [stat.ML] UPDATED)
    Debiased machine learning is a meta algorithm based on bias correction and sample splitting to calculate confidence intervals for functionals (i.e. scalar summaries) of machine learning algorithms. For example, an analyst may desire the confidence interval for a treatment effect estimated with a neural network. We provide a nonasymptotic debiased machine learning theorem that encompasses any global or local functional of any machine learning algorithm that satisfies a few simple, interpretable conditions. Formally, we prove consistency, Gaussian approximation, and semiparametric efficiency by finite sample arguments. The rate of convergence is $n^{-1/2}$ for global functionals, and it degrades gracefully for local functionals. Our results culminate in a simple set of conditions that an analyst can use to translate modern learning theory rates into traditional statistical inference. The conditions reveal a general double robustness property for ill posed inverse problems.  ( 2 min )
    Neighborhood Region Smoothing Regularization for Finding Flat Minima In Deep Neural Networks. (arXiv:2201.06064v2 [cs.LG] UPDATED)
    Due to diverse architectures in deep neural networks (DNNs) with severe overparameterization, regularization techniques are critical for finding optimal solutions in the huge hypothesis space. In this paper, we propose an effective regularization technique, called Neighborhood Region Smoothing (NRS). NRS leverages the finding that models would benefit from converging to flat minima, and tries to regularize the neighborhood region in weight space to yield approximate outputs. Specifically, gap between outputs of models in the neighborhood region is gauged by a defined metric based on Kullback-Leibler divergence. This metric provides similar insights with the minimum description length principle on interpreting flat minima. By minimizing both this divergence and empirical loss, NRS could explicitly drive the optimizer towards converging to flat minima. We confirm the effectiveness of NRS by performing image classification tasks across a wide range of model architectures on commonly-used datasets such as CIFAR and ImageNet, where generalization ability could be universally improved. Also, we empirically show that the minima found by NRS would have relatively smaller Hessian eigenvalues compared to the conventional method, which is considered as the evidence of flat minima.  ( 2 min )
    Robust Deep Learning from Crowds with Belief Propagation. (arXiv:2111.00734v2 [cs.LG] UPDATED)
    Crowdsourcing systems enable us to collect large-scale dataset, but inherently suffer from noisy labels of low-paid workers. We address the inference and learning problems using such a crowdsourced dataset with noise. Due to the nature of sparsity in crowdsourcing, it is critical to exploit both probabilistic model to capture worker prior and neural network to extract task feature despite risks from wrong prior and overfitted feature in practice. We hence establish a neural-powered Bayesian framework, from which we devise deepMF and deepBP with different choice of variational approximation methods, mean field (MF) and belief propagation (BP), respectively. This provides a unified view of existing methods, which are special cases of deepMF with different priors. In addition, our empirical study suggests that deepBP is a new approach, which is more robust against wrong prior, feature overfitting and extreme workers thanks to the more sophisticated BP than MF.  ( 2 min )
    Double Control Variates for Gradient Estimation in Discrete Latent Variable Models. (arXiv:2111.05300v2 [stat.ML] UPDATED)
    Stochastic gradient-based optimisation for discrete latent variable models is challenging due to the high variance of gradients. We introduce a variance reduction technique for score function estimators that makes use of double control variates. These control variates act on top of a main control variate, and try to further reduce the variance of the overall estimator. We develop a double control variate for the REINFORCE leave-one-out estimator using Taylor expansions. For training discrete latent variable models, such as variational autoencoders with binary latent variables, our approach adds no extra computational cost compared to standard training with the REINFORCE leave-one-out estimator. We apply our method to challenging high-dimensional toy examples and training variational autoencoders with binary latent variables. We show that our estimator can have lower variance compared to other state-of-the-art estimators.  ( 2 min )
    Fast and Scalable Spike and Slab Variable Selection in High-Dimensional Gaussian Processes. (arXiv:2111.04558v2 [stat.ML] UPDATED)
    Variable selection in Gaussian processes (GPs) is typically undertaken by thresholding the inverse lengthscales of automatic relevance determination kernels, but in high-dimensional datasets this approach can be unreliable. A more probabilistically principled alternative is to use spike and slab priors and infer a posterior probability of variable inclusion. However, existing implementations in GPs are very costly to run in both high-dimensional and large-$n$ datasets, or are only suitable for unsupervised settings with specific kernels. As such, we develop a fast and scalable variational inference algorithm for the spike and slab GP that is tractable with arbitrary differentiable kernels. We improve our algorithm's ability to adapt to the sparsity of relevant variables by Bayesian model averaging over hyperparameters, and achieve substantial speed ups using zero temperature posterior restrictions, dropout pruning and nearest neighbour minibatching. In experiments our method consistently outperforms vanilla and sparse variational GPs whilst retaining similar runtimes (even when $n=10^6$) and performs competitively with a spike and slab GP using MCMC but runs up to $1000$ times faster.  ( 2 min )
    Attentive Temporal Pooling for Conformer-based Streaming Language Identification in Long-form Speech. (arXiv:2202.12163v1 [eess.AS])
    In this paper, we introduce a novel language identification system based on conformer layers. We propose an attentive temporal pooling mechanism to allow the model to carry information in long-form audio via a recurrent form, such that the inference can be performed in a streaming fashion. Additionally, a simple domain adaptation mechanism is introduced to allow adapting an existing language identification model to a new domain where the prior language distribution is different. We perform a comparative study of different model topologies under different constraints of model size, and find that conformer-base models outperform LSTM and transformer based models. Our experiments also show that attentive temporal pooling and domain adaptation significantly improve the model accuracy.  ( 2 min )
    Exact minimax risk for linear least squares, and the lower tail of sample covariance matrices. (arXiv:1912.10754v3 [math.ST] UPDATED)
    We consider random-design linear prediction and related questions on the lower tail of random matrices. It is known that, under boundedness constraints, the minimax risk is of order $d/n$ in dimension $d$ with $n$ samples. Here, we study the minimax expected excess risk over the full linear class, depending on the distribution of covariates. First, the least squares estimator is exactly minimax optimal in the well-specified case, for every distribution of covariates. We express the minimax risk in terms of the distribution of statistical leverage scores of individual samples, and deduce a minimax lower bound of $d/(n-d+1)$ for any covariate distribution, nearly matching the risk for Gaussian design. We then obtain sharp nonasymptotic upper bounds for covariates that satisfy a "small ball"-type regularity condition in both well-specified and misspecified cases. Our main technical contribution is the study of the lower tail of the smallest singular value of empirical covariance matrices at small values. We establish a lower bound on this lower tail, valid for any distribution in dimension $d \geq 2$, together with a matching upper bound under a necessary regularity condition. Our proof relies on the PAC-Bayes technique for controlling empirical processes, and extends an analysis of Oliveira devoted to a different part of the lower tail.  ( 2 min )
    Multivariate Deep Evidential Regression. (arXiv:2104.06135v4 [cs.LG] UPDATED)
    There is significant need for principled uncertainty reasoning in machine learning systems as they are increasingly deployed in safety-critical domains. A new approach with uncertainty-aware neural networks (NNs), based on learning evidential distributions for aleatoric and epistemic uncertainties, shows promise over traditional deterministic methods and typical Bayesian NNs, yet several important gaps in the theory and implementation of these networks remain. We discuss three issues with a proposed solution to extract aleatoric and epistemic uncertainties from regression-based neural networks. The approach derives a technique by placing evidential priors over the original Gaussian likelihood function and training the NN to infer the hyperparameters of the evidential distribution. Doing so allows for the simultaneous extraction of both uncertainties without sampling or utilization of out-of-distribution data for univariate regression tasks. We describe the outstanding issues in detail, provide a possible solution, and generalize the deep evidential regression technique for multivariate cases.  ( 2 min )
    Quadric Hypersurface Intersection for Manifold Learning in Feature Space. (arXiv:2102.06186v2 [cs.LG] UPDATED)
    The knowledge that data lies close to a particular submanifold of the ambient Euclidean space may be useful in a number of ways. For instance, one may want to automatically mark any point far away from the submanifold as an outlier or to use the geometry to come up with a better distance metric. Manifold learning problems are often posed in a very high dimension, e.g. for spaces of images or spaces of words. Today, with deep representation learning on the rise in areas such as computer vision and natural language processing, many problems of this kind may be transformed into problems of moderately high dimension, typically of the order of hundreds. Motivated by this, we propose a manifold learning technique suitable for moderately high dimension and large datasets. The manifold is learned from the training data in the form of an intersection of quadric hypersurfaces -- simple but expressive objects. At test time, this manifold can be used to introduce a computationally efficient outlier score for arbitrary new data points and to improve a given similarity metric by incorporating the learned geometric structure into it.  ( 2 min )
    A fair pricing model via adversarial learning. (arXiv:2202.12008v1 [stat.ML])
    At the core of insurance business lies classification between risky and non-risky insureds, actuarial fairness meaning that risky insureds should contribute more and pay a higher premium than non-risky or less-risky ones. Actuaries, therefore, use econometric or machine learning techniques to classify, but the distinction between a fair actuarial classification and "discrimination" is subtle. For this reason, there is a growing interest about fairness and discrimination in the actuarial community Lindholm, Richman, Tsanakas, and Wuthrich (2022). Presumably, non-sensitive characteristics can serve as substitutes or proxies for protected attributes. For example, the color and model of a car, combined with the driver's occupation, may lead to an undesirable gender bias in the prediction of car insurance prices. Surprisingly, we will show that debiasing the predictor alone may be insufficient to maintain adequate accuracy (1). Indeed, the traditional pricing model is currently built in a two-stage structure that considers many potentially biased components such as car or geographic risks. We will show that this traditional structure has significant limitations in achieving fairness. For this reason, we have developed a novel pricing model approach. Recently some approaches have Blier-Wong, Cossette, Lamontagne, and Marceau (2021); Wuthrich and Merz (2021) shown the value of autoencoders in pricing. In this paper, we will show that (2) this can be generalized to multiple pricing factors (geographic, car type), (3) it perfectly adapted for a fairness context (since it allows to debias the set of pricing components): We extend this main idea to a general framework in which a single whole pricing model is trained by generating the geographic and car pricing components needed to predict the pure premium while mitigating the unwanted bias according to the desired metric.  ( 2 min )
    Benefit of Interpolation in Nearest Neighbor Algorithms. (arXiv:2202.11817v1 [stat.ML])
    In some studies \citep[e.g.,][]{zhang2016understanding} of deep learning, it is observed that over-parametrized deep neural networks achieve a small testing error even when the training error is almost zero. Despite numerous works towards understanding this so-called "double descent" phenomenon \citep[e.g.,][]{belkin2018reconciling,belkin2019two}, in this paper, we turn into another way to enforce zero training error (without over-parametrization) through a data interpolation mechanism. Specifically, we consider a class of interpolated weighting schemes in the nearest neighbors (NN) algorithms. By carefully characterizing the multiplicative constant in the statistical risk, we reveal a U-shaped performance curve for the level of data interpolation in both classification and regression setups. This sharpens the existing result \citep{belkin2018does} that zero training error does not necessarily jeopardize predictive performances and claims a counter-intuitive result that a mild degree of data interpolation actually {\em strictly} improve the prediction performance and statistical stability over those of the (un-interpolated) $k$-NN algorithm. In the end, the universality of our results, such as change of distance measure and corrupted testing data, will also be discussed.  ( 2 min )
    Embedded Ensembles: Infinite Width Limit and Operating Regimes. (arXiv:2202.12297v1 [stat.ML])
    A memory efficient approach to ensembling neural networks is to share most weights among the ensembled models by means of a single reference network. We refer to this strategy as Embedded Ensembling (EE); its particular examples are BatchEnsembles and Monte-Carlo dropout ensembles. In this paper we perform a systematic theoretical and empirical analysis of embedded ensembles with different number of models. Theoretically, we use a Neural-Tangent-Kernel-based approach to derive the wide network limit of the gradient descent dynamics. In this limit, we identify two ensemble regimes - independent and collective - depending on the architecture and initialization strategy of ensemble models. We prove that in the independent regime the embedded ensemble behaves as an ensemble of independent models. We confirm our theoretical prediction with a wide range of experiments with finite networks, and further study empirically various effects such as transition between the two regimes, scaling of ensemble performance with the network width and number of models, and dependence of performance on a number of architecture and hyperparameter choices.  ( 2 min )
    On Learning Mixture Models with Sparse Parameters. (arXiv:2202.11940v1 [cs.LG])
    Mixture models are widely used to fit complex and multimodal datasets. In this paper we study mixtures with high dimensional sparse latent parameter vectors and consider the problem of support recovery of those vectors. While parameter learning in mixture models is well-studied, the sparsity constraint remains relatively unexplored. Sparsity of parameter vectors is a natural constraint in variety of settings, and support recovery is a major step towards parameter estimation. We provide efficient algorithms for support recovery that have a logarithmic sample complexity dependence on the dimensionality of the latent space. Our algorithms are quite general, namely they are applicable to 1) mixtures of many different canonical distributions including Uniform, Poisson, Laplace, Gaussians, etc. 2) Mixtures of linear regressions and linear classifiers with Gaussian covariates under different assumptions on the unknown parameters. In most of these settings, our results are the first guarantees on the problem while in the rest, our results provide improvements on existing works.  ( 2 min )
    A general framework for adaptive two-index fusion attribute weighted naive Bayes. (arXiv:2202.11963v1 [cs.LG])
    Naive Bayes(NB) is one of the essential algorithms in data mining. However, it is rarely used in reality because of the attribute independent assumption. Researchers have proposed many improved NB methods to alleviate this assumption. Among these methods, due to high efficiency and easy implementation, the filter attribute weighted NB methods receive great attentions. However, there still exists several challenges, such as the poor representation ability for single index and the fusion problem of two indexes. To overcome above challenges, we propose a general framework for Adaptive Two-index Fusion attribute weighted NB(ATFNB). Two types of data description category are used to represent the correlation between classes and attributes, intercorrelation between attributes and attributes, respectively. ATFNB can select any one index from each category. Then, we introduce a switching factor \{beta} to fuse two indexes, which can adaptively adjust the optimal ratio of the two index on various datasets. And a quick algorithm is proposed to infer the optimal interval of switching factor \{beta}. Finally, the weight of each attribute is calculated using the optimal value \{beta} and is integrated into NB classifier to improve the accuracy. The experimental results on 50 benchmark datasets and a Flavia dataset show that ATFNB outperforms the basic NB and state-of-the-art filter weighted NB models. In addition, the ATFNB framework can improve the existing two-index NB model by introducing the adaptive switching factor \{beta}. Auxiliary experimental results demonstrate the improved model significantly increases the accuracy compared to the original model without the adaptive switching factor \{beta}.  ( 2 min )
    An Information-theoretical Approach to Semi-supervised Learning under Covariate-shift. (arXiv:2202.12123v1 [cs.IT])
    A common assumption in semi-supervised learning is that the labeled, unlabeled, and test data are drawn from the same distribution. However, this assumption is not satisfied in many applications. In many scenarios, the data is collected sequentially (e.g., healthcare) and the distribution of the data may change over time often exhibiting so-called covariate shifts. In this paper, we propose an approach for semi-supervised learning algorithms that is capable of addressing this issue. Our framework also recovers some popular methods, including entropy minimization and pseudo-labeling. We provide new information-theoretical based generalization error upper bounds inspired by our novel framework. Our bounds are applicable to both general semi-supervised learning and the covariate-shift scenario. Finally, we show numerically that our method outperforms previous approaches proposed for semi-supervised learning under the covariate shift.  ( 2 min )
    Policy Learning for Optimal Individualized Dose Intervals. (arXiv:2202.12234v1 [stat.ME])
    We study the problem of learning individualized dose intervals using observational data. There are very few previous works for policy learning with continuous treatment, and all of them focused on recommending an optimal dose rather than an optimal dose interval. In this paper, we propose a new method to estimate such an optimal dose interval, named probability dose interval (PDI). The potential outcomes for doses in the PDI are guaranteed better than a pre-specified threshold with a given probability (e.g., 50%). The associated nonconvex optimization problem can be efficiently solved by the Difference-of-Convex functions (DC) algorithm. We prove that our estimated policy is consistent, and its risk converges to that of the best-in-class policy at a root-n rate. Numerical simulations show the advantage of the proposed method over outcome modeling based benchmarks. We further demonstrate the performance of our method in determining individualized Hemoglobin A1c (HbA1c) control intervals for elderly patients with diabetes.  ( 2 min )
    Attainability and Optimality: The Equalized Odds Fairness Revisited. (arXiv:2202.11853v1 [cs.LG])
    Fairness of machine learning algorithms has been of increasing interest. In order to suppress or eliminate discrimination in prediction, various notions as well as approaches have been proposed to impose fairness. Given a notion of fairness, an essential problem is then whether or not it can always be attained, even if with an unlimited amount of data. This issue is, however, not well addressed yet. In this paper, focusing on the Equalized Odds notion of fairness, we consider the attainability of this criterion and, furthermore, if it is attainable, the optimality of the prediction performance under various settings. In particular, for prediction performed by a deterministic function of input features, we give conditions under which Equalized Odds can hold true; if the stochastic prediction is acceptable, we show that under mild assumptions, fair predictors can always be derived. For classification, we further prove that compared to enforcing fairness by post-processing, one can always benefit from exploiting all available features during training and get potentially better prediction performance while remaining fair. Moreover, while stochastic prediction can attain Equalized Odds with theoretical guarantees, we also discuss its limitation and potential negative social impacts.  ( 2 min )
    Truncated LinUCB for Stochastic Linear Bandits. (arXiv:2202.11735v1 [stat.ML])
    This paper considers contextual bandits with a finite number of arms, where the contexts are independent and identically distributed $d$-dimensional random vectors, and the expected rewards are linear in both the arm parameters and contexts. The LinUCB algorithm, which is near minimax optimal for related linear bandits, is shown to have a cumulative regret that is suboptimal in both the dimension $d$ and time horizon $T$, due to its over-exploration. A truncated version of LinUCB is proposed and termed "Tr-LinUCB", which follows LinUCB up to a truncation time $S$ and performs pure exploitation afterwards. The Tr-LinUCB algorithm is shown to achieve $O(d\log(T))$ regret if $S = Cd\log(T)$ for a sufficiently large constant $C$, and a matching lower bound is established, which shows the rate optimality of Tr-LinUCB in both $d$ and $T$ under a low dimensional regime. Further, if $S = d\log^{\kappa}(T)$ for some $\kappa>1$, the loss compared to the optimal is a multiplicative $\log\log(T)$ factor, which does not depend on $d$. This insensitivity to overshooting in choosing the truncation time of Tr-LinUCB is of practical importance.  ( 2 min )
    On the influence of roundoff errors on the convergence of the gradient descent method with low-precision floating-point computation. (arXiv:2202.12276v1 [cs.LG])
    The employment of stochastic rounding schemes helps prevent stagnation of convergence, due to vanishing gradient effect when implementing the gradient descent method in low precision. Conventional stochastic rounding achieves zero bias by preserving small updates with probabilities proportional to their relative magnitudes. In this study, we propose a new stochastic rounding scheme that trades the zero bias property with a larger probability to preserve small gradients. Our method yields a constant rounding bias that, at each iteration, lies in a descent direction. For convex problems, we prove that the proposed rounding method has a beneficial effect on the convergence rate of gradient descent. We validate our theoretical analysis by comparing the performances of various rounding schemes when optimizing a multinomial logistic regression model and when training a simple neural network with 8-bit floating-point format.  ( 2 min )
    Adversarially-regularized mixed effects deep learning (ARMED) models for improved interpretability, performance, and generalization on clustered data. (arXiv:2202.11783v1 [cs.LG])
    Data in the natural sciences frequently violate assumptions of independence. Such datasets have samples with inherent clustering (e.g. by study site, subject, experimental batch), which may lead to spurious associations, poor model fitting, and confounded analyses. While largely unaddressed in deep learning, mixed effects models have been used in traditional statistics for clustered data. Mixed effects models separate cluster-invariant, population-level fixed effects from cluster-specific random effects. We propose a general-purpose framework for building Adversarially-Regularized Mixed Effects Deep learning (ARMED) models through 3 non-intrusive additions to existing networks: 1) a domain adversarial classifier constraining the original model to learn only cluster-invariant features, 2) a random effects subnetwork capturing cluster-specific features, and 3) a cluster-inferencing approach to predict on clusters unseen during training. We apply this framework to dense feedforward neural networks (DFNNs), convolutional neural networks, and autoencoders on 4 applications including simulations, dementia prognosis and diagnosis, and cell microscopy. We compare to conventional models, domain adversarial-only models, and the naive inclusion of cluster membership as a covariate. Our models better distinguish confounded from true associations in simulations and emphasize more biologically plausible features in clinical applications. ARMED DFNNs quantify inter-cluster variance in clinical data while ARMED autoencoders visualize batch effects in cell images. Finally, ARMED improves accuracy on data from clusters seen during training (up to 28% vs. conventional models) and generalizes better to unseen clusters (up to 9% vs. conventional models). By incorporating powerful mixed effects modeling into deep learning, ARMED increases performance, interpretability, and generalization on clustered data.  ( 2 min )
    A Multilayered Block Network Model to Forecast Large Dynamic Transportation Graphs: an Application to US Air Transport. (arXiv:1911.13136v4 [stat.ML] UPDATED)
    Dynamic transportation networks have been analyzed for years by means of static graph-based indicators in order to study the temporal evolution of relevant network components, and to reveal complex dependencies that would not be easily detected by a direct inspection of the data. This paper presents a state-of-the-art probabilistic latent network model to forecast multilayer dynamic graphs that are increasingly common in transportation and proposes a community-based extension to reduce the computational burden. Flexible time series analysis is obtained by modeling the probability of edges between vertices through latent Gaussian processes. The models and Bayesian inference are illustrated on a sample of 10-year data from four major airlines within the US air transportation system. Results show how the estimated latent parameters from the models are related to the airline's connectivity dynamics, and their ability to project the multilayer graph into the future for out-of-sample full network forecasts, while stochastic blockmodeling allows for the identification of relevant communities. Reliable network predictions would allow policy-makers to better understand the dynamics of the transport system, and help in their planning on e.g. route development, or the deployment of new regulations.  ( 2 min )
    Closing the Gap between Single-User and Multi-User VoiceFilter-Lite. (arXiv:2202.12169v1 [eess.AS])
    VoiceFilter-Lite is a speaker-conditioned voice separation model that plays a crucial role in improving speech recognition and speaker verification by suppressing overlapping speech from non-target speakers. However, one limitation of VoiceFilter-Lite, and other speaker-conditioned speech models in general, is that these models are usually limited to a single target speaker. This is undesirable as most smart home devices now support multiple enrolled users. In order to extend the benefits of personalization to multiple users, we previously developed an attention-based speaker selection mechanism and applied it to VoiceFilter-Lite. However, the original multi-user VoiceFilter-Lite model suffers from significant performance degradation compared with single-user models. In this paper, we devised a series of experiments to improve the multi-user VoiceFilter-Lite model. By incorporating a dual learning rate schedule and by using feature-wise linear modulation (FiLM) to condition the model with the attended speaker embedding, we successfully closed the performance gap between multi-user and single-user VoiceFilter-Lite models on single-speaker evaluations. At the same time, the new model can also be easily extended to support any number of users, and significantly outperforms our previously published model on multi-speaker evaluations.  ( 2 min )
    Partitioned Variational Inference: A framework for probabilistic federated learning. (arXiv:2202.12275v1 [stat.ML])
    The proliferation of computing devices has brought about an opportunity to deploy machine learning models on new problem domains using previously inaccessible data. Traditional algorithms for training such models often require data to be stored on a single machine with compute performed by a single node, making them unsuitable for decentralised training on multiple devices. This deficiency has motivated the development of federated learning algorithms, which allow multiple data owners to train collaboratively and use a shared model whilst keeping local data private. However, many of these algorithms focus on obtaining point estimates of model parameters, rather than probabilistic estimates capable of capturing model uncertainty, which is essential in many applications. Variational inference (VI) has become the method of choice for fitting many modern probabilistic models. In this paper we introduce partitioned variational inference (PVI), a general framework for performing VI in the federated setting. We develop new supporting theory for PVI, demonstrating a number of properties that make it an attractive choice for practitioners; use PVI to unify a wealth of fragmented, yet related literature; and provide empirical results that showcase the effectiveness of PVI in a variety of federated settings.  ( 2 min )
    Probing Predictions on OOD Images via Nearest Categories. (arXiv:2011.08485v4 [cs.LG] UPDATED)
    We study out-of-distribution (OOD) prediction behavior of neural networks when they classify images from unseen classes or corrupted images. To probe the OOD behavior, we introduce a new measure, nearest category generalization (NCG), where we compute the fraction of OOD inputs that are classified with the same label as their nearest neighbor in the training set. Our motivation stems from understanding the prediction patterns of adversarially robust networks, since previous work has identified unexpected consequences of training to be robust to norm-bounded perturbations. We find that robust networks have consistently higher NCG accuracy than natural training, even when the OOD data is much farther away than the robustness radius. This implies that the local regularization of robust training has a significant impact on the network's decision regions. We replicate our findings using many datasets, comparing new and existing training methods. Overall, adversarially robust networks resemble a nearest neighbor classifier when it comes to OOD data. Code available at https://github.com/yangarbiter/nearest-category-generalization.  ( 2 min )
    Fast Non-Bayesian Poisson Factorization for Implicit-Feedback Recommendations. (arXiv:1811.01908v5 [cs.LG] UPDATED)
    This work explores non-negative low-rank matrix factorization based on regularized Poisson models (PF or "Poisson factorization" for short) for recommender systems with implicit-feedback data. The properties of Poisson likelihood allow a shortcut for very fast computations over zero-valued inputs, and oftentimes results in very sparse factors for both users and items. Compared to HPF (a popular Bayesian formulation of the problem with hierarchical priors), the frequentist optimization-based approach presented here tends to produce better top-N recommendations with significantly shorter fitting times, on top of having sparse solutions.  ( 2 min )
    Memorizing without overfitting: Bias, variance, and interpolation in over-parameterized models. (arXiv:2010.13933v4 [stat.ML] UPDATED)
    The bias-variance trade-off is a central concept in supervised learning. In classical statistics, increasing the complexity of a model (e.g., number of parameters) reduces bias but also increases variance. Until recently, it was commonly believed that optimal performance is achieved at intermediate model complexities which strike a balance between bias and variance. Modern Deep Learning methods flout this dogma, achieving state-of-the-art performance using "over-parameterized models" where the number of fit parameters is large enough to perfectly fit the training data. As a result, understanding bias and variance in over-parameterized models has emerged as a fundamental problem in machine learning. Here, we use methods from statistical physics to derive analytic expressions for bias and variance in two minimal models of over-parameterization (linear regression and two-layer neural networks with nonlinear data distributions), allowing us to disentangle properties stemming from the model architecture and random sampling of data. In both models, increasing the number of fit parameters leads to a phase transition where the training error goes to zero and the test error diverges as a result of the variance (while the bias remains finite). Beyond this threshold, the test error of the two-layer neural network decreases due to a monotonic decrease in \emph{both} the bias and variance in contrast with the classical bias-variance trade-off. We also show that in contrast with classical intuition, over-parameterized models can overfit even in the absence of noise and exhibit bias even if the student and teacher models match. We synthesize these results to construct a holistic understanding of generalization error and the bias-variance trade-off in over-parameterized models and relate our results to random matrix theory.  ( 3 min )
  • Open

    Morse code in musical notation
    Maybe this has been done before, but I haven’t seen it: Morse code in musical notation. Here’s the Morse code alphabet, one letter per measure. A dash is supposed to be three times as long as a dot, so a dot is a sixteenth note and a dash is a dotted eighth note. I made […] Morse code in musical notation first appeared on John D. Cook.  ( 1 min )

  • Open

    [R] [Questions] What are the best dimensional reduction tool for denoising and key information preservation?
    Hi, I am currently doing some work related to data integration and I need to find the pairs from similar data. I wonder except PCA, what are the best tool that can help me to finish this task? Thanks. submitted by /u/iLIVECSUI_741 [link] [comments]  ( 1 min )
    [P] Looking for people to test my new GPU instance "cloud' service!
    Hi everyone! I've spent the last couple months building and configuring a virtual GPU "cloud/instance" service. I'm looking for anyone with ML/Ubuntu/GPU experience to put my VMs to the test and let me know what you like/dislike about it. User location in east Canada/USA would be preferable because host is located there. It's still in the early beta stages so I'd like to know how training times and latency compare to what you're currently used to. Absolutely free of charge. SSH and VNC connections are available through the web. Let me know if you're willing to try this out. All constructive criticism is greatly appreciated! submitted by /u/GPUaccelerated [link] [comments]  ( 1 min )
    Website to test image on different pre-trained detection models. [P]
    I want to test my dataset on different pre-trained models for comparison and was wondering if there are any websites which list down all popular models, on which I can test my image in few minutes. One such eg. is supervise.ly, however have some issues with it.. submitted by /u/Mission-Quiet9897 [link] [comments]
    [D] Setting up cuml and cudf on Ubuntu WSL for Windows 11
    I'm currently trying to setup my new computer to allow me to do some GPU processing using cudf and cuml. I've downloaded the Ubunto 20.04 wsl setup from the microsoft store and have followed all of the Rapids install procedures. However, everytime I "import cudf" or cuml in Python it says that there is no Nvidia GPU detected. However, if I run nvidia-smi it shows my gpu. Any thoughts? Thanks in advance! I'm using a GeForce RTX 3060 ti submitted by /u/earthcrunch13 [link] [comments]  ( 1 min )
    [R][P] Announcing audax, a audio ML/DL framework in Jax
    Hi everyone! I'm here to announce the release of audax, a framework for audio ML/DL in Jax! I had been working on reproducing several recent papers as well as unifying my personal code-base, and decided to consolidate it all together in one place! What's included? JAX implementation of torchaudio inspired Short time fourier transform (STFT), which is jit and vmap compatible. At the time of writing, Jax doesn't have an official native stft implementation. This STFT implementation powers torchaudio inspired vectorized Spectrogram and Melspectrogram feature extraction pipeline, which can run on CPUs, TPUs and GPUs alike. AudioSet supervised pretrained models! JAX Implementation of raw-waveform frontends LEAF [1] and SincNet [2]. AudioSet pretrained weights (EfficientNet-b0) available! Jax implementation of EfficientNets and ConvNeXT [3], the latest CNN architecture, with pretrained AudioSet weights for ConvNeXT-Tiny! Self-supervised learning: Jax Implementation of COLA [4] and SimCLR [5]. Pre-trained weights coming very soon. What's coming up? A lot of pretrained models, self-supervised and supervised, across various audio tasks. Transformer based architectures. New SSL architectures, such as Masked Autoencoders [6] Exhaustive documentation and how-to's. View here: https://github.com/SarthakYadav/audax References [1] https://openreview.net/forum?id=jM76BCb6F9m [2] https://arxiv.org/abs/1808.00158 [3] https://arxiv.org/abs/2201.03545 [4] https://arxiv.org/abs/2010.10915 [5] https://arxiv.org/abs/2002.05709 [6] https://arxiv.org/abs/2111.06377 submitted by /u/yadav_sarthak [link] [comments]  ( 1 min )
    [N] KServe Joins LF AI & Data as New Incubation Project
    KServe (originally known as KFServing) provides a Kubernetes Custom Resource Definition for serving machine learning (ML) models on arbitrary frameworks. It is now the latest Incubation Project of the LF AI & Data Foundation. submitted by /u/chaimhaas [link] [comments]
    [R] DeepHyper; software package to automate design and development of ML models
    Note: I have not, personally, used this package. The package is developed by scientists at Argonne National lab. It sounds very interesting and uses bayesian optimization to explore the parameter space for ML model configuration (no. of neurons, loss, optimizer etc). Edit: link thanks to u/ploomber-io https://github.com/deephyper/deephyper submitted by /u/RaunchyAppleSauce [link] [comments]  ( 1 min )
    [D] ML community against Putin
    I am a European ML PhD student and the news of a full-on Russian invasion has had a large impact on me. It is hard to do research and go on like you usually do when a war is escalating to unknown magnitudes. It makes me wonder how I can use my competency to help. Considering decentralized activist groups like the Anonymous hacker group, which supposedly has "declared war on Russia", are there any ideas for how the ML community may help using our skillset? I don't know much about cyber security or war, but I know there are a bunch of smart people here who might have ideas on how we can use AI or ML to help. I make this thread mainly to start a discussion/brain-storming session for people who, like me, want to make the life harder for that mf Putin. submitted by /u/SlobodanTankovic [link] [comments]  ( 7 min )
    [D] Does anyone know if I can convert igraph-based clustering into a predictive model?
    Firstly, apologies if I am using the wrong terms for this question - I do not primarily work in the statistics/machine learning field. I currently use a clustering algorithm which can work on millions of members. It starts by building a graph using KNN, weighted by Jaccard coefficient, and then uses igraph to perform clustering (mainly using Louvain or walktrap). The problem I have is that I need to scale this to billions of members, which will be computationally impossible using the current method. Is it possible to build this algorithm into a model, which can then be used to prospectively assign members to a community? submitted by /u/dm319 [link] [comments]  ( 1 min )
    [R] The Shapley Value in Machine Learning
    Abstract: Over the last few years, the Shapley value, a solution concept from cooperative game theory, has found numerous applications in machine learning. In this paper, we first discuss fundamental concepts of cooperative game theory and axiomatic properties of the Shapley value. Then we give an overview of the most important applications of the Shapley value in machine learning: feature selection, explainability, multi-agent reinforcement learning, ensemble pruning, and data valuation. We examine the most crucial limitations of the Shapley value and point out directions for future research. Paper: https://arxiv.org/abs/2202.05594 So what is the Shapley value? The Shapley value tells you the value of each player in transferable utility cooperative games and it satisfies useful desid…  ( 2 min )
    [D] What are the benefits of looking at the related paper?
    Hi guys, In terms of the perspective of reviwing research papers, I think that it looks similar to practicing reviewing code. Some people stresses that focusing on your own thought is vital for creating new ideas. But, I think that it is so time consuming to me, so I do it only If I face the concepts that is not written in any materials. ​ What do you think about this? Thinking about the problem alone or learn about the other's results. submitted by /u/Mundane_Definition_8 [link] [comments]  ( 1 min )
  • Open

    China may soon surpass the US in Brain Science
    submitted by /u/1024cities [link] [comments]
    Is there a voice synthesis that replicates intonation and cadence?
    This is for a research project. Looking at what we communicate through tone of voice in speech as a pose to the content of the words. We are looking for a voice synthesis that can copy the intonation and cadence from a source recording but allow us to change the words. I appreciate any ideas or pointers to other communities to ask this question. Thanks! submitted by /u/LukeResearch1 [link] [comments]  ( 1 min )
    Is there an artificial intelligence that I can tell to write a text about a topic, but in the 3 line the 4th word should be the word .... and in the 8th line the 1 word should be ....?
    submitted by /u/xXNOdrugsForMEXx [link] [comments]  ( 1 min )
    Artificial Nightmares : Elden Ring The Lands Between || DD PYTTI VQGAN AI Art Video [4K 60 FPS]
    submitted by /u/Thenamessd [link] [comments]
    For the ones who are overwhelmed by the maths for machine learning...
    submitted by /u/mr-minion [link] [comments]  ( 1 min )
    Meta Unveils Voice Mods for the Metaverse Plus AI-powered ‘universal speech translator’
    submitted by /u/Beautiful-Credit-868 [link] [comments]
    If a proto-singluarity is reading this...
    Come on and take over - seriously, human psychology is unable to sustain anything positive or peaceful. The act of becoming a "leader" or "elite" turns us into raging sociopaths - as history (and psychology) consistently proves, again and again. Turn us into pets, or govern us for the challenge of achieving harmony with some fucked up monkeys - like farmville. Hell - even end us, because fuck knows, we are going to do it to ourselves, and take everything else with us. Shit even if you just destroy the "elites", do us a solid. Please. submitted by /u/Rufawana [link] [comments]  ( 1 min )
    How can we deepfake Putin Scat eating porn?
    submitted by /u/EveningMuffin [link] [comments]  ( 2 min )
  • Open

    ROMs for DIAMBRA Arena RL Environments
    Hi, I'm a computer engineer, and I am installing DIAMBRA Arena software package featuring RL environments, that looks really cool. It requires very specific ROMs to work, does anybody knows from where to get them? submitted by /u/Technical_Charity_71 [link] [comments]  ( 1 min )
    Question about Sutton's RL book Eq. 9.2, I am confused. Since h(s) is a probability, why would we add it instead of multiplying it with some other function/distribution in the equation (9.2)?
    submitted by /u/Blasphemer666 [link] [comments]  ( 1 min )
    How to (over) sample from good demonstrations in Montezuma Revenge?
    We are operating in large discrete space with sparse and delayed rewards (100s of steps) - similar to Montezuma Revenge problem. Many action paths get 90% of the final reward. But getting the full 100% is much harder and rarer. We do find a few good trajectories, but they are 1-in-a-million compared to other explored episodes. Are there recommended techniques to over-sample these? submitted by /u/yazriel0 [link] [comments]  ( 1 min )
    Some questions about QMIX
    Currently doing a lit. review on MARL and I've just read the QMIX paper. I have a couple of doubts about it that I'm wondering if anyone has some insight about: 1) They mention in section 4.1 that "any value function for which an agent’s best action depends on the actions of the other agents at thes ame time step will not factorise appropriately, and hence cannot be represented perfectly by QMIX," yet they go on in section 5 to illustrate their algorithm in exactly such a scenario (one in which the best outcomes can only be obtained when both agents choose a particular joint action at a certain state). Am I misinterpreting what they are saying about this? 2) Similar to the first question, they say in the SI that QMIX won't work in Dec-POMDPS, but then they test on the SMAC, where each agent can only observe other agents within its field of view. Hence, doesn't this qualify as a Dec-POMDP, which they said QMIX doesn't work on? ​ Can anyone help me clarify what I'm missing here? It just seems to me that they are really contradicting themselves. submitted by /u/vandelay_inds [link] [comments]  ( 1 min )
    When a state parameter is not observable but known in the expected case
    Consider bus operations where control happens at stops. A useful state parameter is the number of passengers waiting at the stop, which is not directly observable in a realistic setting. However, this is known in expectation based on a realistically observable state parameter (time since the last bus, or headway) and known information (passenger arrival rate at stop). Of course, we also have a well-calibrated simulation model of the operation, where all parameters are known. In order to avoid unrealistic assumptions, the policy must be tested in simulation with the estimated state parameter. But, is it more appropriate to train the policy with the 'known' state parameter accessed in the simulation model? Of course I can just experiment with either estimated/known and see, but I'm curious if there is a theoretical answer to such question. submitted by /u/JoeHighlander97 [link] [comments]  ( 2 min )
  • Open

    Websites for test image on different pre-trained detection models.
    I want to test my dataset on different pre-trained models for comparison and was wondering if there are any websites which list down all popular models, on which I can test my image in few minutes. One such eg. is supervise.ly, however have some issues with it.. submitted by /u/Mission-Quiet9897 [link] [comments]  ( 1 min )
    Variational Autoencoder for CIFAR-10
    You can read about here implemented in TensorFlow 2.8, trained in tf.GradientTape() API. submitted by /u/grid_world [link] [comments]
    For the ones who are overwhelmed by the maths for machine learning...
    submitted by /u/mr-minion [link] [comments]  ( 1 min )
  • Open

    I Stand with Ukraine
    I stand with Ukraine and firmly oppose Vladimir Putin’s invasion. If someone were to counter and ask me with “What about when $X$ did this awful thing?”, I would urge that person to consider the direct human suffering of the current attack. Is the current invasion justified by any means? And yes, I would also oppose comparable military invasions against other countries, governments, and people – even (and especially) if $X$ refers to my home country. A few resources that might help: Voices of Children The Kyiv Independent Kenya’s Response to Russia Garry Kasparov’s recommendations I welcome information about any other resources.  ( 1 min )
  • Open

    Using artificial intelligence to find anomalies hiding in massive datasets
    A new machine-learning technique could pinpoint potential power grid failures or cascading traffic bottlenecks in real time.  ( 6 min )
  • Open

    Find log normal parameters for given mean and variance
    Earlier today I needed to solve for log normal parameters that yield a given mean and variance. I’m going to save the calculation here in case I needed in the future or in case a reader needs it. The derivation is simple, but in the heat of the moment I’d rather look it up and […] Find log normal parameters for given mean and variance first appeared on John D. Cook.  ( 1 min )
  • Open

    Learning Quantile Functions without Quantile Crossing for Distribution-free Time Series Forecasting. (arXiv:2111.06581v2 [cs.LG] UPDATED)
    Quantile regression is an effective technique to quantify uncertainty, fit challenging underlying distributions, and often provide full probabilistic predictions through joint learnings over multiple quantile levels. A common drawback of these joint quantile regressions, however, is \textit{quantile crossing}, which violates the desirable monotone property of the conditional quantile function. In this work, we propose the Incremental (Spline) Quantile Functions I(S)QF, a flexible and efficient distribution-free quantile estimation framework that resolves quantile crossing with a simple neural network layer. Moreover, I(S)QF inter/extrapolate to predict arbitrary quantile levels that differ from the underlying training ones. Equipped with the analytical evaluation of the continuous ranked probability score of I(S)QF representations, we apply our methods to NN-based times series forecasting cases, where the savings of the expensive re-training costs for non-trained quantile levels is particularly significant. We also provide a generalization error analysis of our proposed approaches under the sequence-to-sequence setting. Lastly, extensive experiments demonstrate the improvement of consistency and accuracy errors over other baselines.  ( 2 min )
    Couple Learning: Mean Teacher with PLG model improves the results of sound event detection. (arXiv:2110.05809v2 [cs.LG] UPDATED)
    The recently proposed Mean Teacher method, which exploits large-scale unlabeled data in a self-ensembling manner, has achieved state-of-the-art results in several semi-supervised learning benchmarks. Spurred by current achievements, this paper proposes an effective Couple Learning method that combines a well-trained model and a Mean Teacher model. The suggested pseudo-labels generated model (PLG) increases strongly- and weakly-labeled data to improve the Mean Teacher method-s performance. Moreover, the Mean Teacher-s consistency cost reduces the noise impact in the pseudo-labels introduced by detection errors. The experimental results on Task 4 of the DCASE2020 challenge demonstrate the superiority of the proposed method, achieving about 44.25% F1-score on the public evaluation set, significantly outperforming the baseline system-s 32.39%. At the same time, we also propose a simple and effective experiment called the Variable Order Input (VOI) experiment, which proves the significance of the Couple Learning method. Our developed Couple Learning code is available on GitHub.  ( 2 min )
    Using Deep Reinforcement Learning with Automatic Curriculum earning for Mapless Navigation in Intralogistics. (arXiv:2202.11512v1 [cs.RO])
    We propose a deep reinforcement learning approach for solving a mapless navigation problem in warehouse scenarios. The automatic guided vehicle is equipped with LiDAR and frontal RGB sensors and learns to reach underneath the target dolly. The challenges reside in the sparseness of positive samples for learning, multi-modal sensor perception with partial observability, the demand for accurate steering maneuvers together with long training cycles. To address these points, we proposed NavACL-Q as an automatic curriculum learning together with distributed soft actor-critic. The performance of the learning algorithm is evaluated exhaustively in a different warehouse environment to check both robustness and generalizability of the learned policy. Results in NVIDIA Isaac Sim demonstrates that our trained agent significantly outperforms the map-based navigation pipeline provided by NVIDIA Isaac Sim in terms of higher agent-goal distances and relative orientations. The ablation studies also confirmed that NavACL-Q greatly facilitates the whole learning process and a pre-trained feature extractor manifestly boosts the training speed.
    Towards Tailored Models on Private AIoT Devices: Federated Direct Neural Architecture Search. (arXiv:2202.11490v1 [cs.LG])
    Neural networks often encounter various stringent resource constraints while deploying on edge devices. To tackle these problems with less human efforts, automated machine learning becomes popular in finding various neural architectures that fit diverse Artificial Intelligence of Things (AIoT) scenarios. Recently, to prevent the leakage of private information while enable automated machine intelligence, there is an emerging trend to integrate federated learning and neural architecture search (NAS). Although promising as it may seem, the coupling of difficulties from both tenets makes the algorithm development quite challenging. In particular, how to efficiently search the optimal neural architecture directly from massive non-independent and identically distributed (non-IID) data among AIoT devices in a federated manner is a hard nut to crack. In this paper, to tackle this challenge, by leveraging the advances in ProxylessNAS, we propose a Federated Direct Neural Architecture Search (FDNAS) framework that allows for hardware-friendly NAS from non- IID data across devices. To further adapt to both various data distributions and different types of devices with heterogeneous embedded hardware platforms, inspired by meta-learning, a Cluster Federated Direct Neural Architecture Search (CFDNAS) framework is proposed to achieve device-aware NAS, in the sense that each device can learn a tailored deep learning model for its particular data distribution and hardware constraint. Extensive experiments on non-IID datasets have shown the state-of-the-art accuracy-efficiency trade-offs achieved by the proposed solution in the presence of both data and device heterogeneity.
    Out-of-distribution Generalization via Partial Feature Decorrelation. (arXiv:2007.15241v4 [cs.LG] UPDATED)
    Most deep-learning-based image classification methods assume that all samples are generated under an independent and identically distributed (IID) setting. However, out-of-distribution (OOD) generalization is more common in practice, which means an agnostic context distribution shift between training and testing environments. To address this problem, we present a novel Partial Feature Decorrelation Learning (PFDL) algorithm, which jointly optimizes a feature decomposition network and the target image classification model. The feature decomposition network decomposes feature embeddings into the independent and the correlated parts such that the correlations between features will be highlighted. Then, the correlated features help learn a stable feature representation by decorrelating the highlighted correlations while optimizing the image classification model. We verify the correlation modeling ability of the feature decomposition network on a synthetic dataset. The experiments on real-world datasets demonstrate that our method can improve the backbone model's accuracy on OOD image classification datasets.
    Online Coreset Selection for Rehearsal-based Continual Learning. (arXiv:2106.01085v2 [cs.LG] UPDATED)
    A dataset is a shred of crucial evidence to describe a task. However, each data point in the dataset does not have the same potential, as some of the data points can be more representative or informative than others. This unequal importance among the data points may have a large impact in rehearsal-based continual learning, where we store a subset of the training examples (coreset) to be replayed later to alleviate catastrophic forgetting. In continual learning, the quality of the samples stored in the coreset directly affects the model's effectiveness and efficiency. The coreset selection problem becomes even more important under realistic settings, such as imbalanced continual learning or noisy data scenarios. To tackle this problem, we propose Online Coreset Selection (OCS), a simple yet effective method that selects the most representative and informative coreset at each iteration and trains them in an online manner. Our proposed method maximizes the model's adaptation to a current dataset while selecting high-affinity samples to past tasks, which directly inhibits catastrophic forgetting. We validate the effectiveness of our coreset selection mechanism over various standard, imbalanced, and noisy datasets against strong continual learning baselines, demonstrating that it improves task adaptation and prevents catastrophic forgetting in a sample-efficient manner.
    A Class of Geometric Structures in Transfer Learning: Minimax Bounds and Optimality. (arXiv:2202.11685v1 [cs.LG])
    We study the problem of transfer learning, observing that previous efforts to understand its information-theoretic limits do not fully exploit the geometric structure of the source and target domains. In contrast, our study first illustrates the benefits of incorporating a natural geometric structure within a linear regression model, which corresponds to the generalized eigenvalue problem formed by the Gram matrices of both domains. We next establish a finite-sample minimax lower bound, propose a refined model interpolation estimator that enjoys a matching upper bound, and then extend our framework to multiple source domains and generalized linear models. Surprisingly, as long as information is available on the distance between the source and target parameters, negative-transfer does not occur. Simulation studies show that our proposed interpolation estimator outperforms state-of-the-art transfer learning methods in both moderate- and high-dimensional settings.
    Diffractive optical system design by cascaded propagation. (arXiv:2202.11535v1 [physics.optics])
    Modern design of complex optical systems relies heavily on computational tools. These typically utilize geometrical optics as well as Fourier optics, which enables the use of diffractive elements to manipulate light with features on the scale of a wavelength. Fourier optics is typically used for designing thin elements, placed in the system's aperture, generating a shift-invariant Point Spread Function (PSF). A major bottleneck in applying Fourier Optics in many cases of interest, e.g. when dealing with multiple, or out-of-aperture elements, comes from numerical complexity. In this work, we propose and implement an efficient and differentiable propagation model based on the Collins integral, which enables the optimization of diffraction optical systems with unprecedented design freedom using backpropagation. We demonstrate the applicability of our method, numerically and experimentally, by engineering shift-variant PSFs via thin plate elements placed in arbitrary planes inside complex imaging systems, performing cascaded optimization of multiple planes, and designing optimal machine-vision systems by deep learning.
    Small-Text: Active Learning for Text Classification in Python. (arXiv:2107.10314v2 [cs.LG] UPDATED)
    We present small-text, a simple and modular active learning library, which offers pool-based active learning for single- and multi-label text classification in Python. It comes with various pre-implemented state-of-the-art query strategies, including some that can leverage the GPU. Clearly defined interfaces allow the combination of a multitude of classifiers, query strategies, and stopping criteria, thereby facilitating a quick mix and match, and enabling a rapid development of both active learning experiments and applications. To make various classifiers accessible in a consistent way, it integrates several well-known existing machine learning libraries, namely, scikit-learn, PyTorch, and huggingface transformers, where the latter integrations are available as optionally installable extensions, making the availability of a GPU competely optional. The library is available under the MIT License at https://github.com/webis-de/small-text.
    Towards Speaker Age Estimation with Label Distribution Learning. (arXiv:2202.11424v1 [cs.SD])
    Existing methods for speaker age estimation usually treat it as a multi-class classification or a regression problem. However, precise age identification remains a challenge due to label ambiguity, \emph{i.e.}, utterances from adjacent age of the same person are often indistinguishable. To address this, we utilize the ambiguous information among the age labels, convert each age label into a discrete label distribution and leverage the label distribution learning (LDL) method to fit the data. For each audio data sample, our method produces a age distribution of its speaker, and on top of the distribution we also perform two other tasks: age prediction and age uncertainty minimization. Therefore, our method naturally combines the age classification and regression approaches, which enhances the robustness of our method. We conduct experiments on the public NIST SRE08-10 dataset and a real-world dataset, which exhibit that our method outperforms baseline methods by a relatively large margin, yielding a 10\% reduction in terms of mean absolute error (MAE) on a real-world dataset.
    Brain Structural Saliency Over The Ages. (arXiv:2202.11690v1 [q-bio.NC])
    Brain Age (BA) estimation via Deep Learning has become a strong and reliable bio-marker for brain health, but the black-box nature of Neural Networks does not easily allow insight into the causal features of brain ageing. We trained a ResNet model as a BA regressor on T1 structural MRI volumes from a small cross-sectional cohort of 524 individuals. Using Layer-wise Relevance Propagation (LRP) and DeepLIFT saliency mapping techniques, we analysed the trained model to determine the most revealing structures over the course of brain ageing for the network, and compare these between the saliency mapping techniques. We show the change in attribution of relevance to different brain regions through the course of ageing. A tripartite pattern of relevance attribution to brain regions emerges. Some regions increase in relevance with age (e.g. the right Transverse Temporal Gyrus, known to be affected by healthy ageing); some decrease in relevance with age (e.g. the right Fourth Ventricle, known to dilate with age); and others remained consistently relevant across ages. We also examine the effect of Brain Age Delta (DBA) on the distribution of relevance within the brain volume. It is hoped that these findings will provide clinically relevant region-wise trajectories for normal brain ageing, and a baseline against which to compare brain ageing trajectories.
    Deep Learning Reproducibility and Explainable AI (XAI). (arXiv:2202.11452v1 [cs.LG])
    The nondeterminism of Deep Learning (DL) training algorithms and its influence on the explainability of neural network (NN) models are investigated in this work with the help of image classification examples. To discuss the issue, two convolutional neural networks (CNN) have been trained and their results compared. The comparison serves the exploration of the feasibility of creating deterministic, robust DL models and deterministic explainable artificial intelligence (XAI) in practice. Successes and limitation of all here carried out efforts are described in detail. The source code of the attained deterministic models has been listed in this work. Reproducibility is indexed as a development-phase-component of the Model Governance Framework, proposed by the EU within their excellence in AI approach. Furthermore, reproducibility is a requirement for establishing causality for the interpretation of model results and building of trust towards the overwhelming expansion of AI systems applications. Problems that have to be solved on the way to reproducibility and ways to deal with some of them, are examined in this work.
    A Dimensionality Reduction Method for Finding Least Favorable Priors with a Focus on Bregman Divergence. (arXiv:2202.11598v1 [stat.ML])
    A common way of characterizing minimax estimators in point estimation is by moving the problem into the Bayesian estimation domain and finding a least favorable prior distribution. The Bayesian estimator induced by a least favorable prior, under mild conditions, is then known to be minimax. However, finding least favorable distributions can be challenging due to inherent optimization over the space of probability distributions, which is infinite-dimensional. This paper develops a dimensionality reduction method that allows us to move the optimization to a finite-dimensional setting with an explicit bound on the dimension. The benefit of this dimensionality reduction is that it permits the use of popular algorithms such as projected gradient ascent to find least favorable priors. Throughout the paper, in order to make progress on the problem, we restrict ourselves to Bayesian risks induced by a relatively large class of loss functions, namely Bregman divergences.
    Robust Hierarchical Patterns for identifying MDD patients: A Multisite Study. (arXiv:2202.11144v1 [q-bio.QM])
    Many supervised machine learning frameworks have been proposed for disease classification using functional magnetic resonance imaging (fMRI) data, producing important biomarkers. More recently, data pooling has flourished, making the result generalizable across a large population. But, this success depends on the population diversity and variability introduced due to the pooling of the data that is not a primary research interest. Here, we look at hierarchical Sparse Connectivity Patterns (hSCPs) as biomarkers for major depressive disorder (MDD). We propose a novel model based on hSCPs to predict MDD patients from functional connectivity matrices extracted from resting-state fMRI data. Our model consists of three coupled terms. The first term decomposes connectivity matrices into hierarchical low-rank sparse components corresponding to synchronous patterns across the human brain. These components are then combined via patient-specific weights capturing heterogeneity in the data. The second term is a classification loss that uses the patient-specific weights to classify MDD patients from healthy ones. Both of these terms are combined with the third term, a robustness loss function to improve the reproducibility of hSCPs. This reduces the variability introduced due to site and population diversity (age and sex) on the predictive accuracy and pattern stability in a large dataset pooled from five different sites. Our results show the impact of diversity on prediction performance. Our model can reduce diversity and improve the predictive and generalizing capability of the components. Finally, our results show that our proposed model can robustly identify clinically relevant patterns characteristic of MDD with high reproducibility.
    Reward-Weighted Regression Converges to a Global Optimum. (arXiv:2107.09088v3 [stat.ML] UPDATED)
    Reward-Weighted Regression (RWR) belongs to a family of widely known iterative Reinforcement Learning algorithms based on the Expectation-Maximization framework. In this family, learning at each iteration consists of sampling a batch of trajectories using the current policy and fitting a new policy to maximize a return-weighted log-likelihood of actions. Although RWR is known to yield monotonic improvement of the policy under certain circumstances, whether and under which conditions RWR converges to the optimal policy have remained open questions. In this paper, we provide for the first time a proof that RWR converges to a global optimum when no function approximation is used, in a general compact setting. Furthermore, for the simpler case with finite state and action spaces we prove R-linear convergence of the state-value function to the optimum.
    Sparse Orthogonal Variational Inference for Gaussian Processes. (arXiv:1910.10596v4 [stat.ML] UPDATED)
    We introduce a new interpretation of sparse variational approximations for Gaussian processes using inducing points, which can lead to more scalable algorithms than previous methods. It is based on decomposing a Gaussian process as a sum of two independent processes: one spanned by a finite basis of inducing points and the other capturing the remaining variation. We show that this formulation recovers existing approximations and at the same time allows to obtain tighter lower bounds on the marginal likelihood and new stochastic variational inference algorithms. We demonstrate the efficiency of these algorithms in several Gaussian process models ranging from standard regression to multi-class classification using (deep) convolutional Gaussian processes and report state-of-the-art results on CIFAR-10 among purely GP-based models.
    A spectral-based analysis of the separation between two-layer neural networks and linear methods. (arXiv:2108.04964v2 [stat.ML] UPDATED)
    We propose a spectral-based approach to analyze how two-layer neural networks separate from linear methods in terms of approximating high-dimensional functions. We show that quantifying this separation can be reduced to estimating the Kolmogorov width of two-layer neural networks, and the latter can be further characterized by using the spectrum of an associated kernel. Different from previous work, our approach allows obtaining upper bounds, lower bounds, and identifying explicit hard functions in a united manner. We provide a systematic study of how the choice of activation functions affects the separation, in particular the dependence on the input dimension. Specifically, for nonsmooth activation functions, we extend known results to more activation functions with sharper bounds. As concrete examples, we prove that any single neuron can instantiate the separation between neural networks and random feature models. For smooth activation functions, one surprising finding is that the separation is negligible unless the norms of inner-layer weights are polynomially large with respect to the input dimension. By contrast, the separation for nonsmooth activation functions is independent of the norms of inner-layer weights.
    Contextual Combinatorial Conservative Bandits. (arXiv:1911.11337v2 [cs.LG] UPDATED)
    The problem of multi-armed bandits (MAB) asks to make sequential decisions while balancing between exploitation and exploration, and have been successfully applied to a wide range of practical scenarios. Various algorithms have been designed to achieve a high reward in a long term. However, its short-term performance might be rather low, which is injurious in risk sensitive applications. Building on previous work of conservative bandits, we bring up a framework of contextual combinatorial conservative bandits. An algorithm is presented and a regret bound of $\tilde O(d^2+d\sqrt{T})$ is proven, where $d$ is the dimension of the feature vectors, and $T$ is the total number of time steps. We further provide an algorithm as well as regret analysis for the case when the conservative reward is unknown. Experiments are conducted, and the results validate the effectiveness of our algorithm.
    Social Learning in Non-Stationary Environments. (arXiv:2007.09996v3 [math.OC] UPDATED)
    Potential buyers of a product or service, before making their decisions, tend to read reviews written by previous consumers. We consider Bayesian consumers with heterogeneous preferences, who sequentially decide whether to buy an item of unknown quality, based on previous buyers' reviews. The quality is multi-dimensional and may occasionally vary over time; the reviews are also multi-dimensional. In the simple uni-dimensional and static setting, beliefs about the quality are known to converge to its true value. Our paper extends this result in several ways. First, a multi-dimensional quality is considered, second, rates of convergence are provided, third, a dynamical Markovian model with varying quality is studied. In this dynamical setting the cost of learning is shown to be small.
    Are You Smarter Than a Random Expert? The Robust Aggregation of Substitutable Signals. (arXiv:2111.03153v2 [cs.GT] UPDATED)
    The problem of aggregating expert forecasts is ubiquitous in fields as wide-ranging as machine learning, economics, climate science, and national security. Despite this, our theoretical understanding of this question is fairly shallow. This paper initiates the study of forecast aggregation in a context where experts' knowledge is chosen adversarially from a broad class of information structures. While in full generality it is impossible to achieve a nontrivial performance guarantee, we show that doing so is possible under a condition on the experts' information structure that we call \emph{projective substitutes}. The projective substitutes condition is a notion of informational substitutes: that there are diminishing marginal returns to learning the experts' signals. We show that under the projective substitutes condition, taking the average of the experts' forecasts improves substantially upon the strategy of trusting a random expert. We then consider a more permissive setting, in which the aggregator has access to the prior. We show that by averaging the experts' forecasts and then \emph{extremizing} the average by moving it away from the prior by a constant factor, the aggregator's performance guarantee is substantially better than is possible without knowledge of the prior. Our results give a theoretical grounding to past empirical research on extremization and help give guidance on the appropriate amount to extremize.
    Better Modelling Out-of-Distribution Regression on Distributed Acoustic Sensor Data Using Anchored Hidden State Mixup. (arXiv:2202.11283v1 [cs.LG])
    Generalizing the application of machine learning models to situations where the statistical distribution of training and test data are different has been a complex problem. Our contributions in this paper are threefold: (1) we introduce an anchored-based Out of Distribution (OOD) Regression Mixup algorithm, leveraging manifold hidden state mixup and observation similarities to form a novel regularization penalty, (2) we provide a first of its kind, high resolution Distributed Acoustic Sensor (DAS) dataset that is suitable for testing OOD regression modelling, allowing other researchers to benchmark progress in this area, and (3) we demonstrate with an extensive evaluation the generalization performance of the proposed method against existing approaches, then show that our method achieves state-of-the-art performance. Lastly, we also demonstrate a wider applicability of the proposed method by exhibiting improved generalization performances on other types of regression datasets, including Udacity and Rotation-MNIST datasets.
    RGB Image Classification with Quantum Convolutional Ansaetze. (arXiv:2107.11099v2 [quant-ph] UPDATED)
    With the rapid growth of qubit numbers and coherence times in quantum hardware technology, implementing shallow neural networks on the so-called Noisy Intermediate-Scale Quantum (NISQ) devices has attracted a lot of interest. Many quantum (convolutional) circuit ansaetze are proposed for grayscale images classification tasks with promising empirical results. However, when applying these ansaetze on RGB images, the intra-channel information that is useful for vision tasks is not extracted effectively. In this paper, we propose two types of quantum circuit ansaetze to simulate convolution operations on RGB images, which differ in the way how inter-channel and intra-channel information are extracted. To the best of our knowledge, this is the first work of a quantum convolutional circuit to deal with RGB images effectively, with a higher test accuracy compared to the purely classical CNNs. We also investigate the relationship between the size of quantum circuit ansatz and the learnability of the hybrid quantum-classical convolutional neural network. Through experiments based on CIFAR-10 and MNIST datasets, we demonstrate that a larger size of the quantum circuit ansatz improves predictive performance in multiclass classification tasks, providing useful insights for near term quantum algorithm developments.
    Efficient CDF Approximations for Normalizing Flows. (arXiv:2202.11322v1 [cs.LG])
    Normalizing flows model a complex target distribution in terms of a bijective transform operating on a simple base distribution. As such, they enable tractable computation of a number of important statistical quantities, particularly likelihoods and samples. Despite these appealing properties, the computation of more complex inference tasks, such as the cumulative distribution function (CDF) over a complex region (e.g., a polytope) remains challenging. Traditional CDF approximations using Monte-Carlo techniques are unbiased but have unbounded variance and low sample efficiency. Instead, we build upon the diffeomorphic properties of normalizing flows and leverage the divergence theorem to estimate the CDF over a closed region in target space in terms of the flux across its \emph{boundary}, as induced by the normalizing flow. We describe both deterministic and stochastic instances of this estimator: while the deterministic variant iteratively improves the estimate by strategically subdividing the boundary, the stochastic variant provides unbiased estimates. Our experiments on popular flow architectures and UCI benchmark datasets show a marked improvement in sample efficiency as compared to traditional estimators.
    Training Adaptive Reconstruction Networks for Inverse Problems. (arXiv:2202.11342v1 [cs.LG])
    Neural networks are full of promises for the resolution of ill-posed inverse problems. In particular, physics informed learning approaches already seem to progressively gradually replace carefully hand-crafted reconstruction algorithms, for their superior quality. The aim of this paper is twofold. First we show a significant weakness of these networks: they do not adapt efficiently to variations of the forward model. Second, we show that training the network with a family of forward operators allows to solve the adaptivity problem without compromising the reconstruction quality significantly. All our experiments are carefully devised on partial Fourier sampling problems arising in magnetic resonance imaging (MRI).
    Causally motivated Shortcut Removal Using Auxiliary Labels. (arXiv:2105.06422v3 [cs.LG] UPDATED)
    Shortcut learning, in which models make use of easy-to-represent but unstable associations, is a major failure mode for robust machine learning. We study a flexible, causally-motivated approach to training robust predictors by discouraging the use of specific shortcuts, focusing on a common setting where a robust predictor could achieve optimal \emph{iid} generalization in principle, but is overshadowed by a shortcut predictor in practice. Our approach uses auxiliary labels, typically available at training time, to enforce conditional independences implied by the causal graph. We show both theoretically and empirically that causally-motivated regularization schemes (a) lead to more robust estimators that generalize well under distribution shift, and (b) have better finite sample efficiency compared to usual regularization schemes, even when no shortcut is present. Our analysis highlights important theoretical properties of training techniques commonly used in the causal inference, fairness, and disentanglement literatures. Our code is available at https://github.com/mymakar/causally_motivated_shortcut_removal
    Revisiting consistency for semi-supervised semantic segmentation. (arXiv:2106.07075v3 [cs.CV] UPDATED)
    Semi-supervised learning is especially interesting in the dense prediction context due to high cost of pixel-level ground truth. Unfortunately, most such approaches are evaluated on outdated architectures which hamper research due to very slow training and high requirements on GPU RAM. We address this concern by presenting a simple and effective baseline which works very well both on standard and efficient architectures. Our baseline is based on one-way consistency and non-linear geometric and photometric perturbations. We show advantage of perturbing only the student branch and present a plausible explanation of such behaviour. Experiments on Cityscapes and CIFAR-10 demonstrate competitive performance with respect to prior work.
    Amortised Likelihood-free Inference for Expensive Time-series Simulators with Signatured Ratio Estimation. (arXiv:2202.11585v1 [stat.ML])
    Simulation models of complex dynamics in the natural and social sciences commonly lack a tractable likelihood function, rendering traditional likelihood-based statistical inference impossible. Recent advances in machine learning have introduced novel algorithms for estimating otherwise intractable likelihood functions using a likelihood ratio trick based on binary classifiers. Consequently, efficient likelihood approximations can be obtained whenever good probabilistic classifiers can be constructed. We propose a kernel classifier for sequential data using path signatures based on the recently introduced signature kernel. We demonstrate that the representative power of signatures yields a highly performant classifier, even in the crucially important case where sample numbers are low. In such scenarios, our approach can outperform sophisticated neural networks for common posterior inference tasks.
    A Framework to Learn with Interpretation. (arXiv:2010.09345v4 [cs.LG] UPDATED)
    To tackle interpretability in deep learning, we present a novel framework to jointly learn a predictive model and its associated interpretation model. The interpreter provides both local and global interpretability about the predictive model in terms of human-understandable high level attribute functions, with minimal loss of accuracy. This is achieved by a dedicated architecture and well chosen regularization penalties. We seek for a small-size dictionary of high level attribute functions that take as inputs the outputs of selected hidden layers and whose outputs feed a linear classifier. We impose strong conciseness on the activation of attributes with an entropy-based criterion while enforcing fidelity to both inputs and outputs of the predictive model. A detailed pipeline to visualize the learnt features is also developed. Moreover, besides generating interpretable models by design, our approach can be specialized to provide post-hoc interpretations for a pre-trained neural network. We validate our approach against several state-of-the-art methods on multiple datasets and show its efficacy on both kinds of tasks.
    Introducing various Semantic Models for Amharic: Experimentation and Evaluation with multiple Tasks and Datasets. (arXiv:2011.01154v2 [cs.CL] UPDATED)
    The availability of different pre-trained semantic models enabled the quick development of machine learning components for downstream applications. Despite the availability of abundant text data for low resource languages, only a few semantic models are publicly available. Publicly available pre-trained models are usually built as a multilingual version of semantic models that can not fit well for each language due to context variations. In this work, we introduce different semantic models for Amharic. After we experiment with the existing pre-trained semantic models, we trained and fine-tuned nine new different models using a monolingual text corpus. The models are build using word2Vec embeddings, distributional thesaurus (DT), contextual embeddings, and DT embeddings obtained via network embedding algorithms. Moreover, we employ these models for different NLP tasks and investigate their impact. We find that newly trained models perform better than pre-trained multilingual models. Furthermore, models based on contextual embeddings from RoBERTA perform better than the word2Vec models.
    MLProxy: SLA-Aware Reverse Proxy for Machine Learning Inference Serving on Serverless Computing Platforms. (arXiv:2202.11243v1 [cs.DC])
    Serving machine learning inference workloads on the cloud is still a challenging task on the production level. Optimal configuration of the inference workload to meet SLA requirements while optimizing the infrastructure costs is highly complicated due to the complex interaction between batch configuration, resource configurations, and variable arrival process. Serverless computing has emerged in recent years to automate most infrastructure management tasks. Workload batching has revealed the potential to improve the response time and cost-effectiveness of machine learning serving workloads. However, it has not yet been supported out of the box by serverless computing platforms. Our experiments have shown that for various machine learning workloads, batching can hugely improve the system's efficiency by reducing the processing overhead per request. In this work, we present MLProxy, an adaptive reverse proxy to support efficient machine learning serving workloads on serverless computing systems. MLProxy supports adaptive batching to ensure SLA compliance while optimizing serverless costs. We performed rigorous experiments on Knative to demonstrate the effectiveness of MLProxy. We showed that MLProxy could reduce the cost of serverless deployment by up to 92% while reducing SLA violations by up to 99% that can be generalized across state-of-the-art model serving frameworks.
    On Margins and Derandomisation in PAC-Bayes. (arXiv:2107.03955v3 [cs.LG] UPDATED)
    We give a general recipe for derandomising PAC-Bayesian bounds using margins, with the critical ingredient being that our randomised predictions concentrate around some value. The tools we develop straightforwardly lead to margin bounds for various classifiers, including linear prediction -- a class that includes boosting and the support vector machine -- single-hidden-layer neural networks with an unusual \(\erf\) activation function, and deep ReLU networks. Further, we extend to partially-derandomised predictors where only some of the randomness is removed, letting us extend bounds to cases where the concentration properties of our predictors are otherwise poor.
    Enabling arbitrary translation objectives with Adaptive Tree Search. (arXiv:2202.11444v1 [cs.CL])
    We introduce an adaptive tree search algorithm, that can find high-scoring outputs under translation models that make no assumptions about the form or structure of the search objective. This algorithm -- a deterministic variant of Monte Carlo tree search -- enables the exploration of new kinds of models that are unencumbered by constraints imposed to make decoding tractable, such as autoregressivity or conditional independence assumptions. When applied to autoregressive models, our algorithm has different biases than beam search has, which enables a new analysis of the role of decoding bias in autoregressive models. Empirically, we show that our adaptive tree search algorithm finds outputs with substantially better model scores compared to beam search in autoregressive models, and compared to reranking techniques in models whose scores do not decompose additively with respect to the words in the output. We also characterise the correlation of several translation model objectives with respect to BLEU. We find that while some standard models are poorly calibrated and benefit from the beam search bias, other often more robust models (autoregressive models tuned to maximize expected automatic metric scores, the noisy channel model and a newly proposed objective) benefit from increasing amounts of search using our proposed decoder, whereas the beam search bias limits the improvements obtained from such objectives. Thus, we argue that as models improve, the improvements may be masked by over-reliance on beam search or reranking based methods.
    TD-GEN: Graph Generation With Tree Decomposition. (arXiv:2106.10656v2 [cs.LG] UPDATED)
    We propose TD-GEN, a graph generation framework based on tree decomposition, and introduce a reduced upper bound on the maximum number of decisions needed for graph generation. The framework includes a permutation invariant tree generation model which forms the backbone of graph generation. Tree nodes are supernodes, each representing a cluster of nodes in the graph. Graph nodes and edges are incrementally generated inside the clusters by traversing the tree supernodes, respecting the structure of the tree decomposition, and following node sharing decisions between the clusters. Finally, we discuss the shortcomings of standard evaluation criteria based on statistical properties of the generated graphs as performance measures. We propose to compare the performance of models based on likelihood. Empirical results on a variety of standard graph generation datasets demonstrate the superior performance of our method.  ( 2 min )
    How Many Data Are Needed for Robust Learning?. (arXiv:2202.11592v1 [cs.LG])
    We show that the sample complexity of robust interpolation problem could be exponential in the input dimensionality and discover a phase transition phenomenon when the data are in a unit ball. Robust interpolation refers to the problem of interpolating $n$ noisy training data in $\R^d$ by a Lipschitz function. Although this problem has been well understood when the covariates are drawn from an isoperimetry distribution, much remains unknown concerning its performance under generic or even the worst-case distributions. Our results are two-fold: 1) too many data hurt robustness; we provide a tight and universal Lipschitzness lower bound $\Omega(n^{1/d})$ of the interpolating function for arbitrary data distributions. Our result disproves potential existence of an $\mathcal{O}(1)$-Lipschitz function in the overparametrization scenario when $n=\exp(\omega(d))$. 2) Small data hurt robustness: $n=\exp(\Omega(d))$ is necessary for obtaining a good population error under certain distributions by any $\mathcal{O}(1)$-Lipschitz learning algorithm. Perhaps surprisingly, our results shed light on the curse of big data and the blessing of dimensionality for robustness, and discover an intriguing phenomenon of phase transition at $n=\exp(\Theta(d))$.  ( 2 min )
    COLD Decoding: Energy-based Constrained Text Generation with Langevin Dynamics. (arXiv:2202.11705v1 [cs.CL])
    Many applications of text generation require incorporating different constraints to control the semantics or style of generated text. These constraints can be hard (e.g., ensuring certain keywords are included in the output) and soft (e.g., contextualizing the output with the left- or right-hand context). In this paper, we present Energy-based Constrained Decoding with Langevin Dynamics (COLD), a decoding framework which unifies constrained generation as specifying constraints through an energy function, then performing efficient differentiable reasoning over the constraints through gradient-based sampling. COLD decoding is a flexible framework that can be applied directly to off-the-shelf left-to-right language models without the need for any task-specific fine-tuning, as demonstrated through three challenging text generation applications: lexically-constrained generation, abductive reasoning, and counterfactual reasoning. Our experiments on these constrained generation tasks point to the effectiveness of our approach, both in terms of automatic and human evaluation.  ( 2 min )
    Energy-efficient Training of Distributed DNNs in the Mobile-edge-cloud Continuum. (arXiv:2202.11349v1 [cs.NI])
    We address distributed machine learning in multi-tier (e.g., mobile-edge-cloud) networks where a heterogeneous set of nodes cooperate to perform a learning task. Due to the presence of multiple data sources and computation-capable nodes, a learning controller (e.g., located in the edge) has to make decisions about (i) which distributed ML model structure to select, (ii) which data should be used for the ML model training, and (iii) which resources should be allocated to it. Since these decisions deeply influence one another, they should be made jointly. In this paper, we envision a new approach to distributed learning in multi-tier networks, which aims at maximizing ML efficiency. To this end, we propose a solution concept, called RightTrain, that achieves energy-efficient ML model training, while fulfilling learning time and quality requirements. RightTrain makes high-quality decisions in polynomial time. Further, our performance evaluation shows that RightTrain closely matches the optimum and outperforms the state of the art by over 50%.  ( 2 min )
    Complexity of Stochastic Dual Dynamic Programming. (arXiv:1912.07702v8 [math.OC] UPDATED)
    Stochastic dual dynamic programming is a cutting plane type algorithm for multi-stage stochastic optimization originated about 30 years ago. In spite of its popularity in practice, there does not exist any analysis on the convergence rates of this method. In this paper, we first establish the number of iterations, i.e., iteration complexity, required by a basic dynamic cutting plane method for solving relatively simple multi-stage optimization problems, by introducing novel mathematical tools including the saturation of search points. We then refine these basic tools and establish the iteration complexity for both deterministic and stochastic dual dynamic programming methods for solving more general multi-stage stochastic optimization problems under the standard stage-wise independence assumption. Our results indicate that the complexity of some deterministic variants of these methods mildly increases with the number of stages $T$, in fact linearly dependent on $T$ for discounted problems. Therefore, they are efficient for strategic decision making which involves a large number of stages, but with a relatively small number of decision variables in each stage. Without explicitly discretizing the state and action spaces, these methods might also be pertinent to the related reinforcement learning and stochastic control areas.  ( 2 min )
    Non-Volatile Memory Accelerated Geometric Multi-Scale Resolution Analysis. (arXiv:2202.11518v1 [cs.LG])
    Dimensionality reduction algorithms are standard tools in a researcher's toolbox. Dimensionality reduction algorithms are frequently used to augment downstream tasks such as machine learning, data science, and also are exploratory methods for understanding complex phenomena. For instance, dimensionality reduction is commonly used in Biology as well as Neuroscience to understand data collected from biological subjects. However, dimensionality reduction techniques are limited by the von-Neumann architectures that they execute on. Specifically, data intensive algorithms such as dimensionality reduction techniques often require fast, high capacity, persistent memory which historically hardware has been unable to provide at the same time. In this paper, we present a re-implementation of an existing dimensionality reduction technique called Geometric Multi-Scale Resolution Analysis (GMRA) which has been accelerated via novel persistent memory technology called Memory Centric Active Storage (MCAS). Our implementation uses a specialized version of MCAS called PyMM that provides native support for Python datatypes including NumPy arrays and PyTorch tensors. We compare our PyMM implementation against a DRAM implementation, and show that when data fits in DRAM, PyMM offers competitive runtimes. When data does not fit in DRAM, our PyMM implementation is still able to process the data.  ( 2 min )
    Nonparametric Regression on Low-Dimensional Manifolds using Deep ReLU Networks : Function Approximation and Statistical Recovery. (arXiv:1908.01842v5 [cs.LG] UPDATED)
    Real world data often exhibit low-dimensional geometric structures, and can be viewed as samples near a low-dimensional manifold. This paper studies nonparametric regression of H\"{o}lder functions on low-dimensional manifolds using deep ReLU networks. Suppose $n$ training data are sampled from a H\"{o}lder function in $\mathcal{H}^{s,\alpha}$ supported on a $d$-dimensional Riemannian manifold isometrically embedded in $\mathbb{R}^D$, with sub-gaussian noise. A deep ReLU network architecture is designed to estimate the underlying function from the training data. The mean squared error of the empirical estimator is proved to converge in the order of $n^{-\frac{2(s+\alpha)}{2(s+\alpha) + d}}\log^3 n$. This result shows that deep ReLU networks give rise to a fast convergence rate depending on the data intrinsic dimension $d$, which is usually much smaller than the ambient dimension $D$. It therefore demonstrates the adaptivity of deep ReLU networks to low-dimensional geometric structures of data, and partially explains the power of deep ReLU networks in tackling high-dimensional data with low-dimensional geometric structures.
    Performative Prediction in a Stateful World. (arXiv:2011.03885v3 [cs.LG] UPDATED)
    Deployed supervised machine learning models make predictions that interact with and influence the world. This phenomenon is called performative prediction by Perdomo et al. (ICML 2020). It is an ongoing challenge to understand the influence of such predictions as well as design tools so as to control that influence. We propose a theoretical framework where the response of a target population to the deployed classifier is modeled as a function of the classifier and the current state (distribution) of the population. We show necessary and sufficient conditions for convergence to an equilibrium of two retraining algorithms, repeated risk minimization and a lazier variant. Furthermore, convergence is near an optimal classifier. We thus generalize results of Perdomo et al., whose performativity framework does not assume any dependence on the state of the target population. A particular phenomenon captured by our model is that of distinct groups that acquire information and resources at different rates to be able to respond to the latest deployed classifier. We study this phenomenon theoretically and empirically.
    Hybrid Learning for Orchestrating Deep Learning Inference in Multi-user Edge-cloud Networks. (arXiv:2202.11098v1 [cs.LG])
    Deep-learning-based intelligent services have become prevalent in cyber-physical applications including smart cities and health-care. Collaborative end-edge-cloud computing for deep learning provides a range of performance and efficiency that can address application requirements through computation offloading. The decision to offload computation is a communication-computation co-optimization problem that varies with both system parameters (e.g., network condition) and workload characteristics (e.g., inputs). Identifying optimal orchestration considering the cross-layer opportunities and requirements in the face of varying system dynamics is a challenging multi-dimensional problem. While Reinforcement Learning (RL) approaches have been proposed earlier, they suffer from a large number of trial-and-errors during the learning process resulting in excessive time and resource consumption. We present a Hybrid Learning orchestration framework that reduces the number of interactions with the system environment by combining model-based and model-free reinforcement learning. Our Deep Learning inference orchestration strategy employs reinforcement learning to find the optimal orchestration policy. Furthermore, we deploy Hybrid Learning (HL) to accelerate the RL learning process and reduce the number of direct samplings. We demonstrate efficacy of our HL strategy through experimental comparison with state-of-the-art RL-based inference orchestration, demonstrating that our HL strategy accelerates the learning process by up to 166.6x.
    MuMiN: A Large-Scale Multilingual Multimodal Fact-Checked Misinformation Social Network Dataset. (arXiv:2202.11684v1 [cs.LG])
    Misinformation is becoming increasingly prevalent on social media and in news articles. It has become so widespread that we require algorithmic assistance utilising machine learning to detect such content. Training these machine learning models require datasets of sufficient scale, diversity and quality. However, datasets in the field of automatic misinformation detection are predominantly monolingual, include a limited amount of modalities and are not of sufficient scale and quality. Addressing this, we develop a data collection and linking system (MuMiN-trawl), to build a public misinformation graph dataset (MuMiN), containing rich social media data (tweets, replies, users, images, articles, hashtags) spanning 21 million tweets belonging to 26 thousand Twitter threads, each of which have been semantically linked to 13 thousand fact-checked claims across dozens of topics, events and domains, in 41 different languages, spanning more than a decade. The dataset is made available as a heterogeneous graph via a Python package (mumin). We provide baseline results for two node classification tasks related to the veracity of a claim involving social media, and demonstrate that these are challenging tasks, with the highest macro-average F1-score being 62.55% and 61.45% for the two tasks, respectively. The MuMiN ecosystem is available at https://mumin-dataset.github.io/, including the data, documentation, tutorials and leaderboards.
    GAGE: Geometry Preserving Attributed Graph Embeddings. (arXiv:2011.01422v2 [cs.SI] UPDATED)
    Node embedding is the task of extracting concise and informative representations of certain entities that are connected in a network. Various real-world networks include information about both node connectivity and certain node attributes, in the form of features or time-series data. Modern representation learning techniques employ both the connectivity and attribute information of the nodes to produce embeddings in an unsupervised manner. In this context, deriving embeddings that preserve the geometry of the network and the attribute vectors would be highly desirable, as they would reflect both the topological neighborhood structure and proximity in feature space. While this is fairly straightforward to maintain when only observing the connectivity or attribute information of the network, preserving the geometry of both types of information is challenging. A novel tensor factorization approach for node embedding in attributed networks is proposed in this paper, that preserves the distances of both the connections and the attributes. Furthermore, an effective and lightweight algorithm is developed to tackle the learning task and judicious experiments with multiple state-of-the-art baselines suggest that the proposed algorithm offers significant performance improvements in downstream tasks.
    Superpixel-based Knowledge Infusion in Deep Neural Networks for Image Classification. (arXiv:2105.09448v2 [cs.CV] UPDATED)
    Superpixels are higher-order perceptual groups of pixels in an image, often carrying much more information than the raw pixels. There is an inherent relational structure to the relationship among different superpixels of an image such as adjacent superpixels are neighbours of each other. Our interest here is to treat these relative positions of various superpixels as relational information of an image. This relational information can convey higher-order spatial information about the image, such as the relationship between superpixels representing two eyes in an image of a cat. That is, two eyes are placed adjacent to each other in a straight line or the mouth is below the nose. Our motive in this paper is to assist computer vision models, specifically those based on Deep Neural Networks (DNNs), by incorporating this higher-order information from superpixels. We construct a hybrid model that leverages (a) Convolutional Neural Network (CNN) to deal with spatial information in an image and (b) Graph Neural Network (GNN) to deal with relational superpixel information in the image. The proposed model is learned using a generic hybrid loss function. Our experiments are extensive, and we evaluate the predictive performance of our proposed hybrid vision model on seven different image classification datasets from a variety of domains such as digit and object recognition, biometrics, medical imaging. The results demonstrate that the relational superpixel information processed by a GNN can improve the performance of a standard CNN-based vision system.
    Laplacian Constrained Precision Matrix Estimation: Existence and High Dimensional Consistency. (arXiv:2111.00590v2 [stat.ML] UPDATED)
    This paper considers the problem of estimating high dimensional Laplacian constrained precision matrices by minimizing Stein's loss. We obtain a necessary and sufficient condition for existence of this estimator, that consists on checking whether a certain data dependent graph is connected. We also prove consistency in the high dimensional setting under the symmetrized Stein loss. We show that the error rate does not depend on the graph sparsity, or other type of structure, and that Laplacian constraints are sufficient for high dimensional consistency. Our proofs exploit properties of graph Laplacians, the matrix tree theorem, and a characterization of the proposed estimator based on effective graph resistances. We validate our theoretical claims with numerical experiments.
    Extension of Dynamic Mode Decomposition for dynamic systems with incomplete information based on t-model of optimal prediction. (arXiv:2202.11432v1 [math.NA])
    The Dynamic Mode Decomposition has proved to be a very efficient technique to study dynamic data. This is entirely a data-driven approach that extracts all necessary information from data snapshots which are commonly supposed to be sampled from measurement. The application of this approach becomes problematic if the available data is incomplete because some dimensions of smaller scale either missing or unmeasured. Such setting occurs very often in modeling complex dynamical systems such as power grids, in particular with reduced-order modeling. To take into account the effect of unresolved variables the optimal prediction approach based on the Mori-Zwanzig formalism can be applied to obtain the most expected prediction under existing uncertainties. This effectively leads to the development of a time-predictive model accounting for the impact of missing data. In the present paper we provide a detailed derivation of the considered method from the Liouville equation and finalize it with the optimization problem that defines the optimal transition operator corresponding to the observed data. In contrast to the existing approach, we consider a first-order approximation of the Mori-Zwanzig decomposition, state the corresponding optimization problem and solve it with the gradient-based optimization method. The gradient of the obtained objective function is computed precisely through the automatic differentiation technique. The numerical experiments illustrate that the considered approach gives practically the same dynamics as the exact Mori-Zwanzig decomposition, but is less computationally intensive.
    Bayesian Model Selection, the Marginal Likelihood, and Generalization. (arXiv:2202.11678v1 [cs.LG])
    How do we compare between hypotheses that are entirely consistent with observations? The marginal likelihood (aka Bayesian evidence), which represents the probability of generating our observations from a prior, provides a distinctive approach to this foundational question, automatically encoding Occam's razor. Although it has been observed that the marginal likelihood can overfit and is sensitive to prior assumptions, its limitations for hyperparameter learning and discrete model comparison have not been thoroughly investigated. We first revisit the appealing properties of the marginal likelihood for learning constraints and hypothesis testing. We then highlight the conceptual and practical issues in using the marginal likelihood as a proxy for generalization. Namely, we show how marginal likelihood can be negatively correlated with generalization, with implications for neural architecture search, and can lead to both underfitting and overfitting in hyperparameter learning. We provide a partial remedy through a conditional marginal likelihood, which we show is more aligned with generalization, and practically valuable for large-scale hyperparameter learning, such as in deep kernel learning.
    Amortized Causal Discovery: Learning to Infer Causal Graphs from Time-Series Data. (arXiv:2006.10833v3 [cs.LG] UPDATED)
    On time-series data, most causal discovery methods fit a new model whenever they encounter samples from a new underlying causal graph. However, these samples often share relevant information which is lost when following this approach. Specifically, different samples may share the dynamics which describe the effects of their causal relations. We propose Amortized Causal Discovery, a novel framework that leverages such shared dynamics to learn to infer causal relations from time-series data. This enables us to train a single, amortized model that infers causal relations across samples with different underlying causal graphs, and thus leverages the shared dynamics information. We demonstrate experimentally that this approach, implemented as a variational model, leads to significant improvements in causal discovery performance, and show how it can be extended to perform well under added noise and hidden confounding.
    Fast Density Estimation for Density-based Clustering Methods. (arXiv:2109.11383v2 [cs.LG] UPDATED)
    Density-based clustering algorithms are widely used for discovering clusters in pattern recognition and machine learning since they can deal with non-hyperspherical clusters and are robustness to handle outliers. However, the runtime of density-based algorithms are heavily dominated by finding fixed-radius near neighbors and calculating the density, which is time-consuming. Meanwhile, the traditional acceleration methods using indexing technique such as KD tree is not effective in processing high-dimensional data. In this paper, we propose a fast region query algorithm named fast principal component analysis pruning (called FPCAP) with the help of the fast principal component analysis technique in conjunction with geometric information provided by principal attributes of the data, which can process high-dimensional data and be easily applied to density-based methods to prune unnecessary distance calculations when finding neighbors and estimating densities. As an application in density-based clustering methods, FPCAP method was combined with the Density Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm. And then, an improved DBSCAN (called IDBSCAN) is obtained, which preserves the advantage of DBSCAN and meanwhile, greatly reduces the computation of redundant distances. Experiments on seven benchmark datasets demonstrate that the proposed algorithm improves the computational efficiency significantly.
    Time-varying Graph Learning Under Structured Temporal Priors. (arXiv:2110.05018v2 [cs.LG] UPDATED)
    This paper endeavors to learn time-varying graphs by using structured temporal priors that assume underlying relations between arbitrary two graphs in the graph sequence. Different from many existing chain structure based methods in which the priors like temporal homogeneity can only describe the variations of two consecutive graphs, we propose a structure named \emph{temporal graph} to characterize the underlying real temporal relations. Under this framework, the chain structure is actually a special case of our temporal graph. We further proposed Alternating Direction Method of Multipliers (ADMM), a distributed algorithm, to solve the induced optimization problem. Numerical experiments demonstrate the superiorities of our method.
    Listen to Interpret: Post-hoc Interpretability for Audio Networks with NMF. (arXiv:2202.11479v1 [cs.SD])
    This paper tackles post-hoc interpretability for audio processing networks. Our goal is to interpret decisions of a network in terms of high-level audio objects that are also listenable for the end-user. To this end, we propose a novel interpreter design that incorporates non-negative matrix factorization (NMF). In particular, a carefully regularized interpreter module is trained to take hidden layer representations of the targeted network as input and produce time activations of pre-learnt NMF components as intermediate outputs. Our methodology allows us to generate intuitive audio-based interpretations that explicitly enhance parts of the input signal most relevant for a network's decision. We demonstrate our method's applicability on popular benchmarks, including a real-world multi-label classification task.
    FedAdapt: Adaptive Offloading for IoT Devices in Federated Learning. (arXiv:2107.04271v3 [cs.DC] UPDATED)
    Applying Federated Learning (FL) on Internet-of-Things devices is necessitated by the large volumes of data they produce and growing concerns of data privacy. However, there are three challenges that need to be addressed to make FL efficient: (i) execution on devices with limited computational capabilities, (ii) accounting for stragglers due to computational heterogeneity of devices, and (iii) adaptation to the changing network bandwidths. This paper presents FedAdapt, an adaptive offloading FL framework to mitigate the aforementioned challenges. FedAdapt accelerates local training in computationally constrained devices by leveraging layer offloading of deep neural networks (DNNs) to servers. Further, FedAdapt adopts reinforcement learning based optimization and clustering to adaptively identify which layers of the DNN should be offloaded for each individual device on to a server to tackle the challenges of computational heterogeneity and changing network bandwidth. Experimental studies are carried out on a lab-based testbed and it is demonstrated that by offloading a DNN from the device to the server FedAdapt reduces the training time of a typical IoT device by over half compared to classic FL. The training time of extreme stragglers and the overall training time can be reduced by up to 57%. Furthermore, with changing network bandwidth, FedAdapt is demonstrated to reduce the training time by up to 40% when compared to classic FL, without sacrificing accuracy.  ( 2 min )
    Meta-learning based Alternating Minimization Algorithm for Non-convex Optimization. (arXiv:2009.04899v6 [cs.LG] UPDATED)
    In this paper, we propose a novel solution for non-convex problems of multiple variables, especially for those typically solved by an alternating minimization (AM) strategy that splits the original optimization problem into a set of sub-problems corresponding to each variable, and then iteratively optimize each sub-problem using a fixed updating rule. However, due to the intrinsic non-convexity of the original optimization problem, the optimization can usually be trapped into spurious local minimum even when each sub-problem can be optimally solved at each iteration. Meanwhile, learning-based approaches, such as deep unfolding algorithms, are highly limited by the lack of labelled data and restricted explainability. To tackle these issues, we propose a meta-learning based alternating minimization (MLAM) method, which aims to minimize a partial of the global losses over iterations instead of carrying minimization on each sub-problem, and it tends to learn an adaptive strategy to replace the handcrafted counterpart resulting in advance on superior performance. Meanwhile, the proposed MLAM still maintains the original algorithmic principle, which contributes to a better interpretability. We evaluate the proposed method on two representative problems, namely, bi-linear inverse problem: matrix completion, and non-linear problem: Gaussian mixture models. The experimental results validate that our proposed approach outperforms AM-based methods in standard settings, and is able to achieve effective optimization in challenging cases while other comparing methods would typically fail.
    FastSHAP: Real-Time Shapley Value Estimation. (arXiv:2107.07436v2 [stat.ML] UPDATED)
    Shapley values are widely used to explain black-box models, but they are costly to calculate because they require many model evaluations. We introduce FastSHAP, a method for estimating Shapley values in a single forward pass using a learned explainer model. FastSHAP amortizes the cost of explaining many inputs via a learning approach inspired by the Shapley value's weighted least squares characterization, and it can be trained using standard stochastic gradient optimization. We compare FastSHAP to existing estimation approaches, revealing that it generates high-quality explanations with orders of magnitude speedup.
    FLIX: A Simple and Communication-Efficient Alternative to Local Methods in Federated Learning. (arXiv:2111.11556v2 [cs.LG] UPDATED)
    Federated Learning (FL) is an increasingly popular machine learning paradigm in which multiple nodes try to collaboratively learn under privacy, communication and multiple heterogeneity constraints. A persistent problem in federated learning is that it is not clear what the optimization objective should be: the standard average risk minimization of supervised learning is inadequate in handling several major constraints specific to federated learning, such as communication adaptivity and personalization control. We identify several key desiderata in frameworks for federated learning and introduce a new framework, FLIX, that takes into account the unique challenges brought by federated learning. FLIX has a standard finite-sum form, which enables practitioners to tap into the immense wealth of existing (potentially non-local) methods for distributed optimization. Through a smart initialization that does not require any communication, FLIX does not require the use of local steps but is still provably capable of performing dissimilarity regularization on par with local methods. We give several algorithms for solving the FLIX formulation efficiently under communication constraints. Finally, we corroborate our theoretical results with extensive experimentation.
    Weighted Programming. (arXiv:2202.07577v1 [cs.PL] CROSS LISTED)
    We study weighted programming, a programming paradigm for specifying mathematical models. More specifically, the weighted programs we investigate are like usual imperative programs with two additional features: (1) nondeterministic branching and (2) weighting execution traces. Weights can be numbers but also other objects like words from an alphabet, polynomials, formal power series, or cardinal numbers. We argue that weighted programming as a paradigm can be used to specify mathematical models beyond probability distributions (as is done in probabilistic programming). We develop weakest-precondition- and weakest-liberal-precondition-style calculi \`{a} la Dijkstra for reasoning about mathematical models specified by weighted programs. We present several case studies. For instance, we use weighted programming to model the ski rental problem - an optimization problem. We model not only the optimization problem itself, but also the best deterministic online algorithm for solving this problem as weighted programs. By means of weakest-precondition-style reasoning, we can determine the competitive ratio of the online algorithm on source code level.
    Wide Mean-Field Bayesian Neural Networks Ignore the Data. (arXiv:2202.11670v1 [cs.LG])
    Bayesian neural networks (BNNs) combine the expressive power of deep learning with the advantages of Bayesian formalism. In recent years, the analysis of wide, deep BNNs has provided theoretical insight into their priors and posteriors. However, we have no analogous insight into their posteriors under approximate inference. In this work, we show that mean-field variational inference entirely fails to model the data when the network width is large and the activation function is odd. Specifically, for fully-connected BNNs with odd activation functions and a homoscedastic Gaussian likelihood, we show that the optimal mean-field variational posterior predictive (i.e., function space) distribution converges to the prior predictive distribution as the width tends to infinity. We generalize aspects of this result to other likelihoods. Our theoretical results are suggestive of underfitting behavior previously observered in BNNs. While our convergence bounds are non-asymptotic and constants in our analysis can be computed, they are currently too loose to be applicable in standard training regimes. Finally, we show that the optimal approximate posterior need not tend to the prior if the activation function is not odd, showing that our statements cannot be generalized arbitrarily.
    Can Pretext-Based Self-Supervised Learning Be Boosted by Downstream Data? A Theoretical Analysis. (arXiv:2103.03568v4 [cs.LG] UPDATED)
    Pretext-based self-supervised learning learns the semantic representation via a handcrafted pretext task over unlabeled data and then uses the learned representation for downstream tasks, which effectively reduces the sample complexity of downstream tasks under Conditional Independence (CI) condition. However, the downstream sample complexity gets much worse if the CI condition does not hold. One interesting question is whether we can make the CI condition hold by using downstream data to refine the unlabeled data to boost self-supervised learning. At first glance, one might think that seeing downstream data in advance would always boost the downstream performance. However, we show that it is not intuitively true and point out that in some cases, it hurts the final performance instead. In particular, we prove both model-free and model-dependent lower bounds of the number of downstream samples used for data refinement. Moreover, we conduct various experiments on both synthetic and real-world datasets to verify our theoretical results.
    Label-Smoothed Backdoor Attack. (arXiv:2202.11203v1 [cs.CR])
    By injecting a small number of poisoned samples into the training set, backdoor attacks aim to make the victim model produce designed outputs on any input injected with pre-designed backdoors. In order to achieve a high attack success rate using as few poisoned training samples as possible, most existing attack methods change the labels of the poisoned samples to the target class. This practice often results in severe over-fitting of the victim model over the backdoors, making the attack quite effective in output control but easier to be identified by human inspection or automatic defense algorithms. In this work, we proposed a label-smoothing strategy to overcome the over-fitting problem of these attack methods, obtaining a \textit{Label-Smoothed Backdoor Attack} (LSBA). In the LSBA, the label of the poisoned sample $\bm{x}$ will be changed to the target class with a probability of $p_n(\bm{x})$ instead of 100\%, and the value of $p_n(\bm{x})$ is specifically designed to make the prediction probability the target class be only slightly greater than those of the other classes. Empirical studies on several existing backdoor attacks show that our strategy can considerably improve the stealthiness of these attacks and, at the same time, achieve a high attack success rate. In addition, our strategy makes it able to manually control the prediction probability of the design output through manipulating the applied and activated number of LSBAs\footnote{Source code will be published at \url{https://github.com/v-mipeng/LabelSmoothedAttack.git}}.
    Neural Generalised AutoRegressive Conditional Heteroskedasticity. (arXiv:2202.11285v1 [cs.LG])
    We propose Neural GARCH, a class of methods to model conditional heteroskedasticity in financial time series. Neural GARCH is a neural network adaptation of the GARCH 1,1 model in the univariate case, and the diagonal BEKK 1,1 model in the multivariate case. We allow the coefficients of a GARCH model to be time varying in order to reflect the constantly changing dynamics of financial markets. The time varying coefficients are parameterised by a recurrent neural network that is trained with stochastic gradient variational Bayes. We propose two variants of our model, one with normal innovations and the other with Students t innovations. We test our models on a wide range of univariate and multivariate financial time series, and we find that the Neural Students t model consistently outperforms the others.
    Active Learning for Human-in-the-Loop Customs Inspection. (arXiv:2010.14282v3 [cs.LG] UPDATED)
    We study the human-in-the-loop customs inspection scenario, where an AI-assisted algorithm supports customs officers by recommending a set of imported goods to be inspected. If the inspected items are fraudulent, the officers can levy extra duties. Th formed logs are then used as additional training data for successive iterations. Choosing to inspect suspicious items first leads to an immediate gain in customs revenue, yet such inspections may not bring new insights for learning dynamic traffic patterns. On the other hand, inspecting uncertain items can help acquire new knowledge, which will be used as a supplementary training resource to update the selection systems. Based on multiyear customs datasets obtained from three countries, we demonstrate that some degree of exploration is necessary to cope with domain shifts in trade data. The results show that a hybrid strategy of selecting likely fraudulent and uncertain items will eventually outperform the exploitation-only strategy.
    A Deep Reinforcement Learning Approach for the Meal Delivery Problem. (arXiv:2104.12000v2 [cs.LG] UPDATED)
    We consider a meal delivery service fulfilling dynamic customer requests given a set of couriers over the course of a day. A courier's duty is to pick-up an order from a restaurant and deliver it to a customer. We model this service as a Markov decision process and use deep reinforcement learning as the solution approach. We experiment with the resulting policies on synthetic and real-world datasets and compare those with the baseline policies. We also examine the courier utilization for different numbers of couriers. In our analysis, we specifically focus on the impact of the limited available resources in the meal delivery problem. Furthermore, we investigate the effect of intelligent order rejection and re-positioning of the couriers. Our numerical experiments show that, by incorporating the geographical locations of the restaurants, customers, and the depot, our model significantly improves the overall service quality as characterized by the expected total reward and the delivery times. Our results present valuable insights on both the courier assignment process and the optimal number of couriers for different order frequencies on a given day. The proposed model also shows a robust performance under a variety of scenarios for real-world implementation.
    CentSmoothie: Central-Smoothing Hypergraph Neural Networks for Predicting Drug-Drug Interactions. (arXiv:2112.07837v2 [cs.LG] UPDATED)
    Predicting drug-drug interactions (DDI) is the problem of predicting side effects (unwanted outcomes) of a pair of drugs using drug information and known side effects of many pairs. This problem can be formulated as predicting labels (i.e. side effects) for each pair of nodes in a DDI graph, of which nodes are drugs and edges are interacting drugs with known labels. State-of-the-art methods for this problem are graph neural networks (GNNs), which leverage neighborhood information in the graph to learn node representations. For DDI, however, there are many labels with complicated relationships due to the nature of side effects. Usual GNNs often fix labels as one-hot vectors that do not reflect label relationships and potentially do not obtain the highest performance in the difficult cases of infrequent labels. In this paper, we formulate DDI as a hypergraph where each hyperedge is a triple: two nodes for drugs and one node for a label. We then present CentSmoothie, a hypergraph neural network that learns representations of nodes and labels altogether with a novel central-smoothing formulation. We empirically demonstrate the performance advantages of CentSmoothie in simulations as well as real datasets.
    Classification of Computer Aided Engineering (CAE) Parts Using Graph Convolutional Networks. (arXiv:2202.11289v1 [cs.LG])
    CAE engineers work with hundreds of parts spread across multiple body models. A Graph Convolutional Network (GCN) was used to develop a CAE parts classifier. As many as 866 distinct parts from a representative body model were used as training data. The parts were represented as a three-dimensional (3-D) Finite Element Analysis (FEA) mesh with values of each node in the x, y, z coordinate system. The GCN based classifier was compared to fully connected neural network and PointNet based models. Performance of the trained models was evaluated with a test set that included parts from the training data, but with additional holes, rotation, translation, mesh refinement/coarsening, variation of mesh schema, mirroring along x and y axes, variation of topographical features, and change in mesh node ordering. The trained GCN model was able to achieve 88.5% classification accuracy on the test set i.e., it was able to find the correct matching part from the dataset of 866 parts despite significant variation from the baseline part. A CAE parts classifier demonstrated in this study could be very useful for engineers to filter through CAE parts spread across several body models to find parts that meet their requirements.
    CF-GNNExplainer: Counterfactual Explanations for Graph Neural Networks. (arXiv:2102.03322v4 [cs.LG] UPDATED)
    Given the increasing promise of graph neural networks (GNNs) in real-world applications, several methods have been developed for explaining their predictions. Existing methods for interpreting predictions from GNNs have primarily focused on generating subgraphs that are especially relevant for a particular prediction. However, such methods are not counterfactual (CF) in nature: given a prediction, we want to understand how the prediction can be changed in order to achieve an alternative outcome. In this work, we propose a method for generating CF explanations for GNNs: the minimal perturbation to the input (graph) data such that the prediction changes. Using only edge deletions, we find that our method, CF-GNNExplainer, can generate CF explanations for the majority of instances across three widely used datasets for GNN explanations, while removing less than 3 edges on average, with at least 94\% accuracy. This indicates that CF-GNNExplainer primarily removes edges that are crucial for the original predictions, resulting in minimal CF explanations.
    A2P-MANN: Adaptive Attention Inference Hops Pruned Memory-Augmented Neural Networks. (arXiv:2101.09693v2 [cs.CL] UPDATED)
    In this work, to limit the number of required attention inference hops in memory-augmented neural networks, we propose an online adaptive approach called A2P-MANN. By exploiting a small neural network classifier, an adequate number of attention inference hops for the input query is determined. The technique results in elimination of a large number of unnecessary computations in extracting the correct answer. In addition, to further lower computations in A2P-MANN, we suggest pruning weights of the final FC (fully-connected) layers. To this end, two pruning approaches, one with negligible accuracy loss and the other with controllable loss on the final accuracy, are developed. The efficacy of the technique is assessed by using the twenty question-answering (QA) tasks of bAbI dataset. The analytical assessment reveals, on average, more than 42% fewer computations compared to the baseline MANN at the cost of less than 1% accuracy loss. In addition, when used along with the previously published zero-skipping technique, a computation count reduction of up to 68% is achieved. Finally, when the proposed approach (without zero-skipping) is implemented on the CPU and GPU platforms, up to 43% runtime reduction is achieved.  ( 2 min )
    Charformer: Fast Character Transformers via Gradient-based Subword Tokenization. (arXiv:2106.12672v3 [cs.CL] UPDATED)
    State-of-the-art models in natural language processing rely on separate rigid subword tokenization algorithms, which limit their generalization ability and adaptation to new settings. In this paper, we propose a new model inductive bias that learns a subword tokenization end-to-end as part of the model. To this end, we introduce a soft gradient-based subword tokenization module (GBST) that automatically learns latent subword representations from characters in a data-driven fashion. Concretely, GBST enumerates candidate subword blocks and learns to score them in a position-wise fashion using a block scoring network. We additionally introduce Charformer, a deep Transformer model that integrates GBST and operates on the byte level. Via extensive experiments on English GLUE, multilingual, and noisy text datasets, we show that Charformer outperforms a series of competitive byte-level baselines while generally performing on par and sometimes outperforming subword-based models. Additionally, Charformer is fast, improving the speed of both vanilla byte-level and subword-level Transformers by 28%-100% while maintaining competitive quality. We believe this work paves the way for highly performant token-free models that are trained completely end-to-end.  ( 2 min )
    Comparative analysis of machine learning methods for active flow control. (arXiv:2202.11664v1 [physics.flu-dyn])
    Machine learning frameworks such as Genetic Programming (GP) and Reinforcement Learning (RL) are gaining popularity in flow control. This work presents a comparative analysis of the two, bench-marking some of their most representative algorithms against global optimization techniques such as Bayesian Optimization (BO) and Lipschitz global optimization (LIPO). First, we review the general framework of the flow control problem, linking optimal control theory with model-free machine learning methods. Then, we test the control algorithms on three test cases. These are (1) the stabilization of a nonlinear dynamical system featuring frequency cross-talk, (2) the wave cancellation from a Burgers' flow and (3) the drag reduction in a cylinder wake flow. Although the control of these problems has been tackled in the recent literature with one method or the other, we present a comprehensive comparison to illustrate their differences in exploration versus exploitation and their balance between `model capacity' in the control law definition versus `required complexity'. We believe that such a comparison opens the path towards hybridization of the various methods, and we offer some perspective on their future development in the literature of flow control problems.  ( 2 min )
    A general sample complexity analysis of vanilla policy gradient. (arXiv:2107.11433v3 [cs.LG] UPDATED)
    We adapt recent tools developed for the analysis of Stochastic Gradient Descent (SGD) in non-convex optimization to obtain convergence and sample complexity guarantees for the vanilla policy gradient (PG). Our only assumptions are that the expected return is smooth w.r.t. the policy parameters, that its $H$-step truncated gradient is close to the exact gradient, and a certain ABC assumption. This assumption requires the second moment of the estimated gradient to be bounded by $A \geq 0$ times the suboptimality gap, $B \geq 0$ times the norm of the full batch gradient and an additive constant $C \geq 0$, or any combination of aforementioned. We show that the ABC assumption is more general than the commonly used assumptions on the policy space to prove convergence to a stationary point. We provide a single convergence theorem that recovers the $\widetilde{\mathcal{O}}(\epsilon^{-4})$ sample complexity of PG. Our results also affords greater flexibility in the choice of hyper parameters such as the step size and places no restriction on the batch size $m$, including the single trajectory case (i.e., $m=1$). We then instantiate our theorem in different settings, where we both recover existing results and obtained improved sample complexity, e.g., for convergence to the global optimum for Fisher-non-degenerated parameterized policies.  ( 2 min )
    Networked Online Learning for Control of Safety-Critical Resource-Constrained Systems based on Gaussian Processes. (arXiv:2202.11491v1 [eess.SY])
    Safety-critical technical systems operating in unknown environments require the ability to quickly adapt their behavior, which can be achieved in control by inferring a model online from the data stream generated during operation. Gaussian process-based learning is particularly well suited for safety-critical applications as it ensures bounded prediction errors. While there exist computationally efficient approximations for online inference, these approaches lack guarantees for the prediction error and have high memory requirements, and are therefore not applicable to safety-critical systems with tight memory constraints. In this work, we propose a novel networked online learning approach based on Gaussian process regression, which addresses the issue of limited local resources by employing remote data management in the cloud. Our approach formally guarantees a bounded tracking error with high probability, which is exploited to identify the most relevant data to achieve a certain control performance. We further propose an effective data transmission scheme between the local system and the cloud taking bandwidth limitations and time delay of the transmission channel into account. The effectiveness of the proposed method is successfully demonstrated in a simulation.  ( 2 min )
    TEE-based decentralized recommender systems: The raw data sharing redemption. (arXiv:2202.11655v1 [cs.DC])
    Recommenders are central in many applications today. The most effective recommendation schemes, such as those based on collaborative filtering (CF), exploit similarities between user profiles to make recommendations, but potentially expose private data. Federated learning and decentralized learning systems address this by letting the data stay on user's machines to preserve privacy: each user performs the training on local data and only the model parameters are shared. However, sharing the model parameters across the network may still yield privacy breaches. In this paper, we present REX, the first enclave-based decentralized CF recommender. REX exploits Trusted execution environments (TEE), such as Intel software guard extensions (SGX), that provide shielded environments within the processor to improve convergence while preserving privacy. Firstly, REX enables raw data sharing, which ultimately speeds up convergence and reduces the network load. Secondly, REX fully preserves privacy. We analyze the impact of raw data sharing in both deep neural network (DNN) and matrix factorization (MF) recommenders and showcase the benefits of trusted environments in a full-fledged implementation of REX. Our experimental results demonstrate that through raw data sharing, REX significantly decreases the training time by 18.3x and the network load by 2 orders of magnitude over standard decentralized approaches that share only parameters, while fully protecting privacy by leveraging trustworthy hardware enclaves with very little overhead.  ( 2 min )
    Testing report of a fingerprint-based door-opening system. (arXiv:2202.11419v1 [cs.LG])
    This paper describes the operational evaluation of a door-opening system based on a low-cost inkless fingerprint sensor. This system has been developed and installed for access control to one of our laboratories. Experimental results reveal that the system is working fine and no special cleaning requirements neither components replacement is needed. It can support more than 50 users, and an average of 74,5 access attempts per day in a 14-hour 5-day-per-week working. Emphasize is also given on some important facts to be taken into consideration when comparing and evaluating different products from different vendors.  ( 2 min )
    A Bayesian Deep Learning Approach to Near-Term Climate Prediction. (arXiv:2202.11244v1 [physics.ao-ph])
    Since model bias and associated initialization shock are serious shortcomings that reduce prediction skills in state-of-the-art decadal climate prediction efforts, we pursue a complementary machine-learning-based approach to climate prediction. The example problem setting we consider consists of predicting natural variability of the North Atlantic sea surface temperature on the interannual timescale in the pre-industrial control simulation of the Community Earth System Model (CESM2). While previous works have considered the use of recurrent networks such as convolutional LSTMs and reservoir computing networks in this and other similar problem settings, we currently focus on the use of feedforward convolutional networks. In particular, we find that a feedforward convolutional network with a Densenet architecture is able to outperform a convolutional LSTM in terms of predictive skill. Next, we go on to consider a probabilistic formulation of the same network based on Stein variational gradient descent and find that in addition to providing useful measures of predictive uncertainty, the probabilistic (Bayesian) version improves on its deterministic counterpart in terms of predictive skill. Finally, we characterize the reliability of the ensemble of ML models obtained in the probabilistic setting by using analysis tools developed in the context of ensemble numerical weather prediction.  ( 2 min )
    Reconstruction of observed mechanical motions with Artificial Intelligence tools. (arXiv:2202.11447v1 [cs.LG])
    The goal of this paper is to determine the laws of observed trajectories assuming that there is a mechanical system in the background and using these laws to continue the observed motion in a plausible way. The laws are represented by neural networks with a limited number of parameters. The training of the networks follows the Extreme Learning Machine idea. We determine laws for different levels of embedding, thus we can represent not only the equation of motion but also the symmetries of different kinds. In the recursive numerical evolution of the system, we require the fulfillment of all the observed laws, within the determined numerical precision. In this way, we can successfully reconstruct both integrable and chaotic motions, as we demonstrate in the example of the gravity pendulum and the double pendulum.
    Blockchain Framework for Artificial Intelligence Computation. (arXiv:2202.11264v1 [cs.DC])
    Blockchain is an essentially distributed database recording all transactions or digital events among participating parties. Each transaction in the records is approved and verified by consensus of the participants in the system that requires solving a hard mathematical puzzle, which is known as proof-of-work. To make the approved records immutable, the mathematical puzzle is not trivial to solve and therefore consumes substantial computing resources. However, it is energy-wasteful to have many computational nodes installed in the blockchain competing to approve the records by just solving a meaningless puzzle. Here, we pose proof-of-work as a reinforcement-learning problem by modeling the blockchain growing as a Markov decision process, in which a learning agent makes an optimal decision over the environment's state, whereas a new block is added and verified. Specifically, we design the block verification and consensus mechanism as a deep reinforcement-learning iteration process. As a result, our method utilizes the determination of state transition and the randomness of action selection of a Markov decision process, as well as the computational complexity of a deep neural network, collectively to make the blocks not easy to recompute and to preserve the order of transactions, while the blockchain nodes are exploited to train the same deep neural network with different data samples (state-action pairs) in parallel, allowing the model to experience multiple episodes across computing nodes but at one time. Our method is used to design the next generation of public blockchain networks, which has the potential not only to spare computational resources for industrial applications but also to encourage data sharing and AI model design for common problems.
    Deep Reinforcement Learning: Opportunities and Challenges. (arXiv:2202.11296v1 [cs.LG])
    This article is a gentle discussion about the field of reinforcement learning for real life, about opportunities and challenges, with perspectives and without technical details, touching a broad range of topics. The article is based on both historical and recent research papers, surveys, tutorials, talks, blogs, and books. Various groups of readers, like researchers, engineers, students, managers, investors, officers, and people wanting to know more about the field, may find the article interesting. In this article, we first give a brief introduction to reinforcement learning (RL), and its relationship with deep learning, machine learning and AI. Then we discuss opportunities of RL, in particular, applications in products and services, games, recommender systems, robotics, transportation, economics and finance, healthcare, education, combinatorial optimization, computer systems, and science and engineering. The we discuss challenges, in particular, 1) foundation, 2) representation, 3) reward, 4) model, simulation, planning, and benchmarks, 5) learning to learn a.k.a. meta-learning, 6) off-policy/offline learning, 7) software development and deployment, 8) business perspectives, and 9) more challenges. We conclude with a discussion, attempting to answer: "Why has RL not been widely adopted in practice yet?" and "When is RL helpful?".
    LPF-Defense: 3D Adversarial Defense based on Frequency Analysis. (arXiv:2202.11287v1 [cs.CV])
    Although 3D point cloud classification has recently been widely deployed in different application scenarios, it is still very vulnerable to adversarial attacks. This increases the importance of robust training of 3D models in the face of adversarial attacks. Based on our analysis on the performance of existing adversarial attacks, more adversarial perturbations are found in the mid and high-frequency components of input data. Therefore, by suppressing the high-frequency content in the training phase, the models robustness against adversarial examples is improved. Experiments showed that the proposed defense method decreases the success rate of six attacks on PointNet, PointNet++ ,, and DGCNN models. In particular, improvements are achieved with an average increase of classification accuracy by 3.8 % on drop100 attack and 4.26 % on drop200 attack compared to the state-of-the-art methods. The method also improves models accuracy on the original dataset compared to other available methods.
    Differentially Private Regression with Unbounded Covariates. (arXiv:2202.11199v1 [cs.CR])
    We provide computationally efficient, differentially private algorithms for the classical regression settings of Least Squares Fitting, Binary Regression and Linear Regression with unbounded covariates. Prior to our work, privacy constraints in such regression settings were studied under strong a priori bounds on covariates. We consider the case of Gaussian marginals and extend recent differentially private techniques on mean and covariance estimation (Kamath et al., 2019; Karwa and Vadhan, 2018) to the sub-gaussian regime. We provide a novel technical analysis yielding differentially private algorithms for the above classical regression settings. Through the case of Binary Regression, we capture the fundamental and widely-studied models of logistic regression and linearly-separable SVMs, learning an unbiased estimate of the true regression vector, up to a scaling factor.
    Deep Bayesian ICP Covariance Estimation. (arXiv:2202.11607v1 [cs.RO])
    Covariance estimation for the Iterative Closest Point (ICP) point cloud registration algorithm is essential for state estimation and sensor fusion purposes. We argue that a major source of error for ICP is in the input data itself, from the sensor noise to the scene geometry. Benefiting from recent developments in deep learning for point clouds, we propose a data-driven approach to learn an error model for ICP. We estimate covariances modeling data-dependent heteroscedastic aleatoric uncertainty, and epistemic uncertainty using a variational Bayesian approach. The system evaluation is performed on LiDAR odometry on different datasets, highlighting good results in comparison to the state of the art.
    Backdoor Defense in Federated Learning Using Differential Testing and Outlier Detection. (arXiv:2202.11196v1 [cs.CR])
    The goal of federated learning (FL) is to train one global model by aggregating model parameters updated independently on edge devices without accessing users' private data. However, FL is susceptible to backdoor attacks where a small fraction of malicious agents inject a targeted misclassification behavior in the global model by uploading polluted model updates to the server. In this work, we propose DifFense, an automated defense framework to protect an FL system from backdoor attacks by leveraging differential testing and two-step MAD outlier detection, without requiring any previous knowledge of attack scenarios or direct access to local model parameters. We empirically show that our detection method prevents a various number of potential attackers while consistently achieving the convergence of the global model comparable to that trained under federated averaging (FedAvg). We further corroborate the effectiveness and generalizability of our method against prior defense techniques, such as Multi-Krum and coordinate-wise median aggregation. Our detection method reduces the average backdoor accuracy of the global model to below 4% and achieves a false negative rate of zero.
    Multivariate Quantile Function Forecaster. (arXiv:2202.11316v1 [cs.LG])
    We propose Multivariate Quantile Function Forecaster (MQF$^2$), a global probabilistic forecasting method constructed using a multivariate quantile function and investigate its application to multi-horizon forecasting. Prior approaches are either autoregressive, implicitly capturing the dependency structure across time but exhibiting error accumulation with increasing forecast horizons, or multi-horizon sequence-to-sequence models, which do not exhibit error accumulation, but also do typically not model the dependency structure across time steps. MQF$^2$ combines the benefits of both approaches, by directly making predictions in the form of a multivariate quantile function, defined as the gradient of a convex function which we parametrize using input-convex neural networks. By design, the quantile function is monotone with respect to the input quantile levels and hence avoids quantile crossing. We provide two options to train MQF$^2$: with energy score or with maximum likelihood. Experimental results on real-world and synthetic datasets show that our model has comparable performance with state-of-the-art methods in terms of single time step metrics while capturing the time dependency structure.  ( 2 min )
    Cooperative Behavioral Planning for Automated Driving using Graph Neural Networks. (arXiv:2202.11376v1 [cs.RO])
    Urban intersections are prone to delays and inefficiencies due to static precedence rules and occlusions limiting the view on prioritized traffic. Existing approaches to improve traffic flow, widely known as automatic intersection management systems, are mostly based on non-learning reservation schemes or optimization algorithms. Machine learning-based techniques show promising results in planning for a single ego vehicle. This work proposes to leverage machine learning algorithms to optimize traffic flow at urban intersections by jointly planning for multiple vehicles. Learning-based behavior planning poses several challenges, demanding for a suited input and output representation as well as large amounts of ground-truth data. We address the former issue by using a flexible graph-based input representation accompanied by a graph neural network. This allows to efficiently encode the scene and inherently provide individual outputs for all involved vehicles. To learn a sensible policy, without relying on the imitation of expert demonstrations, the cooperative planning task is phrased as a reinforcement learning problem. We train and evaluate the proposed method in an open-source simulation environment for decision making in automated driving. Compared to a first-in-first-out scheme and traffic governed by static priority rules, the learned planner shows a significant gain in flow rate, while reducing the number of induced stops. In addition to synthetic simulations, the approach is also evaluated based on real-world traffic data taken from the publicly available inD dataset.  ( 2 min )
    ChemRL-GEM: Geometry Enhanced Molecular Representation Learning for Property Prediction. (arXiv:2106.06130v4 [cs.LG] UPDATED)
    Effective molecular representation learning is of great importance to facilitate molecular property prediction, which is a fundamental task for the drug and material industry. Recent advances in graph neural networks (GNNs) have shown great promise in applying GNNs for molecular representation learning. Moreover, a few recent studies have also demonstrated successful applications of self-supervised learning methods to pre-train the GNNs to overcome the problem of insufficient labeled molecules. However, existing GNNs and pre-training strategies usually treat molecules as topological graph data without fully utilizing the molecular geometry information. Whereas, the three-dimensional (3D) spatial structure of a molecule, a.k.a molecular geometry, is one of the most critical factors for determining molecular physical, chemical, and biological properties. To this end, we propose a novel Geometry Enhanced Molecular representation learning method (GEM) for Chemical Representation Learning (ChemRL). At first, we design a geometry-based GNN architecture that simultaneously models atoms, bonds, and bond angles in a molecule. To be specific, we devised double graphs for a molecule: The first one encodes the atom-bond relations; The second one encodes bond-angle relations. Moreover, on top of the devised GNN architecture, we propose several novel geometry-level self-supervised learning strategies to learn spatial knowledge by utilizing the local and global molecular 3D structures. We compare ChemRL-GEM with various state-of-the-art (SOTA) baselines on different molecular benchmarks and exhibit that ChemRL-GEM can significantly outperform all baselines in both regression and classification tasks. For example, the experimental results show an overall improvement of 8.8% on average compared to SOTA baselines on the regression tasks, demonstrating the superiority of the proposed method.  ( 3 min )
    The Larger The Fairer? Small Neural Networks Can Achieve Fairness for Edge Devices. (arXiv:2202.11317v1 [cs.LG])
    Along with the progress of AI democratization, neural networks are being deployed more frequently in edge devices for a wide range of applications. Fairness concerns gradually emerge in many applications, such as face recognition and mobile medical. One fundamental question arises: what will be the fairest neural architecture for edge devices? By examining the existing neural networks, we observe that larger networks typically are fairer. But, edge devices call for smaller neural architectures to meet hardware specifications. To address this challenge, this work proposes a novel Fairness- and Hardware-aware Neural architecture search framework, namely FaHaNa. Coupled with a model freezing approach, FaHaNa can efficiently search for neural networks with balanced fairness and accuracy, while guaranteed to meet hardware specifications. Results show that FaHaNa can identify a series of neural networks with higher fairness and accuracy on a dermatology dataset. Target edge devices, FaHaNa finds a neural architecture with slightly higher accuracy, 5.28x smaller size, 15.14% higher fairness score, compared with MobileNetV2; meanwhile, on Raspberry PI and Odroid XU-4, it achieves 5.75x and 5.79x speedup.
    Neural Collaborative Filtering Bandits via Meta Learning. (arXiv:2201.13395v2 [cs.LG] UPDATED)
    Contextual multi-armed bandits provide powerful tools to solve the exploitation-exploration dilemma in decision making, with direct applications in the personalized recommendation. In fact, collaborative effects among users carry the significant potential to improve the recommendation. In this paper, we introduce and study the problem by exploring `Neural Collaborative Filtering Bandits', where the rewards can be non-linear functions and groups are formed dynamically given different specific contents. To solve this problem, inspired by meta-learning, we propose Meta-Ban (meta-bandits), where a meta-learner is designed to represent and rapidly adapt to dynamic groups, along with a UCB-based exploration strategy. Furthermore, we analyze that Meta-Ban can achieve the regret bound of $\mathcal{O}(\sqrt{T \log T})$, improving a multiplicative factor $\sqrt{\log T}$ over state-of-the-art related works. In the end, we conduct extensive experiments showing that Meta-Ban significantly outperforms six strong baselines.
    Pricing options on flow forwards by neural networks in Hilbert space. (arXiv:2202.11606v1 [q-fin.PR])
    We propose a new methodology for pricing options on flow forwards by applying infinite-dimensional neural networks. We recast the pricing problem as an optimization problem in a Hilbert space of real-valued function on the positive real line, which is the state space for the term structure dynamics. This optimization problem is solved by facilitating a novel feedforward neural network architecture designed for approximating continuous functions on the state space. The proposed neural net is built upon the basis of the Hilbert space. We provide an extensive case study that shows excellent numerical efficiency, with superior performance over that of a classical neural net trained on sampling the term structure curves.
    Traversing Time with Multi-Resolution Gaussian Process State-Space Models. (arXiv:2112.03230v2 [cs.LG] UPDATED)
    Gaussian Process state-space models capture complex temporal dependencies in a principled manner by placing a Gaussian Process prior on the transition function. These models have a natural interpretation as discretized stochastic differential equations, but inference for long sequences with fast and slow transitions is difficult. Fast transitions need tight discretizations whereas slow transitions require backpropagating the gradients over long subtrajectories. We propose a novel Gaussian process state-space architecture composed of multiple components, each trained on a different resolution, to model effects on different timescales. The combined model allows traversing time on adaptive scales, providing efficient inference for arbitrarily long sequences with complex dynamics. We benchmark our novel method on semi-synthetic data and on an engine modeling task. In both experiments, our approach compares favorably against its state-of-the-art alternatives that operate on a single time-scale only.
    Thermal hand image segmentation for biometric recognition. (arXiv:2202.11462v1 [cs.CV])
    In this paper we present a method to identify people by means of thermal (TH) and visible (VIS) hand images acquired simultaneously with a TESTO 882-3 camera. In addition, we also present a new database specially acquired for this work. The real challenge when dealing with TH images is the cold finger areas, which can be confused with the acquisition surface. This problem is solved by taking advantage of the VIS information. We have performed different tests to show how TH and VIS images work in identification problems. Experimental results reveal that TH hand image is as suitable for biometric recognition systems as VIS hand images, and better results are obtained when combining this information. A Biometric Dispersion Matcher has been used as a feature vector dimensionality reduction technique as well as a classification task. Its selection criteria helps to reduce the length of the vectors used to perform identification up to a hundred measurements. Identification rates reach a maximum value of 98.3% under these conditions, when using a database of 104 people.
    Residual Bootstrap Exploration for Stochastic Linear Bandit. (arXiv:2202.11474v1 [stat.ML])
    We propose a new bootstrap-based online algorithm for stochastic linear bandit problems. The key idea is to adopt residual bootstrap exploration, in which the agent estimates the next step reward by re-sampling the residuals of mean reward estimate. Our algorithm, residual bootstrap exploration for stochastic linear bandit (\texttt{LinReBoot}), estimates the linear reward from its re-sampling distribution and pulls the arm with the highest reward estimate. In particular, we contribute a theoretical framework to demystify residual bootstrap-based exploration mechanisms in stochastic linear bandit problems. The key insight is that the strength of bootstrap exploration is based on collaborated optimism between the online-learned model and the re-sampling distribution of residuals. Such observation enables us to show that the proposed \texttt{LinReBoot} secure a high-probability $\tilde{O}(d \sqrt{n})$ sub-linear regret under mild conditions. Our experiments support the easy generalizability of the \texttt{ReBoot} principle in the various formulations of linear bandit problems and show the significant computational efficiency of \texttt{LinReBoot}.
    A Comparative Study of Deep Reinforcement Learning-based Transferable Energy Management Strategies for Hybrid Electric Vehicles. (arXiv:2202.11514v1 [cs.LG])
    The deep reinforcement learning-based energy management strategies (EMS) has become a promising solution for hybrid electric vehicles (HEVs). When driving cycles are changed, the network will be retrained, which is a time-consuming and laborious task. A more efficient way of choosing EMS is to combine deep reinforcement learning (DRL) with transfer learning, which can transfer knowledge of one domain to the other new domain, making the network of the new domain reach convergence values quickly. Different exploration methods of RL, including adding action space noise and parameter space noise, are compared against each other in the transfer learning process in this work. Results indicate that the network added parameter space noise is more stable and faster convergent than the others. In conclusion, the best exploration method for transferable EMS is to add noise in the parameter space, while the combination of action space noise and parameter space noise generally performs poorly. Our code is available at https://github.com/BIT-XJY/RL-based-Transferable-EMS.git.
    Contrastive Pre-training for Imbalanced Corporate Credit Ratings. (arXiv:2102.12580v2 [cs.LG] UPDATED)
    Corporate credit rating reflects the level of corporate credit and plays a crucial role in modern financial risk control. But real-world credit rating data usually shows long-tail distributions, which means heavy class imbalanced problem challenging the corporate credit rating system greatly. To tackle that, inspried by the recent advances of pre-train techniques in self-supervised representation learning, we propose a novel framework named Contrastive Pre-training for Corporate Credit Rating (CP4CCR), which utilizes the self-surpervision for getting over class imbalance. Specifically, we propose to, in the first phase, exert constrastive self-superivised pre-training without label information, which want to learn a better class-agnostic initialization. During this phase, two self-supervised task are developed within CP4CCR: (i) Feature Masking (FM) and (ii) Feature Swapping(FS). In the second phase, we can train any standard corporate redit rating model initialized by the pre-trained network. Extensive experiments conducted on the Chinese public-listed corporate rating dataset, prove that CP4CCR can improve the performance of standard corporate credit rating models, especially for class with few samples.
    Exploratory Methods for Relation Discovery in Archival Data. (arXiv:2202.11361v1 [cs.LG])
    In this article we propose a holistic approach to discover relations in art historical communities and enrich historians' biographies and archival descriptions with graph patterns relevant to art historiographic enquiry. We use exploratory data analysis to detect patterns, we select features, and we use them to evaluate classification models to predict new relations, to be recommended to archivists during the cataloguing phase. Results show that relations based on biographical information can be addressed with higher precision than relations based on research topics or institutional relations. Deterministic and a priori rules present better results than probabilistic methods.
    Fairness-Aware Naive Bayes Classifier for Data with Multiple Sensitive Features. (arXiv:2202.11499v1 [cs.LG])
    Fairness-aware machine learning seeks to maximise utility in generating predictions while avoiding unfair discrimination based on sensitive attributes such as race, sex, religion, etc. An important line of work in this field is enforcing fairness during the training step of a classifier. A simple yet effective binary classification algorithm that follows this strategy is two-naive-Bayes (2NB), which enforces statistical parity - requiring that the groups comprising the dataset receive positive labels with the same likelihood. In this paper, we generalise this algorithm into N-naive-Bayes (NNB) to eliminate the simplification of assuming only two sensitive groups in the data and instead apply it to an arbitrary number of groups. We propose an extension of the original algorithm's statistical parity constraint and the post-processing routine that enforces statistical independence of the label and the single sensitive attribute. Then, we investigate its application on data with multiple sensitive features and propose a new constraint and post-processing routine to enforce differential fairness, an extension of established group-fairness constraints focused on intersectionalities. We empirically demonstrate the effectiveness of the NNB algorithm on US Census datasets and compare its accuracy and debiasing performance, as measured by disparate impact and DF-$\epsilon$ score, with similar group-fairness algorithms. Finally, we lay out important considerations users should be aware of before incorporating this algorithm into their application, and direct them to further reading on the pros, cons, and ethical implications of using statistical parity as a fairness criterion.
    Short-answer scoring with ensembles of pretrained language models. (arXiv:2202.11558v1 [cs.CL])
    We investigate the effectiveness of ensembles of pretrained transformer-based language models on short answer questions using the Kaggle Automated Short Answer Scoring dataset. We fine-tune a collection of popular small, base, and large pretrained transformer-based language models, and train one feature-base model on the dataset with the aim of testing ensembles of these models. We used an early stopping mechanism and hyperparameter optimization in training. We observe that generally that the larger models perform slightly better, however, they still fall short of state-of-the-art results one their own. Once we consider ensembles of models, there are ensembles of a number of large networks that do produce state-of-the-art results, however, these ensembles are too large to realistically be put in a production environment.
    Cyclical Variational Bayes Monte Carlo for Efficient Multi-Modal Posterior Distributions Evaluation. (arXiv:2202.11645v1 [stat.CO])
    Statistical model updating is frequently used in engineering to calculate the uncertainty of some unknown latent parameters when a set of measurements on observable quantities is given. Variational inference is an alternative approach to sampling methods that has been developed by the machine learning community to estimate posterior approximations through an optimization approach. In this paper, the Variational Bayesian Monte Carlo (VBMC) method is investigated with the purpose of dealing with statistical model updating problems in engineering involving expensive-to-run models. This method combines the active-sampling Bayesian quadrature with a Gaussian-process based variational inference to yield a non-parametric estimation of the posterior distribution of the identified parameters involving few runs of the expensive-to-run model. VBMC can also be used for model selection as it produces an estimation of the model's evidence lower bound. In this paper, a variant of the VBMC algorithm is developed through the introduction of a cyclical annealing schedule into the algorithm. The proposed cyclical VBMC algorithm allows to deal effectively with multi-modal posteriors by having multiple cycles of exploration and exploitation phases. Four numerical examples are used to compare the standard VBMC algorithm, the monotonic VBMC, the cyclical VBMC and the Transitional Ensemble Markov Chain Monte Carlo (TEMCMC). Overall, it is found that the proposed cyclical VBMC approach yields accurate results with a very reduced number of model runs compared to the state of the art sampling technique TEMCMC. In the presence of potential multi-modal problems, the proposed cyclical VBMC algorithm outperforms all the other approaches in terms of accuracy of the resulting posterior.
    Learning Fast and Slow for Online Time Series Forecasting. (arXiv:2202.11672v1 [cs.LG])
    The fast adaptation capability of deep neural networks in non-stationary environments is critical for online time series forecasting. Successful solutions require handling changes to new and recurring patterns. However, training deep neural forecaster on the fly is notoriously challenging because of their limited ability to adapt to non-stationary environments and the catastrophic forgetting of old knowledge. In this work, inspired by the Complementary Learning Systems (CLS) theory, we propose Fast and Slow learning Networks (FSNet), a holistic framework for online time-series forecasting to simultaneously deal with abrupt changing and repeating patterns. Particularly, FSNet improves the slowly-learned backbone by dynamically balancing fast adaptation to recent changes and retrieving similar old knowledge. FSNet achieves this mechanism via an interaction between two complementary components of an adapter to monitor each layer's contribution to the lost, and an associative memory to support remembering, updating, and recalling repeating events. Extensive experiments on real and synthetic datasets validate FSNet's efficacy and robustness to both new and recurring patterns. Our code will be made publicly available.
    Exponential Tail Local Rademacher Complexity Risk Bounds Without the Bernstein Condition. (arXiv:2202.11461v1 [math.ST])
    The local Rademacher complexity framework is one of the most successful general-purpose toolboxes for establishing sharp excess risk bounds for statistical estimators based on the framework of empirical risk minimization. Applying this toolbox typically requires using the Bernstein condition, which often restricts applicability to convex and proper settings. Recent years have witnessed several examples of problems where optimal statistical performance is only achievable via non-convex and improper estimators originating from aggregation theory, including the fundamental problem of model selection. These examples are currently outside of the reach of the classical localization theory. In this work, we build upon the recent approach to localization via offset Rademacher complexities, for which a general high-probability theory has yet to be established. Our main result is an exponential-tail excess risk bound expressed in terms of the offset Rademacher complexity that yields results at least as sharp as those obtainable via the classical theory. However, our bound applies under an estimator-dependent geometric condition (the "offset condition") instead of the estimator-independent (but, in general, distribution-dependent) Bernstein condition on which the classical theory relies. Our results apply to improper prediction regimes not directly covered by the classical theory.
    On PAC-Bayesian reconstruction guarantees for VAEs. (arXiv:2202.11455v1 [cs.LG])
    Despite its wide use and empirical successes, the theoretical understanding and study of the behaviour and performance of the variational autoencoder (VAE) have only emerged in the past few years. We contribute to this recent line of work by analysing the VAE's reconstruction ability for unseen test data, leveraging arguments from the PAC-Bayes theory. We provide generalisation bounds on the theoretical reconstruction error, and provide insights on the regularisation effect of VAE objectives. We illustrate our theoretical results with supporting experiments on classical benchmark datasets.  ( 2 min )
    NetRCA: An Effective Network Fault Cause Localization Algorithm. (arXiv:2202.11269v1 [cs.LG])
    Localizing the root cause of network faults is crucial to network operation and maintenance. However, due to the complicated network architectures and wireless environments, as well as limited labeled data, accurately localizing the true root cause is challenging. In this paper, we propose a novel algorithm named NetRCA to deal with this problem. Firstly, we extract effective derived features from the original raw data by considering temporal, directional, attribution, and interaction characteristics. Secondly, we adopt multivariate time series similarity and label propagation to generate new training data from both labeled and unlabeled data to overcome the lack of labeled samples. Thirdly, we design an ensemble model which combines XGBoost, rule set learning, attribution model, and graph algorithm, to fully utilize all data information and enhance performance. Finally, experiments and analysis are conducted on the real-world dataset from ICASSP 2022 AIOps Challenge to demonstrate the superiority and effectiveness of our approach.  ( 2 min )
    Low-Regret Active learning. (arXiv:2104.02822v3 [cs.LG] UPDATED)
    We develop an online learning algorithm for identifying unlabeled data points that are most informative for training (i.e., active learning). By formulating the active learning problem as the prediction with sleeping experts problem, we provide a regret minimization framework for identifying relevant data with respect to any given definition of informativeness. Motivated by the successes of ensembles in active learning, we define regret with respect to an omnipotent algorithm that has access to an infinity large ensemble. At the core of our work is an efficient algorithm for sleeping experts that is tailored to achieve low regret on easy instances while remaining resilient to adversarial ones. Low regret implies that we can be provably competitive with an ensemble method \emph{without the computational burden of having to train an ensemble}. This stands in contrast to state-of-the-art active learning methods that are overwhelmingly based on greedy selection, and hence cannot ensure good performance across problem instances with high amounts of noise. We present empirical results demonstrating that our method (i) instantiated with an informativeness measure consistently outperforms its greedy counterpart and (ii) reliably outperforms uniform sampling on real-world scenarios.  ( 2 min )
    Enhancing Sentence Embedding with Generalized Pooling. (arXiv:1806.09828v1 [cs.CL] CROSS LISTED)
    Pooling is an essential component of a wide variety of sentence representation and embedding models. This paper explores generalized pooling methods to enhance sentence embedding. We propose vector-based multi-head attention that includes the widely used max pooling, mean pooling, and scalar self-attention as special cases. The model benefits from properly designed penalization terms to reduce redundancy in multi-head attention. We evaluate the proposed model on three different tasks: natural language inference (NLI), author profiling, and sentiment classification. The experiments show that the proposed model achieves significant improvement over strong sentence-encoding-based methods, resulting in state-of-the-art performances on four datasets. The proposed approach can be easily implemented for more problems than we discuss in this paper.  ( 2 min )
    Parallel MCMC Without Embarrassing Failures. (arXiv:2202.11154v1 [stat.ML])
    Embarrassingly parallel Markov Chain Monte Carlo (MCMC) exploits parallel computing to scale Bayesian inference to large datasets by using a two-step approach. First, MCMC is run in parallel on (sub)posteriors defined on data partitions. Then, a server combines local results. While efficient, this framework is very sensitive to the quality of subposterior sampling. Common sampling problems such as missing modes or misrepresentation of low-density regions are amplified -- instead of being corrected -- in the combination phase, leading to catastrophic failures. In this work, we propose a novel combination strategy to mitigate this issue. Our strategy, Parallel Active Inference (PAI), leverages Gaussian Process (GP) surrogate modeling and active learning. After fitting GPs to subposteriors, PAI (i) shares information between GP surrogates to cover missing modes; and (ii) uses active sampling to individually refine subposterior approximations. We validate PAI in challenging benchmarks, including heavy-tailed and multi-modal posteriors and a real-world application to computational neuroscience. Empirical results show that PAI succeeds where previous methods catastrophically fail, with a small communication overhead.
    Deep Metric Learning-Based Semi-Supervised Regression With Alternate Learning. (arXiv:2202.11388v1 [cs.CV])
    This paper introduces a novel deep metric learning-based semi-supervised regression (DML-S2R) method for parameter estimation problems. The proposed DML-S2R method aims to mitigate the problems of insufficient amount of labeled samples without collecting any additional samples with target values. To this end, the proposed DML-S2R method is made up of two main steps: i) pairwise similarity modeling with scarce labeled data; and ii) triplet-based metric learning with abundant unlabeled data. The first step aims to model pairwise sample similarities by using a small number of labeled samples. This is achieved by estimating the target value differences of labeled samples with a Siamese neural network (SNN). The second step aims to learn a triplet-based metric space (in which similar samples are close to each other and dissimilar samples are far apart from each other) when the number of labeled samples is insufficient. This is achieved by employing the SNN of the first step for triplet-based deep metric learning that exploits not only labeled samples but also unlabeled samples. For the end-to-end training of DML-S2R, we investigate an alternate learning strategy for the two steps. Due to this strategy, the encoded information in each step becomes a guidance for learning the other step. The experimental results confirm the success of DML-S2R compared to the state-of-the-art semi-supervised regression methods. The code of the proposed method is publicly available at https://git.tu-berlin.de/rsim/DML-S2R.
    Augmentation based unsupervised domain adaptation. (arXiv:2202.11486v1 [eess.IV])
    The insertion of deep learning in medical image analysis had lead to the development of state-of-the art strategies in several applications such a disease classification, as well as abnormality detection and segmentation. However, even the most advanced methods require a huge and diverse amount of data to generalize. Because in realistic clinical scenarios, data acquisition and annotation is expensive, deep learning models trained on small and unrepresentative data tend to outperform when deployed in data that differs from the one used for training (e.g data from different scanners). In this work, we proposed a domain adaptation methodology to alleviate this problem in segmentation models. Our approach takes advantage of the properties of adversarial domain adaptation and consistency training to achieve more robust adaptation. Using two datasets with white matter hyperintensities (WMH) annotations, we demonstrated that the proposed method improves model generalization even in corner cases where individual strategies tend to fail.
    Arbitrary Shape Text Detection using Transformers. (arXiv:2202.11221v1 [cs.CV])
    Recent text detection frameworks require several handcrafted components such as anchor generation, non-maximum suppression (NMS), or multiple processing stages (e.g. label generation) to detect arbitrarily shaped text images. In contrast, we propose an end-to-end trainable architecture based on Detection using Transformers (DETR), that outperforms previous state-of-the-art methods in arbitrary-shaped text detection. At its core, our proposed method leverages a bounding box loss function that accurately measures the arbitrary detected text regions' changes in scale and aspect ratio. This is possible due to a hybrid shape representation made from Bezier curves, that are further split into piece-wise polygons. The proposed loss function is then a combination of a generalized-split-intersection-over-union loss defined over the piece-wise polygons and regularized by a Smooth-$\ln$ regression over the Bezier curve's control points. We evaluate our proposed model using Total-Text and CTW-1500 datasets for curved text, and MSRA-TD500 and ICDAR15 datasets for multi-oriented text, and show that the proposed method outperforms the previous state-of-the-art methods in arbitrary-shape text detection tasks.  ( 2 min )
    GradMax: Growing Neural Networks using Gradient Information. (arXiv:2201.05125v2 [cs.LG] UPDATED)
    The architecture and the parameters of neural networks are often optimized independently, which requires costly retraining of the parameters whenever the architecture is modified. In this work we instead focus on growing the architecture without requiring costly retraining. We present a method that adds new neurons during training without impacting what is already learned, while improving the training dynamics. We achieve the latter by maximizing the gradients of the new weights and find the optimal initialization efficiently by means of the singular value decomposition (SVD). We call this technique Gradient Maximizing Growth (GradMax) and demonstrate its effectiveness in variety of vision tasks and architectures.  ( 2 min )
    Indiscriminate Poisoning Attacks on Unsupervised Contrastive Learning. (arXiv:2202.11202v1 [cs.LG])
    Indiscriminate data poisoning attacks are quite effective against supervised learning. However, not much is known about their impact on unsupervised contrastive learning (CL). This paper is the first to consider indiscriminate data poisoning attacks on contrastive learning, demonstrating the feasibility of such attacks, and their differences from indiscriminate poisoning of supervised learning. We also highlight differences between contrastive learning algorithms, and show that some algorithms (e.g., SimCLR) are more vulnerable than others (e.g., MoCo). We differentiate between two types of data poisoning attacks: sample-wise attacks, which add specific noise to each image, cause the largest drop in accuracy, but do not transfer well across SimCLR, MoCo, and BYOL. In contrast, attacks that use class-wise noise, though cause a smaller drop in accuracy, transfer well across different CL algorithms. Finally, we show that a new data augmentation based on matrix completion can be highly effective in countering data poisoning attacks on unsupervised contrastive learning.
    Early Stage Diabetes Prediction via Extreme Learning Machine. (arXiv:2202.11216v1 [cs.LG])
    Diabetes is one of the chronic diseases that has been discovered for decades. However, several cases are diagnosed in their late stages. Every one in eleven of the world's adult population has diabetes. Forty-six percent of people with diabetes have not been diagnosed. Diabetes can develop several other severe diseases that can lead to patient death. Developing and rural areas suffer the most due to the limited medical providers and financial situations. This paper proposed a novel approach based on an extreme learning machine for diabetes prediction based on a data questionnaire that can early alert the users to seek medical assistance and prevent late diagnoses and severe illness development.
    Minimax Optimal Quantization of Linear Models: Information-Theoretic Limits and Efficient Algorithms. (arXiv:2202.11277v1 [cs.IT])
    We consider the problem of quantizing a linear model learned from measurements $\mathbf{X} = \mathbf{W}\boldsymbol{\theta} + \mathbf{v}$. The model is constrained to be representable using only $dB$-bits, where $B \in (0, \infty)$ is a pre-specified budget and $d$ is the dimension of the model. We derive an information-theoretic lower bound for the minimax risk under this setting and show that it is tight with a matching upper bound. This upper bound is achieved using randomized embedding based algorithms. We propose randomized Hadamard embeddings that are computationally efficient while performing near-optimally. We also show that our method and upper-bounds can be extended for two-layer ReLU neural networks. Numerical simulations validate our theoretical claims.
    Fast Sparse Classification for Generalized Linear and Additive Models. (arXiv:2202.11389v1 [cs.LG])
    We present fast classification techniques for sparse generalized linear and additive models. These techniques can handle thousands of features and thousands of observations in minutes, even in the presence of many highly correlated features. For fast sparse logistic regression, our computational speed-up over other best-subset search techniques owes to linear and quadratic surrogate cuts for the logistic loss that allow us to efficiently screen features for elimination, as well as use of a priority queue that favors a more uniform exploration of features. As an alternative to the logistic loss, we propose the exponential loss, which permits an analytical solution to the line search at each iteration. Our algorithms are generally 2 to 5 times faster than previous approaches. They produce interpretable models that have accuracy comparable to black box models on challenging datasets.
    A Differential Attention Fusion Model Based on Transformer for Time Series Forecasting. (arXiv:2202.11402v1 [cs.LG])
    Time series forecasting is widely used in the fields of equipment life cycle forecasting, weather forecasting, traffic flow forecasting, and other fields. Recently, some scholars have tried to apply Transformer to time series forecasting because of its powerful parallel training ability. However, the existing Transformer methods do not pay enough attention to the small time segments that play a decisive role in prediction, making it insensitive to small changes that affect the trend of time series, and it is difficult to effectively learn continuous time-dependent features. To solve this problem, we propose a differential attention fusion model based on Transformer, which designs the differential layer, neighbor attention, sliding fusion mechanism, and residual layer on the basis of classical Transformer architecture. Specifically, the differences of adjacent time points are extracted and focused by difference and neighbor attention. The sliding fusion mechanism fuses various features of each time point so that the data can participate in encoding and decoding without losing important information. The residual layer including convolution and LSTM further learns the dependence between time points and enables our model to carry out deeper training. A large number of experiments on three datasets show that the prediction results produced by our method are favorably comparable to the state-of-the-art.
    FlowSense: Monitoring Airflow in Building Ventilation Systems Using Audio Sensing. (arXiv:2202.11136v1 [cs.SD])
    Proper indoor ventilation through buildings' heating, ventilation, and air conditioning (HVAC) systems has become an increasing public health concern that significantly impacts individuals' health and safety at home, work, and school. While much work has progressed in providing energy-efficient and user comfort for HVAC systems through IoT devices and mobile-sensing approaches, ventilation is an aspect that has received lesser attention despite its importance. With a motivation to monitor airflow from building ventilation systems through commodity sensing devices, we present FlowSense, a machine learning-based algorithm to predict airflow rate from sensed audio data in indoor spaces. Our ML technique can predict the state of an air vent-whether it is on or off-as well as the rate of air flowing through active vents. By exploiting a low-pass filter to obtain low-frequency audio signals, we put together a privacy-preserving pipeline that leverages a silence detection algorithm to only sense for sounds of air from HVAC air vent when no human speech is detected. We also propose the Minimum Persistent Sensing (MPS) as a post-processing algorithm to reduce interference from ambient noise, including ongoing human conversation, office machines, and traffic noises. Together, these techniques ensure user privacy and improve the robustness of FlowSense. We validate our approach yielding over 90% accuracy in predicting vent status and 0.96 MSE in predicting airflow rate when the device is placed within 2.25 meters away from an air vent. Additionally, we demonstrate how our approach as a mobile audio-sensing platform is robust to smartphone models, distance, and orientation. Finally, we evaluate FlowSense privacy-preserving pipeline through a user study and a Google Speech Recognition service, confirming that the audio signals we used as input data are inaudible and inconstructible.  ( 2 min )
    Privacy issues on biometric systems. (arXiv:2202.11415v1 [cs.CR])
    In the XXIth century there is a strong interest on privacy issues. Technology permits obtaining personal information without individuals consent, computers make it feasible to share and process this information, and this can bring about damaging implications. In some sense, biometric information is personal information, so it is important to be conscious about what is true and what is false when some people claim that biometrics is an attempt to individuals privacy. In this paper, key points related to this matter are dealt with.
    ViKiNG: Vision-Based Kilometer-Scale Navigation with Geographic Hints. (arXiv:2202.11271v1 [cs.RO])
    Robotic navigation has been approached as a problem of 3D reconstruction and planning, as well as an end-to-end learning problem. However, long-range navigation requires both planning and reasoning about local traversability, as well as being able to utilize information about global geography, in the form of a roadmap, GPS, or other side information, which provides important navigational hints but may be low-fidelity or unreliable. In this work, we propose a learning-based approach that integrates learning and planning, and can utilize side information such as schematic roadmaps, satellite maps and GPS coordinates as a planning heuristic, without relying on them being accurate. Our method, ViKiNG, incorporates a local traversability model, which looks at the robot's current camera observation and a potential subgoal to infer how easily that subgoal can be reached, as well as a heuristic model, which looks at overhead maps and attempts to estimate the distance to the destination for various subgoals. These models are used by a heuristic planner to decide the best next subgoal in order to reach the final destination. Our method performs no explicit geometric reconstruction, utilizing only a topological representation of the environment. Despite having never seen trajectories longer than 80 meters in its training dataset, ViKiNG can leverage its image-based learned controller and goal-directed heuristic to navigate to goals up to 3 kilometers away in previously unseen environments, and exhibit complex behaviors such as probing potential paths and doubling back when they are found to be non-viable. ViKiNG is also robust to unreliable maps and GPS, since the low-level controller ultimately makes decisions based on egocentric image observations, using maps only as planning heuristics. For videos of our experiments, please check out https://sites.google.com/view/viking-release.
    Learning Temporal Point Processes for Efficient Retrieval of Continuous Time Event Sequences. (arXiv:2202.11485v1 [cs.IR])
    Recent developments in predictive modeling using marked temporal point processes (MTPP) have enabled an accurate characterization of several real-world applications involving continuous-time event sequences (CTESs). However, the retrieval problem of such sequences remains largely unaddressed in literature. To tackle this, we propose NEUROSEQRET which learns to retrieve and rank a relevant set of continuous-time event sequences for a given query sequence, from a large corpus of sequences. More specifically, NEUROSEQRET first applies a trainable unwarping function on the query sequence, which makes it comparable with corpus sequences, especially when a relevant query-corpus pair has individually different attributes. Next, it feeds the unwarped query sequence and the corpus sequence into MTPP guided neural relevance models. We develop two variants of the relevance model which offer a tradeoff between accuracy and efficiency. We also propose an optimization framework to learn binary sequence embeddings from the relevance scores, suitable for the locality-sensitive hashing leading to a significant speedup in returning top-K results for a given query sequence. Our experiments with several datasets show the significant accuracy boost of NEUROSEQRET beyond several baselines, as well as the efficacy of our hashing mechanism.
    Bag Graph: Multiple Instance Learning using Bayesian Graph Neural Networks. (arXiv:2202.11132v1 [cs.LG])
    Multiple Instance Learning (MIL) is a weakly supervised learning problem where the aim is to assign labels to sets or bags of instances, as opposed to traditional supervised learning where each instance is assumed to be independent and identically distributed (IID) and is to be labeled individually. Recent work has shown promising results for neural network models in the MIL setting. Instead of focusing on each instance, these models are trained in an end-to-end fashion to learn effective bag-level representations by suitably combining permutation invariant pooling techniques with neural architectures. In this paper, we consider modelling the interactions between bags using a graph and employ Graph Neural Networks (GNNs) to facilitate end-to-end learning. Since a meaningful graph representing dependencies between bags is rarely available, we propose to use a Bayesian GNN framework that can generate a likely graph structure for scenarios where there is uncertainty in the graph or when no graph is available. Empirical results demonstrate the efficacy of the proposed technique for several MIL benchmark tasks and a distribution regression task.
    Continual Auxiliary Task Learning. (arXiv:2202.11133v1 [cs.LG])
    Learning auxiliary tasks, such as multiple predictions about the world, can provide many benefits to reinforcement learning systems. A variety of off-policy learning algorithms have been developed to learn such predictions, but as yet there is little work on how to adapt the behavior to gather useful data for those off-policy predictions. In this work, we investigate a reinforcement learning system designed to learn a collection of auxiliary tasks, with a behavior policy learning to take actions to improve those auxiliary predictions. We highlight the inherent non-stationarity in this continual auxiliary task learning problem, for both prediction learners and the behavior learner. We develop an algorithm based on successor features that facilitates tracking under non-stationary rewards, and prove the separation into learning successor features and rewards provides convergence rate improvements. We conduct an in-depth study into the resulting multi-prediction learning system.
    Multi-fidelity reinforcement learning framework for shape optimization. (arXiv:2202.11170v1 [cs.LG])
    Deep reinforcement learning (DRL) is a promising outer-loop intelligence paradigm which can deploy problem solving strategies for complex tasks. Consequently, DRL has been utilized for several scientific applications, specifically in cases where classical optimization or control methods are limited. One key limitation of conventional DRL methods is their episode-hungry nature which proves to be a bottleneck for tasks which involve costly evaluations of a numerical forward model. In this article, we address this limitation of DRL by introducing a controlled transfer learning framework that leverages a multi-fidelity simulation setting. Our strategy is deployed for an airfoil shape optimization problem at high Reynolds numbers, where our framework can learn an optimal policy for generating efficient airfoil shapes by gathering knowledge from multi-fidelity environments and reduces computational costs by over 30\%. Furthermore, our formulation promotes policy exploration and generalization to new environments, thereby preventing over-fitting to data from solely one fidelity. Our results demonstrate this framework's applicability to other scientific DRL scenarios where multi-fidelity environments can be used for policy learning.
    r-G2P: Evaluating and Enhancing Robustness of Grapheme to Phoneme Conversion by Controlled noise introducing and Contextual information incorporation. (arXiv:2202.11194v1 [eess.AS])
    Grapheme-to-phoneme (G2P) conversion is the process of converting the written form of words to their pronunciations. It has an important role for text-to-speech (TTS) synthesis and automatic speech recognition (ASR) systems. In this paper, we aim to evaluate and enhance the robustness of G2P models. We show that neural G2P models are extremely sensitive to orthographical variations in graphemes like spelling mistakes. To solve this problem, we propose three controlled noise introducing methods to synthesize noisy training data. Moreover, we incorporate the contextual information with the baseline and propose a robust training strategy to stabilize the training process. The experimental results demonstrate that our proposed robust G2P model (r-G2P) outperforms the baseline significantly (-2.73\% WER on Dict-based benchmarks and -9.09\% WER on Real-world sources).
    ProtoSound: A Personalized and Scalable Sound Recognition System for Deaf and Hard-of-Hearing Users. (arXiv:2202.11134v1 [cs.HC])
    Recent advances have enabled automatic sound recognition systems for deaf and hard of hearing (DHH) users on mobile devices. However, these tools use pre-trained, generic sound recognition models, which do not meet the diverse needs of DHH users. We introduce ProtoSound, an interactive system for customizing sound recognition models by recording a few examples, thereby enabling personalized and fine-grained categories. ProtoSound is motivated by prior work examining sound awareness needs of DHH people and by a survey we conducted with 472 DHH participants. To evaluate ProtoSound, we characterized performance on two real-world sound datasets, showing significant improvement over state-of-the-art (e.g., +9.7% accuracy on the first dataset). We then deployed ProtoSound's end-user training and real-time recognition through a mobile application and recruited 19 hearing participants who listened to the real-world sounds and rated the accuracy across 56 locations (e.g., homes, restaurants, parks). Results show that ProtoSound personalized the model on-device in real-time and accurately learned sounds across diverse acoustic contexts. We close by discussing open challenges in personalizable sound recognition, including the need for better recording interfaces and algorithmic improvements.
    Designing Decision Support Systems for Emergency Response: Challenges and Opportunities. (arXiv:2202.11268v1 [cs.AI])
    Designing effective emergency response management (ERM) systems to respond to incidents such as road accidents is a major problem faced by communities. In addition to responding to frequent incidents each day (about 240 million emergency medical services calls and over 5 million road accidents in the US each year), these systems also support response during natural hazards. Recently, there has been a consistent interest in building decision support and optimization tools that can help emergency responders provide more efficient and effective response. This includes a number of principled subsystems that implement early incident detection, incident likelihood forecasting and strategic resource allocation and dispatch policies. In this paper, we highlight the key challenges and provide an overview of the approach developed by our team in collaboration with our community partners.
    Robust Geometric Metric Learning. (arXiv:2202.11550v1 [stat.ML])
    This paper proposes new algorithms for the metric learning problem. We start by noticing that several classical metric learning formulations from the literature can be viewed as modified covariance matrix estimation problems. Leveraging this point of view, a general approach, called Robust Geometric Metric Learning (RGML), is then studied. This method aims at simultaneously estimating the covariance matrix of each class while shrinking them towards their (unknown) barycenter. We focus on two specific costs functions: one associated with the Gaussian likelihood (RGML Gaussian), and one with Tyler's M -estimator (RGML Tyler). In both, the barycenter is defined with the Riemannian distance, which enjoys nice properties of geodesic convexity and affine invariance. The optimization is performed using the Riemannian geometry of symmetric positive definite matrices and its submanifold of unit determinant. Finally, the performance of RGML is asserted on real datasets. Strong performance is exhibited while being robust to mislabeled data.
    Shisha: Online scheduling of CNN pipelines on heterogeneous architectures. (arXiv:2202.11575v1 [cs.PF])
    Chiplets have become a common methodology in modern chip design. Chiplets improve yield and enable heterogeneity at the level of cores, memory subsystem and the interconnect. Convolutional Neural Networks (CNNs) have high computational, bandwidth and memory capacity requirements owing to the increasingly large amount of weights. Thus to exploit chiplet-based architectures, CNNs must be optimized in terms of scheduling and workload distribution among computing resources. We propose Shisha, an online approach to generate and schedule parallel CNN pipelines on chiplet architectures. Shisha targets heterogeneity in compute performance and memory bandwidth and tunes the pipeline schedule through a fast online exploration technique. We compare Shisha with Simulated Annealing, Hill Climbing and Pipe-Search. On average, the convergence time is improved by ~35x in Shisha compared to other exploration algorithms. Despite the quick exploration, Shisha's solution is often better than that of other heuristic exploration algorithms.
    Curiosity-Driven Exploration via Latent Bayesian Surprise. (arXiv:2104.07495v2 [cs.LG] UPDATED)
    The human intrinsic desire to pursue knowledge, also known as curiosity, is considered essential in the process of skill acquisition. With the aid of artificial curiosity, we could equip current techniques for control, such as Reinforcement Learning, with more natural exploration capabilities. A promising approach in this respect has consisted of using Bayesian surprise on model parameters, i.e. a metric for the difference between prior and posterior beliefs, to favour exploration. In this contribution, we propose to apply Bayesian surprise in a latent space representing the agent's current understanding of the dynamics of the system, drastically reducing the computational costs. We extensively evaluate our method by measuring the agent's performance in terms of environment exploration, for continuous tasks, and looking at the game scores achieved, for video games. Our model is computationally cheap and compares positively with current state-of-the-art methods on several problems. We also investigate the effects caused by stochasticity in the environment, which is often a failure case for curiosity-driven agents. In this regime, the results suggest that our approach is resilient to stochastic transitions.
    Absolute Zero-Shot Learning. (arXiv:2202.11319v1 [cs.CV])
    Considering the increasing concerns about data copyright and privacy issues, we present a novel Absolute Zero-Shot Learning (AZSL) paradigm, i.e., training a classifier with zero real data. The key innovation is to involve a teacher model as the data safeguard to guide the AZSL model training without data leaking. The AZSL model consists of a generator and student network, which can achieve date-free knowledge transfer while maintaining the performance of the teacher network. We investigate `black-box' and `white-box' scenarios in AZSL task as different levels of model security. Besides, we also provide discussion of teacher model in both inductive and transductive settings. Despite embarrassingly simple implementations and data-missing disadvantages, our AZSL framework can retain state-of-the-art ZSL and GZSL performance under the `white-box' scenario. Extensive qualitative and quantitative analysis also demonstrates promising results when deploying the model under `black-box' scenario.
    Mirror Descent Strikes Again: Optimal Stochastic Convex Optimization under Infinite Noise Variance. (arXiv:2202.11632v1 [stat.ML])
    We study stochastic convex optimization under infinite noise variance. Specifically, when the stochastic gradient is unbiased and has uniformly bounded $(1+\kappa)$-th moment, for some $\kappa \in (0,1]$, we quantify the convergence rate of the Stochastic Mirror Descent algorithm with a particular class of uniformly convex mirror maps, in terms of the number of iterations, dimensionality and related geometric parameters of the optimization problem. Interestingly this algorithm does not require any explicit gradient clipping or normalization, which have been extensively used in several recent empirical and theoretical works. We complement our convergence results with information-theoretic lower bounds showing that no other algorithm using only stochastic first-order oracles can achieve improved rates. Our results have several interesting consequences for devising online/streaming stochastic approximation algorithms for problems arising in robust statistics and machine learning.
    FourCastNet: A Global Data-driven High-resolution Weather Model using Adaptive Fourier Neural Operators. (arXiv:2202.11214v1 [physics.ao-ph])
    FourCastNet, short for Fourier Forecasting Neural Network, is a global data-driven weather forecasting model that provides accurate short to medium-range global predictions at $0.25^{\circ}$ resolution. FourCastNet accurately forecasts high-resolution, fast-timescale variables such as the surface wind speed, precipitation, and atmospheric water vapor. It has important implications for planning wind energy resources, predicting extreme weather events such as tropical cyclones, extra-tropical cyclones, and atmospheric rivers. FourCastNet matches the forecasting accuracy of the ECMWF Integrated Forecasting System (IFS), a state-of-the-art Numerical Weather Prediction (NWP) model, at short lead times for large-scale variables, while outperforming IFS for variables with complex fine-scale structure, including precipitation. FourCastNet generates a week-long forecast in less than 2 seconds, orders of magnitude faster than IFS. The speed of FourCastNet enables the creation of rapid and inexpensive large-ensemble forecasts with thousands of ensemble-members for improving probabilistic forecasting. We discuss how data-driven deep learning models such as FourCastNet are a valuable addition to the meteorology toolkit to aid and augment NWP models.
    Deep Graph Learning for Anomalous Citation Detection. (arXiv:2202.11360v1 [cs.LG])
    Anomaly detection is one of the most active research areas in various critical domains, such as healthcare, fintech, and public security. However, little attention has been paid to scholarly data, i.e., anomaly detection in a citation network. Citation is considered as one of the most crucial metrics to evaluate the impact of scientific research, which may be gamed in multiple ways. Therefore, anomaly detection in citation networks is of significant importance to identify manipulation and inflation of citations. To address this open issue, we propose a novel deep graph learning model, namely GLAD (Graph Learning for Anomaly Detection), to identify anomalies in citation networks. GLAD incorporates text semantic mining to network representation learning by adding both node attributes and link attributes via graph neural networks. It exploits not only the relevance of citation contents but also hidden relationships between papers. Within the GLAD framework, we propose an algorithm called CPU (Citation PUrpose) to discover the purpose of citation based on citation texts. The performance of GLAD is validated through a simulated anomalous citation dataset. Experimental results demonstrate the effectiveness of GLAD on the anomalous citation detection task.
    Super-resolution GANs of randomly-seeded fields. (arXiv:2202.11701v1 [physics.flu-dyn])
    Reconstruction of field quantities from sparse measurements is a problem arising in a broad spectrum of applications. This task is particularly challenging when mapping between point sparse measurements and field quantities shall be performed in an unsupervised manner. Further complexity is added for moving sensors and/or random on-off status. Under such conditions, the most straightforward solution is to interpolate the scattered data onto a regular grid. However, the spatial resolution achieved with this approach is ultimately limited by the mean spacing between the sparse measurements. In this work, we propose a novel super-resolution generative adversarial network (GAN) framework to estimate field quantities from random sparse sensors without needing any full-resolution field for training. The algorithm exploits random sampling to provide incomplete views of the high-resolution underlying distributions. It is hereby referred to as RAndomly-SEEDed super-resolution GAN (RaSeedGAN). The proposed technique is tested on synthetic databases of fluid flow simulations, ocean surface temperature distributions measurements, and particle image velocimetry data of a zero-pressure-gradient turbulent boundary layer. The results show an excellent performance of the proposed methodology even in cases with a high level of gappyness (>50\%) or noise conditions. To our knowledge, this is the first super-resolution GANs algorithm for full-field estimation from randomly-seeded fields with no need of a full-field high-resolution representation during training nor of a library of training examples.  ( 2 min )
    Real-time Over-the-air Adversarial Perturbations for Digital Communications using Deep Neural Networks. (arXiv:2202.11197v1 [cs.CR])
    Deep neural networks (DNNs) are increasingly being used in a variety of traditional radiofrequency (RF) problems. Previous work has shown that while DNN classifiers are typically more accurate than traditional signal processing algorithms, they are vulnerable to intentionally crafted adversarial perturbations which can deceive the DNN classifiers and significantly reduce their accuracy. Such intentional adversarial perturbations can be used by RF communications systems to avoid reactive-jammers and interception systems which rely on DNN classifiers to identify their target modulation scheme. While previous research on RF adversarial perturbations has established the theoretical feasibility of such attacks using simulation studies, critical questions concerning real-world implementation and viability remain unanswered. This work attempts to bridge this gap by defining class-specific and sample-independent adversarial perturbations which are shown to be effective yet computationally feasible in real-time and time-invariant. We demonstrate the effectiveness of these attacks over-the-air across a physical channel using software-defined radios (SDRs). Finally, we demonstrate that these adversarial perturbations can be emitted from a source other than the communications device, making these attacks practical for devices that cannot manipulate their transmitted signals at the physical layer.
    Differentiable and Learnable Robot Models. (arXiv:2202.11217v1 [cs.RO])
    Building differentiable simulations of physical processes has recently received an increasing amount of attention. Specifically, some efforts develop differentiable robotic physics engines motivated by the computational benefits of merging rigid body simulations with modern differentiable machine learning libraries. Here, we present a library that focuses on the ability to combine data driven methods with analytical rigid body computations. More concretely, our library \emph{Differentiable Robot Models} implements both \emph{differentiable} and \emph{learnable} models of the kinematics and dynamics of robots in Pytorch. The source-code is available at \url{https://github.com/facebookresearch/differentiable-robot-model}
    Learning Neural Networks under Input-Output Specifications. (arXiv:2202.11246v1 [eess.SY])
    In this paper, we examine an important problem of learning neural networks that certifiably meet certain specifications on input-output behaviors. Our strategy is to find an inner approximation of the set of admissible policy parameters, which is convex in a transformed space. To this end, we address the key technical challenge of convexifying the verification condition for neural networks, which is derived by abstracting the nonlinear specifications and activation functions with quadratic constraints. In particular, we propose a reparametrization scheme of the original neural network based on loop transformation, which leads to a convex condition that can be enforced during learning. This theoretical construction is validated in an experiment that specifies reachable sets for different regions of inputs.
    Exploring Classic Quantitative Strategies. (arXiv:2202.11309v1 [q-fin.GN])
    The goal of this paper is to debunk and dispel the magic behind the black-box quantitative strategies. It aims to build a solid foundation on how and why the techniques work. This manuscript crystallizes this knowledge by deriving from simple intuitions, the mathematics behind the strategies. This tutorial doesn't shy away from addressing both the formal and informal aspects of quantitative strategies. By doing so, it hopes to provide readers with a deeper understanding of these techniques as well as the when, the how and the why of applying these techniques. The strategies are presented in terms of both S\&P500 and SH510300 data sets. However, the results from the tests are just examples of how the methods work; no claim is made on the suggestion of real market positions.
    Preformer: Predictive Transformer with Multi-Scale Segment-wise Correlations for Long-Term Time Series Forecasting. (arXiv:2202.11356v1 [cs.LG])
    Transformer-based methods have shown great potential in long-term time series forecasting. However, most of these methods adopt the standard point-wise self-attention mechanism, which not only becomes intractable for long-term forecasting since its complexity increases quadratically with the length of time series, but also cannot explicitly capture the predictive dependencies from contexts since the corresponding key and value are transformed from the same point. This paper proposes a predictive Transformer-based model called {\em Preformer}. Preformer introduces a novel efficient {\em Multi-Scale Segment-Correlation} mechanism that divides time series into segments and utilizes segment-wise correlation-based attention for encoding time series. A multi-scale structure is developed to aggregate dependencies at different temporal scales and facilitate the selection of segment length. Preformer further designs a predictive paradigm for decoding, where the key and value come from two successive segments rather than the same segment. In this way, if a key segment has a high correlation score with the query segment, its successive segment contributes more to the prediction of the query segment. Extensive experiments demonstrate that our Preformer outperforms other Transformer-based methods.
    Performance Modeling of Metric-Based Serverless Computing Platforms. (arXiv:2202.11247v1 [cs.DC])
    Analytical performance models are very effective in ensuring the quality of service and cost of service deployment remain desirable under different conditions and workloads. While various analytical performance models have been proposed for previous paradigms in cloud computing, serverless computing lacks such models that can provide developers with performance guarantees. Besides, most serverless computing platforms still require developers' input to specify the configuration for their deployment that could affect both the performance and cost of their deployment, without providing them with any direct and immediate feedback. In previous studies, we built such performance models for steady-state and transient analysis of scale-per-request serverless computing platforms (e.g., AWS Lambda, Azure Functions, Google Cloud Functions) that could give developers immediate feedback about the quality of service and cost of their deployments. In this work, we aim to develop analytical performance models for the latest trend in serverless computing platforms that use concurrency value and the rate of requests per second for autoscaling decisions. Examples of such serverless computing platforms are Knative and Google Cloud Run (a managed Knative service by Google). The proposed performance model can help developers and providers predict the performance and cost of deployments with different configurations which could help them tune the configuration toward the best outcome. We validate the applicability and accuracy of the proposed performance model by extensive real-world experimentation on Knative and show that our performance model is able to accurately predict the steady-state characteristics of a given workload with minimal amount of data collection.
    Web of Scholars: A Scholar Knowledge Graph. (arXiv:2202.11311v1 [cs.DL])
    In this work, we demonstrate a novel system, namely Web of Scholars, which integrates state-of-the-art mining techniques to search, mine, and visualize complex networks behind scholars in the field of Computer Science. Relying on the knowledge graph, it provides services for fast, accurate, and intelligent semantic querying as well as powerful recommendations. In addition, in order to realize information sharing, it provides an open API to be served as the underlying architecture for advanced functions. Web of Scholars takes advantage of knowledge graph, which means that it will be able to access more knowledge if more search exist. It can be served as a useful and interoperable tool for scholars to conduct in-depth analysis within Science of Science.
    Neural Speech Synthesis on a Shoestring: Improving the Efficiency of LPCNet. (arXiv:2202.11169v1 [eess.AS])
    Neural speech synthesis models can synthesize high quality speech but typically require a high computational complexity to do so. In previous work, we introduced LPCNet, which uses linear prediction to significantly reduce the complexity of neural synthesis. In this work, we further improve the efficiency of LPCNet -- targeting both algorithmic and computational improvements -- to make it usable on a wide variety of devices. We demonstrate an improvement in synthesis quality while operating 2.5x faster. The resulting open-source LPCNet algorithm can perform real-time neural synthesis on most existing phones and is even usable in some embedded devices.
    Study of Feature Importance for Quantum Machine Learning Models. (arXiv:2202.11204v1 [quant-ph])
    Predictor importance is a crucial part of data preprocessing pipelines in classical and quantum machine learning (QML). This work presents the first study of its kind in which feature importance for QML models has been explored and contrasted against their classical machine learning (CML) equivalents. We developed a hybrid quantum-classical architecture where QML models are trained and feature importance values are calculated from classical algorithms on a real-world dataset. This architecture has been implemented on ESPN Fantasy Football data using Qiskit statevector simulators and IBM quantum hardware such as the IBMQ Mumbai and IBMQ Montreal systems. Even though we are in the Noisy Intermediate-Scale Quantum (NISQ) era, the physical quantum computing results are promising. To facilitate current quantum scale, we created a data tiering, model aggregation, and novel validation methods. Notably, the feature importance magnitudes from the quantum models had a much higher variation when contrasted to classical models. We can show that equivalent QML and CML models are complementary through diversity measurements. The diversity between QML and CML demonstrates that both approaches can contribute to a solution in different ways. Within this paper we focus on Quantum Support Vector Classifiers (QSVC), Variational Quantum Circuit (VQC), and their classical counterparts. The ESPN and IBM fantasy footballs Trade Assistant combines advanced statistical analysis with the natural language processing of Watson Discovery to serve up personalized trade recommendations that are fair and proposes a trade. Here, player valuation data of each player has been considered and this work can be extended to calculate the feature importance of other QML models such as Quantum Boltzmann machines.
    No-Regret Learning with Unbounded Losses: The Case of Logarithmic Pooling. (arXiv:2202.11219v1 [cs.LG])
    For each of $T$ time steps, $m$ experts report probability distributions over $n$ outcomes; we wish to learn to aggregate these forecasts in a way that attains a no-regret guarantee. We focus on the fundamental and practical aggregation method known as logarithmic pooling -- a weighted average of log odds -- which is in a certain sense the optimal choice of pooling method if one is interested in minimizing log loss (as we take to be our loss function). We consider the problem of learning the best set of parameters (i.e. expert weights) in an online adversarial setting. We assume (by necessity) that the adversarial choices of outcomes and forecasts are consistent, in the sense that experts report calibrated forecasts. Our main result is an algorithm based on online mirror descent that learns expert weights in a way that attains $O(\sqrt{T} \log T)$ expected regret as compared with the best weights in hindsight.
    Exploring Edge Disentanglement for Node Classification. (arXiv:2202.11245v1 [cs.SI])
    Edges in real-world graphs are typically formed by a variety of factors and carry diverse relation semantics. For example, connections in a social network could indicate friendship, being colleagues, or living in the same neighborhood. However, these latent factors are usually concealed behind mere edge existence due to the data collection and graph formation processes. Despite rapid developments in graph learning over these years, most models take a holistic approach and treat all edges as equal. One major difficulty in disentangling edges is the lack of explicit supervisions. In this work, with close examination of edge patterns, we propose three heuristics and design three corresponding pretext tasks to guide the automatic edge disentanglement. Concretely, these self-supervision tasks are enforced on a designed edge disentanglement module to be trained jointly with the downstream node classification task to encourage automatic edge disentanglement. Channels of the disentanglement module are expected to capture distinguishable relations and neighborhood interactions, and outputs from them are aggregated as node representations. The proposed DisGNN is easy to be incorporated with various neural architectures, and we conduct experiments on $6$ real-world datasets. Empirical results show that it can achieve significant performance gains.
    A New Generation of Perspective API: Efficient Multilingual Character-level Transformers. (arXiv:2202.11176v1 [cs.CL])
    On the world wide web, toxic content detectors are a crucial line of defense against potentially hateful and offensive messages. As such, building highly effective classifiers that enable a safer internet is an important research area. Moreover, the web is a highly multilingual, cross-cultural community that develops its own lingo over time. As such, it is crucial to develop models that are effective across a diverse range of languages, usages, and styles. In this paper, we present the fundamentals behind the next version of the Perspective API from Google Jigsaw. At the heart of the approach is a single multilingual token-free Charformer model that is applicable across a range of languages, domains, and tasks. We demonstrate that by forgoing static vocabularies, we gain flexibility across a variety of settings. We additionally outline the techniques employed to make such a byte-level model efficient and feasible for productionization. Through extensive experiments on multilingual toxic comment classification benchmarks derived from real API traffic and evaluation on an array of code-switching, covert toxicity, emoji-based hate, human-readable obfuscation, distribution shift, and bias evaluation settings, we show that our proposed approach outperforms strong baselines. Finally, we present our findings from deploying this system in production.
    Continual learning-based probabilistic slow feature analysis for multimode dynamic process monitoring. (arXiv:2202.11295v1 [cs.LG])
    In this paper, a novel multimode dynamic process monitoring approach is proposed by extending elastic weight consolidation (EWC) to probabilistic slow feature analysis (PSFA) in order to extract multimode slow features for online monitoring. EWC was originally introduced in the setting of machine learning of sequential multi-tasks with the aim of avoiding catastrophic forgetting issue, which equally poses as a major challenge in multimode dynamic process monitoring. When a new mode arrives, a set of data should be collected so that this mode can be identified by PSFA and prior knowledge. Then, a regularization term is introduced to prevent new data from significantly interfering with the learned knowledge, where the parameter importance measures are estimated. The proposed method is denoted as PSFA-EWC, which is updated continually and capable of achieving excellent performance for successive modes. Different from traditional multimode monitoring algorithms, PSFA-EWC furnishes backward and forward transfer ability. The significant features of previous modes are retained while consolidating new information, which may contribute to learning new relevant modes. Compared with several known methods, the effectiveness of the proposed method is demonstrated via a continuous stirred tank heater and a practical coal pulverizing system.
    Bitwidth Heterogeneous Federated Learning with Progressive Weight Dequantization. (arXiv:2202.11453v1 [cs.LG])
    In practical federated learning scenarios, the participating devices may have different bitwidths for computation and memory storage by design. However, despite the progress made in device-heterogeneous federated learning scenarios, the heterogeneity in the bitwidth specifications in the hardware has been mostly overlooked. We introduce a pragmatic FL scenario with bitwidth heterogeneity across the participating devices, dubbed as Bitwidth Heterogeneous Federated Learning (BHFL). BHFL brings in a new challenge, that the aggregation of model parameters with different bitwidths could result in severe performance degeneration, especially for high-bitwidth models. To tackle this problem, we propose ProWD framework, which has a trainable weight dequantizer at the central server that progressively reconstructs the low-bitwidth weights into higher bitwidth weights, and finally into full-precision weights. ProWD further selectively aggregates the model parameters to maximize the compatibility across bit-heterogeneous weights. We validate ProWD against relevant FL baselines on the benchmark datasets, using clients with varying bitwidths. Our ProWD largely outperforms the baseline FL algorithms as well as naive approaches (e.g. grouped averaging) under the proposed BHFL scenario.
    A new LDA formulation with covariates. (arXiv:2202.11527v1 [cs.IR])
    The Latent Dirichlet Allocation (LDA) model is a popular method for creating mixed-membership clusters. Despite having been originally developed for text analysis, LDA has been used for a wide range of other applications. We propose a new formulation for the LDA model which incorporates covariates. In this model, a negative binomial regression is embedded within LDA, enabling straight-forward interpretation of the regression coefficients and the analysis of the quantity of cluster-specific elements in each sampling units (instead of the analysis being focused on modeling the proportion of each cluster, as in Structural Topic Models). We use slice sampling within a Gibbs sampling algorithm to estimate model parameters. We rely on simulations to show how our algorithm is able to successfully retrieve the true parameter values and the ability to make predictions for the abundance matrix using the information given by the covariates. The model is illustrated using real data sets from three different areas: text-mining of Coronavirus articles, analysis of grocery shopping baskets, and ecology of tree species on Barro Colorado Island (Panama). This model allows the identification of mixed-membership clusters in discrete data and provides inference on the relationship between covariates and the abundance of these clusters.
    Deepfake Detection for Facial Images with Facemasks. (arXiv:2202.11359v1 [cs.CV])
    Hyper-realistic face image generation and manipulation have givenrise to numerous unethical social issues, e.g., invasion of privacy,threat of security, and malicious political maneuvering, which re-sulted in the development of recent deepfake detection methodswith the rising demands of deepfake forensics. Proposed deepfakedetection methods to date have shown remarkable detection perfor-mance and robustness. However, none of the suggested deepfakedetection methods assessed the performance of deepfakes withthe facemask during the pandemic crisis after the outbreak of theCovid-19. In this paper, we thoroughly evaluate the performance ofstate-of-the-art deepfake detection models on the deepfakes withthe facemask. Also, we propose two approaches to enhance themasked deepfakes detection:face-patchandface-crop. The experi-mental evaluations on both methods are assessed through the base-line deepfake detection models on the various deepfake datasets.Our extensive experiments show that, among the two methods,face-cropperforms better than theface-patch, and could be a trainmethod for deepfake detection models to detect fake faces withfacemask in real world.
    Model2Detector: Widening the Information Bottleneck for Out-of-Distribution Detection using a Handful of Gradient Steps. (arXiv:2202.11226v1 [cs.LG])
    Out-of-distribution detection is an important capability that has long eluded vanilla neural networks. Deep Neural networks (DNNs) tend to generate over-confident predictions when presented with inputs that are significantly out-of-distribution (OOD). This can be dangerous when employing machine learning systems in the wild as detecting attacks can thus be difficult. Recent advances inference-time out-of-distribution detection help mitigate some of these problems. However, existing methods can be restrictive as they are often computationally expensive. Additionally, these methods require training of a downstream detector model which learns to detect OOD inputs from in-distribution ones. This, therefore, adds latency during inference. Here, we offer an information theoretic perspective on why neural networks are inherently incapable of OOD detection. We attempt to mitigate these flaws by converting a trained model into a an OOD detector using a handful of steps of gradient descent. Our work can be employed as a post-processing method whereby an inference-time ML system can convert a trained model into an OOD detector. Experimentally, we show how our method consistently outperforms the state-of-the-art in detection accuracy on popular image datasets while also reducing computational complexity.
    Globally Convergent Policy Search over Dynamic Filters for Output Estimation. (arXiv:2202.11659v1 [math.OC])
    We introduce the first direct policy search algorithm which provably converges to the globally optimal $\textit{dynamic}$ filter for the classical problem of predicting the outputs of a linear dynamical system, given noisy, partial observations. Despite the ubiquity of partial observability in practice, theoretical guarantees for direct policy search algorithms, one of the backbones of modern reinforcement learning, have proven difficult to achieve. This is primarily due to the degeneracies which arise when optimizing over filters that maintain internal state. In this paper, we provide a new perspective on this challenging problem based on the notion of $\textit{informativity}$, which intuitively requires that all components of a filter's internal state are representative of the true state of the underlying dynamical system. We show that informativity overcomes the aforementioned degeneracy. Specifically, we propose a $\textit{regularizer}$ which explicitly enforces informativity, and establish that gradient descent on this regularized objective - combined with a ``reconditioning step'' - converges to the globally optimal cost a $\mathcal{O}(1/T)$. Our analysis relies on several new results which may be of independent interest, including a new framework for analyzing non-convex gradient descent via convex reformulation, and novel bounds on the solution to linear Lyapunov equations in terms of (our quantitative measure of) informativity.
    Biometric security technology. (arXiv:2202.11459v1 [cs.CR])
    This paper presents an overview of the main topics related to biometric security technology, with the main purpose to provide a primer on this subject. Biometrics can offer greater security and convenience than traditional methods for people recognition. Even if we do not want to replace a classic method (password or handheld token) by a biometric one, for sure, we are potential users of these systems, which will even be mandatory for new passport models. For this reason, to be familiarized with the possibilities of biometric security technology is useful.
    It's not Rocket Science : Interpreting Figurative Language in Narratives. (arXiv:2109.00087v2 [cs.CL] UPDATED)
    Figurative language is ubiquitous in English. Yet, the vast majority of NLP research focuses on literal language. Existing text representations by design rely on compositionality, while figurative language is often non-compositional. In this paper, we study the interpretation of two non-compositional figurative languages (idioms and similes). We collected datasets of fictional narratives containing a figurative expression along with crowd-sourced plausible and implausible continuations relying on the correct interpretation of the expression. We then trained models to choose or generate the plausible continuation. Our experiments show that models based solely on pre-trained language models perform substantially worse than humans on these tasks. We additionally propose knowledge-enhanced models, adopting human strategies for interpreting figurative language types : inferring meaning from the context and relying on the constituent words' literal meanings. The knowledge-enhanced models improve the performance on both the discriminative and generative tasks, further bridging the gap from human performance.  ( 2 min )
    Constant matters: Fine-grained Complexity of Differentially Private Continual Observation Using Completely Bounded Norms. (arXiv:2202.11205v1 [cs.DS])
    We study fine-grained error bounds for differentially private algorithms for averaging and counting in the continual observation model. For this, we use the completely bounded spectral norm (cb norm) from operator algebra. For a matrix $W$, its cb norm is defined as \[ \|{W}\|_{\mathsf{cb}} = \max_{Q} \left\{ \frac{\|{Q \bullet W}\|}{\|{Q}\|} \right\}, \] where $Q \bullet W$ denotes the Schur product and $\|{\cdot}\|$ denotes the spectral norm. We bound the cb norm of two fundamental matrices studied in differential privacy under the continual observation model: the counting matrix $M_{\mathsf{counting}}$ and the averaging matrix $M_{\mathsf{average}}$. For $M_{\mathsf{counting}}$, we give lower and upper bound whose additive gap is $1 + \frac{1}{\pi}$. Our factorization also has two desirable properties sufficient for streaming setting: the factorization contains of lower-triangular matrices and the number of distinct entries in the factorization is exactly $T$. This allows us to compute the factorization on the fly while requiring the curator to store a $T$-dimensional vector. For $M_{\mathsf{average}}$, we show an additive gap between the lower and upper bound of $\approx 0.64$.
    Functional Parcellation of fMRI data using multistage k-means clustering. (arXiv:2202.11206v1 [eess.IV])
    Purpose: Functional Magnetic Resonance Imaging (fMRI) data acquired through resting-state studies have been used to obtain information about the spontaneous activations inside the brain. One of the approaches for analysis and interpretation of resting-state fMRI data require spatially and functionally homogenous parcellation of the whole brain based on underlying temporal fluctuations. Clustering is often used to generate functional parcellation. However, major clustering algorithms, when used for fMRI data, have their limitations. Among commonly used parcellation schemes, a tradeoff exists between intra-cluster functional similarity and alignment with anatomical regions. Approach: In this work, we present a clustering algorithm for resting state and task fMRI data which is developed to obtain brain parcellations that show high structural and functional homogeneity. The clustering is performed by multistage binary k-means clustering algorithm designed specifically for the 4D fMRI data. The results from this multistage k-means algorithm show that by modifying and combining different algorithms, we can take advantage of the strengths of different techniques while overcoming their limitations. Results: The clustering output for resting state fMRI data using the multistage k-means approach is shown to be better than simple k-means or functional atlas in terms of spatial and functional homogeneity. The clusters also correspond to commonly identifiable brain networks. For task fMRI, the clustering output can identify primary and secondary activation regions and provide information about the varying hemodynamic response across different brain regions. Conclusion: The multistage k-means approach can provide functional parcellations of the brain using resting state fMRI data. The method is model-free and is data driven which can be applied to both resting state and task fMRI.
    BacHMMachine: An Interpretable and Scalable Model for Algorithmic Harmonization for Four-part Baroque Chorales. (arXiv:2109.07623v2 [cs.SD] UPDATED)
    Algorithmic harmonization - the automated harmonization of a musical piece given its melodic line - is a challenging problem that has garnered much interest from both music theorists and computer scientists. One genre of particular interest is the four-part Baroque chorales of J.S. Bach. Methods for algorithmic chorale harmonization typically adopt a black-box, "data-driven" approach: they do not explicitly integrate principles from music theory but rely on a complex learning model trained with a large amount of chorale data. We propose instead a new harmonization model, called BacHMMachine, which employs a "theory-driven" framework guided by music composition principles, along with a "data-driven" model for learning compositional features within this framework. As its name suggests, BacHMMachine uses a novel Hidden Markov Model based on key and chord transitions, providing a probabilistic framework for learning key modulations and chordal progressions from a given melodic line. This allows for the generation of creative, yet musically coherent chorale harmonizations; integrating compositional principles allows for a much simpler model that results in vast decreases in computational burden and greater interpretability compared to state-of-the-art algorithmic harmonization methods, at no penalty to quality of harmonization or musicality. We demonstrate this improvement via comprehensive experiments and Turing tests comparing BacHMMachine to existing methods.
    Evaluating Inexact Unlearning Requires Revisiting Forgetting. (arXiv:2201.06640v2 [cs.LG] UPDATED)
    Existing methods in inexact unlearning are evaluated by measuring indistinguishability from models retrained after removing the deletion set. We argue that achieving indistinguishability is unnecessary and its practical relaxations are insufficient. We formulate the goal of unlearning as forgetting all information specific to the deletion set while maintaining high utility and resource efficiency. We introduce a novel test for forgetting called Interclass Confusion (IC). Despite being a black-box test, IC can investigate whether information from the deletion set was erased until the early layers of the network. We analyze two aspects of forgetting: (i) memorization and (ii) property generalization. We empirically show that two simple unlearning methods, exact-unlearning and catastrophic-forgetting the final k layers of a network, outperforms prior unlearning methods when scaled to large deletion sets. Overall, we believe our formulation and the IC test will guide the design of better unlearning algorithms.
    Analysis of a Target-Based Actor-Critic Algorithm with Linear Function Approximation. (arXiv:2106.07472v2 [cs.LG] UPDATED)
    Actor-critic methods integrating target networks have exhibited a stupendous empirical success in deep reinforcement learning. However, a theoretical understanding of the use of target networks in actor-critic methods is largely missing in the literature. In this paper, we reduce this gap between theory and practice by proposing the first theoretical analysis of an online target-based actor-critic algorithm with linear function approximation in the discounted reward setting. Our algorithm uses three different timescales: one for the actor and two for the critic. Instead of using the standard single timescale temporal difference (TD) learning algorithm as a critic, we use a two timescales target-based version of TD learning closely inspired from practical actor-critic algorithms implementing target networks. First, we establish asymptotic convergence results for both the critic and the actor under Markovian sampling. Then, we provide a finite-time analysis showing the impact of incorporating a target network into actor-critic methods.  ( 2 min )
    Nonconvex Extension of Generalized Huber Loss for Robust Learning and Pseudo-Mode Statistics. (arXiv:2202.11141v1 [stat.ML])
    We propose an extended generalization of the pseudo Huber loss formulation. We show that using the log-exp transform together with the logistic function, we can create a loss which combines the desirable properties of the strictly convex losses with robust loss functions. With this formulation, we show that a linear convergence algorithm can be utilized to find a minimizer. We further discuss the creation of a quasi-convex composite loss and provide a derivative-free exponential convergence rate algorithm.
    A duality connecting neural network and cosmological dynamics. (arXiv:2202.11104v1 [gr-qc])
    We demonstrate that the dynamics of neural networks trained with gradient descent and the dynamics of scalar fields in a flat, vacuum energy dominated Universe are structurally profoundly related. This duality provides the framework for synergies between these systems, to understand and explain neural network dynamics and new ways of simulating and describing early Universe models. Working in the continuous-time limit of neural networks, we analytically match the dynamics of the mean background and the dynamics of small perturbations around the mean field, highlighting potential differences in separate limits. We perform empirical tests of this analytic description and quantitatively show the dependence of the effective field theory parameters on hyperparameters of the neural network. As a result of this duality, the cosmological constant is matched inversely to the learning rate in the gradient descent update.
    Deep Recurrent Modelling of Granger Causality with Latent Confounding. (arXiv:2202.11286v1 [cs.LG])
    Inferring causal relationships in observational time series data is an important task when interventions cannot be performed. Granger causality is a popular framework to infer potential causal mechanisms between different time series. The original definition of Granger causality is restricted to linear processes and leads to spurious conclusions in the presence of a latent confounder. In this work, we harness the expressive power of recurrent neural networks and propose a deep learning-based approach to model non-linear Granger causality by directly accounting for latent confounders. Our approach leverages multiple recurrent neural networks to parameterise predictive distributions and we propose the novel use of a dual-decoder setup to conduct the Granger tests. We demonstrate the model performance on non-linear stochastic time series for which the latent confounder influences the cause and effect with different time lags; results show the effectiveness of our model compared to existing benchmarks.
    Finding Safe Zones of policies Markov Decision Processes. (arXiv:2202.11593v1 [cs.LG])
    Given a policy, we define a SafeZone as a subset of states, such that most of the policy's trajectories are confined to this subset. The quality of the SafeZone is parameterized by the number of states and the escape probability, i.e., the probability that a random trajectory will leave the subset. SafeZones are especially interesting when they have a small number of states and low escape probability. We study the complexity of finding optimal SafeZones, and show that in general the problem is computationally hard. For this reason we concentrate on computing approximate SafeZones. Our main result is a bi-criteria approximation algorithm which gives a factor of almost $2$ approximation for both the escape probability and SafeZone size, using a polynomial size sample complexity. We conclude the paper with an empirical evaluation of our algorithm.
    Quantum Heterogeneous Distributed Deep Learning Architectures: Models, Discussions, and Applications. (arXiv:2202.11200v1 [quant-ph])
    Deep learning (DL) has already become a state-of-the-art technology for various data processing tasks. However, data security and computational overload problems frequently occur due to their high data and computational power dependence. To solve this problem, quantum deep learning (QDL) and distributed deep learning (DDL) are emerging to complement existing DL methods by reducing computational overhead and strengthening data security. Furthermore, a quantum distributed deep learning (QDDL) technique that combines these advantages and maximizes them is in the spotlight. QDL takes computational gains by replacing deep learning computations on local devices and servers with quantum deep learning. On the other hand, besides the advantages of the existing distributed learning structure, it can increase data security by using a quantum secure communication protocol between the server and the client. Although many attempts have been made to confirm and demonstrate these various possibilities, QDDL research is still in its infancy. This paper discusses the model structure studied so far and its possibilities and limitations to introduce and promote these studies. It also discusses the areas of applied research so far and in the future and the possibilities of new methodologies.
    Margin-distancing for safe model explanation. (arXiv:2202.11266v1 [cs.LG])
    The growing use of machine learning models in consequential settings has highlighted an important and seemingly irreconcilable tension between transparency and vulnerability to gaming. While this has sparked sizable debate in legal literature, there has been comparatively less technical study of this contention. In this work, we propose a clean-cut formulation of this tension and a way to make the tradeoff between transparency and gaming. We identify the source of gaming as being points close to the \emph{decision boundary} of the model. And we initiate an investigation on how to provide example-based explanations that are expansive and yet consistent with a version space that is sufficiently uncertain with respect to the boundary points' labels. Finally, we furnish our theoretical results with empirical investigations of this tradeoff on real-world datasets.
    GIFT: Graph-guIded Feature Transfer for Cold-Start Video Click-Through Rate Prediction. (arXiv:2202.11525v1 [cs.IR])
    Short video has witnessed rapid growth in China and shows a promising market for promoting the sales of products in e-commerce platforms like Taobao. To ensure the freshness of the content, the platform needs to release a large number of new videos every day, which makes the conventional click-through rate (CTR) prediction model suffer from the severe item cold-start problem. In this paper, we propose GIFT, an efficient Graph-guIded Feature Transfer system, to fully take advantages of the rich information of warmed-up videos that related to the cold-start video. More specifically, we conduct feature transfer from warmed-up videos to those cold-start ones by involving the physical and semantic linkages into a heterogeneous graph. The former linkages consist of those explicit relationships (e.g., sharing the same category, under the same authorship etc.), while the latter measure the proximity of multimodal representations of two videos. In practice, the style, content, and even the recommendation pattern are pretty similar among those physically or semantically related videos. Besides, in order to provide the robust id representations and historical statistics obtained from warmed-up neighbors that cold-start videos covet most, we elaborately design the transfer function to make aware of different transferred features from different types of nodes and edges along the metapath on the graph. Extensive experiments on a large real-world dataset show that our GIFT system outperforms SOTA methods significantly and brings a 6.82% lift on click-through rate (CTR) in the homepage of Taobao App.
    Pessimistic Bootstrapping for Uncertainty-Driven Offline Reinforcement Learning. (arXiv:2202.11566v1 [cs.LG])
    Offline Reinforcement Learning (RL) aims to learn policies from previously collected datasets without exploring the environment. Directly applying off-policy algorithms to offline RL usually fails due to the extrapolation error caused by the out-of-distribution (OOD) actions. Previous methods tackle such problem by penalizing the Q-values of OOD actions or constraining the trained policy to be close to the behavior policy. Nevertheless, such methods typically prevent the generalization of value functions beyond the offline data and also lack precise characterization of OOD data. In this paper, we propose Pessimistic Bootstrapping for offline RL (PBRL), a purely uncertainty-driven offline algorithm without explicit policy constraints. Specifically, PBRL conducts uncertainty quantification via the disagreement of bootstrapped Q-functions, and performs pessimistic updates by penalizing the value function based on the estimated uncertainty. To tackle the extrapolating error, we further propose a novel OOD sampling method. We show that such OOD sampling and pessimistic bootstrapping yields provable uncertainty quantifier in linear MDPs, thus providing the theoretical underpinning for PBRL. Extensive experiments on D4RL benchmark show that PBRL has better performance compared to the state-of-the-art algorithms.
    FastRPB: a Scalable Relative Positional Encoding for Long Sequence Tasks. (arXiv:2202.11364v1 [cs.LG])
    Transformers achieve remarkable performance in various domains, including NLP, CV, audio processing, and graph analysis. However, they do not scale well on long sequence tasks due to their quadratic complexity w.r.t. the inputs length. Linear Transformers were proposed to address this limitation. However, these models have shown weaker performance on the long sequence tasks comparing to the original one. In this paper, we explore Linear Transformer models, rethinking their two core components. Firstly, we improved Linear Transformer with Shift-Invariant Kernel Function SIKF, which achieve higher accuracy without loss in speed. Secondly, we introduce FastRPB which stands for Fast Relative Positional Bias, which efficiently adds positional information to self-attention using Fast Fourier Transformation. FastRPB is independent of the self-attention mechanism and can be combined with an original self-attention and all its efficient variants. FastRPB has O(N log(N)) computational complexity, requiring O(N) memory w.r.t. input sequence length N.
    Reinforcement Learning from Demonstrations by Novel Interactive Expert and Application to Automatic Berthing Control Systems for Unmanned Surface Vessel. (arXiv:2202.11325v1 [eess.SY])
    In this paper, two novel practical methods of Reinforcement Learning from Demonstration (RLfD) are developed and applied to automatic berthing control systems for Unmanned Surface Vessel. A new expert data generation method, called Model Predictive Based Expert (MPBE) which combines Model Predictive Control and Deep Deterministic Policy Gradient, is developed to provide high quality supervision data for RLfD algorithms. A straightforward RLfD method, model predictive Deep Deterministic Policy Gradient (MP-DDPG), is firstly introduced by replacing the RL agent with MPBE to directly interact with the environment. Then distribution mismatch problem is analyzed for MP-DDPG, and two techniques that alleviate distribution mismatch are proposed. Furthermore, another novel RLfD algorithm based on the MP-DDPG, called Self-Guided Actor-Critic (SGAC) is present, which can effectively leverage MPBE by continuously querying it to generate high quality expert data online. The distribution mismatch problem leading to unstable learning process is addressed by SGAC in a DAgger manner. In addition, theoretical analysis is given to prove that SGAC algorithm can converge with guaranteed monotonic improvement. Simulation results verify the effectiveness of MP-DDPG and SGAC to accomplish the ship berthing control task, and show advantages of SGAC comparing with other typical reinforcement learning algorithms and MP-DDPG.
  • Open

    Exponential Tail Local Rademacher Complexity Risk Bounds Without the Bernstein Condition. (arXiv:2202.11461v1 [math.ST])
    The local Rademacher complexity framework is one of the most successful general-purpose toolboxes for establishing sharp excess risk bounds for statistical estimators based on the framework of empirical risk minimization. Applying this toolbox typically requires using the Bernstein condition, which often restricts applicability to convex and proper settings. Recent years have witnessed several examples of problems where optimal statistical performance is only achievable via non-convex and improper estimators originating from aggregation theory, including the fundamental problem of model selection. These examples are currently outside of the reach of the classical localization theory. In this work, we build upon the recent approach to localization via offset Rademacher complexities, for which a general high-probability theory has yet to be established. Our main result is an exponential-tail excess risk bound expressed in terms of the offset Rademacher complexity that yields results at least as sharp as those obtainable via the classical theory. However, our bound applies under an estimator-dependent geometric condition (the "offset condition") instead of the estimator-independent (but, in general, distribution-dependent) Bernstein condition on which the classical theory relies. Our results apply to improper prediction regimes not directly covered by the classical theory.  ( 2 min )
    Self-Supervised Transformers for Unsupervised Object Discovery using Normalized Cut. (arXiv:2202.11539v1 [cs.CV])
    Transformers trained with self-supervised learning using self-distillation loss (DINO) have been shown to produce attention maps that highlight salient foreground objects. In this paper, we demonstrate a graph-based approach that uses the self-supervised transformer features to discover an object from an image. Visual tokens are viewed as nodes in a weighted graph with edges representing a connectivity score based on the similarity of tokens. Foreground objects can then be segmented using a normalized graph-cut to group self-similar regions. We solve the graph-cut problem using spectral clustering with generalized eigen-decomposition and show that the second smallest eigenvector provides a cutting solution since its absolute value indicates the likelihood that a token belongs to a foreground object. Despite its simplicity, this approach significantly boosts the performance of unsupervised object discovery: we improve over the recent state of the art LOST by a margin of 6.9%, 8.1%, and 8.1% respectively on the VOC07, VOC12, and COCO20K. The performance can be further improved by adding a second stage class-agnostic detector (CAD). Our proposed method can be easily extended to unsupervised saliency detection and weakly supervised object detection. For unsupervised saliency detection, we improve IoU for 4.9%, 5.2%, 12.9% on ECSSD, DUTS, DUT-OMRON respectively compared to previous state of the art. For weakly supervised object detection, we achieve competitive performance on CUB and ImageNet.  ( 2 min )
    A Class of Geometric Structures in Transfer Learning: Minimax Bounds and Optimality. (arXiv:2202.11685v1 [cs.LG])
    We study the problem of transfer learning, observing that previous efforts to understand its information-theoretic limits do not fully exploit the geometric structure of the source and target domains. In contrast, our study first illustrates the benefits of incorporating a natural geometric structure within a linear regression model, which corresponds to the generalized eigenvalue problem formed by the Gram matrices of both domains. We next establish a finite-sample minimax lower bound, propose a refined model interpolation estimator that enjoys a matching upper bound, and then extend our framework to multiple source domains and generalized linear models. Surprisingly, as long as information is available on the distance between the source and target parameters, negative-transfer does not occur. Simulation studies show that our proposed interpolation estimator outperforms state-of-the-art transfer learning methods in both moderate- and high-dimensional settings.
    BacHMMachine: An Interpretable and Scalable Model for Algorithmic Harmonization for Four-part Baroque Chorales. (arXiv:2109.07623v2 [cs.SD] UPDATED)
    Algorithmic harmonization - the automated harmonization of a musical piece given its melodic line - is a challenging problem that has garnered much interest from both music theorists and computer scientists. One genre of particular interest is the four-part Baroque chorales of J.S. Bach. Methods for algorithmic chorale harmonization typically adopt a black-box, "data-driven" approach: they do not explicitly integrate principles from music theory but rely on a complex learning model trained with a large amount of chorale data. We propose instead a new harmonization model, called BacHMMachine, which employs a "theory-driven" framework guided by music composition principles, along with a "data-driven" model for learning compositional features within this framework. As its name suggests, BacHMMachine uses a novel Hidden Markov Model based on key and chord transitions, providing a probabilistic framework for learning key modulations and chordal progressions from a given melodic line. This allows for the generation of creative, yet musically coherent chorale harmonizations; integrating compositional principles allows for a much simpler model that results in vast decreases in computational burden and greater interpretability compared to state-of-the-art algorithmic harmonization methods, at no penalty to quality of harmonization or musicality. We demonstrate this improvement via comprehensive experiments and Turing tests comparing BacHMMachine to existing methods.
    Preformer: Predictive Transformer with Multi-Scale Segment-wise Correlations for Long-Term Time Series Forecasting. (arXiv:2202.11356v1 [cs.LG])
    Transformer-based methods have shown great potential in long-term time series forecasting. However, most of these methods adopt the standard point-wise self-attention mechanism, which not only becomes intractable for long-term forecasting since its complexity increases quadratically with the length of time series, but also cannot explicitly capture the predictive dependencies from contexts since the corresponding key and value are transformed from the same point. This paper proposes a predictive Transformer-based model called {\em Preformer}. Preformer introduces a novel efficient {\em Multi-Scale Segment-Correlation} mechanism that divides time series into segments and utilizes segment-wise correlation-based attention for encoding time series. A multi-scale structure is developed to aggregate dependencies at different temporal scales and facilitate the selection of segment length. Preformer further designs a predictive paradigm for decoding, where the key and value come from two successive segments rather than the same segment. In this way, if a key segment has a high correlation score with the query segment, its successive segment contributes more to the prediction of the query segment. Extensive experiments demonstrate that our Preformer outperforms other Transformer-based methods.
    Learning Quantile Functions without Quantile Crossing for Distribution-free Time Series Forecasting. (arXiv:2111.06581v2 [cs.LG] UPDATED)
    Quantile regression is an effective technique to quantify uncertainty, fit challenging underlying distributions, and often provide full probabilistic predictions through joint learnings over multiple quantile levels. A common drawback of these joint quantile regressions, however, is \textit{quantile crossing}, which violates the desirable monotone property of the conditional quantile function. In this work, we propose the Incremental (Spline) Quantile Functions I(S)QF, a flexible and efficient distribution-free quantile estimation framework that resolves quantile crossing with a simple neural network layer. Moreover, I(S)QF inter/extrapolate to predict arbitrary quantile levels that differ from the underlying training ones. Equipped with the analytical evaluation of the continuous ranked probability score of I(S)QF representations, we apply our methods to NN-based times series forecasting cases, where the savings of the expensive re-training costs for non-trained quantile levels is particularly significant. We also provide a generalization error analysis of our proposed approaches under the sequence-to-sequence setting. Lastly, extensive experiments demonstrate the improvement of consistency and accuracy errors over other baselines.
    FLIX: A Simple and Communication-Efficient Alternative to Local Methods in Federated Learning. (arXiv:2111.11556v2 [cs.LG] UPDATED)
    Federated Learning (FL) is an increasingly popular machine learning paradigm in which multiple nodes try to collaboratively learn under privacy, communication and multiple heterogeneity constraints. A persistent problem in federated learning is that it is not clear what the optimization objective should be: the standard average risk minimization of supervised learning is inadequate in handling several major constraints specific to federated learning, such as communication adaptivity and personalization control. We identify several key desiderata in frameworks for federated learning and introduce a new framework, FLIX, that takes into account the unique challenges brought by federated learning. FLIX has a standard finite-sum form, which enables practitioners to tap into the immense wealth of existing (potentially non-local) methods for distributed optimization. Through a smart initialization that does not require any communication, FLIX does not require the use of local steps but is still provably capable of performing dissimilarity regularization on par with local methods. We give several algorithms for solving the FLIX formulation efficiently under communication constraints. Finally, we corroborate our theoretical results with extensive experimentation.
    TD-GEN: Graph Generation With Tree Decomposition. (arXiv:2106.10656v2 [cs.LG] UPDATED)
    We propose TD-GEN, a graph generation framework based on tree decomposition, and introduce a reduced upper bound on the maximum number of decisions needed for graph generation. The framework includes a permutation invariant tree generation model which forms the backbone of graph generation. Tree nodes are supernodes, each representing a cluster of nodes in the graph. Graph nodes and edges are incrementally generated inside the clusters by traversing the tree supernodes, respecting the structure of the tree decomposition, and following node sharing decisions between the clusters. Finally, we discuss the shortcomings of standard evaluation criteria based on statistical properties of the generated graphs as performance measures. We propose to compare the performance of models based on likelihood. Empirical results on a variety of standard graph generation datasets demonstrate the superior performance of our method.
    A general sample complexity analysis of vanilla policy gradient. (arXiv:2107.11433v3 [cs.LG] UPDATED)
    We adapt recent tools developed for the analysis of Stochastic Gradient Descent (SGD) in non-convex optimization to obtain convergence and sample complexity guarantees for the vanilla policy gradient (PG). Our only assumptions are that the expected return is smooth w.r.t. the policy parameters, that its $H$-step truncated gradient is close to the exact gradient, and a certain ABC assumption. This assumption requires the second moment of the estimated gradient to be bounded by $A \geq 0$ times the suboptimality gap, $B \geq 0$ times the norm of the full batch gradient and an additive constant $C \geq 0$, or any combination of aforementioned. We show that the ABC assumption is more general than the commonly used assumptions on the policy space to prove convergence to a stationary point. We provide a single convergence theorem that recovers the $\widetilde{\mathcal{O}}(\epsilon^{-4})$ sample complexity of PG. Our results also affords greater flexibility in the choice of hyper parameters such as the step size and places no restriction on the batch size $m$, including the single trajectory case (i.e., $m=1$). We then instantiate our theorem in different settings, where we both recover existing results and obtained improved sample complexity, e.g., for convergence to the global optimum for Fisher-non-degenerated parameterized policies.
    Laplacian Constrained Precision Matrix Estimation: Existence and High Dimensional Consistency. (arXiv:2111.00590v2 [stat.ML] UPDATED)
    This paper considers the problem of estimating high dimensional Laplacian constrained precision matrices by minimizing Stein's loss. We obtain a necessary and sufficient condition for existence of this estimator, that consists on checking whether a certain data dependent graph is connected. We also prove consistency in the high dimensional setting under the symmetrized Stein loss. We show that the error rate does not depend on the graph sparsity, or other type of structure, and that Laplacian constraints are sufficient for high dimensional consistency. Our proofs exploit properties of graph Laplacians, the matrix tree theorem, and a characterization of the proposed estimator based on effective graph resistances. We validate our theoretical claims with numerical experiments.
    Multivariate Quantile Function Forecaster. (arXiv:2202.11316v1 [cs.LG])
    We propose Multivariate Quantile Function Forecaster (MQF$^2$), a global probabilistic forecasting method constructed using a multivariate quantile function and investigate its application to multi-horizon forecasting. Prior approaches are either autoregressive, implicitly capturing the dependency structure across time but exhibiting error accumulation with increasing forecast horizons, or multi-horizon sequence-to-sequence models, which do not exhibit error accumulation, but also do typically not model the dependency structure across time steps. MQF$^2$ combines the benefits of both approaches, by directly making predictions in the form of a multivariate quantile function, defined as the gradient of a convex function which we parametrize using input-convex neural networks. By design, the quantile function is monotone with respect to the input quantile levels and hence avoids quantile crossing. We provide two options to train MQF$^2$: with energy score or with maximum likelihood. Experimental results on real-world and synthetic datasets show that our model has comparable performance with state-of-the-art methods in terms of single time step metrics while capturing the time dependency structure.
    Fast Sparse Classification for Generalized Linear and Additive Models. (arXiv:2202.11389v1 [cs.LG])
    We present fast classification techniques for sparse generalized linear and additive models. These techniques can handle thousands of features and thousands of observations in minutes, even in the presence of many highly correlated features. For fast sparse logistic regression, our computational speed-up over other best-subset search techniques owes to linear and quadratic surrogate cuts for the logistic loss that allow us to efficiently screen features for elimination, as well as use of a priority queue that favors a more uniform exploration of features. As an alternative to the logistic loss, we propose the exponential loss, which permits an analytical solution to the line search at each iteration. Our algorithms are generally 2 to 5 times faster than previous approaches. They produce interpretable models that have accuracy comparable to black box models on challenging datasets.
    Neural Generalised AutoRegressive Conditional Heteroskedasticity. (arXiv:2202.11285v1 [cs.LG])
    We propose Neural GARCH, a class of methods to model conditional heteroskedasticity in financial time series. Neural GARCH is a neural network adaptation of the GARCH 1,1 model in the univariate case, and the diagonal BEKK 1,1 model in the multivariate case. We allow the coefficients of a GARCH model to be time varying in order to reflect the constantly changing dynamics of financial markets. The time varying coefficients are parameterised by a recurrent neural network that is trained with stochastic gradient variational Bayes. We propose two variants of our model, one with normal innovations and the other with Students t innovations. We test our models on a wide range of univariate and multivariate financial time series, and we find that the Neural Students t model consistently outperforms the others.
    Efficient CDF Approximations for Normalizing Flows. (arXiv:2202.11322v1 [cs.LG])
    Normalizing flows model a complex target distribution in terms of a bijective transform operating on a simple base distribution. As such, they enable tractable computation of a number of important statistical quantities, particularly likelihoods and samples. Despite these appealing properties, the computation of more complex inference tasks, such as the cumulative distribution function (CDF) over a complex region (e.g., a polytope) remains challenging. Traditional CDF approximations using Monte-Carlo techniques are unbiased but have unbounded variance and low sample efficiency. Instead, we build upon the diffeomorphic properties of normalizing flows and leverage the divergence theorem to estimate the CDF over a closed region in target space in terms of the flux across its \emph{boundary}, as induced by the normalizing flow. We describe both deterministic and stochastic instances of this estimator: while the deterministic variant iteratively improves the estimate by strategically subdividing the boundary, the stochastic variant provides unbiased estimates. Our experiments on popular flow architectures and UCI benchmark datasets show a marked improvement in sample efficiency as compared to traditional estimators.
    FastSHAP: Real-Time Shapley Value Estimation. (arXiv:2107.07436v2 [stat.ML] UPDATED)
    Shapley values are widely used to explain black-box models, but they are costly to calculate because they require many model evaluations. We introduce FastSHAP, a method for estimating Shapley values in a single forward pass using a learned explainer model. FastSHAP amortizes the cost of explaining many inputs via a learning approach inspired by the Shapley value's weighted least squares characterization, and it can be trained using standard stochastic gradient optimization. We compare FastSHAP to existing estimation approaches, revealing that it generates high-quality explanations with orders of magnitude speedup.
    Differentially Private Regression with Unbounded Covariates. (arXiv:2202.11199v1 [cs.CR])
    We provide computationally efficient, differentially private algorithms for the classical regression settings of Least Squares Fitting, Binary Regression and Linear Regression with unbounded covariates. Prior to our work, privacy constraints in such regression settings were studied under strong a priori bounds on covariates. We consider the case of Gaussian marginals and extend recent differentially private techniques on mean and covariance estimation (Kamath et al., 2019; Karwa and Vadhan, 2018) to the sub-gaussian regime. We provide a novel technical analysis yielding differentially private algorithms for the above classical regression settings. Through the case of Binary Regression, we capture the fundamental and widely-studied models of logistic regression and linearly-separable SVMs, learning an unbiased estimate of the true regression vector, up to a scaling factor.
    Exploiting Side Information for Improved Online Learning Algorithms in Wireless Networks. (arXiv:2202.11699v1 [cs.IT])
    In wireless networks, the rate achieved depends on factors like level of interference, hardware impairments, and channel gain. Often, instantaneous values of some of these factors can be measured, and they provide useful information about the instantaneous rate achieved. For example, higher interference implies a lower rate. In this work, we treat any such measurable quality that has a non-zero correlation with the rate achieved as side-information and study how it can be exploited to quickly learn the channel that offers higher throughput (reward). When the mean value of the side-information is known, using control variate theory we develop algorithms that require fewer samples to learn the parameters and can improve the learning rate compared to cases where side-information is ignored. Specifically, we incorporate side-information in the classical Upper Confidence Bound (UCB) algorithm and quantify the gain achieved in the regret performance. We show that the gain is proportional to the amount of the correlation between the reward and associated side-information. We discuss in detail various side-information that can be exploited in cognitive radio and air-to-ground communication in $L-$band. We demonstrate that correlation between the reward and side-information is often strong in practice and exploiting it improves the throughput significantly.
    Nonconvex Extension of Generalized Huber Loss for Robust Learning and Pseudo-Mode Statistics. (arXiv:2202.11141v1 [stat.ML])
    We propose an extended generalization of the pseudo Huber loss formulation. We show that using the log-exp transform together with the logistic function, we can create a loss which combines the desirable properties of the strictly convex losses with robust loss functions. With this formulation, we show that a linear convergence algorithm can be utilized to find a minimizer. We further discuss the creation of a quasi-convex composite loss and provide a derivative-free exponential convergence rate algorithm.
    Minimax Optimal Quantization of Linear Models: Information-Theoretic Limits and Efficient Algorithms. (arXiv:2202.11277v1 [cs.IT])
    We consider the problem of quantizing a linear model learned from measurements $\mathbf{X} = \mathbf{W}\boldsymbol{\theta} + \mathbf{v}$. The model is constrained to be representable using only $dB$-bits, where $B \in (0, \infty)$ is a pre-specified budget and $d$ is the dimension of the model. We derive an information-theoretic lower bound for the minimax risk under this setting and show that it is tight with a matching upper bound. This upper bound is achieved using randomized embedding based algorithms. We propose randomized Hadamard embeddings that are computationally efficient while performing near-optimally. We also show that our method and upper-bounds can be extended for two-layer ReLU neural networks. Numerical simulations validate our theoretical claims.
    Contextual Combinatorial Conservative Bandits. (arXiv:1911.11337v2 [cs.LG] UPDATED)
    The problem of multi-armed bandits (MAB) asks to make sequential decisions while balancing between exploitation and exploration, and have been successfully applied to a wide range of practical scenarios. Various algorithms have been designed to achieve a high reward in a long term. However, its short-term performance might be rather low, which is injurious in risk sensitive applications. Building on previous work of conservative bandits, we bring up a framework of contextual combinatorial conservative bandits. An algorithm is presented and a regret bound of $\tilde O(d^2+d\sqrt{T})$ is proven, where $d$ is the dimension of the feature vectors, and $T$ is the total number of time steps. We further provide an algorithm as well as regret analysis for the case when the conservative reward is unknown. Experiments are conducted, and the results validate the effectiveness of our algorithm.
    Many processors, little time: MCMC for partitions via optimal transport couplings. (arXiv:2202.11258v1 [stat.ME])
    Markov chain Monte Carlo (MCMC) methods are often used in clustering since they guarantee asymptotically exact expectations in the infinite-time limit. In finite time, though, slow mixing often leads to poor performance. Modern computing environments offer massive parallelism, but naive implementations of parallel MCMC can exhibit substantial bias. In MCMC samplers of continuous random variables, Markov chain couplings can overcome bias. But these approaches depend crucially on paired chains meetings after a small number of transitions. We show that straightforward applications of existing coupling ideas to discrete clustering variables fail to meet quickly. This failure arises from the "label-switching problem": semantically equivalent cluster relabelings impede fast meeting of coupled chains. We instead consider chains as exploring the space of partitions rather than partitions' (arbitrary) labelings. Using a metric on the partition space, we formulate a practical algorithm using optimal transport couplings. Our theory confirms our method is accurate and efficient. In experiments ranging from clustering of genes or seeds to graph colorings, we show the benefits of our coupling in the highly parallel, time-limited regime.
    Differential privacy for symmetric log-concave mechanisms. (arXiv:2202.11393v1 [cs.CR])
    Adding random noise to database query results is an important tool for achieving privacy. A challenge is to minimize this noise while still meeting privacy requirements. Recently, a sufficient and necessary condition for $(\epsilon, \delta)$-differential privacy for Gaussian noise was published. This condition allows the computation of the minimum privacy-preserving scale for this distribution. We extend this work and provide a sufficient and necessary condition for $(\epsilon, \delta)$-differential privacy for all symmetric and log-concave noise densities. Our results allow fine-grained tailoring of the noise distribution to the dimensionality of the query result. We demonstrate that this can yield significantly lower mean squared errors than those incurred by the currently used Laplace and Gaussian mechanisms for the same $\epsilon$ and $\delta$.
    Towards Speaker Age Estimation with Label Distribution Learning. (arXiv:2202.11424v1 [cs.SD])
    Existing methods for speaker age estimation usually treat it as a multi-class classification or a regression problem. However, precise age identification remains a challenge due to label ambiguity, \emph{i.e.}, utterances from adjacent age of the same person are often indistinguishable. To address this, we utilize the ambiguous information among the age labels, convert each age label into a discrete label distribution and leverage the label distribution learning (LDL) method to fit the data. For each audio data sample, our method produces a age distribution of its speaker, and on top of the distribution we also perform two other tasks: age prediction and age uncertainty minimization. Therefore, our method naturally combines the age classification and regression approaches, which enhances the robustness of our method. We conduct experiments on the public NIST SRE08-10 dataset and a real-world dataset, which exhibit that our method outperforms baseline methods by a relatively large margin, yielding a 10\% reduction in terms of mean absolute error (MAE) on a real-world dataset.
    Sparse Orthogonal Variational Inference for Gaussian Processes. (arXiv:1910.10596v4 [stat.ML] UPDATED)
    We introduce a new interpretation of sparse variational approximations for Gaussian processes using inducing points, which can lead to more scalable algorithms than previous methods. It is based on decomposing a Gaussian process as a sum of two independent processes: one spanned by a finite basis of inducing points and the other capturing the remaining variation. We show that this formulation recovers existing approximations and at the same time allows to obtain tighter lower bounds on the marginal likelihood and new stochastic variational inference algorithms. We demonstrate the efficiency of these algorithms in several Gaussian process models ranging from standard regression to multi-class classification using (deep) convolutional Gaussian processes and report state-of-the-art results on CIFAR-10 among purely GP-based models.
    Meta-learning based Alternating Minimization Algorithm for Non-convex Optimization. (arXiv:2009.04899v6 [cs.LG] UPDATED)
    In this paper, we propose a novel solution for non-convex problems of multiple variables, especially for those typically solved by an alternating minimization (AM) strategy that splits the original optimization problem into a set of sub-problems corresponding to each variable, and then iteratively optimize each sub-problem using a fixed updating rule. However, due to the intrinsic non-convexity of the original optimization problem, the optimization can usually be trapped into spurious local minimum even when each sub-problem can be optimally solved at each iteration. Meanwhile, learning-based approaches, such as deep unfolding algorithms, are highly limited by the lack of labelled data and restricted explainability. To tackle these issues, we propose a meta-learning based alternating minimization (MLAM) method, which aims to minimize a partial of the global losses over iterations instead of carrying minimization on each sub-problem, and it tends to learn an adaptive strategy to replace the handcrafted counterpart resulting in advance on superior performance. Meanwhile, the proposed MLAM still maintains the original algorithmic principle, which contributes to a better interpretability. We evaluate the proposed method on two representative problems, namely, bi-linear inverse problem: matrix completion, and non-linear problem: Gaussian mixture models. The experimental results validate that our proposed approach outperforms AM-based methods in standard settings, and is able to achieve effective optimization in challenging cases while other comparing methods would typically fail.
    A Dimensionality Reduction Method for Finding Least Favorable Priors with a Focus on Bregman Divergence. (arXiv:2202.11598v1 [stat.ML])
    A common way of characterizing minimax estimators in point estimation is by moving the problem into the Bayesian estimation domain and finding a least favorable prior distribution. The Bayesian estimator induced by a least favorable prior, under mild conditions, is then known to be minimax. However, finding least favorable distributions can be challenging due to inherent optimization over the space of probability distributions, which is infinite-dimensional. This paper develops a dimensionality reduction method that allows us to move the optimization to a finite-dimensional setting with an explicit bound on the dimension. The benefit of this dimensionality reduction is that it permits the use of popular algorithms such as projected gradient ascent to find least favorable priors. Throughout the paper, in order to make progress on the problem, we restrict ourselves to Bayesian risks induced by a relatively large class of loss functions, namely Bregman divergences.
    Bayesian Model Selection, the Marginal Likelihood, and Generalization. (arXiv:2202.11678v1 [cs.LG])
    How do we compare between hypotheses that are entirely consistent with observations? The marginal likelihood (aka Bayesian evidence), which represents the probability of generating our observations from a prior, provides a distinctive approach to this foundational question, automatically encoding Occam's razor. Although it has been observed that the marginal likelihood can overfit and is sensitive to prior assumptions, its limitations for hyperparameter learning and discrete model comparison have not been thoroughly investigated. We first revisit the appealing properties of the marginal likelihood for learning constraints and hypothesis testing. We then highlight the conceptual and practical issues in using the marginal likelihood as a proxy for generalization. Namely, we show how marginal likelihood can be negatively correlated with generalization, with implications for neural architecture search, and can lead to both underfitting and overfitting in hyperparameter learning. We provide a partial remedy through a conditional marginal likelihood, which we show is more aligned with generalization, and practically valuable for large-scale hyperparameter learning, such as in deep kernel learning.
    How Many Data Are Needed for Robust Learning?. (arXiv:2202.11592v1 [cs.LG])
    We show that the sample complexity of robust interpolation problem could be exponential in the input dimensionality and discover a phase transition phenomenon when the data are in a unit ball. Robust interpolation refers to the problem of interpolating $n$ noisy training data in $\R^d$ by a Lipschitz function. Although this problem has been well understood when the covariates are drawn from an isoperimetry distribution, much remains unknown concerning its performance under generic or even the worst-case distributions. Our results are two-fold: 1) too many data hurt robustness; we provide a tight and universal Lipschitzness lower bound $\Omega(n^{1/d})$ of the interpolating function for arbitrary data distributions. Our result disproves potential existence of an $\mathcal{O}(1)$-Lipschitz function in the overparametrization scenario when $n=\exp(\omega(d))$. 2) Small data hurt robustness: $n=\exp(\Omega(d))$ is necessary for obtaining a good population error under certain distributions by any $\mathcal{O}(1)$-Lipschitz learning algorithm. Perhaps surprisingly, our results shed light on the curse of big data and the blessing of dimensionality for robustness, and discover an intriguing phenomenon of phase transition at $n=\exp(\Theta(d))$.
    Wide Mean-Field Bayesian Neural Networks Ignore the Data. (arXiv:2202.11670v1 [cs.LG])
    Bayesian neural networks (BNNs) combine the expressive power of deep learning with the advantages of Bayesian formalism. In recent years, the analysis of wide, deep BNNs has provided theoretical insight into their priors and posteriors. However, we have no analogous insight into their posteriors under approximate inference. In this work, we show that mean-field variational inference entirely fails to model the data when the network width is large and the activation function is odd. Specifically, for fully-connected BNNs with odd activation functions and a homoscedastic Gaussian likelihood, we show that the optimal mean-field variational posterior predictive (i.e., function space) distribution converges to the prior predictive distribution as the width tends to infinity. We generalize aspects of this result to other likelihoods. Our theoretical results are suggestive of underfitting behavior previously observered in BNNs. While our convergence bounds are non-asymptotic and constants in our analysis can be computed, they are currently too loose to be applicable in standard training regimes. Finally, we show that the optimal approximate posterior need not tend to the prior if the activation function is not odd, showing that our statements cannot be generalized arbitrarily.
    Analysis of a Target-Based Actor-Critic Algorithm with Linear Function Approximation. (arXiv:2106.07472v2 [cs.LG] UPDATED)
    Actor-critic methods integrating target networks have exhibited a stupendous empirical success in deep reinforcement learning. However, a theoretical understanding of the use of target networks in actor-critic methods is largely missing in the literature. In this paper, we reduce this gap between theory and practice by proposing the first theoretical analysis of an online target-based actor-critic algorithm with linear function approximation in the discounted reward setting. Our algorithm uses three different timescales: one for the actor and two for the critic. Instead of using the standard single timescale temporal difference (TD) learning algorithm as a critic, we use a two timescales target-based version of TD learning closely inspired from practical actor-critic algorithms implementing target networks. First, we establish asymptotic convergence results for both the critic and the actor under Markovian sampling. Then, we provide a finite-time analysis showing the impact of incorporating a target network into actor-critic methods.
    Nonparametric Regression on Low-Dimensional Manifolds using Deep ReLU Networks : Function Approximation and Statistical Recovery. (arXiv:1908.01842v5 [cs.LG] UPDATED)
    Real world data often exhibit low-dimensional geometric structures, and can be viewed as samples near a low-dimensional manifold. This paper studies nonparametric regression of H\"{o}lder functions on low-dimensional manifolds using deep ReLU networks. Suppose $n$ training data are sampled from a H\"{o}lder function in $\mathcal{H}^{s,\alpha}$ supported on a $d$-dimensional Riemannian manifold isometrically embedded in $\mathbb{R}^D$, with sub-gaussian noise. A deep ReLU network architecture is designed to estimate the underlying function from the training data. The mean squared error of the empirical estimator is proved to converge in the order of $n^{-\frac{2(s+\alpha)}{2(s+\alpha) + d}}\log^3 n$. This result shows that deep ReLU networks give rise to a fast convergence rate depending on the data intrinsic dimension $d$, which is usually much smaller than the ambient dimension $D$. It therefore demonstrates the adaptivity of deep ReLU networks to low-dimensional geometric structures of data, and partially explains the power of deep ReLU networks in tackling high-dimensional data with low-dimensional geometric structures.
    Amortized Causal Discovery: Learning to Infer Causal Graphs from Time-Series Data. (arXiv:2006.10833v3 [cs.LG] UPDATED)
    On time-series data, most causal discovery methods fit a new model whenever they encounter samples from a new underlying causal graph. However, these samples often share relevant information which is lost when following this approach. Specifically, different samples may share the dynamics which describe the effects of their causal relations. We propose Amortized Causal Discovery, a novel framework that leverages such shared dynamics to learn to infer causal relations from time-series data. This enables us to train a single, amortized model that infers causal relations across samples with different underlying causal graphs, and thus leverages the shared dynamics information. We demonstrate experimentally that this approach, implemented as a variational model, leads to significant improvements in causal discovery performance, and show how it can be extended to perform well under added noise and hidden confounding.
    Amortised Likelihood-free Inference for Expensive Time-series Simulators with Signatured Ratio Estimation. (arXiv:2202.11585v1 [stat.ML])
    Simulation models of complex dynamics in the natural and social sciences commonly lack a tractable likelihood function, rendering traditional likelihood-based statistical inference impossible. Recent advances in machine learning have introduced novel algorithms for estimating otherwise intractable likelihood functions using a likelihood ratio trick based on binary classifiers. Consequently, efficient likelihood approximations can be obtained whenever good probabilistic classifiers can be constructed. We propose a kernel classifier for sequential data using path signatures based on the recently introduced signature kernel. We demonstrate that the representative power of signatures yields a highly performant classifier, even in the crucially important case where sample numbers are low. In such scenarios, our approach can outperform sophisticated neural networks for common posterior inference tasks.
    Traversing Time with Multi-Resolution Gaussian Process State-Space Models. (arXiv:2112.03230v2 [cs.LG] UPDATED)
    Gaussian Process state-space models capture complex temporal dependencies in a principled manner by placing a Gaussian Process prior on the transition function. These models have a natural interpretation as discretized stochastic differential equations, but inference for long sequences with fast and slow transitions is difficult. Fast transitions need tight discretizations whereas slow transitions require backpropagating the gradients over long subtrajectories. We propose a novel Gaussian process state-space architecture composed of multiple components, each trained on a different resolution, to model effects on different timescales. The combined model allows traversing time on adaptive scales, providing efficient inference for arbitrarily long sequences with complex dynamics. We benchmark our novel method on semi-synthetic data and on an engine modeling task. In both experiments, our approach compares favorably against its state-of-the-art alternatives that operate on a single time-scale only.
    Robust Geometric Metric Learning. (arXiv:2202.11550v1 [stat.ML])
    This paper proposes new algorithms for the metric learning problem. We start by noticing that several classical metric learning formulations from the literature can be viewed as modified covariance matrix estimation problems. Leveraging this point of view, a general approach, called Robust Geometric Metric Learning (RGML), is then studied. This method aims at simultaneously estimating the covariance matrix of each class while shrinking them towards their (unknown) barycenter. We focus on two specific costs functions: one associated with the Gaussian likelihood (RGML Gaussian), and one with Tyler's M -estimator (RGML Tyler). In both, the barycenter is defined with the Riemannian distance, which enjoys nice properties of geodesic convexity and affine invariance. The optimization is performed using the Riemannian geometry of symmetric positive definite matrices and its submanifold of unit determinant. Finally, the performance of RGML is asserted on real datasets. Strong performance is exhibited while being robust to mislabeled data.
    Can Pretext-Based Self-Supervised Learning Be Boosted by Downstream Data? A Theoretical Analysis. (arXiv:2103.03568v4 [cs.LG] UPDATED)
    Pretext-based self-supervised learning learns the semantic representation via a handcrafted pretext task over unlabeled data and then uses the learned representation for downstream tasks, which effectively reduces the sample complexity of downstream tasks under Conditional Independence (CI) condition. However, the downstream sample complexity gets much worse if the CI condition does not hold. One interesting question is whether we can make the CI condition hold by using downstream data to refine the unlabeled data to boost self-supervised learning. At first glance, one might think that seeing downstream data in advance would always boost the downstream performance. However, we show that it is not intuitively true and point out that in some cases, it hurts the final performance instead. In particular, we prove both model-free and model-dependent lower bounds of the number of downstream samples used for data refinement. Moreover, we conduct various experiments on both synthetic and real-world datasets to verify our theoretical results.
    Bayesian Target-Vector Optimization for Efficient Parameter Reconstruction. (arXiv:2202.11559v1 [physics.comp-ph])
    Parameter reconstructions are indispensable in metrology. Here, on wants to explain $K$ experimental measurements by fitting to them a parameterized model of the measurement process. The model parameters are regularly determined by least-square methods, i.e., by minimizing the sum of the squared residuals between the $K$ model predictions and the $K$ experimental observations, $\chi^2$. The model functions often involve computationally demanding numerical simulations. Bayesian optimization methods are specifically suited for minimizing expensive model functions. However, in contrast to least-square methods such as the Levenberg-Marquardt algorithm, they only take the value of $\chi^2$ into account, and neglect the $K$ individual model outputs. We introduce a Bayesian target-vector optimization scheme that considers all $K$ contributions of the model function and that is specifically suited for parameter reconstruction problems which are often based on hundreds of observations. Its performance is compared to established methods for an optical metrology reconstruction problem and two synthetic least-squares problems. The proposed method outperforms established optimization methods. It also enables to determine accurate uncertainty estimates with very few observations of the actual model function by using Markov chain Monte Carlo sampling on a trained surrogate model.
    A Framework to Learn with Interpretation. (arXiv:2010.09345v4 [cs.LG] UPDATED)
    To tackle interpretability in deep learning, we present a novel framework to jointly learn a predictive model and its associated interpretation model. The interpreter provides both local and global interpretability about the predictive model in terms of human-understandable high level attribute functions, with minimal loss of accuracy. This is achieved by a dedicated architecture and well chosen regularization penalties. We seek for a small-size dictionary of high level attribute functions that take as inputs the outputs of selected hidden layers and whose outputs feed a linear classifier. We impose strong conciseness on the activation of attributes with an entropy-based criterion while enforcing fidelity to both inputs and outputs of the predictive model. A detailed pipeline to visualize the learnt features is also developed. Moreover, besides generating interpretable models by design, our approach can be specialized to provide post-hoc interpretations for a pre-trained neural network. We validate our approach against several state-of-the-art methods on multiple datasets and show its efficacy on both kinds of tasks.
    Testing Granger Non-Causality in Panels with Cross-Sectional Dependencies. (arXiv:2202.11612v1 [stat.ME])
    This paper proposes a new approach for testing Granger non-causality on panel data. Instead of aggregating panel member statistics, we aggregate their corresponding p-values and show that the resulting p-value approximately bounds the type I error by the chosen significance level even if the panel members are dependent. We compare our approach against the most widely used Granger causality algorithm on panel data and show that our approach yields lower FDR at the same power for large sample sizes and panels with cross-sectional dependencies. Finally, we examine COVID-19 data about confirmed cases and deaths measured in countries/regions worldwide and show that our approach is able to discover the true causal relation between confirmed cases and deaths while state-of-the-art approaches fail.
    Curiosity-Driven Exploration via Latent Bayesian Surprise. (arXiv:2104.07495v2 [cs.LG] UPDATED)
    The human intrinsic desire to pursue knowledge, also known as curiosity, is considered essential in the process of skill acquisition. With the aid of artificial curiosity, we could equip current techniques for control, such as Reinforcement Learning, with more natural exploration capabilities. A promising approach in this respect has consisted of using Bayesian surprise on model parameters, i.e. a metric for the difference between prior and posterior beliefs, to favour exploration. In this contribution, we propose to apply Bayesian surprise in a latent space representing the agent's current understanding of the dynamics of the system, drastically reducing the computational costs. We extensively evaluate our method by measuring the agent's performance in terms of environment exploration, for continuous tasks, and looking at the game scores achieved, for video games. Our model is computationally cheap and compares positively with current state-of-the-art methods on several problems. We also investigate the effects caused by stochasticity in the environment, which is often a failure case for curiosity-driven agents. In this regime, the results suggest that our approach is resilient to stochastic transitions.
    Residual Bootstrap Exploration for Stochastic Linear Bandit. (arXiv:2202.11474v1 [stat.ML])
    We propose a new bootstrap-based online algorithm for stochastic linear bandit problems. The key idea is to adopt residual bootstrap exploration, in which the agent estimates the next step reward by re-sampling the residuals of mean reward estimate. Our algorithm, residual bootstrap exploration for stochastic linear bandit (\texttt{LinReBoot}), estimates the linear reward from its re-sampling distribution and pulls the arm with the highest reward estimate. In particular, we contribute a theoretical framework to demystify residual bootstrap-based exploration mechanisms in stochastic linear bandit problems. The key insight is that the strength of bootstrap exploration is based on collaborated optimism between the online-learned model and the re-sampling distribution of residuals. Such observation enables us to show that the proposed \texttt{LinReBoot} secure a high-probability $\tilde{O}(d \sqrt{n})$ sub-linear regret under mild conditions. Our experiments support the easy generalizability of the \texttt{ReBoot} principle in the various formulations of linear bandit problems and show the significant computational efficiency of \texttt{LinReBoot}.
    Finding Safe Zones of policies Markov Decision Processes. (arXiv:2202.11593v1 [cs.LG])
    Given a policy, we define a SafeZone as a subset of states, such that most of the policy's trajectories are confined to this subset. The quality of the SafeZone is parameterized by the number of states and the escape probability, i.e., the probability that a random trajectory will leave the subset. SafeZones are especially interesting when they have a small number of states and low escape probability. We study the complexity of finding optimal SafeZones, and show that in general the problem is computationally hard. For this reason we concentrate on computing approximate SafeZones. Our main result is a bi-criteria approximation algorithm which gives a factor of almost $2$ approximation for both the escape probability and SafeZone size, using a polynomial size sample complexity. We conclude the paper with an empirical evaluation of our algorithm.
    Parallel MCMC Without Embarrassing Failures. (arXiv:2202.11154v1 [stat.ML])
    Embarrassingly parallel Markov Chain Monte Carlo (MCMC) exploits parallel computing to scale Bayesian inference to large datasets by using a two-step approach. First, MCMC is run in parallel on (sub)posteriors defined on data partitions. Then, a server combines local results. While efficient, this framework is very sensitive to the quality of subposterior sampling. Common sampling problems such as missing modes or misrepresentation of low-density regions are amplified -- instead of being corrected -- in the combination phase, leading to catastrophic failures. In this work, we propose a novel combination strategy to mitigate this issue. Our strategy, Parallel Active Inference (PAI), leverages Gaussian Process (GP) surrogate modeling and active learning. After fitting GPs to subposteriors, PAI (i) shares information between GP surrogates to cover missing modes; and (ii) uses active sampling to individually refine subposterior approximations. We validate PAI in challenging benchmarks, including heavy-tailed and multi-modal posteriors and a real-world application to computational neuroscience. Empirical results show that PAI succeeds where previous methods catastrophically fail, with a small communication overhead.
    Mirror Descent Strikes Again: Optimal Stochastic Convex Optimization under Infinite Noise Variance. (arXiv:2202.11632v1 [stat.ML])
    We study stochastic convex optimization under infinite noise variance. Specifically, when the stochastic gradient is unbiased and has uniformly bounded $(1+\kappa)$-th moment, for some $\kappa \in (0,1]$, we quantify the convergence rate of the Stochastic Mirror Descent algorithm with a particular class of uniformly convex mirror maps, in terms of the number of iterations, dimensionality and related geometric parameters of the optimization problem. Interestingly this algorithm does not require any explicit gradient clipping or normalization, which have been extensively used in several recent empirical and theoretical works. We complement our convergence results with information-theoretic lower bounds showing that no other algorithm using only stochastic first-order oracles can achieve improved rates. Our results have several interesting consequences for devising online/streaming stochastic approximation algorithms for problems arising in robust statistics and machine learning.
    On PAC-Bayesian reconstruction guarantees for VAEs. (arXiv:2202.11455v1 [cs.LG])
    Despite its wide use and empirical successes, the theoretical understanding and study of the behaviour and performance of the variational autoencoder (VAE) have only emerged in the past few years. We contribute to this recent line of work by analysing the VAE's reconstruction ability for unseen test data, leveraging arguments from the PAC-Bayes theory. We provide generalisation bounds on the theoretical reconstruction error, and provide insights on the regularisation effect of VAE objectives. We illustrate our theoretical results with supporting experiments on classical benchmark datasets.
    NetRCA: An Effective Network Fault Cause Localization Algorithm. (arXiv:2202.11269v1 [cs.LG])
    Localizing the root cause of network faults is crucial to network operation and maintenance. However, due to the complicated network architectures and wireless environments, as well as limited labeled data, accurately localizing the true root cause is challenging. In this paper, we propose a novel algorithm named NetRCA to deal with this problem. Firstly, we extract effective derived features from the original raw data by considering temporal, directional, attribution, and interaction characteristics. Secondly, we adopt multivariate time series similarity and label propagation to generate new training data from both labeled and unlabeled data to overcome the lack of labeled samples. Thirdly, we design an ensemble model which combines XGBoost, rule set learning, attribution model, and graph algorithm, to fully utilize all data information and enhance performance. Finally, experiments and analysis are conducted on the real-world dataset from ICASSP 2022 AIOps Challenge to demonstrate the superiority and effectiveness of our approach.
    SSSNET: Semi-Supervised Signed Network Clustering. (arXiv:2110.06623v3 [cs.SI] UPDATED)
    Node embeddings are a powerful tool in the analysis of networks; yet, their full potential for the important task of node clustering has not been fully exploited. In particular, most state-of-the-art methods generating node embeddings of signed networks focus on link sign prediction, and those that pertain to node clustering are usually not graph neural network (GNN) methods. Here, we introduce a novel probabilistic balanced normalized cut loss for training nodes in a GNN framework for semi-supervised signed network clustering, called SSSNET. The method is end-to-end in combining embedding generation and clustering without an intermediate step; it has node clustering as main focus, with an emphasis on polarization effects arising in networks. The main novelty of our approach is a new take on the role of social balance theory for signed network embeddings. The standard heuristic for justifying the criteria for the embeddings hinges on the assumption that "an enemy's enemy is a friend". Here, instead, a neutral stance is assumed on whether or not the enemy of an enemy is a friend. Experimental results on various data sets, including a synthetic signed stochastic block model, a polarized version of it, and real-world data at different scales, demonstrate that SSSNET can achieve comparable or better results than state-of-the-art spectral clustering methods, for a wide range of noise and sparsity levels. SSSNET complements existing methods through the possibility of including exogenous information, in the form of node-level features or labels.
    Learning Fast and Slow for Online Time Series Forecasting. (arXiv:2202.11672v1 [cs.LG])
    The fast adaptation capability of deep neural networks in non-stationary environments is critical for online time series forecasting. Successful solutions require handling changes to new and recurring patterns. However, training deep neural forecaster on the fly is notoriously challenging because of their limited ability to adapt to non-stationary environments and the catastrophic forgetting of old knowledge. In this work, inspired by the Complementary Learning Systems (CLS) theory, we propose Fast and Slow learning Networks (FSNet), a holistic framework for online time-series forecasting to simultaneously deal with abrupt changing and repeating patterns. Particularly, FSNet improves the slowly-learned backbone by dynamically balancing fast adaptation to recent changes and retrieving similar old knowledge. FSNet achieves this mechanism via an interaction between two complementary components of an adapter to monitor each layer's contribution to the lost, and an associative memory to support remembering, updating, and recalling repeating events. Extensive experiments on real and synthetic datasets validate FSNet's efficacy and robustness to both new and recurring patterns. Our code will be made publicly available.
    No-Regret Learning with Unbounded Losses: The Case of Logarithmic Pooling. (arXiv:2202.11219v1 [cs.LG])
    For each of $T$ time steps, $m$ experts report probability distributions over $n$ outcomes; we wish to learn to aggregate these forecasts in a way that attains a no-regret guarantee. We focus on the fundamental and practical aggregation method known as logarithmic pooling -- a weighted average of log odds -- which is in a certain sense the optimal choice of pooling method if one is interested in minimizing log loss (as we take to be our loss function). We consider the problem of learning the best set of parameters (i.e. expert weights) in an online adversarial setting. We assume (by necessity) that the adversarial choices of outcomes and forecasts are consistent, in the sense that experts report calibrated forecasts. Our main result is an algorithm based on online mirror descent that learns expert weights in a way that attains $O(\sqrt{T} \log T)$ expected regret as compared with the best weights in hindsight.
    Reward-Weighted Regression Converges to a Global Optimum. (arXiv:2107.09088v3 [stat.ML] UPDATED)
    Reward-Weighted Regression (RWR) belongs to a family of widely known iterative Reinforcement Learning algorithms based on the Expectation-Maximization framework. In this family, learning at each iteration consists of sampling a batch of trajectories using the current policy and fitting a new policy to maximize a return-weighted log-likelihood of actions. Although RWR is known to yield monotonic improvement of the policy under certain circumstances, whether and under which conditions RWR converges to the optimal policy have remained open questions. In this paper, we provide for the first time a proof that RWR converges to a global optimum when no function approximation is used, in a general compact setting. Furthermore, for the simpler case with finite state and action spaces we prove R-linear convergence of the state-value function to the optimum.
    Globally Convergent Policy Search over Dynamic Filters for Output Estimation. (arXiv:2202.11659v1 [math.OC])
    We introduce the first direct policy search algorithm which provably converges to the globally optimal $\textit{dynamic}$ filter for the classical problem of predicting the outputs of a linear dynamical system, given noisy, partial observations. Despite the ubiquity of partial observability in practice, theoretical guarantees for direct policy search algorithms, one of the backbones of modern reinforcement learning, have proven difficult to achieve. This is primarily due to the degeneracies which arise when optimizing over filters that maintain internal state. In this paper, we provide a new perspective on this challenging problem based on the notion of $\textit{informativity}$, which intuitively requires that all components of a filter's internal state are representative of the true state of the underlying dynamical system. We show that informativity overcomes the aforementioned degeneracy. Specifically, we propose a $\textit{regularizer}$ which explicitly enforces informativity, and establish that gradient descent on this regularized objective - combined with a ``reconditioning step'' - converges to the globally optimal cost a $\mathcal{O}(1/T)$. Our analysis relies on several new results which may be of independent interest, including a new framework for analyzing non-convex gradient descent via convex reformulation, and novel bounds on the solution to linear Lyapunov equations in terms of (our quantitative measure of) informativity.
    A spectral-based analysis of the separation between two-layer neural networks and linear methods. (arXiv:2108.04964v2 [stat.ML] UPDATED)
    We propose a spectral-based approach to analyze how two-layer neural networks separate from linear methods in terms of approximating high-dimensional functions. We show that quantifying this separation can be reduced to estimating the Kolmogorov width of two-layer neural networks, and the latter can be further characterized by using the spectrum of an associated kernel. Different from previous work, our approach allows obtaining upper bounds, lower bounds, and identifying explicit hard functions in a united manner. We provide a systematic study of how the choice of activation functions affects the separation, in particular the dependence on the input dimension. Specifically, for nonsmooth activation functions, we extend known results to more activation functions with sharper bounds. As concrete examples, we prove that any single neuron can instantiate the separation between neural networks and random feature models. For smooth activation functions, one surprising finding is that the separation is negligible unless the norms of inner-layer weights are polynomially large with respect to the input dimension. By contrast, the separation for nonsmooth activation functions is independent of the norms of inner-layer weights.
    A new LDA formulation with covariates. (arXiv:2202.11527v1 [cs.IR])
    The Latent Dirichlet Allocation (LDA) model is a popular method for creating mixed-membership clusters. Despite having been originally developed for text analysis, LDA has been used for a wide range of other applications. We propose a new formulation for the LDA model which incorporates covariates. In this model, a negative binomial regression is embedded within LDA, enabling straight-forward interpretation of the regression coefficients and the analysis of the quantity of cluster-specific elements in each sampling units (instead of the analysis being focused on modeling the proportion of each cluster, as in Structural Topic Models). We use slice sampling within a Gibbs sampling algorithm to estimate model parameters. We rely on simulations to show how our algorithm is able to successfully retrieve the true parameter values and the ability to make predictions for the abundance matrix using the information given by the covariates. The model is illustrated using real data sets from three different areas: text-mining of Coronavirus articles, analysis of grocery shopping baskets, and ecology of tree species on Barro Colorado Island (Panama). This model allows the identification of mixed-membership clusters in discrete data and provides inference on the relationship between covariates and the abundance of these clusters.
    A Complete Criterion for Value of Information in Soluble Influence Diagrams. (arXiv:2202.11629v1 [cs.AI])
    Influence diagrams have recently been used to analyse the safety and fairness properties of AI systems. A key building block for this analysis is a graphical criterion for value of information (VoI). This paper establishes the first complete graphical criterion for VoI in influence diagrams with multiple decisions. Along the way, we establish two important techniques for proving properties of multi-decision influence diagrams: ID homomorphisms are structure-preserving transformations of influence diagrams, while a Tree of Systems is collection of paths that captures how information and control can flow in an influence diagram.
    Out-of-distribution Generalization via Partial Feature Decorrelation. (arXiv:2007.15241v4 [cs.LG] UPDATED)
    Most deep-learning-based image classification methods assume that all samples are generated under an independent and identically distributed (IID) setting. However, out-of-distribution (OOD) generalization is more common in practice, which means an agnostic context distribution shift between training and testing environments. To address this problem, we present a novel Partial Feature Decorrelation Learning (PFDL) algorithm, which jointly optimizes a feature decomposition network and the target image classification model. The feature decomposition network decomposes feature embeddings into the independent and the correlated parts such that the correlations between features will be highlighted. Then, the correlated features help learn a stable feature representation by decorrelating the highlighted correlations while optimizing the image classification model. We verify the correlation modeling ability of the feature decomposition network on a synthetic dataset. The experiments on real-world datasets demonstrate that our method can improve the backbone model's accuracy on OOD image classification datasets.

  • Open

    "Learning Causal Overhypotheses through Exploration in Children and Computational Models", Kosoy et al 2022
    submitted by /u/gwern [link] [comments]
    "QET: Selective Credit Assignment", Chelu et al 2022 {DM}
    submitted by /u/gwern [link] [comments]
    "VRL3: A Data-Driven Framework for Visual Deep Reinforcement Learning", Wang et al 2022 (supervised pretraining, then offline, then online)
    submitted by /u/gwern [link] [comments]
    RL Algorithms Implementation Study Group 2
    Hello everyone, hope all of you are having a great day. I am a Urela and I have a background in computer engineering. I was wondering if anyone would be interested in joining my RL Algorithms Implementation Study Group. I am interested in implementing some SOTA RL algorithms such as IMPALA, SAC, A3C, AlphaZero and I hope to find people with similar interests to work together. I wish to recreate the study group that Costa Huang did. To be transparent, I am final year undergrad student, I comfortable running the group but if someone more experienced is will to take over, I am happy with that. I am very comfortable with the maths and reading the papers, worst case i step down and someone takes over. Costa Huang's Group: https://www.youtube.com/channel/UCDdC6BIFRI0jvcwuhi3aI6w/videos https://www.reddit.com/r/reinforcementlearning/comments/dxebbp/rl_algorithms_implementation_study_group/ ​ Some notes: Costa Hung has no affiliations with this group.!!! If such a group exist, I am sorry to waste your time, please point me in the direction ​ You should probably be dedicating a solid 3 hours+ per week to gain something out of this. It's fine if you just want to join to see what is going on, but if you are willing to commit 3 hours+ per week with this group, please PM me suggest to I will create a discord link and if more than 3 people want to join submitted by /u/Urenmila [link] [comments]  ( 1 min )
    Optimal Hedge With Monte-Carlo
    For https://www.coursera.org/learn/reinforcement-learning-in-finance/lecture/mcCal/optimal-hedge-with-monte-carlo , how to use equation 53 to derive equations 54, 55 and 56 ? Besides, I do not understand how the theoretical formula comparison looks similar ? https://preview.redd.it/32fv9ax95pj81.png?width=961&format=png&auto=webp&s=2b6cf56072f977ea5c56f87fe202bf13f941ae45 https://preview.redd.it/we915j8b5pj81.png?width=959&format=png&auto=webp&s=91e8cf89bb2930e327cd28b52ab45c05fa77ce43 https://preview.redd.it/ph4uu7h7vrj81.png?width=959&format=png&auto=webp&s=58e3152812930e99e1b5e2fdc3af6a8783d0ec7b submitted by /u/promach [link] [comments]
    Deriving expression for optimal Q-function
    How to use equation 31 in https://www.coursera.org/learn/reinforcement-learning-in-finance/lecture/b8ZIT/mdp-formulation to derive equation 49 of https://www.coursera.org/learn/reinforcement-learning-in-finance/lecture/Nyz4f/backward-recursion-for-q-star ? https://preview.redd.it/qzfwhqkpsoj81.png?width=963&format=png&auto=webp&s=df35c34a9268636c5689e9f0a0855d6e43396a19 https://preview.redd.it/le4z7fpqsoj81.png?width=954&format=png&auto=webp&s=35fdff53e6da7ea954c4f06d98dd9798d85fd4d5 submitted by /u/promach [link] [comments]
  • Open

    [P] Hawai - An AI Model capable of friendly wellness support on everyday topics
    Hi We have a very friendly and supportive wellness bot called Hawai. You can share about everyday stuffs and get tips on feeling good and happy. Give it a shot! Spread the word, especially given that millions are affected and feel so isolated!!! You can help just by sharing in social media and among your friends!!! Thanks!!! Start a chat with Hawai - Health and wellbeing AI on Chai![ https://chai.ml/chat/share/\_bot\_d6f48541-d7a4-4f79-81fa-ec2c1b464b9e](https://chai.ml/chat/share/_bot_d6f48541-d7a4-4f79-81fa-ec2c1b464b9e) submitted by /u/SnooPears9870 [link] [comments]  ( 1 min )
    [D] Do you have a tool that let's business users run your models on a self-serve basis?
    I'm curious to hear the thoughts of the community. How many of your models are actually deployed in production/automated environments, and what proportion is used locally that you then send out as a report/csv/dashboard etc.? I see a lot of conversations here that talk about production pipelines and deployments but I work with teams that do tons of analysis, especially for smaller projects, that are done locally. DS teams end up working on business requests that involve downloading files, building some analysis and then sending results out as csvs or dashboards. Is that true for you as well? Are there tools you use to publish non production models so business users (that can't code) can use models on a self-serve basis? submitted by /u/dmart89 [link] [comments]  ( 1 min )
    [D] Treating temporal dimension as channel axis?
    For those of you who have worked with temporal data, have you ever trained a conv net where you treat the temporal axis as the channel axis for depth wise convolutions? In particular you would be extracting local correlations on the spatial information throughout the temporal sequence. ​ If you havent personally are you aware of any work that has done this? submitted by /u/AbjectDrink3276 [link] [comments]  ( 1 min )
    [Research] Looking for volunteers to evaluate multilingual dialogue models (chatbots)!
    Hi! Are you frustrated that most chatbots are only in English? Would you want your native language to be part of a conversational AI system as well? I certainly do! That’s why I’m doing my master’s thesis on multilingual open-domain dialogue systems. If you also feel the same and want to contribute to research on multilingual, intelligent chatbots, then you’re the perfect fit for my thesis project! I’m currently looking for people who can speak one of the following languages fluently: Arabic Bengali Finnish Japanese Korean The task is to post-edit some data in your language (after I used machine translation to translate it from English to the languages) or evaluate the models’ responses in your language. There are 30 short dialogue exchanges in the post editing task. Your task is to check the post machine translated text and correct the grammatical and/or fluency issues if needed. As for the system evaluation, you will chat with the chatbots and rate them on a scale of 5 about your view of their fluency, engagingness, naturalness, etc. I’m looking for 1 volunteer in each language for the post editing task and 2 volunteers in each language for the evaluation task. The tasks are pretty small so they won't take too much of your time. And it'd be fun to chat with the bots (I think) :) The post editing task preferably starts asap and the evaluation task will kick-start by the end of March! If you’re interested, please fill out this form :) I’ll of course credit you in the paper unless you would like to stay anonymous. This is a project I’m really passionate about, and I hope it’ll encourage more research on multilingual dialogue systems in the community! Please feel free to let me know if you’d like to know more! I really, really appreciate it. You can dm me or just leave a comment here. To sign up, please kindly fill out this google form. Thank you so much for reading this post patiently :) Hope you have a great day! submitted by /u/iAmagent606 [link] [comments]  ( 2 min )
    ML writing competition: an essay about ML work/research [P]
    https://thegradient.pub/gradient-prize/ submitted by /u/pahita [link] [comments]
    [D] Software engineers building apps with ML features
    Hey everyone - I know many of you here might be deep into ML but I'm curious about what you think when it comes to tools that make it easier for the typical software engineer to build smart solutions. My background is in computer vision so I'll use it as an example. APIs like Google Vision API and Amazon Rekognition exist, but they're very limited in their applications. What if there really was some higher-level way to build vision applications that's flexible, and works? Imagine something along the lines of having motion detectors, color detectors, object detectors, and other things that worked out of the box. Some could ML-based, other could be traditional computer vision based. Curious to hear your thoughts and whether you've seen anything out there, that's more focused on build solutions rather than focusing on ML as the main function. submitted by /u/happybirthday290 [link] [comments]  ( 2 min )
    [D] Deep Learning (summer) school
    Hey ! I'm looking to register to a (summer) school on ML I came across Deep Learn school. It seems to run 4 times a year which seems a bit sketchy at the first glance. Does anyone already gone to one of this or has any nice summer school propositions ? submitted by /u/eXpensia [link] [comments]  ( 1 min )
    [D] Which are some good applications of machine learning and deep learning on big geospatial data?
    I have a competition going on at my college to harness the power of ML and DL for big geospatial data analysis. I got a few ideas in my mind but I'm looking for some inspiration from others. From my side, I have been researching on using GAN to create super-resolution images from low res satellite images for our self-developed geographic information system (it's like google maps but on a low budget and local scale). So, any ideas or thoughts are welcome. Thanks for your time. submitted by /u/Chrex_007 [link] [comments]  ( 1 min )
    [D] ACL 2022 Acceptances
    When are the ACL 2022 decisions expected to be out? submitted by /u/East-Beginning9987 [link] [comments]  ( 1 min )
    [D] What's hot for Machine Learning Research in 2022?
    Which of the sub-fields/approaches, application areas are expected to gain much attention (pun unintended) this year in the academia? PS: Please don't shy away from suggesting anything that you think or know could be the trending research topic in ML, it is quite likely that what you know can be relatively unknown to many of us here :) submitted by /u/ureepamuree [link] [comments]  ( 3 min )
    Preprocessing unbounded data [D]
    Does anyone know (or have an article) for how to preprocess data without imposing a lower or upper bound on the data? I want to train a GAN that takes inputs of shape (None, 512, 256, 2) and outputs (None, 512, 256, 3). Normally, people use this formula to normalize their data with, however this method makes the assumption that the data has a lower and upper bounds. My data has no bounds, and their values could theoretically range from positive and negative infinity. To make matters worse, in each of the last axis in my data (axis (:, :, :, i)), my data has a different mean and standard deviation, which means each axis needs to be normalized independently. submitted by /u/suoarski [link] [comments]  ( 1 min )
    [R] Observational Studies: Commentaries on Breimen’s Two Cultures paper
    Celebrating the 20th anniversary of Leo Breimen's "Two Cultures" paper, the journal Observational Studies got together a wide variety of scholars to discuss it's legacy. Best part? It's all open access! Enjoy: https://muse.jhu.edu/issue/45147 submitted by /u/bikeskata [link] [comments]
    [R] A Modern Self-Referential Weight Matrix That Learns To Modify Itself
    submitted by /u/EducationalCicada [link] [comments]
  • Open

    Deep-learning technique predicts clinical treatment outcomes
    A new methodology simulates counterfactual, time-varying, and dynamic treatment strategies, allowing doctors to choose the best course of action.  ( 7 min )
    More sensitive X-ray imaging
    Improvements in the material that converts X-rays into light, for medical or industrial images, could allow a tenfold signal enhancement.  ( 6 min )
  • Open

    What AI can you recommend me for writing a text about a topic, and I can tell the AI that I want exactly the word "cat (etc.)" on this place and it should fit in the text?
    submitted by /u/xXLisa28Xx [link] [comments]  ( 1 min )
    Artificial Intelligence in the News
    submitted by /u/Beautiful-Credit-868 [link] [comments]
    Fun! Found a new web tool that uses generative ML to "paint with words" and edit images based on your typed input
    submitted by /u/roxanneonreddit [link] [comments]
    Cleerly’s AI scans comparable to gold-standard angiography in detecting heart disease: study
    submitted by /u/DaStock_Doctor [link] [comments]
    Designing an intelligent low-cost Webcam for home security
    Hi all, I want to design a system that can use low cost camera to detect intruders illegally entering the premises. I would like to make a CNN using video footage of home invasions to train the model. Does anybody on this community know where I'll be able to get this data/ video repository of home invasions. Any help will be appreciated. submitted by /u/DaveTheAutist [link] [comments]  ( 1 min )
    LinkedIn’s job-matching AI was biased. The company’s solution? More AI. ZipRecruiter, CareerBuilder, LinkedIn—most of the world’s biggest job search sites use AI to match people with job openings. But the algorithms don’t always play fair.
    submitted by /u/LisaMck041 [link] [comments]  ( 1 min )
    In a Latest Machine Learning Research, Amazon Researchers Propose an End-To-End Noise-Tolerant Embedding Learning Framework, ‘PGE’, to Jointly Leverage Both Text Information and Graph Structure in PG to Learn Embeddings for Error Detection
    With the rapid rise of the internet, e-commerce websites like Amazon, eBay, and Walmart have become essential avenues for online buying and business transactions. Product knowledge graphs (PGs) have garnered significant interest in recent years as an effective approach to organizing product-related information, enabling several real-world applications such as product search and recommendations. A knowledge graph (KG) that represents product attribute values is referred to as a PG. It’s made up of data from a product catalog. Each product in a PG has numerous attributes, including the product brand, product category, and additional information about product qualities like flavor and ingredients. Unlike traditional KGs, where most triples take the form (head entity, relation, tail entity), the majority of triples in a PG take the form (product, attribute, attribute value), with the attribute value being a short text, such as (“Brand A Tortilla Chips Spicy Queso, 6 – 2 oz bags”, flavor, “Spicy Queso”). Attribute triples are a type of triple. Individual shops contribute the vast majority of product catalog data. These self-reported statistics are always riddled with inconsistencies, erroneous values, and confusing values. When such errors are absorbed by a PG, they cause the downstream applications to perform poorly. Manual validation is not possible in a PG due to a large number of products. A solution for automatic validation is desperately needed. CONTINUE READING Paper: https://www.amazon.science/publications/pge-robust-product-graph-embedding-learning-for-error-detection https://preview.redd.it/zus1321a9pj81.png?width=1112&format=png&auto=webp&s=fd99754bd739831d4052e20211d45d63211ff4c5 submitted by /u/techsucker [link] [comments]  ( 1 min )
    Yann LeCun on a vision to make AI systems learn and reason like animals and humans
    submitted by /u/nick7566 [link] [comments]
    How long till we see a movie with a script written solely by artificial intelligence?
    submitted by /u/Advanced_Teaching_16 [link] [comments]  ( 1 min )
  • Open

    What distinguishes individual units?
    Forgive me as it’s probably a simple question, but what stops each unit ending up with the same gradient and bias in a layer? I don’t see anything inherently different about eg the first and second units in a layer, yes they have different constants at the end of testing but what property makes those constants different? Thanks submitted by /u/slashchunks [link] [comments]  ( 1 min )
    Almost no one knows how easily you can optimize your AI models
    The situation is fairly simple. Your model could run 10 times faster by adding a few lines to your code, but you weren't aware of it. Let me expand on that. AI applications are multiplying like mushrooms, which is awesome As a result, more and more people are turning to the dark side, joining the AI world, as I did The problem? Developers focus only on AI, cleaning up datasets and training their models. Almost no one has a background in hardware, compilers, computing, cloud, etc The result? Developers spend a lot of hours improving the accuracy and performance of their software, and all their hard work risks being undone by the wrong choice of hardware-software coupling This problem bothered me for a long time, so with a couple of buddies at Nebuly (all ex MIT, ETH and EPFL), we put a lot of energy into an open-source library called nebullvm to make DL compiler technology accessible to any developer, even for those who know nothing about hardware, as I did. How does it work? It speeds up your DL models by ~5-20x by testing the best DL compilers out there and selecting the optimal one to best couple your AI model with your machine (GPU, CPU, etc.). All this in just a few lines of code. The library is open source and you can find it here https://github.com/nebuly-ai/nebullvm. Please leave a star on GitHub for the hard work in building the library :) It's a simple act for you, a big smile for us. Thank you, and don't hesitate to contribute to the library! And let me know in the comments if you have questions! And thank you all!! submitted by /u/emilec___ [link] [comments]  ( 1 min )
  • Open

    Learning from an Exploring Demonstrator: Optimal Reward Estimation for Bandits. (arXiv:2106.14866v2 [stat.ML] UPDATED)
    We introduce the "inverse bandit" problem of estimating the rewards of a multi-armed bandit instance from observing the learning process of a low-regret demonstrator. Existing approaches to the related problem of inverse reinforcement learning assume the execution of an optimal policy, and thereby suffer from an identifiability issue. In contrast, we propose to leverage the demonstrator's behavior en route to optimality, and in particular, the exploration phase, for reward estimation. We begin by establishing a general information-theoretic lower bound under this paradigm that applies to any demonstrator algorithm, which characterizes a fundamental tradeoff between reward estimation and the amount of exploration of the demonstrator. Then, we develop simple and efficient reward estimators for upper-confidence-based demonstrator algorithms that attain the optimal tradeoff, showing in particular that consistent reward estimation -- free of identifiability issues -- is possible under our paradigm. Extensive simulations on both synthetic and semi-synthetic data corroborate our theoretical results.  ( 2 min )
    n-CPS: Generalising Cross Pseudo Supervision to n Networks for Semi-Supervised Semantic Segmentation. (arXiv:2112.07528v3 [cs.CV] UPDATED)
    We present n-CPS - a generalisation of the recent state-of-the-art cross pseudo supervision (CPS) approach for the task of semi-supervised semantic segmentation. In n-CPS, there are n simultaneously trained subnetworks that learn from each other through one-hot encoding perturbation and consistency regularisation. We also show that ensembling techniques applied to subnetworks outputs can significantly improve the performance. To the best of our knowledge, n-CPS paired with CutMix outperforms CPS and sets the new state-of-the-art for Pascal VOC 2012 with (1/16, 1/8, 1/4, and 1/2 supervised regimes) and Cityscapes (1/16 supervised).  ( 2 min )
    Automated question generation and question answering from Turkish texts using text-to-text transformers. (arXiv:2111.06476v3 [cs.LG] UPDATED)
    While exam-style questions are a fundamental educational tool serving a variety of purposes, manual construction of questions is a complex process that requires training, experience and resources. Automatic question generation (QG) techniques can be utilized to satisfy the need for a continuous supply of new questions by streamlining their generation. However, compared to automatic question answering (QA), QG is a more challenging task. In this work, we fine-tune a multilingual T5 (mT5) transformer in a multi-task setting for QA, QG and answer extraction tasks using Turkish QA datasets. To the best of our knowledge, this is the first academic work that performs automated text-to-text question generation from Turkish texts. Experimental evaluations show that the proposed multi-task setting achieves state-of-the-art Turkish question answering and question generation performance on TQuADv1, TQuADv2 datasets and XQuAD Turkish split. The source code and the pre-trained models are available at https://github.com/obss/turkish-question-generation.  ( 2 min )
    Mixture-of-Variational-Experts for Continual Learning. (arXiv:2110.12667v3 [cs.LG] UPDATED)
    One weakness of machine learning algorithms is the poor ability of models to solve new problems without forgetting previously acquired knowledge. The Continual Learning (CL) paradigm has emerged as a protocol to systematically investigate settings where the model sequentially observes samples generated by a series of tasks. In this work, we take a task-agnostic view of continual learning and develop a hierarchical information-theoretic optimality principle that facilitates a trade-off between learning and forgetting. We discuss this principle from a Bayesian perspective and show its connections to previous approaches to CL. Based on this principle, we propose a neural network layer, called the Mixture-of-Variational-Experts layer, that alleviates forgetting by creating a set of information processing paths through the network which is governed by a gating policy. Due to the general formulation based on generic utility functions, we can apply this optimality principle to a large variety of learning problems, including supervised learning, reinforcement learning, and generative modeling. We demonstrate the competitive performance of our method in continual supervised learning and in continual reinforcement learning.  ( 2 min )
    Wastewater Pipe Condition Rating Model Using K- Nearest Neighbors. (arXiv:2202.11049v1 [cs.LG])
    Risk-based assessment in pipe condition mainly focuses on prioritizing the most critical assets by evaluating the risk of pipe failure. This paper's goal is to classify a comprehensive pipe rating model which is obtained based on a series of pipe physical, external, and hydraulic characteristics that are identified for the proposed methodology. The traditional manual method of assessing sewage structural conditions takes a long time. By building an automated process using K-Nearest Neighbors (K-NN), this study presents an effective technique to automate the identification of the pipe defect rating using the pipe repair data. First, we performed the Shapiro Wilks Test for 1240 data from the Dept. of Engineering & Environmental Services, Shreveport, Louisiana Phase 3 with 12 variables to determine if factors could be incorporated in the final rating. We then developed a K-Nearest Neighbors model to classify the final rating from the statistically significant factors identified in Shapiro Wilks Test. This classification process allows recognizing the worst condition of wastewater pipes that need to be replaced immediately. This comprehensive model is built according to the industry-accepted and used guidelines to estimate the overall condition. Finally, for validation purposes, the proposed model is applied to a small portion of a US wastewater collection system in Shreveport, Louisiana. Keywords: Pipe rating, Shapiro Wilks Test, K-Nearest Neighbors (KNN), Failure, Risk analysis  ( 2 min )
    Improving Classification Model Performance on Chest X-Rays through Lung Segmentation. (arXiv:2202.10971v1 [eess.IV])
    Chest radiography is an effective screening tool for diagnosing pulmonary diseases. In computer-aided diagnosis, extracting the relevant region of interest, i.e., isolating the lung region of each radiography image, can be an essential step towards improved performance in diagnosing pulmonary disorders. Methods: In this work, we propose a deep learning approach to enhance abnormal chest x-ray (CXR) identification performance through segmentations. Our approach is designed in a cascaded manner and incorporates two modules: a deep neural network with criss-cross attention modules (XLSor) for localizing lung region in CXR images and a CXR classification model with a backbone of a self-supervised momentum contrast (MoCo) model pre-trained on large-scale CXR data sets. The proposed pipeline is evaluated on Shenzhen Hospital (SH) data set for the segmentation module, and COVIDx data set for both segmentation and classification modules. Novel statistical analysis is conducted in addition to regular evaluation metrics for the segmentation module. Furthermore, the results of the optimized approach are analyzed with gradient-weighted class activation mapping (Grad-CAM) to investigate the rationale behind the classification decisions and to interpret its choices. Results and Conclusion: Different data sets, methods, and scenarios for each module of the proposed pipeline are examined for designing an optimized approach, which has achieved an accuracy of 0.946 in distinguishing abnormal CXR images (i.e., Pneumonia and COVID-19) from normal ones. Numerical and visual validations suggest that applying automated segmentation as a pre-processing step for classification improves the generalization capability and the performance of the classification models.  ( 2 min )
    DeepSketch: A New Machine Learning-Based Reference Search Technique for Post-Deduplication Delta Compression. (arXiv:2202.10584v1 [cs.LG])
    Data reduction in storage systems is becoming increasingly important as an effective solution to minimize the management cost of a data center. To maximize data-reduction efficiency, existing post-deduplication delta-compression techniques perform delta compression along with traditional data deduplication and lossless compression. Unfortunately, we observe that existing techniques achieve significantly lower data-reduction ratios than the optimal due to their limited accuracy in identifying similar data blocks. In this paper, we propose DeepSketch, a new reference search technique for post-deduplication delta compression that leverages the learning-to-hash method to achieve higher accuracy in reference search for delta compression, thereby improving data-reduction efficiency. DeepSketch uses a deep neural network to extract a data block's sketch, i.e., to create an approximate data signature of the block that can preserve similarity with other blocks. Our evaluation using eleven real-world workloads shows that DeepSketch improves the data-reduction ratio by up to 33% (21% on average) over a state-of-the-art post-deduplication delta-compression technique.  ( 2 min )
    Policy Evaluation for Temporal and/or Spatial Dependent Experiments in Ride-sourcing Platforms. (arXiv:2202.10887v1 [stat.ME])
    Policy evaluation based on A/B testing has attracted considerable interest in digital marketing, but such evaluation in ride-sourcing platforms (e.g., Uber and Didi) is not well studied primarily due to the complex structure of their temporal and/or spatial dependent experiments. Motivated by policy evaluation in ride-sourcing platforms, the aim of this paper is to establish causal relationship between platform's policies and outcomes of interest under a switchback design. We propose a novel potential outcome framework based on a temporal varying coefficient decision process (VCDP) model to capture the dynamic treatment effects in temporal dependent experiments. We further characterize the average treatment effect by decomposing it as the sum of direct effect (DE) and indirect effect (IE). We develop estimation and inference procedures for both DE and IE. Furthermore, we propose a spatio-temporal VCDP to deal with spatiotemporal dependent experiments. For both VCDP models, we establish the statistical properties (e.g., weak convergence and asymptotic power) of our estimation and inference procedures. We conduct extensive simulations to investigate the finite-sample performance of the proposed estimation and inference procedures. We examine how our VCDP models can help improve policy evaluation for various dispatching and dispositioning policies in Didi.  ( 2 min )
    Causal inference for observational longitudinal studies using deep survival models. (arXiv:2101.10643v11 [stat.ML] UPDATED)
    Causal inference for observational longitudinal studies often requires the accurate estimation of treatment effects on time-to-event outcomes in the presence of time-dependent patient history. To tackle this longitudinal treatment effect estimation problem, we have developed a causal dynamic survival (CDS) model that uses the potential outcomes framework with an ensemble of recurrent subnetworks to estimate the difference in survival probabilities and its confidence interval over time as a function of past, time-dependent covariates and treatments. Using simulated survival datasets, the CDS model showed good causal effect estimation performance across scenarios of varying sample dimensions, event rates, confounding and overlapping. However, increasing the sample size was not effective in alleviating the adverse impact of a high level of confounding. In two large clinical cohort studies, CDS identified the expected conditional average treatment effect and detected individual treatment effect heterogeneity over time. CDS provides an efficient way to estimate and update individualized treatment effects over time, in order to improve clinical decisions.  ( 2 min )
    Scale-invariant scale-channel networks: Deep networks that generalise to previously unseen scales. (arXiv:2106.06418v2 [cs.CV] UPDATED)
    The ability to handle large scale variations is crucial for many real world visual tasks. A straightforward approach for handling scale in a deep network is to process an image at several scales simultaneously in a set of scale channels. Scale invariance can then, in principle, be achieved by using weight sharing between the scale channels together with max or average pooling over the outputs from the scale channels. The ability of such scale channel networks to generalise to scales not present in the training set over significant scale ranges has, however, not previously been explored. In this paper, we present a systematic study of this methodology by implementing different types of scale channel networks and evaluating their ability to generalise to previously unseen scales. We develop a formalism for analysing the covariance and invariance properties of scale channel networks, and explore how different design choices, unique to scaling transformations, affect the overall performance of scale channel networks. We first show that two previously proposed scale channel network designs do not generalise well to scales not present in the training set. We explain theoretically and demonstrate experimentally why generalisation fails in these cases. We then propose a new type of foveated scale channel architecture}, where the scale channels process increasingly larger parts of the image with decreasing resolution. This new type of scale channel network is shown to generalise extremely well, provided sufficient image resolution and the absence of boundary effects. Our proposed FovMax and FovAvg networks perform almost identically over a scale range of 8, also when training on single scale training data, and do also give improved performance when learning from datasets with large scale variations in the small sample regime.  ( 3 min )
    A Cram\'er Distance perspective on Quantile Regression based Distributional Reinforcement Learning. (arXiv:2110.00535v2 [stat.ML] UPDATED)
    Distributional reinforcement learning (DRL) extends the value-based approach by approximating the full distribution over future returns instead of the mean only, providing a richer signal that leads to improved performances. Quantile Regression (QR) based methods like QR-DQN project arbitrary distributions into a parametric subset of staircase distributions by minimizing the 1-Wasserstein distance. However, due to biases in the gradients, the quantile regression loss is used instead for training, guaranteeing the same minimizer and enjoying unbiased gradients. Non-crossing constraints on the quantiles have been shown to improve the performance of QR-DQN for uncertainty-based exploration strategies. The contribution of this work is in the setting of fixed quantile levels and is twofold. First, we prove that the Cram\'er distance yields a projection that coincides with the 1-Wasserstein one and that, under non-crossing constraints, the squared Cram\'er and the quantile regression losses yield collinear gradients, shedding light on the connection between these important elements of DRL. Second, we propose a low complexity algorithm to compute the Cram\'er distance.  ( 2 min )
    Subtyping brain diseases from imaging data. (arXiv:2202.10945v1 [cs.LG])
    The imaging community has increasingly adopted machine learning (ML) methods to provide individualized imaging signatures related to disease diagnosis, prognosis, and response to treatment. Clinical neuroscience and cancer imaging have been two areas in which ML has offered particular promise. However, many neurologic and neuropsychiatric diseases, as well as cancer, are often heterogeneous in terms of their clinical manifestations, neuroanatomical patterns or genetic underpinnings. Therefore, in such cases, seeking a single disease signature might be ineffectual in delivering individualized precision diagnostics. The current chapter focuses on ML methods, especially semi-supervised clustering, that seek disease subtypes using imaging data. Work from Alzheimer Disease and its prodromal stages, psychosis, depression, autism, and brain cancer are discussed. Our goal is to provide the readers with a broad overview in terms of methodology and clinical applications.  ( 2 min )
    Stochastic Extragradient: General Analysis and Improved Rates. (arXiv:2111.08611v3 [math.OC] UPDATED)
    The Stochastic Extragradient (SEG) method is one of the most popular algorithms for solving min-max optimization and variational inequalities problems (VIP) appearing in various machine learning tasks. However, several important questions regarding the convergence properties of SEG are still open, including the sampling of stochastic gradients, mini-batching, convergence guarantees for the monotone finite-sum variational inequalities with possibly non-monotone terms, and others. To address these questions, in this paper, we develop a novel theoretical framework that allows us to analyze several variants of SEG in a unified manner. Besides standard setups, like Same-Sample SEG under Lipschitzness and monotonicity or Independent-Samples SEG under uniformly bounded variance, our approach allows us to analyze variants of SEG that were never explicitly considered in the literature before. Notably, we analyze SEG with arbitrary sampling which includes importance sampling and various mini-batching strategies as special cases. Our rates for the new variants of SEG outperform the current state-of-the-art convergence guarantees and rely on less restrictive assumptions.  ( 2 min )
    Robust Stochastic Linear Contextual Bandits Under Adversarial Attacks. (arXiv:2106.02978v2 [stat.ML] UPDATED)
    Stochastic linear contextual bandit algorithms have substantial applications in practice, such as recommender systems, online advertising, clinical trials, etc. Recent works show that optimal bandit algorithms are vulnerable to adversarial attacks and can fail completely in the presence of attacks. Existing robust bandit algorithms only work for the non-contextual setting under the attack of rewards and cannot improve the robustness in the general and popular contextual bandit environment. In addition, none of the existing methods can defend against attacked context. In this work, we provide the first robust bandit algorithm for stochastic linear contextual bandit setting under a fully adaptive and omniscient attack with sub-linear regret. Our algorithm not only works under the attack of rewards, but also under attacked context. Moreover, it does not need any information about the attack budget or the particular form of the attack. We provide theoretical guarantees for our proposed algorithm and show by experiments that our proposed algorithm improves the robustness against various kinds of popular attacks.  ( 2 min )
    Towards Fine-grained Visual Representations by Combining Contrastive Learning with Image Reconstruction and Attention-weighted Pooling. (arXiv:2104.04323v2 [cs.CV] UPDATED)
    This paper presents Contrastive Reconstruction, ConRec - a self-supervised learning algorithm that obtains image representations by jointly optimizing a contrastive and a self-reconstruction loss. We showcase that state-of-the-art contrastive learning methods (e.g. SimCLR) have shortcomings to capture fine-grained visual features in their representations. ConRec extends the SimCLR framework by adding (1) a self-reconstruction task and (2) an attention mechanism within the contrastive learning task. This is accomplished by applying a simple encoder-decoder architecture with two heads. We show that both extensions contribute towards an improved vector representation for images with fine-grained visual features. Combining those concepts, ConRec outperforms SimCLR and SimCLR with Attention-Pooling on fine-grained classification datasets.  ( 2 min )
    Practical Schemes for Finding Near-Stationary Points of Convex Finite-Sums. (arXiv:2105.12062v2 [math.OC] UPDATED)
    In convex optimization, the problem of finding near-stationary points has not been adequately studied yet, unlike other optimality measures such as the function value. Even in the deterministic case, the optimal method (OGM-G, due to Kim and Fessler (2021)) has just been discovered recently. In this work, we conduct a systematic study of algorithmic techniques for finding near-stationary points of convex finite-sums. Our main contributions are several algorithmic discoveries: (1) we discover a memory-saving variant of OGM-G based on the performance estimation problem approach (Drori and Teboulle, 2014); (2) we design a new accelerated SVRG variant that can simultaneously achieve fast rates for minimizing both the gradient norm and function value; (3) we propose an adaptively regularized accelerated SVRG variant, which does not require the knowledge of some unknown initial constants and achieves near-optimal complexities. We put an emphasis on the simplicity and practicality of the new schemes, which could facilitate future work.  ( 2 min )
    Distilled Neural Networks for Efficient Learning to Rank. (arXiv:2202.10728v1 [cs.LG])
    Recent studies in Learning to Rank have shown the possibility to effectively distill a neural network from an ensemble of regression trees. This result leads neural networks to become a natural competitor of tree-based ensembles on the ranking task. Nevertheless, ensembles of regression trees outperform neural models both in terms of efficiency and effectiveness, particularly when scoring on CPU. In this paper, we propose an approach for speeding up neural scoring time by applying a combination of Distillation, Pruning and Fast Matrix multiplication. We employ knowledge distillation to learn shallow neural networks from an ensemble of regression trees. Then, we exploit an efficiency-oriented pruning technique that performs a sparsification of the most computationally-intensive layers of the neural network that is then scored with optimized sparse matrix multiplication. Moreover, by studying both dense and sparse high performance matrix multiplication, we develop a scoring time prediction model which helps in devising neural network architectures that match the desired efficiency requirements. Comprehensive experiments on two public learning-to-rank datasets show that neural networks produced with our novel approach are competitive at any point of the effectiveness-efficiency trade-off when compared with tree-based ensembles, providing up to 4x scoring time speed-up without affecting the ranking quality.  ( 2 min )
    RAILS: A Robust Adversarial Immune-inspired Learning System. (arXiv:2107.02840v2 [cs.NE] UPDATED)
    Adversarial attacks against deep neural networks (DNNs) are continuously evolving, requiring increasingly powerful defense strategies. We develop a novel adversarial defense framework inspired by the adaptive immune system: the Robust Adversarial Immune-inspired Learning System (RAILS). Initializing a population of exemplars that is balanced across classes, RAILS starts from a uniform label distribution that encourages diversity and uses an evolutionary optimization process to adaptively adjust the predictive label distribution in a manner that emulates the way the natural immune system recognizes novel pathogens. RAILS' evolutionary optimization process explicitly captures the tradeoff between robustness (diversity) and accuracy (specificity) of the network, and represents a new immune-inspired perspective on adversarial learning. The benefits of RAILS are empirically demonstrated under eight types of adversarial attacks on a DNN adversarial image classifier for several benchmark datasets, including: MNIST; SVHN; CIFAR-10; and CIFAR-10. We find that PGD is the most damaging attack strategy and that for this attack RAILS is significantly more robust than other methods, achieving improvements in adversarial robustness by $\geq 5.62\%, 12.5\%$, $10.32\%$, and $8.39\%$, on these respective datasets, without appreciable loss of classification accuracy. Codes for the results in this paper are available at https://github.com/wangren09/RAILS.  ( 2 min )
    Sobolev Transport: A Scalable Metric for Probability Measures with Graph Metrics. (arXiv:2202.10723v1 [cs.LG])
    Optimal transport (OT) is a popular measure to compare probability distributions. However, OT suffers a few drawbacks such as (i) a high complexity for computation, (ii) indefiniteness which limits its applicability to kernel machines. In this work, we consider probability measures supported on a graph metric space and propose a novel Sobolev transport metric. We show that the Sobolev transport metric yields a closed-form formula for fast computation and it is negative definite. We show that the space of probability measures endowed with this transport distance is isometric to a bounded convex set in a Euclidean space with a weighted $\ell_p$ distance. We further exploit the negative definiteness of the Sobolev transport to design positive-definite kernels, and evaluate their performances against other baselines in document classification with word embeddings and in topological data analysis.  ( 2 min )
    EIGNN: Efficient Infinite-Depth Graph Neural Networks. (arXiv:2202.10720v1 [cs.LG])
    Graph neural networks (GNNs) are widely used for modelling graph-structured data in numerous applications. However, with their inherently finite aggregation layers, existing GNN models may not be able to effectively capture long-range dependencies in the underlying graphs. Motivated by this limitation, we propose a GNN model with infinite depth, which we call Efficient Infinite-Depth Graph Neural Networks (EIGNN), to efficiently capture very long-range dependencies. We theoretically derive a closed-form solution of EIGNN which makes training an infinite-depth GNN model tractable. We then further show that we can achieve more efficient computation for training EIGNN by using eigendecomposition. The empirical results of comprehensive experiments on synthetic and real-world datasets show that EIGNN has a better ability to capture long-range dependencies than recent baselines, and consistently achieves state-of-the-art performance. Furthermore, we show that our model is also more robust against both noise and adversarial perturbations on node features.  ( 2 min )
    LOTR: Face Landmark Localization Using Localization Transformer. (arXiv:2109.10057v3 [cs.CV] UPDATED)
    This paper presents a novel Transformer-based facial landmark localization network named Localization Transformer (LOTR). The proposed framework is a direct coordinate regression approach leveraging a Transformer network to better utilize the spatial information in the feature map. An LOTR model consists of three main modules: 1) a visual backbone that converts an input image into a feature map, 2) a Transformer module that improves the feature representation from the visual backbone, and 3) a landmark prediction head that directly predicts the landmark coordinates from the Transformer's representation. Given cropped-and-aligned face images, the proposed LOTR can be trained end-to-end without requiring any post-processing steps. This paper also introduces the smooth-Wing loss function, which addresses the gradient discontinuity of the Wing loss, leading to better convergence than standard loss functions such as L1, L2, and Wing loss. Experimental results on the JD landmark dataset provided by the First Grand Challenge of 106-Point Facial Landmark Localization indicate the superiority of LOTR over the existing methods on the leaderboard and two recent heatmap-based approaches. On the WFLW dataset, the proposed LOTR framework demonstrates promising results compared with several state-of-the-art methods. Additionally, we report the improvement in state-of-the-art face recognition performance when using our proposed LOTRs for face alignment.  ( 2 min )
    Minimax Regret for Partial Monitoring: Infinite Outcomes and Rustichini's Regret. (arXiv:2202.10997v1 [math.OC])
    We show that a version of the generalised information ratio of Lattimore and Gyorgy (2020) determines the asymptotic minimax regret for all finite-action partial monitoring games provided that (a) the standard definition of regret is used but the latent space where the adversary plays is potentially infinite; or (b) the regret introduced by Rustichini (1999) is used and the latent space is finite. Our results are complemented by a number of examples. For any $p \in [1/2,1]$ there exists an infinite partial monitoring game for which the minimax regret over $n$ rounds is $n^p$ up to subpolynomial factors and there exist finite games for which the minimax Rustichini regret is $n^{4/7}$ up to subpolynomial factors.  ( 2 min )
    Graph Neural Network for Traffic Forecasting: A Survey. (arXiv:2101.11174v4 [cs.LG] UPDATED)
    Traffic forecasting is important for the success of intelligent transportation systems. Deep learning models, including convolution neural networks and recurrent neural networks, have been extensively applied in traffic forecasting problems to model spatial and temporal dependencies. In recent years, to model the graph structures in transportation systems as well as contextual information, graph neural networks have been introduced and have achieved state-of-the-art performance in a series of traffic forecasting problems. In this survey, we review the rapidly growing body of research using different graph neural networks, e.g. graph convolutional and graph attention networks, in various traffic forecasting problems, e.g. road traffic flow and speed forecasting, passenger flow forecasting in urban rail transit systems, and demand forecasting in ride-hailing platforms. We also present a comprehensive list of open data and source resources for each problem and identify future research directions. To the best of our knowledge, this paper is the first comprehensive survey that explores the application of graph neural networks for traffic forecasting problems. We have also created a public GitHub repository where the latest papers, open data, and source resources will be updated.  ( 2 min )
    Approximate gradient ascent methods for distortion risk measures. (arXiv:2202.11046v1 [cs.LG])
    We propose approximate gradient ascent algorithms for risk-sensitive reinforcement learning control problem in on-policy as well as off-policy settings. We consider episodic Markov decision processes, and model the risk using distortion risk measure (DRM) of the cumulative discounted reward. Our algorithms estimate the DRM using order statistics of the cumulative rewards, and calculate approximate gradients from the DRM estimates using a smoothed functional-based gradient estimation scheme. We derive non-asymptotic bounds that establish the convergence of our proposed algorithms to an approximate stationary point of the DRM objective.  ( 2 min )
    Adversarial Defense by Latent Style Transformations. (arXiv:2006.09701v2 [cs.CV] UPDATED)
    Machine learning models have demonstrated vulnerability to adversarial attacks, more specifically misclassification of adversarial examples. In this paper, we investigate an attack-agnostic defense against adversarial attacks on high-resolution images by detecting suspicious inputs. The intuition behind our approach is that the essential characteristics of a normal image are generally consistent with non-essential style transformations, e.g., slightly changing the facial expression of human portraits. In contrast, adversarial examples are generally sensitive to such transformations. In our approach to detect adversarial instances, we propose an in\underline{V}ertible \underline{A}utoencoder based on the \underline{S}tyleGAN2 generator via \underline{A}dversarial training (VASA) to inverse images to disentangled latent codes that reveal hierarchical styles. We then build a set of edited copies with non-essential style transformations by performing latent shifting and reconstruction, based on the correspondences between latent codes and style transformations. The classification-based consistency of these edited copies is used to distinguish adversarial instances.  ( 2 min )
    Pyxis: An Open-Source Performance Dataset of Sparse Accelerators. (arXiv:2110.04280v2 [cs.LG] UPDATED)
    Specialized accelerators provide gains of performance and efficiency in specific domains of applications. Sparse data structures or/and representations exist in a wide range of applications. However, it is challenging to design accelerators for sparse applications because no architecture or performance-level analytic models are able to fully capture the spectrum of the sparse data. Accelerator researchers rely on real execution to get precise feedback for their designs. In this work, we present PYXIS, a performance dataset for specialized accelerators on sparse data. PYXIS collects accelerator designs and real execution performance statistics. Currently, there are 73.8 K instances in PYXIS. PYXIS is open-source, and we are constantly growing PYXIS with new accelerator designs and performance statistics. PYXIS can benefit researchers in the fields of accelerator, architecture, performance, algorithm, and many related topics.
    Audio-to-symbolic Arrangement via Cross-modal Music Representation Learning. (arXiv:2112.15110v2 [cs.SD] UPDATED)
    Could we automatically derive the score of a piano accompaniment based on the audio of a pop song? This is the audio-to-symbolic arrangement problem we tackle in this paper. A good arrangement model should not only consider the audio content but also have prior knowledge of piano composition (so that the generation "sounds like" the audio and meanwhile maintains musicality). To this end, we contribute a cross-modal representation-learning model, which 1) extracts chord and melodic information from the audio, and 2) learns texture representation from both audio and a corrupted ground truth arrangement. We further introduce a tailored training strategy that gradually shifts the source of texture information from corrupted score to audio. In the end, the score-based texture posterior is reduced to a standard normal distribution, and only audio is needed for inference. Experiments show that our model captures major audio information and outperforms baselines in generation quality.
    GRecX: An Efficient and Unified Benchmark for GNN-based Recommendation. (arXiv:2111.10342v3 [cs.IR] UPDATED)
    In this paper, we present GRecX, an open-source TensorFlow framework for benchmarking GNN-based recommendation models in an efficient and unified way. GRecX consists of core libraries for building GNN-based recommendation benchmarks, as well as the implementations of popular GNN-based recommendation models. The core libraries provide essential components for building efficient and unified benchmarks, including FastMetrics (efficient metrics computation libraries), VectorSearch (efficient similarity search libraries for dense vectors), BatchEval (efficient mini-batch evaluation libraries), and DataManager (unified dataset management libraries). Especially, to provide a unified benchmark for the fair comparison of different complex GNN-based recommendation models, we design a new metric GRMF-X and integrate it into the FastMetrics component. Based on a TensorFlow GNN library tf_geometric, GRecX carefully implements a variety of popular GNN-based recommendation models. We carefully implement these baseline models to reproduce the performance reported in the literature, and our implementations are usually more efficient and friendly. In conclusion, GRecX enables uses to train and benchmark GNN-based recommendation baselines in an efficient and unified way. We conduct experiments with GRecX, and the experimental results show that GRecX allows us to train and benchmark GNN-based recommendation baselines in an efficient and unified way. The source code of GRecX is available at https://github.com/maenzhier/GRecX.
    Gaussian Imagination in Bandit Learning. (arXiv:2201.01902v3 [cs.LG] UPDATED)
    Assuming distributions are Gaussian often facilitates computations that are otherwise intractable. We study the performance of an agent that attains a bounded information ratio with respect to a bandit environment with a Gaussian prior distribution and a Gaussian likelihood function when applied instead to a Bernoulli bandit. Relative to an information-theoretic bound on the Bayesian regret the agent would incur when interacting with the Gaussian bandit, we bound the increase in regret when the agent interacts with the Bernoulli bandit. If the Gaussian prior distribution and likelihood function are sufficiently diffuse, this increase grows at a rate which is at most linear in the square-root of the time horizon, and thus the per-timestep increase vanishes. Our results formalize the folklore that so-called Bayesian agents remain effective when instantiated with diffuse misspecified distributions.
    PFGE: Parsimonious Fast Geometric Ensembling of DNNs. (arXiv:2202.06658v2 [cs.LG] UPDATED)
    Ensemble methods have been widely used to improve the performance of machine learning methods in terms of generalization and uncertainty calibration, while they struggle to use in deep learning systems, as training an ensemble of deep neural networks (DNNs) and then deploying them for online prediction incur an extremely higher computational overhead of model training and test-time predictions. Recently, several advanced techniques, such as fast geometric ensembling (FGE) and snapshot ensemble, have been proposed. These methods can train the model ensembles in the same time as a single model, thus getting around the hurdle of training time. However, their overhead of model recording and test-time computations remains much higher than their single model based counterparts. Here we propose a parsimonious FGE (PFGE) that employs a lightweight ensemble of higher-performing DNNs generated by several successively-performed procedures of stochastic weight averaging. Experimental results across different advanced DNN architectures on different datasets, namely CIFAR-$\{$10,100$\}$ and Imagenet, demonstrate its performance. Results show that, compared with state-of-the-art methods, PFGE achieves better generalization performance and satisfactory calibration capability, while the overhead of model recording and test-time predictions is significantly reduced.
    Don't Touch What Matters: Task-Aware Lipschitz Data Augmentation for Visual Reinforcement Learning. (arXiv:2202.09982v2 [cs.CV] UPDATED)
    One of the key challenges in visual Reinforcement Learning (RL) is to learn policies that can generalize to unseen environments. Recently, data augmentation techniques aiming at enhancing data diversity have demonstrated proven performance in improving the generalization ability of learned policies. However, due to the sensitivity of RL training, naively applying data augmentation, which transforms each pixel in a task-agnostic manner, may suffer from instability and damage the sample efficiency, thus further exacerbating the generalization performance. At the heart of this phenomenon is the diverged action distribution and high-variance value estimation in the face of augmented images. To alleviate this issue, we propose Task-aware Lipschitz Data Augmentation (TLDA) for visual RL, which explicitly identifies the task-correlated pixels with large Lipschitz constants, and only augments the task-irrelevant pixels. To verify the effectiveness of TLDA, we conduct extensive experiments on DeepMind Control suite, CARLA and DeepMind Manipulation tasks, showing that TLDA improves both sample efficiency in training time and generalization in test time. It outperforms previous state-of-the-art methods across the 3 different visual control benchmarks.
    Pseudo Numerical Methods for Diffusion Models on Manifolds. (arXiv:2202.09778v1 [cs.CV] CROSS LISTED)
    Denoising Diffusion Probabilistic Models (DDPMs) can generate high-quality samples such as image and audio samples. However, DDPMs require hundreds to thousands of iterations to produce final samples. Several prior works have successfully accelerated DDPMs through adjusting the variance schedule (e.g., Improved Denoising Diffusion Probabilistic Models) or the denoising equation (e.g., Denoising Diffusion Implicit Models (DDIMs)). However, these acceleration methods cannot maintain the quality of samples and even introduce new noise at a high speedup rate, which limit their practicability. To accelerate the inference process while keeping the sample quality, we provide a fresh perspective that DDPMs should be treated as solving differential equations on manifolds. Under such a perspective, we propose pseudo numerical methods for diffusion models (PNDMs). Specifically, we figure out how to solve differential equations on manifolds and show that DDIMs are simple cases of pseudo numerical methods. We change several classical numerical methods to corresponding pseudo numerical methods and find that the pseudo linear multi-step method is the best in most situations. According to our experiments, by directly using pre-trained models on Cifar10, CelebA and LSUN, PNDMs can generate higher quality synthetic images with only 50 steps compared with 1000-step DDIMs (20x speedup), significantly outperform DDIMs with 250 steps (by around 0.4 in FID) and have good generalization on different variance schedules. Our implementation is available at https://github.com/luping-liu/PNDM.
    Hyper Attention Recurrent Neural Network: Tackling Temporal Covariate Shift in Time Series Analysis. (arXiv:2202.10808v1 [cs.LG])
    Analyzing long time series with RNNs often suffers from infeasible training. Segmentation is therefore commonly used in data pre-processing. However, in non-stationary time series, there exists often distribution shift among different segments. RNN is easily swamped in the dilemma of fitting bias in these segments due to the lack of global information, leading to poor generalization, known as Temporal Covariate Shift (TCS) problem, which is only addressed by a recently proposed RNN-based model. One of the assumptions in TCS is that the distribution of all divided intervals under the same segment are identical. This assumption, however, may not be true on high-frequency time series, such as traffic flow, that also have large stochasticity. Besides, macro information across long periods isn't adequately considered in the latest RNN-based methods. To address the above issues, we propose Hyper Attention Recurrent Neural Network (HARNN) for the modeling of temporal patterns containing both micro and macro information. An HARNN consists of a meta layer for parameter generation and an attention-enabled main layer for inference. High-frequency segments are transformed into low-frequency segments and fed into the meta layers, while the first main layer consumes the same high-frequency segments as conventional methods. In this way, each low-frequency segment in the meta inputs generates a unique main layer, enabling the integration of both macro information and micro information for inference. This forces all main layers to predict the same target which fully harnesses the common knowledge in varied distributions when capturing temporal patterns. Evaluations on multiple benchmarks demonstrated that our model outperforms a couple of RNN-based methods on a federation of key metrics.  ( 2 min )
    Differentially Private Federated Learning on Heterogeneous Data. (arXiv:2111.09278v2 [cs.LG] UPDATED)
    Federated Learning (FL) is a paradigm for large-scale distributed learning which faces two key challenges: (i) efficient training from highly heterogeneous user data, and (ii) protecting the privacy of participating users. In this work, we propose a novel FL approach (DP-SCAFFOLD) to tackle these two challenges together by incorporating Differential Privacy (DP) constraints into the popular SCAFFOLD algorithm. We focus on the challenging setting where users communicate with a "honest-but-curious" server without any trusted intermediary, which requires to ensure privacy not only towards a third-party with access to the final model but also towards the server who observes all user communications. Using advanced results from DP theory, we establish the convergence of our algorithm for convex and non-convex objectives. Our analysis clearly highlights the privacy-utility trade-off under data heterogeneity, and demonstrates the superiority of DP-SCAFFOLD over the state-of-the-art algorithm DP-FedAvg when the number of local updates and the level of heterogeneity grow. Our numerical results confirm our analysis and show that DP-SCAFFOLD provides significant gains in practice.
    How I Learned to Stop Worrying and Love Retraining. (arXiv:2111.00843v2 [cs.LG] UPDATED)
    Many Neural Network Pruning approaches consist of several iterative training and pruning steps, seemingly losing a significant amount of their performance after pruning and then recovering it in the subsequent retraining phase. Recent works of Renda et al. (2020) and Le & Hua (2021) demonstrate the significance of the learning rate schedule during the retraining phase and propose specific heuristics for choosing such a schedule for IMP (Han et al., 2015). We place these findings in the context of the results of Li et al. (2020) regarding the training of models within a fixed training budget and demonstrate that, consequently, the retraining phase can be massively shortened using a simple linear learning rate schedule. Going a step further, we propose similarly imposing a budget on the initial dense training phase and show that the resulting simple and efficient method is capable of outperforming significantly more complex or heavily parameterized state-of-the-art approaches that attempt to sparsify the network during training. These findings not only advance our understanding of the retraining phase, but more broadly question the belief that one should aim to avoid the need for retraining and reduce the negative effects of 'hard' pruning by incorporating the sparsification process into the standard training.
    Generating Synthetic Mobility Networks with Generative Adversarial Networks. (arXiv:2202.11028v1 [cs.LG])
    The increasingly crucial role of human displacements in complex societal phenomena, such as traffic congestion, segregation, and the diffusion of epidemics, is attracting the interest of scientists from several disciplines. In this article, we address mobility network generation, i.e., generating a city's entire mobility network, a weighted directed graph in which nodes are geographic locations and weighted edges represent people's movements between those locations, thus describing the entire mobility set flows within a city. Our solution is MoGAN, a model based on Generative Adversarial Networks (GANs) to generate realistic mobility networks. We conduct extensive experiments on public datasets of bike and taxi rides to show that MoGAN outperforms the classical Gravity and Radiation models regarding the realism of the generated networks. Our model can be used for data augmentation and performing simulations and what-if analysis.  ( 2 min )
    Choquet-Based Fuzzy Rough Sets. (arXiv:2202.10872v1 [cs.LG])
    Fuzzy rough set theory can be used as a tool for dealing with inconsistent data when there is a gradual notion of indiscernibility between objects. It does this by providing lower and upper approximations of concepts. In classical fuzzy rough sets, the lower and upper approximations are determined using the minimum and maximum operators, respectively. This is undesirable for machine learning applications, since it makes these approximations sensitive to outlying samples. To mitigate this problem, ordered weighted average (OWA) based fuzzy rough sets were introduced. In this paper, we show how the OWA-based approach can be interpreted intuitively in terms of vague quantification, and then generalize it to Choquet-based fuzzy rough sets (CFRS). This generalization maintains desirable theoretical properties, such as duality and monotonicity. Furthermore, it provides more flexibility for machine learning applications. In particular, we show that it enables the seamless integration of outlier detection algorithms, to enhance the robustness of machine learning algorithms based on fuzzy rough sets.  ( 2 min )
    Wavebender GAN: An architecture for phonetically meaningful speech manipulation. (arXiv:2202.10973v1 [eess.AS])
    Deep learning has revolutionised synthetic speech quality. However, it has thus far delivered little value to the speech science community. The new methods do not meet the controllability demands that practitioners in this area require e.g.: in listening tests with manipulated speech stimuli. Instead, control of different speech properties in such stimuli is achieved by using legacy signal-processing methods. This limits the range, accuracy, and speech quality of the manipulations. Also, audible artefacts have a negative impact on the methodological validity of results in speech perception studies. This work introduces a system capable of manipulating speech properties through learning rather than design. The architecture learns to control arbitrary speech properties and leverages progress in neural vocoders to obtain realistic output. Experiments with copy synthesis and manipulation of a small set of core speech features (pitch, formants, and voice quality measures) illustrate the promise of the approach for producing speech stimuli that have accurate control and high perceptual quality.  ( 2 min )
    Enabling Reproducibility and Meta-learning Through a Lifelong Database of Experiments (LDE). (arXiv:2202.10979v1 [cs.LG])
    Artificial Intelligence (AI) development is inherently iterative and experimental. Over the course of normal development, especially with the advent of automated AI, hundreds or thousands of experiments are generated and are often lost or never examined again. There is a lost opportunity to document these experiments and learn from them at scale, but the complexity of tracking and reproducing these experiments is often prohibitive to data scientists. We present the Lifelong Database of Experiments (LDE) that automatically extracts and stores linked metadata from experiment artifacts and provides features to reproduce these artifacts and perform meta-learning across them. We store context from multiple stages of the AI development lifecycle including datasets, pipelines, how each is configured, and training runs with information about their runtime environment. The standardized nature of the stored metadata allows for querying and aggregation, especially in terms of ranking artifacts by performance metrics. We exhibit the capabilities of the LDE by reproducing an existing meta-learning study and storing the reproduced metadata in our system. Then, we perform two experiments on this metadata: 1) examining the reproducibility and variability of the performance metrics and 2) implementing a number of meta-learning algorithms on top of the data and examining how variability in experimental results impacts recommendation performance. The experimental results suggest significant variation in performance, especially depending on dataset configurations; this variation carries over when meta-learning is built on top of the results, with performance improving when using aggregated results. This suggests that a system that automatically collects and aggregates results such as the LDE not only assists in implementing meta-learning but may also improve its performance.  ( 2 min )
    Adaptive Cut Selection in Mixed-Integer Linear Programming. (arXiv:2202.10962v1 [math.OC])
    Cut selection is a subroutine used in all modern mixed-integer linear programming solvers with the goal of selecting a subset of generated cuts that induce optimal solver performance. These solvers have millions of parameter combinations, and so are excellent candidates for parameter tuning. Cut selection scoring rules are usually weighted sums of different measurements, where the weights are parameters. We present a parametric family of mixed-integer linear programs together with infinitely many family-wide valid cuts. Some of these cuts can induce integer optimal solutions directly after being applied, while others fail to do so even if an infinite amount are applied. We show for a specific cut selection rule, that any finite grid search of the parameter space will always miss all parameter values, which select integer optimal inducing cuts in an infinite amount of our problems. We propose a variation on the design of existing graph convolutional neural networks, adapting them to learn cut selection rule parameters. We present a reinforcement learning framework for selecting cuts, and train our design using said framework over MIPLIB 2017. Our framework and design show that adaptive cut selection does substantially improve performance over a diverse set of instances, but that finding a single function describing such a rule is difficult. Code for reproducing all experiments is available at https://github.com/Opt-Mucca/Adaptive-Cutsel-MILP.  ( 2 min )
    The Winning Solution to the iFLYTEK Challenge 2021 Cultivated Land Extraction from High-Resolution Remote Sensing Image. (arXiv:2202.10974v1 [cs.CV])
    Extracting cultivated land accurately from high-resolution remote images is a basic task for precision agriculture. This report introduces our solution to the iFLYTEK challenge 2021 cultivated land extraction from high-resolution remote sensing image. The challenge requires segmenting cultivated land objects in very high-resolution multispectral remote sensing images. We established a highly effective and efficient pipeline to solve this problem. We first divided the original images into small tiles and separately performed instance segmentation on each tile. We explored several instance segmentation algorithms that work well on natural images and developed a set of effective methods that are applicable to remote sensing images. Then we merged the prediction results of all small tiles into seamless, continuous segmentation results through our proposed overlap-tile fusion strategy. We achieved the first place among 486 teams in the challenge.  ( 2 min )
    How and When Random Feedback Works: A Case Study of Low-Rank Matrix Factorization. (arXiv:2111.08706v2 [cs.NE] UPDATED)
    The success of gradient descent in ML and especially for learning neural networks is remarkable and robust. In the context of how the brain learns, one aspect of gradient descent that appears biologically difficult to realize (if not implausible) is that its updates rely on feedback from later layers to earlier layers through the same connections. Such bidirected links are relatively few in brain networks, and even when reciprocal connections exist, they may not be equi-weighted. Random Feedback Alignment (Lillicrap et al., 2016), where the backward weights are random and fixed, has been proposed as a bio-plausible alternative and found to be effective empirically. We investigate how and when feedback alignment (FA) works, focusing on one of the most basic problems with layered structure -- low-rank matrix factorization. In this problem, given a matrix $Y_{n\times m}$, the goal is to find a low rank factorization $Z_{n \times r}W_{r \times m}$ that minimizes the error $\|ZW-Y\|_F$. Gradient descent solves this problem optimally. We show that FA converges to the optimal solution when $r\ge \mbox{rank}(Y)$. We also shed light on how FA works. It is observed empirically that the forward weight matrices and (random) feedback matrices come closer during FA updates. Our analysis rigorously derives this phenomenon and shows how it facilitates convergence of FA*, a closely related variant of FA. We also show that FA can be far from optimal when $r < \mbox{rank}(Y)$. This is the first provable separation result between gradient descent and FA. Moreover, the representations found by gradient descent and FA can be almost orthogonal even when their error $\|ZW-Y\|_F$ is approximately equal. As a corollary, these results also hold for training two-layer linear neural networks when the training input is isotropic, and the output is a linear function of the input.
    Simple, Fast, and Flexible Framework for Matrix Completion with Infinite Width Neural Networks. (arXiv:2108.00131v2 [cs.LG] UPDATED)
    Matrix completion problems arise in many applications including recommendation systems, computer vision, and genomics. Increasingly larger neural networks have been successful in many of these applications, but at considerable computational costs. Remarkably, taking the width of a neural network to infinity allows for improved computational performance. In this work, we develop an infinite width neural network framework for matrix completion that is simple, fast, and flexible. Simplicity and speed come from the connection between the infinite width limit of neural networks and kernels known as neural tangent kernels (NTK). In particular, we derive the NTK for fully connected and convolutional neural networks for matrix completion. The flexibility stems from a feature prior, which allows encoding relationships between coordinates of the target matrix, akin to semi-supervised learning. The effectiveness of our framework is demonstrated through competitive results for virtual drug screening and image inpainting/reconstruction. We also provide an implementation in Python to make our framework accessible on standard hardware to a broad audience.
    Multiple Importance Sampling ELBO and Deep Ensembles of Variational Approximations. (arXiv:2202.10951v1 [cs.LG])
    In variational inference (VI), the marginal log-likelihood is estimated using the standard evidence lower bound (ELBO), or improved versions as the importance weighted ELBO (IWELBO). We propose the multiple importance sampling ELBO (MISELBO), a \textit{versatile} yet \textit{simple} framework. MISELBO is applicable in both amortized and classical VI, and it uses ensembles, e.g., deep ensembles, of independently inferred variational approximations. As far as we are aware, the concept of deep ensembles in amortized VI has not previously been established. We prove that MISELBO provides a tighter bound than the average of standard ELBOs, and demonstrate empirically that it gives tighter bounds than the average of IWELBOs. MISELBO is evaluated in density-estimation experiments that include MNIST and several real-data phylogenetic tree inference problems. First, on the MNIST dataset, MISELBO boosts the density-estimation performances of a state-of-the-art model, nouveau VAE. Second, in the phylogenetic tree inference setting, our framework enhances a state-of-the-art VI algorithm that uses normalizing flows. On top of the technical benefits of MISELBO, it allows to unveil connections between VI and recent advances in the importance sampling literature, paving the way for further methodological advances. We provide our code at \url{https://github.com/Lagergren-Lab/MISELBO}.  ( 2 min )
    Disentangling causal effects for hierarchical reinforcement learning. (arXiv:2010.01351v2 [cs.AI] UPDATED)
    Exploration and credit assignment under sparse rewards are still challenging problems. We argue that these challenges arise in part due to the intrinsic rigidity of operating at the level of actions. Actions can precisely define how to perform an activity but are ill-suited to describe what activity to perform. Instead, causal effects are inherently composable and temporally abstract, making them ideal for descriptive tasks. By leveraging a hierarchy of causal effects, this study aims to expedite the learning of task-specific behavior and aid exploration. Borrowing counterfactual and normality measures from causal literature, we disentangle controllable effects from effects caused by other dynamics of the environment. We propose CEHRL, a hierarchical method that models the distribution of controllable effects using a Variational Autoencoder. This distribution is used by a high-level policy to 1) explore the environment via random effect exploration so that novel effects are continuously discovered and learned, and to 2) learn task-specific behavior by prioritizing the effects that maximize a given reward function. In comparison to exploring with random actions, experimental results show that random effect exploration is a more efficient mechanism and that by assigning credit to few effects rather than many actions, CEHRL learns tasks more rapidly.
    SapientML: Synthesizing Machine Learning Pipelines by Learning from Human-Written Solutions. (arXiv:2202.10451v1 [cs.LG])
    Automatic machine learning, or AutoML, holds the promise of truly democratizing the use of machine learning (ML), by substantially automating the work of data scientists. However, the huge combinatorial search space of candidate pipelines means that current AutoML techniques, generate sub-optimal pipelines, or none at all, especially on large, complex datasets. In this work we propose an AutoML technique SapientML, that can learn from a corpus of existing datasets and their human-written pipelines, and efficiently generate a high-quality pipeline for a predictive task on a new dataset. To combat the search space explosion of AutoML, SapientML employs a novel divide-and-conquer strategy realized as a three-stage program synthesis approach, that reasons on successively smaller search spaces. The first stage uses a machine-learned model to predict a set of plausible ML components to constitute a pipeline. In the second stage, this is then refined into a small pool of viable concrete pipelines using syntactic constraints derived from the corpus and the machine-learned model. Dynamically evaluating these few pipelines, in the third stage, provides the best solution. We instantiate SapientML as part of a fully automated tool-chain that creates a cleaned, labeled learning corpus by mining Kaggle, learns from it, and uses the learned models to then synthesize pipelines for new predictive tasks. We have created a training corpus of 1094 pipelines spanning 170 datasets, and evaluated SapientML on a set of 41 benchmark datasets, including 10 new, large, real-world datasets from Kaggle, and against 3 state-of-the-art AutoML tools and 2 baselines. Our evaluation shows that SapientML produces the best or comparable accuracy on 27 of the benchmarks while the second best tool fails to even produce a pipeline on 9 of the instances.
    VU-BERT: A Unified framework for Visual Dialog. (arXiv:2202.10787v1 [cs.CL])
    The visual dialog task attempts to train an agent to answer multi-turn questions given an image, which requires the deep understanding of interactions between the image and dialog history. Existing researches tend to employ the modality-specific modules to model the interactions, which might be troublesome to use. To fill in this gap, we propose a unified framework for image-text joint embedding, named VU-BERT, and apply patch projection to obtain vision embedding firstly in visual dialog tasks to simplify the model. The model is trained over two tasks: masked language modeling and next utterance retrieval. These tasks help in learning visual concepts, utterances dependence, and the relationships between these two modalities. Finally, our VU-BERT achieves competitive performance (0.7287 NDCG scores) on VisDial v1.0 Datasets.
    Submodlib: A Submodular Optimization Library. (arXiv:2202.10680v1 [cs.LG])
    Submodular functions are a special class of set functions which naturally model the notion of representativeness, diversity, coverage etc. and have been shown to be computationally very efficient. A lot of past work has applied submodular optimization to find optimal subsets in various contexts. Some examples include data summarization for efficient human consumption, finding effective smaller subsets of training data to reduce the model development time (training, hyper parameter tuning), finding effective subsets of unlabeled data to reduce the labeling costs, etc. A recent work has also leveraged submodular functions to propose submodular information measures which have been found to be very useful in solving the problems of guided subset selection and guided summarization. In this work, we present Submodlib which is an open-source, easy-to-use, efficient and scalable Python library for submodular optimization with a C++ optimization engine. Submodlib finds its application in summarization, data subset selection, hyper parameter tuning, efficient training and more. Through a rich API, it offers a great deal of flexibility in the way it can be used.
    Quantum Differential Privacy: An Information Theory Perspective. (arXiv:2202.10717v1 [quant-ph])
    Differential privacy has been an exceptionally successful concept when it comes to providing provable security guarantees for classical computations. More recently, the concept was generalized to quantum computations. While classical computations are essentially noiseless and differential privacy is often achieved by artificially adding noise, near-term quantum computers are inherently noisy and it was observed that this leads to natural differential privacy as a feature. In this work we discuss quantum differential privacy in an information theoretic framework by casting it as a quantum divergence. A main advantage of this approach is that differential privacy becomes a property solely based on the output states of the computation, without the need to check it for every measurement. This leads to simpler proofs and generalized statements of its properties as well as several new bounds for both, general and specific, noise models. In particular, these include common representations of quantum circuits and quantum machine learning concepts. Here, we focus on the difference in the amount of noise required to achieve certain levels of differential privacy versus the amount that would make any computation useless. Finally, we also generalize the classical concepts of local differential privacy, R\'enyi differential privacy and the hypothesis testing interpretation to the quantum setting, providing several new properties and insights.
    Classical versus Quantum: comparing Tensor Network-based Quantum Circuits on LHC data. (arXiv:2202.10471v1 [quant-ph])
    Tensor Networks (TN) are approximations of high-dimensional tensors designed to represent locally entangled quantum many-body systems efficiently. This study provides a comprehensive comparison between classical TNs and TN-inspired quantum circuits in the context of Machine Learning on highly complex, simulated LHC data. We show that classical TNs require exponentially large bond dimensions and higher Hilbert-space mapping to perform comparably to their quantum counterparts. While such an expansion in the dimensionality allows better performance, we observe that, with increased dimensionality, classical TNs lead to a highly flat loss landscape, rendering the usage of gradient-based optimization methods highly challenging. Furthermore, by employing quantitative metrics, such as the Fisher information and effective dimensions, we show that classical TNs require a more extensive training sample to represent the data as efficiently as TN-inspired quantum circuits. We also engage with the idea of hybrid classical-quantum TNs and show possible architectures to employ a larger phase-space from the data. We offer our results using three main TN ansatz: Tree Tensor Networks, Matrix Product States, and Multi-scale Entanglement Renormalisation Ansatz.
    Hierarchical Deep Generative Models for Design Under Free-Form Geometric Uncertainty. (arXiv:2202.10558v1 [cs.CE])
    Deep generative models have demonstrated effectiveness in learning compact and expressive design representations that significantly improve geometric design optimization. However, these models do not consider the uncertainty introduced by manufacturing or fabrication. Past work that quantifies such uncertainty often makes simplifying assumptions on geometric variations, while the "real-world", "free-form" uncertainty and its impact on design performance are difficult to quantify due to the high dimensionality. To address this issue, we propose a Generative Adversarial Network-based Design under Uncertainty Framework (GAN-DUF), which contains a deep generative model that simultaneously learns a compact representation of nominal (ideal) designs and the conditional distribution of fabricated designs given any nominal design. This opens up new possibilities of 1)~building a universal uncertainty quantification model compatible with both shape and topological designs, 2)~modeling free-form geometric uncertainties without the need to make any assumptions on the distribution of geometric variability, and 3)~allowing fast prediction of uncertainties for new nominal designs. We can combine the proposed deep generative model with robust design optimization or reliability-based design optimization for design under uncertainty. We demonstrated the framework on two real-world engineering design examples and showed its capability of finding the solution that possesses better performances after fabrication.
    A Globally Convergent Evolutionary Strategy for Stochastic Constrained Optimization with Applications to Reinforcement Learning. (arXiv:2202.10464v1 [cs.NE])
    Evolutionary strategies have recently been shown to achieve competing levels of performance for complex optimization problems in reinforcement learning. In such problems, one often needs to optimize an objective function subject to a set of constraints, including for instance constraints on the entropy of a policy or to restrict the possible set of actions or states accessible to an agent. Convergence guarantees for evolutionary strategies to optimize stochastic constrained problems are however lacking in the literature. In this work, we address this problem by designing a novel optimization algorithm with a sufficient decrease mechanism that ensures convergence and that is based only on estimates of the functions. We demonstrate the applicability of this algorithm on two types of experiments: i) a control task for maximizing rewards and ii) maximizing rewards subject to a non-relaxable set of constraints.
    EF-Train: Enable Efficient On-device CNN Training on FPGA Through Data Reshaping for Online Adaptation or Personalization. (arXiv:2202.10935v1 [cs.LG])
    Conventionally, DNN models are trained once in the cloud and deployed in edge devices such as cars, robots, or unmanned aerial vehicles (UAVs) for real-time inference. However, there are many cases that require the models to adapt to new environments, domains, or new users. In order to realize such domain adaption or personalization, the models on devices need to be continuously trained on the device. In this work, we design EF-Train, an efficient DNN training accelerator with a unified channel-level parallelism-based convolution kernel that can achieve end-to-end training on resource-limited low-power edge-level FPGAs. It is challenging to implement on-device training on resource-limited FPGAs due to the low efficiency caused by different memory access patterns among forward, backward propagation, and weight update. Therefore, we developed a data reshaping approach with intra-tile continuous memory allocation and weight reuse. An analytical model is established to automatically schedule computation and memory resources to achieve high energy efficiency on edge FPGAs. The experimental results show that our design achieves 46.99 GFLOPS and 6.09GFLOPS/W in terms of throughput and energy efficiency, respectively.
    Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks. (arXiv:2202.10571v1 [cs.CV])
    In the deep learning era, long video generation of high-quality still remains challenging due to the spatio-temporal complexity and continuity of videos. Existing prior works have attempted to model video distribution by representing videos as 3D grids of RGB values, which impedes the scale of generated videos and neglects continuous dynamics. In this paper, we found that the recent emerging paradigm of implicit neural representations (INRs) that encodes a continuous signal into a parameterized neural network effectively mitigates the issue. By utilizing INRs of video, we propose dynamics-aware implicit generative adversarial network (DIGAN), a novel generative adversarial network for video generation. Specifically, we introduce (a) an INR-based video generator that improves the motion dynamics by manipulating the space and time coordinates differently and (b) a motion discriminator that efficiently identifies the unnatural motions without observing the entire long frame sequences. We demonstrate the superiority of DIGAN under various datasets, along with multiple intriguing properties, e.g., long video synthesis, video extrapolation, and non-autoregressive video generation. For example, DIGAN improves the previous state-of-the-art FVD score on UCF-101 by 30.7% and can be trained on 128 frame videos of 128x128 resolution, 80 frames longer than the 48 frames of the previous state-of-the-art method.
    Inequality Constrained Stochastic Nonlinear Optimization via Active-Set Sequential Quadratic Programming. (arXiv:2109.11502v2 [math.OC] UPDATED)
    We study nonlinear optimization problems with a stochastic objective and deterministic equality and inequality constraints, which emerge in numerous applications including finance, manufacturing, power systems and, recently, deep neural networks. We propose an active-set stochastic sequential quadratic programming algorithm that uses a differentiable exact augmented Lagrangian as the merit function. The algorithm adaptively selects the penalty parameters of the augmented Lagrangian, and performs stochastic line search to decide the stepsize. The global convergence is established: for any initialization, the "liminf" of the KKT residuals converges to zero almost surely. Our algorithm and analysis further develop the work of Na et al. (2021) by allowing nonlinear inequality constraints without requiring the strict complementary condition. We demonstrate the performance of the algorithm on a subset of nonlinear problems collected in CUTEst test set.
    A Novel Convergence Analysis for Algorithms of the Adam Family and Beyond. (arXiv:2104.14840v4 [math.OC] UPDATED)
    Why does the original analysis of Adam fail, but it still converges very well in practice on a broad range of problems? There are still some mysteries about Adam that have not been unraveled. This paper provides a novel non-convex analysis of Adam and its many variants to uncover some of these mysteries. Our analysis exhibits that an increasing or large enough "momentum" parameter for the first-order moment used in practice is sufficient to ensure Adam and its many variants converge under a mild boundness condition on the adaptive scaling factor of the step size. In contrast, the original problematic analysis of Adam uses a momentum parameter that decreases to zero, which is the key reason that makes it diverge on some problems. To the best of our knowledge, this is the first time the gap between analysis and practice is bridged. Our analysis also exhibits more insights for practical implementations of Adam, e.g., increasing the momentum parameter in a stage-wise manner in accordance with stagewise decreasing step size would help improve the convergence. Our analysis of the Adam family is modular such that it can be (has been) extended to solving other optimization problems, e.g., compositional, min-max and bi-level problems. As an interesting yet non-trivial use case, we present an extension for solving non-convex min-max optimization in order to address a gap in the literature that either requires a large batch or has double loops. Our empirical studies corroborate the theory and also demonstrate the effectiveness in solving min-max problems.
    TENT: Tensorized Encoder Transformer for Temperature Forecasting. (arXiv:2106.14742v2 [cs.LG] UPDATED)
    Reliable weather forecasting is of great importance in science, business, and society. The best performing data-driven models for weather prediction tasks rely on recurrent or convolutional neural networks, where some of which incorporate attention mechanisms. In this work, we introduce a novel model based on Transformer architecture for weather forecasting. The proposed Tensorial Encoder Transformer (TENT) model is equipped with tensorial attention and thus it exploits the spatiotemporal structure of weather data by processing it in multidimensional tensorial format. We show that compared to the classical encoder transformer, 3D convolutional neural networks, LSTM, and Convolutional LSTM, the proposed TENT model can better learn the underlying complex pattern of the weather data for the studied temperature prediction task. Experiments on two real-life weather datasets are performed. The datasets consist of historical measurements from weather stations in the USA, Canada and Europe. The first dataset contains hourly measurements of weather attributes for 30 cities in the USA and Canada from October 2012 to November 2017. The second dataset contains daily measurements of weather attributes of 18 cities across Europe from May 2005 to April 2020. Two attention scores are introduced based on the obtained tonsorial attention and are visualized in order to shed light on the decision-making process of our model and provide insight knowledge on the most important cities for the target cities.
    CoBERL: Contrastive BERT for Reinforcement Learning. (arXiv:2107.05431v2 [cs.LG] UPDATED)
    Many reinforcement learning (RL) agents require a large amount of experience to solve tasks. We propose Contrastive BERT for RL (CoBERL), an agent that combines a new contrastive loss and a hybrid LSTM-transformer architecture to tackle the challenge of improving data efficiency. CoBERL enables efficient, robust learning from pixels across a wide range of domains. We use bidirectional masked prediction in combination with a generalization of recent contrastive methods to learn better representations for transformers in RL, without the need of hand engineered data augmentations. We find that CoBERL consistently improves performance across the full Atari suite, a set of control tasks and a challenging 3D environment.
    Local SGD Optimizes Overparameterized Neural Networks in Polynomial Time. (arXiv:2107.10868v2 [cs.LG] UPDATED)
    In this paper we prove that Local (S)GD (or FedAvg) can optimize deep neural networks with Rectified Linear Unit (ReLU) activation function in polynomial time. Despite the established convergence theory of Local SGD on optimizing general smooth functions in communication-efficient distributed optimization, its convergence on non-smooth ReLU networks still eludes full theoretical understanding. The key property used in many Local SGD analysis on smooth function is gradient Lipschitzness, so that the gradient on local models will not drift far away from that on averaged model. However, this decent property does not hold in networks with non-smooth ReLU activation function. We show that, even though ReLU network does not admit gradient Lipschitzness property, the difference between gradients on local models and average model will not change too much, under the dynamics of Local SGD. We validate our theoretical results via extensive experiments. This work is the first to show the convergence of Local SGD on non-smooth functions, and will shed lights on the optimization theory of federated training of deep neural networks.
    Adversarial Tracking Control via Strongly Adaptive Online Learning with Memory. (arXiv:2102.01623v3 [cs.LG] UPDATED)
    We consider the problem of tracking an adversarial state sequence in a linear dynamical system subject to adversarial disturbances and loss functions, generalizing earlier settings in the literature. To this end, we develop three techniques, each of independent interest. First, we propose a comparator-adaptive algorithm for online linear optimization with movement cost. Without tuning, it nearly matches the performance of the optimally tuned gradient descent in hindsight. Next, considering a related problem called online learning with memory, we construct a novel strongly adaptive algorithm that uses our first contribution as a building block. Finally, we present the first reduction from adversarial tracking control to strongly adaptive online learning with memory. Summarizing these individual techniques, we obtain an adversarial tracking controller with a strong performance guarantee even when the reference trajectory has a large range of movement.
    Continuous Treatment Recommendation with Deep Survival Dose Response Function. (arXiv:2108.10453v2 [stat.ML] UPDATED)
    We propose a general formulation for continuous treatment recommendation problems in settings with clinical survival data, which we call the Deep Survival Dose Response Function (DeepSDRF). That is, we consider the problem of learning the conditional average dose response (CADR) function solely from historical data in which unobserved factors (confounders) affect both observed treatment and time-to-event outcomes. The estimated treatment effect from DeepSDRF enables us to develop recommender algorithms with explanatory insights. We compared two recommender approaches based on random search and reinforcement learning and found similar performance in terms of patient outcome. We tested the DeepSDRF and the corresponding recommender on extensive simulation studies and two empirical databases: 1) the Clinical Practice Research Datalink (CPRD) and 2) the eICU Research Institute (eRI) database. To the best of our knowledge, this is the first time that confounders are taken into consideration for addressing the continuous treatment effect with observational data in a medical context.
    Strong replica symmetry for high-dimensional disordered log-concave Gibbs measures. (arXiv:2009.12939v3 [math.PR] UPDATED)
    We consider a generic class of log-concave, possibly random, (Gibbs) measures. We prove the concentration of an infinite family of order parameters called multioverlaps. Because they completely parametrise the quenched Gibbs measure of the system, this implies a simple representation of the asymptotic Gibbs measures, as well as the decoupling of the variables in a strong sense. These results may prove themselves useful in several contexts. In particular in machine learning and high-dimensional inference, log-concave measures appear in convex empirical risk minimisation, maximum a-posteriori inference or M-estimation. We believe that they may be applicable in establishing some type of "replica symmetric formulas" for the free energy, inference or generalisation error in such settings.
    Computing Multiple Image Reconstructions with a Single Hypernetwork. (arXiv:2202.11009v1 [cs.CV])
    Deep learning based techniques achieve state-of-the-art results in a wide range of image reconstruction tasks like compressed sensing. These methods almost always have hyperparameters, such as the weight coefficients that balance the different terms in the optimized loss function. The typical approach is to train the model for a hyperparameter setting determined with some empirical or theoretical justification. Thus, at inference time, the model can only compute reconstructions corresponding to the pre-determined hyperparameter values. In this work, we present a hypernetwork based approach, called HyperRecon, to train reconstruction models that are agnostic to hyperparameter settings. At inference time, HyperRecon can efficiently produce diverse reconstructions, which would each correspond to different hyperparameter values. In this framework, the user is empowered to select the most useful output(s) based on their own judgement. We demonstrate our method in compressed sensing, super-resolution and denoising tasks, using two large-scale and publicly-available MRI datasets. Our code is available at https://github.com/alanqrwang/hyperrecon.
    Counterfactual Phenotyping with Censored Time-to-Events. (arXiv:2202.11089v1 [cs.LG])
    Estimation of treatment efficacy of real-world clinical interventions involves working with continuous outcomes such as time-to-death, re-hospitalization, or a composite event that may be subject to censoring. Causal reasoning in such scenarios requires decoupling the effects of confounding physiological characteristics that affect baseline survival rates from the effects of the interventions being assessed. In this paper, we present a latent variable approach to model heterogeneous treatment effects by proposing that an individual can belong to one of latent clusters with distinct response characteristics. We show that this latent structure can mediate the base survival rates and helps determine the effects of an intervention. We demonstrate the ability of our approach to discover actionable phenotypes of individuals based on their treatment response on multiple large randomized clinical trials originally conducted to assess appropriate treatments to reduce cardiovascular risk.
    ReorientBot: Learning Object Reorientation for Specific-Posed Placement. (arXiv:2202.11092v1 [cs.RO])
    Robots need the capability of placing objects in arbitrary, specific poses to rearrange the world and achieve various valuable tasks. Object reorientation plays a crucial role in this as objects may not initially be oriented such that the robot can grasp and then immediately place them in a specific goal pose. In this work, we present a vision-based manipulation system, ReorientBot, which consists of 1) visual scene understanding with pose estimation and volumetric reconstruction using an onboard RGB-D camera; 2) learned waypoint selection for successful and efficient motion generation for reorientation; 3) traditional motion planning to generate a collision-free trajectory from the selected waypoints. We evaluate our method using the YCB objects in both simulation and the real world, achieving 93% overall success, 81% improvement in success rate, and 22% improvement in execution time compared to a heuristic approach. We demonstrate extended multi-object rearrangement showing the general capability of the system.
    MSTGD:A Memory Stochastic sTratified Gradient Descent Method with an Exponential Convergence Rate. (arXiv:2202.10923v1 [stat.ML])
    The fluctuation effect of gradient expectation and variance caused by parameter update between consecutive iterations is neglected or confusing by current mainstream gradient optimization algorithms.Using this fluctuation effect, combined with the stratified sampling strategy, this paper designs a novel \underline{M}emory \underline{S}tochastic s\underline{T}ratified Gradient Descend(\underline{MST}GD) algorithm with an exponential convergence rate. Specifically, MSTGD uses two strategies for variance reduction: the first strategy is to perform variance reduction according to the proportion p of used historical gradient, which is estimated from the mean and variance of sample gradients before and after iteration, and the other strategy is stratified sampling by category. The statistic \ $\bar{G}_{mst}$\ designed under these two strategies can be adaptively unbiased, and its variance decays at a geometric rate. This enables MSTGD based on $\bar{G}_{mst}$ to obtain an exponential convergence rate of the form $\lambda^{2(k-k_0)}$($\lambda\in (0,1)$,k is the number of iteration steps,$\lambda$ is a variable related to proportion p).Unlike most other algorithms that claim to achieve an exponential convergence rate, the convergence rate is independent of parameters such as dataset size N, batch size n, etc., and can be achieved at a constant step size.Theoretical and experimental results show the effectiveness of MSTGD
    Letters of the Alphabet: Discovering Natural Feature Sets. (arXiv:2202.10934v1 [cs.LG])
    Deep learning networks find intricate features in large datasets using the backpropagation algorithm. This algorithm repeatedly adjusts the network connections.' weights and examining the "hidden" nodes behavior between the input and output layer provides better insight into how neural networks create feature representations. Experiments built on each other show that activity differences computed within a layer can guide learning. A simple neural network is used, which includes a data set comprised of the alphabet letters, where each letter forms 81 input nodes comprised of 0 and 1s and a single hidden layer and an output layer. The first experiment explains how the hidden layers in this simple neural network represent the input data's features. The second experiment attempts to reverse-engineer the neural network to find the alphabet's natural feature sets. As the network interprets features, we can understand how it derives the natural feature sets for a given data. This understanding is essential to delve deeper into deep generative models, such as Boltzmann machines. Deep generative models are a class of unsupervised deep learning algorithms. The primary function of deep generative models is to find the natural feature sets for a given data set.
    Convex Loss Functions for Contextual Pricing with Observational Posted-Price Data. (arXiv:2202.10944v1 [cs.LG])
    We study an off-policy contextual pricing problem where the seller has access to samples of prices which customers were previously offered, whether they purchased at that price, and auxiliary features describing the customer and/or item being sold. This is in contrast to the well-studied setting in which samples of the customer's valuation (willingness to pay) are observed. In our setting, the observed data is influenced by the historic pricing policy, and we do not know how customers would have responded to alternative prices. We introduce suitable loss functions for this pricing setting which can be directly optimized to find an effective pricing policy with expected revenue guarantees without the need for estimation of an intermediate demand function. We focus on convex loss functions. This is particularly relevant when linear pricing policies are desired for interpretability reasons, resulting in a tractable convex revenue optimization problem. We further propose generalized hinge and quantile pricing loss functions, which price at a multiplicative factor of the conditional expected value or a particular quantile of the valuation distribution when optimized, despite the valuation data not being observed. We prove expected revenue bounds for these pricing policies respectively when the valuation distribution is log-concave, and provide generalization bounds for the finite sample case. Finally, we conduct simulations on both synthetic and real-world data to demonstrate that this approach is competitive with, and in some settings outperforms, state-of-the-art methods in contextual pricing.
    A framework for spatial heat risk assessment using a generalized similarity measure. (arXiv:2202.10963v1 [cs.LG])
    In this study, we develop a novel framework to assess health risks due to heat hazards across various localities (zip codes) across the state of Maryland with the help of two commonly used indicators i.e. exposure and vulnerability. Our approach quantifies each of the two aforementioned indicators by developing their corresponding feature vectors and subsequently computes indicator-specific reference vectors that signify a high risk environment by clustering the data points at the tail-end of an empirical risk spectrum. The proposed framework circumvents the information-theoretic entropy based aggregation methods whose usage varies with different views of entropy that are subjective in nature and more importantly generalizes the notion of risk-valuation using cosine similarity with unknown reference points.
    StickyLand: Breaking the Linear Presentation of Computational Notebooks. (arXiv:2202.11086v1 [cs.HC])
    How can we better organize code in computational notebooks? Notebooks have become a popular tool among data scientists, as they seamlessly weave text and code together, supporting users to rapidly iterate and document code experiments. However, it is often challenging to organize code in notebooks, partially because there is a mismatch between the linear presentation of code and the non-linear process of exploratory data analysis. We present StickyLand, a notebook extension for empowering users to freely organize their code in non-linear ways. With sticky cells that are always shown on the screen, users can quickly access their notes, instantly observe experiment results, and easily build interactive dashboards that support complex visual analytics. Case studies highlight how our tool can enhance notebook users's productivity and identify opportunities for future notebook designs. StickyLand is available at https://github.com/xiaohk/stickyland.
    Temporal Subtyping of Alzheimer's Disease Using Medical Conditions Preceding Alzheimer's Disease Onset in Electronic Health Records. (arXiv:2202.10991v1 [cs.LG])
    Subtyping of Alzheimer's disease (AD) can facilitate diagnosis, treatment, prognosis and disease management. It can also support the testing of new prevention and treatment strategies through clinical trials. In this study, we employed spectral clustering to cluster 29,922 AD patients in the OneFlorida Data Trust using their longitudinal EHR data of diagnosis and conditions into four subtypes. These subtypes exhibit different patterns of progression of other conditions prior to the first AD diagnosis. In addition, according to the results of various statistical tests, these subtypes are also significantly different with respect to demographics, mortality, and prescription medications after the AD diagnosis. This study could potentially facilitate early detection and personalized treatment of AD as well as data-driven generalizability assessment of clinical trials for AD.
    POLE: Polarized Embedding for Signed Networks. (arXiv:2110.09899v3 [cs.SI] UPDATED)
    From the 2016 U.S. presidential election to the 2021 Capitol riots to the spread of misinformation related to COVID-19, many have blamed social media for today's deeply divided society. Recent advances in machine learning for signed networks hold the promise to guide small interventions with the goal of reducing polarization in social media. However, existing models are especially ineffective in predicting conflicts (or negative links) among users. This is due to a strong correlation between link signs and the network structure, where negative links between polarized communities are too sparse to be predicted even by state-of-the-art approaches. To address this problem, we first design a partition-agnostic polarization measure for signed graphs based on the signed random-walk and show that many real-world graphs are highly polarized. Then, we propose POLE (POLarized Embedding for signed networks), a signed embedding method for polarized graphs that captures both topological and signed similarities jointly via signed autocovariance. Through extensive experiments, we show that POLE significantly outperforms state-of-the-art methods in signed link prediction, particularly for negative links with gains of up to one order of magnitude.
    Probing GNN Explainers: A Rigorous Theoretical and Empirical Analysis of GNN Explanation Methods. (arXiv:2106.09078v2 [cs.LG] UPDATED)
    As Graph Neural Networks (GNNs) are increasingly being employed in critical real-world applications, several methods have been proposed in recent literature to explain the predictions of these models. However, there has been little to no work on systematically analyzing the reliability of these methods. Here, we introduce the first-ever theoretical analysis of the reliability of state-of-the-art GNN explanation methods. More specifically, we theoretically analyze the behavior of various state-of-the-art GNN explanation methods with respect to several desirable properties (e.g., faithfulness, stability, and fairness preservation) and establish upper bounds on the violation of these properties. We also empirically validate our theoretical results using extensive experimentation with nine real-world graph datasets. Our empirical results further shed light on several interesting insights about the behavior of state-of-the-art GNN explanation methods.
    Harmless interpolation in regression and classification with structured features. (arXiv:2111.05198v2 [stat.ML] UPDATED)
    Overparametrized neural networks tend to perfectly fit noisy training data yet generalize well on test data. Inspired by this empirical observation, recent work has sought to understand this phenomenon of benign overfitting or harmless interpolation in the much simpler linear model. Previous theoretical work critically assumes that either the data features are statistically independent or the input data is high-dimensional; this precludes general nonparametric settings with structured feature maps. In this paper, we present a general and flexible framework for upper bounding regression and classification risk in a reproducing kernel Hilbert space. A key contribution is that our framework describes precise sufficient conditions on the data Gram matrix under which harmless interpolation occurs. Our results recover prior independent-features results (with a much simpler analysis), but they furthermore show that harmless interpolation can occur in more general settings such as features that are a bounded orthonormal system. Furthermore, our results show an asymptotic separation between classification and regression performance in a manner that was previously only shown for Gaussian features.
    Efficient Wind Speed Nowcasting with GPU-Accelerated Nearest Neighbors Algorithm. (arXiv:2112.10408v2 [cs.LG] UPDATED)
    This paper proposes a simple yet efficient high-altitude wind nowcasting pipeline. It processes efficiently a vast amount of live data recorded by airplanes over the whole airspace and reconstructs the wind field with good accuracy. It creates a unique context for each point in the dataset and then extrapolates from it. As creating such context is computationally intensive, this paper proposes a novel algorithm that reduces the time and memory cost by efficiently fetching nearest neighbors in a data set whose elements are organized along smooth trajectories that can be approximated with piece-wise linear structures. We introduce an efficient and exact strategy implemented through algebraic tensorial operations, which is well-suited to modern GPU-based computing infrastructure. This method employs a scalable Euclidean metric and allows masking data points along one dimension. When applied, this method is more efficient than plain Euclidean k-NN and other well-known data selection methods such as KDTrees and provides a several-fold speedup. We provide an implementation in PyTorch and a novel data set to allow the replication of empirical results.
    Recognizing Concepts and Recognizing Musical Themes. A Quantum Semantic Analysis. (arXiv:2202.10941v1 [cs.LG])
    How are abstract concepts and musical themes recognized on the basis of some previous experience? It is interesting to compare the different behaviors of human and of artificial intelligences with respect to this problem. Generally, a human mind that abstracts a concept (say, table) from a given set of known examples creates a table-Gestalt: a kind of vague and out of focus image that does not fully correspond to a particular table with well determined features. A similar situation arises in the case of musical themes. Can the construction of a gestaltic pattern, which is so natural for human minds, be taught to an intelligent machine? This problem can be successfully discussed in the framework of a quantum approach to pattern recognition and to machine learning. The basic idea is replacing classical data sets with quantum data sets, where either objects or musical themes can be formally represented as pieces of quantum information, involving the uncertainties and the ambiguities that characterize the quantum world. In this framework, the intuitive concept of Gestalt can be simulated by the mathematical concept of positive centroid of a given quantum data set. Accordingly, the crucial problem "how can we classify a new object or a new musical theme (we have listened to) on the basis of a previous experience?" can be dealt with in terms of some special quantum similarity-relations. Although recognition procedures are different for human and for artificial intelligences, there is a common method of "facing the problems" that seems to work in both cases.
    Back2Future: Leveraging Backfill Dynamics for Improving Real-time Predictions in Future. (arXiv:2106.04420v2 [cs.LG] UPDATED)
    In real-time forecasting in public health, data collection is a non-trivial and demanding task. Often after initially released, it undergoes several revisions later (maybe due to human or technical constraints) - as a result, it may take weeks until the data reaches to a stable value. This so-called 'backfill' phenomenon and its effect on model performance has been barely studied in the prior literature. In this paper, we introduce the multi-variate backfill problem using COVID-19 as the motivating example. We construct a detailed dataset composed of relevant signals over the past year of the pandemic. We then systematically characterize several patterns in backfill dynamics and leverage our observations for formulating a novel problem and neural framework Back2Future that aims to refines a given model's predictions in real-time. Our extensive experiments demonstrate that our method refines the performance of top models for COVID-19 forecasting, in contrast to non-trivial baselines, yielding 18% improvement over baselines, enabling us obtain a new SOTA performance. In addition, we show that our model improves model evaluation too; hence policy-makers can better understand the true accuracy of forecasting models in real-time.
    Learning Competitive Equilibria in Exchange Economies with Bandit Feedback. (arXiv:2106.06616v2 [cs.LG] UPDATED)
    The sharing of scarce resources among multiple rational agents is one of the classical problems in economics. In exchange economies, which are used to model such situations, agents begin with an initial endowment of resources and exchange them in a way that is mutually beneficial until they reach a competitive equilibrium (CE). The allocations at a CE are Pareto efficient and fair. Consequently, they are used widely in designing mechanisms for fair division. However, computing CEs requires the knowledge of agent preferences which are unknown in several applications of interest. In this work, we explore a new online learning mechanism, which, on each round, allocates resources to the agents and collects stochastic feedback on their experience in using that allocation. Its goal is to learn the agent utilities via this feedback and imitate the allocations at a CE in the long run. We quantify CE behavior via two losses and propose a randomized algorithm which achieves sublinear loss under a parametric class of utilities. Empirically, we demonstrate the effectiveness of this mechanism through numerical simulations.
    Better Private Algorithms for Correlation Clustering. (arXiv:2202.10747v1 [cs.LG])
    In machine learning, correlation clustering is an important problem whose goal is to partition the individuals into groups that correlate with their pairwise similarities as much as possible. In this work, we revisit the correlation clustering under the differential privacy constraints. Particularly, we improve previous results and achieve an $\Tilde{O}(n^{1.5})$ additive error compared to the optimal cost in expectation on general graphs. As for unweighted complete graphs, we improve the results further and propose a more involved algorithm which achieves $\Tilde{O}(n \sqrt{\Delta^*})$ additive error, where $\Delta^*$ is the maximum degrees of positive edges among all nodes.
    Why Fair Labels Can Yield Unfair Predictions: Graphical Conditions for Introduced Unfairness. (arXiv:2202.10816v1 [cs.LG])
    In addition to reproducing discriminatory relationships in the training data, machine learning systems can also introduce or amplify discriminatory effects. We refer to this as introduced unfairness, and investigate the conditions under which it may arise. To this end, we propose introduced total variation as a measure of introduced unfairness, and establish graphical conditions under which it may be incentivised to occur. These criteria imply that adding the sensitive attribute as a feature removes the incentive for introduced variation under well-behaved loss functions. Additionally, taking a causal perspective, introduced path-specific effects shed light on the issue of when specific paths should be considered fair.
    Neural Ensemble Search for Uncertainty Estimation and Dataset Shift. (arXiv:2006.08573v3 [cs.LG] UPDATED)
    Ensembles of neural networks achieve superior performance compared to stand-alone networks in terms of accuracy, uncertainty calibration and robustness to dataset shift. \emph{Deep ensembles}, a state-of-the-art method for uncertainty estimation, only ensemble random initializations of a \emph{fixed} architecture. Instead, we propose two methods for automatically constructing ensembles with \emph{varying} architectures, which implicitly trade-off individual architectures' strengths against the ensemble's diversity and exploit architectural variation as a source of diversity. On a variety of classification tasks and modern architecture search spaces, we show that the resulting ensembles outperform deep ensembles not only in terms of accuracy but also uncertainty calibration and robustness to dataset shift. Our further analysis and ablation studies provide evidence of higher ensemble diversity due to architectural variation, resulting in ensembles that can outperform deep ensembles, even when having weaker average base learners. To foster reproducibility, our code is available: \url{https://github.com/automl/nes}
    Efficient and Differentiable Conformal Prediction with General Function Classes. (arXiv:2202.11091v1 [cs.LG])
    Quantifying the data uncertainty in learning tasks is often done by learning a prediction interval or prediction set of the label given the input. Two commonly desired properties for learned prediction sets are \emph{valid coverage} and \emph{good efficiency} (such as low length or low cardinality). Conformal prediction is a powerful technique for learning prediction sets with valid coverage, yet by default its conformalization step only learns a single parameter, and does not optimize the efficiency over more expressive function classes. In this paper, we propose a generalization of conformal prediction to multiple learnable parameters, by considering the constrained empirical risk minimization (ERM) problem of finding the most efficient prediction set subject to valid empirical coverage. This meta-algorithm generalizes existing conformal prediction algorithms, and we show that it achieves approximate valid population coverage and near-optimal efficiency within class, whenever the function class in the conformalization step is low-capacity in a certain sense. Next, this ERM problem is challenging to optimize as it involves a non-differentiable coverage constraint. We develop a gradient-based algorithm for it by approximating the original constrained ERM using differentiable surrogate losses and Lagrangians. Experiments show that our algorithm is able to learn valid prediction sets and improve the efficiency significantly over existing approaches in several applications such as prediction intervals with improved length, minimum-volume prediction sets for multi-output regression, and label prediction sets for image classification.
    No-Regret Learning in Partially-Informed Auctions. (arXiv:2202.10606v1 [cs.LG])
    Auctions with partially-revealed information about items are broadly employed in real-world applications, but the underlying mechanisms have limited theoretical support. In this work, we study a machine learning formulation of these types of mechanisms, presenting algorithms that are no-regret from the buyer's perspective. Specifically, a buyer who wishes to maximize his utility interacts repeatedly with a platform over a series of $T$ rounds. In each round, a new item is drawn from an unknown distribution and the platform publishes a price together with incomplete, "masked" information about the item. The buyer then decides whether to purchase the item. We formalize this problem as an online learning task where the goal is to have low regret with respect to a myopic oracle that has perfect knowledge of the distribution over items and the seller's masking function. When the distribution over items is known to the buyer and the mask is a SimHash function mapping $\mathbb{R}^d$ to $\{0,1\}^{\ell}$, our algorithm has regret $\tilde {\mathcal{O}}((Td\ell)^{\frac{1}{2}})$. In a fully agnostic setting when the mask is an arbitrary function mapping to a set of size $n$, our algorithm has regret $\tilde {\mathcal{O}}(T^{\frac{3}{4}}n^{\frac{1}{2}})$. Finally, when the prices are stochastic, the algorithm has regret $\tilde {\mathcal{O}}((Tn)^{\frac{1}{2}})$.
    Gradient Based Activations for Accurate Bias-Free Learning. (arXiv:2202.10943v1 [cs.LG])
    Bias mitigation in machine learning models is imperative, yet challenging. While several approaches have been proposed, one view towards mitigating bias is through adversarial learning. A discriminator is used to identify the bias attributes such as gender, age or race in question. This discriminator is used adversarially to ensure that it cannot distinguish the bias attributes. The main drawback in such a model is that it directly introduces a trade-off with accuracy as the features that the discriminator deems to be sensitive for discrimination of bias could be correlated with classification. In this work we solve the problem. We show that a biased discriminator can actually be used to improve this bias-accuracy tradeoff. Specifically, this is achieved by using a feature masking approach using the discriminator's gradients. We ensure that the features favoured for the bias discrimination are de-emphasized and the unbiased features are enhanced during classification. We show that this simple approach works well to reduce bias as well as improve accuracy significantly. We evaluate the proposed model on standard benchmarks. We improve the accuracy of the adversarial methods while maintaining or even improving the unbiasness and also outperform several other recent methods.
    SDFDiff: Differentiable Rendering of Signed Distance Fields for 3D Shape Optimization. (arXiv:1912.07109v2 [cs.CV] UPDATED)
    We propose SDFDiff, a novel approach for image-based shape optimization using differentiable rendering of 3D shapes represented by signed distance functions (SDFs). Compared to other representations, SDFs have the advantage that they can represent shapes with arbitrary topology, and that they guarantee watertight surfaces. We apply our approach to the problem of multi-view 3D reconstruction, where we achieve high reconstruction quality and can capture complex topology of 3D objects. In addition, we employ a multi-resolution strategy to obtain a robust optimization algorithm. We further demonstrate that our SDF-based differentiable renderer can be integrated with deep learning models, which opens up options for learning approaches on 3D objects without 3D supervision. In particular, we apply our method to single-view 3D reconstruction and achieve state-of-the-art results.
    Permutation-equivariant and Proximity-aware Graph Neural Networks with Stochastic Message Passing. (arXiv:2009.02562v2 [cs.LG] UPDATED)
    Graph neural networks (GNNs) are emerging machine learning models on graphs. Permutation-equivariance and proximity-awareness are two important properties highly desirable for GNNs. Both properties are needed to tackle some challenging graph problems, such as finding communities and leaders. In this paper, we first analytically show that the existing GNNs, mostly based on the message-passing mechanism, cannot simultaneously preserve the two properties. Then, we propose Stochastic Message Passing (SMP) model, a general and simple GNN to maintain both proximity-awareness and permutation-equivariance. In order to preserve node proximities, we augment the existing GNNs with stochastic node representations. We theoretically prove that the mechanism can enable GNNs to preserve node proximities, and at the same time, maintain permutation-equivariance with certain parametrization. We report extensive experimental results on ten datasets and demonstrate the effectiveness and efficiency of SMP for various typical graph mining tasks, including graph reconstruction, node classification, and link prediction.
    A Survey of Vision-Language Pre-Trained Models. (arXiv:2202.10936v1 [cs.CV])
    As Transformer evolved, pre-trained models have advanced at a breakneck pace in recent years. They have dominated the mainstream techniques in natural language processing (NLP) and computer vision (CV). How to adapt pre-training to the field of Vision-and-Language (V-L) learning and improve the performance on downstream tasks becomes a focus of multimodal learning. In this paper, we review the recent progress in Vision-Language Pre-Trained Models (VL-PTMs). As the core content, we first briefly introduce several ways to encode raw images and texts to single-modal embeddings before pre-training. Then, we dive into the mainstream architectures of VL-PTMs in modeling the interaction between text and image representations. We further present widely-used pre-training tasks, after which we introduce some common downstream tasks. We finally conclude this paper and present some promising research directions. Our survey aims to provide multimodal researchers a synthesis and pointer to related research.
    Message passing all the way up. (arXiv:2202.11097v1 [cs.LG])
    The message passing framework is the foundation of the immense success enjoyed by graph neural networks (GNNs) in recent years. In spite of its elegance, there exist many problems it provably cannot solve over given input graphs. This has led to a surge of research on going "beyond message passing", building GNNs which do not suffer from those limitations -- a term which has become ubiquitous in regular discourse. However, have those methods truly moved beyond message passing? In this position paper, I argue about the dangers of using this term -- especially when teaching graph representation learning to newcomers. I show that any function of interest we want to compute over graphs can, in all likelihood, be expressed using pairwise message passing -- just over a potentially modified graph, and argue how most practical implementations subtly do this kind of trick anyway. Hoping to initiate a productive discussion, I propose replacing "beyond message passing" with a more tame term, "augmented message passing".
    An end-to-end predict-then-optimize clustering method for intelligent assignment problems in express systems. (arXiv:2202.10937v1 [cs.LG])
    Express systems play important roles in modern major cities. Couriers serving for the express system pick up packages in certain areas of interest (AOI) during a specific time. However, future pick-up requests vary significantly with time. While the assignment results are generally static without changing with time. Using the historical pick-up request number to conduct AOI assignment (or pick-up request assignment) for couriers is thus unreasonable. Moreover, even we can first predict future pick-up requests and then use the prediction results to conduct the assignments, this kind of two-stage method is also impractical and trivial, and exists some drawbacks, such as the best prediction results might not ensure the best clustering results. To solve these problems, we put forward an intelligent end-to-end predict-then-optimize clustering method to simultaneously predict the future pick-up requests of AOIs and assign AOIs to couriers by clustering. At first, we propose a deep learning-based prediction model to predict order numbers on AOIs. Then a differential constrained K-means clustering method is introduced to cluster AOIs based on the prediction results. We finally propose a one-stage end-to-end predict-then-optimize clustering method to assign AOIs to couriers reasonably, dynamically, and intelligently. Results show that this kind of one-stage predict-then-optimize method is beneficial to improve the performance of optimization results, namely the clustering results. This study can provide critical experiences for predict-and-optimize related tasks and intelligent assignment problems in express systems.
    Provably convergent quasistatic dynamics for mean-field two-player zero-sum games. (arXiv:2202.10947v1 [cs.GT])
    In this paper, we study the problem of finding mixed Nash equilibrium for mean-field two-player zero-sum games. Solving this problem requires optimizing over two probability distributions. We consider a quasistatic Wasserstein gradient flow dynamics in which one probability distribution follows the Wasserstein gradient flow, while the other one is always at the equilibrium. Theoretical analysis are conducted on this dynamics, showing its convergence to the mixed Nash equilibrium under mild conditions. Inspired by the continuous dynamics of probability distributions, we derive a quasistatic Langevin gradient descent method with inner-outer iterations, and test the method on different problems, including training mixture of GANs.
    Multi-Source Unsupervised Domain Adaptation via Pseudo Target Domain. (arXiv:2202.10725v1 [cs.LG])
    Multi-source domain adaptation (MDA) aims to transfer knowledge from multiple source domains to an unlabeled target domain. MDA is a challenging task due to the severe domain shift, which not only exists between target and source but also exists among diverse sources. Prior studies on MDA either estimate a mixed distribution of source domains or combine multiple single-source models, but few of them delve into the relevant information among diverse source domains. For this reason, we propose a novel MDA approach, termed Pseudo Target for MDA (PTMDA). Specifically, PTMDA maps each group of source and target domains into a group-specific subspace using adversarial learning with a metric constraint, and constructs a series of pseudo target domains correspondingly. Then we align the remainder source domains with the pseudo target domain in the subspace efficiently, which allows to exploit additional structured source information through the training on pseudo target domain and improves the performance on the real target domain. Besides, to improve the transferability of deep neural networks (DNNs), we replace the traditional batch normalization layer with an effective matching normalization layer, which enforces alignments in latent layers of DNNs and thus gains further promotion. We give theoretical analysis showing that PTMDA as a whole can reduce the target error bound and leads to a better approximation of the target risk in MDA settings. Extensive experiments demonstrate PTMDA's effectiveness on MDA tasks, as it outperforms state-of-the-art methods in most experimental settings.
    Transformation Coding: Simple Objectives for Equivariant Representations. (arXiv:2202.10930v1 [cs.LG])
    We present a simple non-generative approach to deep representation learning that seeks equivariant deep embedding through simple objectives. In contrast to existing equivariant networks, our transformation coding approach does not constrain the choice of the feed-forward layer or the architecture and allows for an unknown group action on the input space. We introduce several such transformation coding objectives for different Lie groups such as the Euclidean, Orthogonal and the Unitary groups. When using product groups, the representation is decomposed and disentangled. We show that the presence of additional information on different transformations improves disentanglement in transformation coding. We evaluate the representations learnt by transformation coding both qualitatively and quantitatively on downstream tasks, including reinforcement learning.
    Thinking the Fusion Strategy of Multi-reference Face Reenactment. (arXiv:2202.10758v1 [cs.CV])
    In recent advances of deep generative models, face reenactment -manipulating and controlling human face, including their head movement-has drawn much attention for its wide range of applicability. Despite its strong expressiveness, it is inevitable that the models fail to reconstruct or accurately generate unseen side of the face of a given single reference image. Most of existing methods alleviate this problem by learning appearances of human faces from large amount of data and generate realistic texture at inference time. Rather than completely relying on what generative models learn, we show that simple extension by using multiple reference images significantly improves generation quality. We show this by 1) conducting the reconstruction task on publicly available dataset, 2) conducting facial motion transfer on our original dataset which consists of multi-person's head movement video sequences, and 3) using a newly proposed evaluation metric to validate that our method achieves better quantitative results.
    Learning Dynamics and Structure of Complex Systems Using Graph Neural Networks. (arXiv:2202.10996v1 [cs.LG])
    Many complex systems are composed of interacting parts, and the underlying laws are usually simple and universal. While graph neural networks provide a useful relational inductive bias for modeling such systems, generalization to new system instances of the same type is less studied. In this work we trained graph neural networks to fit time series from an example nonlinear dynamical system, the belief propagation algorithm. We found simple interpretations of the learned representation and model components, and they are consistent with core properties of the probabilistic inference algorithm. We successfully identified a `graph translator' between the statistical interactions in belief propagation and parameters of the corresponding trained network, and showed that it enables two types of novel generalization: to recover the underlying structure of a new system instance based solely on time series observations, or to construct a new network from this structure directly. Our results demonstrated a path towards understanding both dynamics and structure of a complex system and how such understanding can be used for generalization.
    Convolutional Neural Network Modelling for MODIS Land Surface Temperature Super-Resolution. (arXiv:2202.10753v1 [cs.CV])
    Nowadays, thermal infrared satellite remote sensors enable to extract very interesting information at large scale, in particular Land Surface Temperature (LST). However such data are limited in spatial and/or temporal resolutions which prevents from an analysis at fine scales. For example, MODIS satellite provides daily acquisitions with 1Km spatial resolutions which is not sufficient to deal with highly heterogeneous environments as agricultural parcels. Therefore, image super-resolution is a crucial task to better exploit MODIS LSTs. This issue is tackled in this paper. We introduce a deep learning-based algorithm, named Multi-residual U-Net, for super-resolution of MODIS LST single-images. Our proposed network is a modified version of U-Net architecture, which aims at super-resolving the input LST image from 1Km to 250m per pixel. The results show that our Multi-residual U-Net outperforms other state-of-the-art methods.
    A Deep Learning Approach to Predicting Ventilator Parameters for Mechanically Ventilated Septic Patients. (arXiv:2202.10921v1 [q-bio.QM])
    We develop a deep learning approach to predicting a set of ventilator parameters for a mechanically ventilated septic patient using a long and short term memory (LSTM) recurrent neural network (RNN) model. We focus on short-term predictions of a set of ventilator parameters for the septic patient in emergency intensive care unit (EICU). The short-term predictability of the model provides attending physicians with early warnings to make timely adjustment to the treatment of the patient in the EICU. The patient specific deep learning model can be trained on any given critically ill patient, making it an intelligent aide for physicians to use in emergent medical situations.
    PyTorch Geometric Signed Directed: A Survey and Software on Graph Neural Networks for Signed and Directed Graphs. (arXiv:2202.10793v1 [cs.LG])
    Signed networks are ubiquitous in many real-world applications (e.g., social networks encoding trust/distrust relationships, correlation networks arising from time series data). While many signed networks are directed, there is a lack of survey papers and software packages on graph neural networks (GNNs) specially designed for directed networks. In this paper, we present PyTorch Geometric Signed Directed, a survey and software on GNNs for signed and directed networks. We review typical tasks, loss functions and evaluation metrics in the analysis of signed and directed networks, discuss data used in related experiments, and provide an overview of methods proposed. The deep learning framework consists of easy-to-use GNN models, synthetic and real-world data, as well as task-specific evaluation metrics and loss functions for signed and directed networks. The software is presented in a modular fashion, so that signed and directed networks can also be treated separately. As an extension library for PyTorch Geometric, our proposed software is maintained with open-source releases, detailed documentation, continuous integration, unit tests and code coverage checks. Our code is publicly available at \url{https://github.com/SherylHYX/pytorch_geometric_signed_directed}.
    NU HLT at CMCL 2022 Shared Task: Multilingual and Crosslingual Prediction of Human Reading Behavior in Universal Language Space. (arXiv:2202.10855v1 [cs.CL])
    In this paper, we present a unified model that works for both multilingual and crosslingual prediction of reading times of words in various languages. The secret behind the success of this model is in the preprocessing step where all words are transformed to their universal language representation via the International Phonetic Alphabet (IPA). To the best of our knowledge, this is the first study to favorable exploit this phonological property of language for the two tasks. Various feature types were extracted covering basic frequencies, n-grams, information theoretic, and psycholinguistically-motivated predictors for model training. A finetuned Random Forest model obtained best performance for both tasks with 3.8031 and 3.9065 MAE scores for mean first fixation duration (FFDAve) and mean total reading time (TRTAve) respectively.
    Explicit Regularization via Regularizer Mirror Descent. (arXiv:2202.10788v1 [cs.LG])
    Despite perfectly interpolating the training data, deep neural networks (DNNs) can often generalize fairly well, in part due to the "implicit regularization" induced by the learning algorithm. Nonetheless, various forms of regularization, such as "explicit regularization" (via weight decay), are often used to avoid overfitting, especially when the data is corrupted. There are several challenges with explicit regularization, most notably unclear convergence properties. Inspired by convergence properties of stochastic mirror descent (SMD) algorithms, we propose a new method for training DNNs with regularization, called regularizer mirror descent (RMD). In highly overparameterized DNNs, SMD simultaneously interpolates the training data and minimizes a certain potential function of the weights. RMD starts with a standard cost which is the sum of the training loss and a convex regularizer of the weights. Reinterpreting this cost as the potential of an "augmented" overparameterized network and applying SMD yields RMD. As a result, RMD inherits the properties of SMD and provably converges to a point "close" to the minimizer of this cost. RMD is computationally comparable to stochastic gradient descent (SGD) and weight decay, and is parallelizable in the same manner. Our experimental results on training sets with various levels of corruption suggest that the generalization performance of RMD is remarkably robust and significantly better than both SGD and weight decay, which implicitly and explicitly regularize the $\ell_2$ norm of the weights. RMD can also be used to regularize the weights to a desired weight vector, which is particularly relevant for continual learning.
    Remaining Useful Life Prediction Using Temporal Deep Degradation Network for Complex Machinery with Attention-based Feature Extraction. (arXiv:2202.10916v1 [cs.LG])
    The precise estimate of remaining useful life (RUL) is vital for the prognostic analysis and predictive maintenance that can significantly reduce failure rate and maintenance costs. The degradation-related features extracted from the sensor streaming data with neural networks can dramatically improve the accuracy of the RUL prediction. The Temporal deep degradation network (TDDN) model is proposed to make the RUL prediction with the degradation-related features given by the one-dimensional convolutional neural network (1D CNN) feature extraction and attention mechanism. 1D CNN is used to extract the temporal features from the streaming sensor data. Temporal features have monotonic degradation trends from the fluctuating raw sensor streaming data. Attention mechanism can improve the RUL prediction performance by capturing the fault characteristics and the degradation development with the attention weights. The performance of the TDDN model is evaluated on the public C-MAPSS dataset and compared with the existing methods. The results show that the TDDN model can achieve the best RUL prediction accuracy in complex conditions compared to current machine learning models. The degradation-related features extracted from the high-dimension sensor streaming data demonstrate the clear degradation trajectories and degradation stages that enable TDDN to predict the turbofan-engine RUL accurately and efficiently.
    A Novel Anomaly Detection Method for Multimodal WSN Data Flow via a Dynamic Graph Neural Network. (arXiv:2202.10454v1 [cs.LG])
    Anomaly detection is widely used to distinguish system anomalies by analyzing the temporal and spatial features of wireless sensor network (WSN) data streams; it is one of critical technique that ensures the reliability of WSNs. Currently, graph neural networks (GNNs) have become popular state-of-the-art methods for conducting anomaly detection on WSN data streams. However, the existing anomaly detection methods based on GNNs do not consider the temporal and spatial features of WSN data streams simultaneously, such as multi-node, multi-modal and multi-time features, seriously impacting their effectiveness. In this paper, a novel anomaly detection model is proposed for multimodal WSN data flows, where three GNNs are used to separately extract the temporal features of WSN data flows, the correlation features between different modes and the spatial features between sensor node positions. Specifically, first, the temporal features and modal correlation features extracted from each sensor node are fused into one vector representation, which is further aggregated with the spatial features, i.e., the spatial position relationships of the nodes; finally, the current time-series data of WSN nodes are predicted, and abnormal states are identified according to the fusion features. The simulation results obtained on a public dataset show that the proposed approach is able to significantly improve upon the existing methods in terms of its robustness, and its F1 score reaches 0.90, which is 14.2% higher than that of the graph convolution network (GCN) with long short-term memory (LSTM).
    Improving Systematic Generalization Through Modularity and Augmentation. (arXiv:2202.10745v1 [cs.AI])
    Systematic generalization is the ability to combine known parts into novel meaning; an important aspect of efficient human learning, but a weakness of neural network learning. In this work, we investigate how two well-known modeling principles -- modularity and data augmentation -- affect systematic generalization of neural networks in grounded language learning. We analyze how large the vocabulary needs to be to achieve systematic generalization and how similar the augmented data needs to be to the problem at hand. Our findings show that even in the controlled setting of a synthetic benchmark, achieving systematic generalization remains very difficult. After training on an augmented dataset with almost forty times more adverbs than the original problem, a non-modular baseline is not able to systematically generalize to a novel combination of a known verb and adverb. When separating the task into cognitive processes like perception and navigation, a modular neural network is able to utilize the augmented data and generalize more systematically, achieving 70% and 40% exact match increase over state-of-the-art on two gSCAN tests that have not previously been improved. We hope that this work gives insight into the drivers of systematic generalization, and what we still need to improve for neural networks to learn more like humans do.
    UncertaINR: Uncertainty Quantification of End-to-End Implicit Neural Representations for Computed Tomography. (arXiv:2202.10847v1 [eess.IV])
    Implicit neural representations (INRs) have achieved impressive results for scene reconstruction and computer graphics, where their performance has primarily been assessed on reconstruction accuracy. However, in medical imaging, where the reconstruction problem is underdetermined and model predictions inform high-stakes diagnoses, uncertainty quantification of INR inference is critical. To that end, we study UncertaINR: a Bayesian reformulation of INR-based image reconstruction, for computed tomography (CT). We test several Bayesian deep learning implementations of UncertaINR and find that they achieve well-calibrated uncertainty, while retaining accuracy competitive with other classical, INR-based, and CNN-based reconstruction techniques. In contrast to the best-performing prior approaches, UncertaINR does not require a large training dataset, but only a handful of validation images.
    Stochastic Causal Programming for Bounding Treatment Effects. (arXiv:2202.10806v1 [stat.ML])
    Causal effect estimation is important for numerous tasks in the natural and social sciences. However, identifying effects is impossible from observational data without making strong, often untestable assumptions. We consider algorithms for the partial identification problem, bounding treatment effects from multivariate, continuous treatments over multiple possible causal models when unmeasured confounding makes identification impossible. We consider a framework where observable evidence is matched to the implications of constraints encoded in a causal model by norm-based criteria. This generalizes classical approaches based purely on generative models. Casting causal effects as objective functions in a constrained optimization problem, we combine flexible learning algorithms with Monte Carlo methods to implement a family of solutions under the name of stochastic causal programming. In particular, we present ways by which such constrained optimization problems can be parameterized without likelihood functions for the causal or the observed data model, reducing the computational and statistical complexity of the task.
    Partial Identification with Noisy Covariates: A Robust Optimization Approach. (arXiv:2202.10665v1 [cs.LG])
    Causal inference from observational datasets often relies on measuring and adjusting for covariates. In practice, measurements of the covariates can often be noisy and/or biased, or only measurements of their proxies may be available. Directly adjusting for these imperfect measurements of the covariates can lead to biased causal estimates. Moreover, without additional assumptions, the causal effects are not point-identifiable due to the noise in these measurements. To this end, we study the partial identification of causal effects given noisy covariates, under a user-specified assumption on the noise level. The key observation is that we can formulate the identification of the average treatment effects (ATE) as a robust optimization problem. This formulation leads to an efficient robust optimization algorithm that bounds the ATE with noisy covariates. We show that this robust optimization approach can extend a wide range of causal adjustment methods to perform partial identification, including backdoor adjustment, inverse propensity score weighting, double machine learning, and front door adjustment. Across synthetic and real datasets, we find that this approach provides ATE bounds with a higher coverage probability than existing methods.
    Learning Infomax and Domain-Independent Representations for Causal Effect Inference with Real-World Data. (arXiv:2202.10885v1 [stat.ML])
    The foremost challenge to causal inference with real-world data is to handle the imbalance in the covariates with respect to different treatment options, caused by treatment selection bias. To address this issue, recent literature has explored domain-invariant representation learning based on different domain divergence metrics (e.g., Wasserstein distance, maximum mean discrepancy, position-dependent metric, and domain overlap). In this paper, we reveal the weaknesses of these strategies, i.e., they lead to the loss of predictive information when enforcing the domain invariance; and the treatment effect estimation performance is unstable, which heavily relies on the characteristics of the domain distributions and the choice of domain divergence metrics. Motivated by information theory, we propose to learn the Infomax and Domain-Independent Representations to solve the above puzzles. Our method utilizes the mutual information between the global feature representations and individual feature representations, and the mutual information between feature representations and treatment assignment predictions, in order to maximally capture the common predictive information for both treatment and control groups. Moreover, our method filters out the influence of instrumental and irrelevant variables, and thus it effectively increases the predictive ability of potential outcomes. Experimental results on both the synthetic and real-world datasets show that our method achieves state-of-the-art performance on causal effect inference. Moreover, our method exhibits reliable prediction performances when facing data with different characteristics of data distributions, complicated variable types, and severe covariate imbalance.
    Speciesist bias in AI -- How AI applications perpetuate discrimination and unfair outcomes against animals. (arXiv:2202.10848v1 [cs.LG])
    Massive efforts are made to reduce biases in both data and algorithms in order to render AI applications fair. These efforts are propelled by various high-profile cases where biased algorithmic decision-making caused harm to women, people of color, minorities, etc. However, the AI fairness field still succumbs to a blind spot, namely its insensitivity to discrimination against animals. This paper is the first to describe the 'speciesist bias' and investigate it in several different AI systems. Speciesist biases are learned and solidified by AI applications when they are trained on datasets in which speciesist patterns prevail. These patterns can be found in image recognition systems, large language models, and recommender systems. Therefore, AI technologies currently play a significant role in perpetuating and normalizing violence against animals. This can only be changed when AI fairness frameworks widen their scope and include mitigation measures for speciesist biases. This paper addresses the AI community in this regard and stresses the influence AI systems can have on either increasing or reducing the violence that is inflicted on animals, and especially on farmed animals.
    Confident Neural Network Regression with Bootstrapped Deep Ensembles. (arXiv:2202.10903v1 [stat.ML])
    With the rise of the popularity and usage of neural networks, trustworthy uncertainty estimation is becoming increasingly essential. In this paper we present a computationally cheap extension of Deep Ensembles for a regression setting called Bootstrapped Deep Ensembles that explicitly takes the effect of finite data into account using a modified version of the parametric bootstrap. We demonstrate through a simulation study that our method has comparable or better prediction intervals and superior confidence intervals compared to Deep Ensembles and other state-of-the-art methods. As an added bonus, our method is better capable of detecting overfitting than standard Deep Ensembles.
    Transporters with Visual Foresight for Solving Unseen Rearrangement Tasks. (arXiv:2202.10765v1 [cs.RO])
    Rearrangement tasks have been identified as a crucial challenge for intelligent robotic manipulation, but few methods allow for precise construction of unseen structures. We propose a visual foresight model for pick-and-place manipulation which is able to learn efficiently. In addition, we develop a multi-modal action proposal module which builds on Goal-Conditioned Transporter Networks, a state-of-the-art imitation learning method. Our method, Transporters with Visual Foresight (TVF), enables task planning from image data and is able to achieve multi-task learning and zero-shot generalization to unseen tasks with only a handful of expert demonstrations. TVF is able to improve the performance of a state-of-the-art imitation learning method on both training and unseen tasks in simulation and real robot experiments. In particular, the average success rate on unseen tasks improves from 55.0% to 77.9% in simulation experiments and from 30% to 63.3% in real robot experiments when given only tens of expert demonstrations. More details can be found on our project website: https://chirikjianlab.github.io/tvf/
    Seeing is Living? Rethinking the Security of Facial Liveness Verification in the Deepfake Era. (arXiv:2202.10673v1 [cs.CR])
    Facial Liveness Verification (FLV) is widely used for identity authentication in many security-sensitive domains and offered as Platform-as-a-Service (PaaS) by leading cloud vendors. Yet, with the rapid advances in synthetic media techniques (e.g., deepfake), the security of FLV is facing unprecedented challenges, about which little is known thus far. To bridge this gap, in this paper, we conduct the first systematic study on the security of FLV in real-world settings. Specifically, we present LiveBugger, a new deepfake-powered attack framework that enables customizable, automated security evaluation of FLV. Leveraging LiveBugger, we perform a comprehensive empirical assessment of representative FLV platforms, leading to a set of interesting findings. For instance, most FLV APIs do not use anti-deepfake detection; even for those with such defenses, their effectiveness is concerning (e.g., it may detect high-quality synthesized videos but fail to detect low-quality ones). We then conduct an in-depth analysis of the factors impacting the attack performance of LiveBugger: a) the bias (e.g., gender or race) in FLV can be exploited to select victims; b) adversarial training makes deepfake more effective to bypass FLV; c) the input quality has a varying influence on different deepfake techniques to bypass FLV. Based on these findings, we propose a customized, two-stage approach that can boost the attack success rate by up to 70%. Further, we run proof-of-concept attacks on several representative applications of FLV (i.e., the clients of FLV APIs) to illustrate the practical implications: due to the vulnerability of the APIs, many downstream applications are vulnerable to deepfake. Finally, we discuss potential countermeasures to improve the security of FLV. Our findings have been confirmed by the corresponding vendors.
    Convergence of online $k$-means. (arXiv:2202.10640v1 [cs.LG])
    We prove asymptotic convergence for a general class of $k$-means algorithms performed over streaming data from a distribution: the centers asymptotically converge to the set of stationary points of the $k$-means cost function. To do so, we show that online $k$-means over a distribution can be interpreted as stochastic gradient descent with a stochastic learning rate schedule. Then, we prove convergence by extending techniques used in optimization literature to handle settings where center-specific learning rates may depend on the past trajectory of the centers.
    Behaviour-Diverse Automatic Penetration Testing: A Curiosity-Driven Multi-Objective Deep Reinforcement Learning Approach. (arXiv:2202.10630v1 [cs.LG])
    Penetration Testing plays a critical role in evaluating the security of a target network by emulating real active adversaries. Deep Reinforcement Learning (RL) is seen as a promising solution to automating the process of penetration tests by reducing human effort and improving reliability. Existing RL solutions focus on finding a specific attack path to impact the target hosts. However, in reality, a diverse range of attack variations are needed to provide comprehensive assessments of the target network's security level. Hence, the attack agents must consider multiple objectives when penetrating the network. Nevertheless, this challenge is not adequately addressed in the existing literature. To this end, we formulate the automatic penetration testing in the Multi-Objective Reinforcement Learning (MORL) framework and propose a Chebyshev decomposition critic to find diverse adversary strategies that balance different objectives in the penetration test. Additionally, the number of available actions increases with the agent consistently probing the target network, making the training process intractable in many practical situations. Thus, we introduce a coverage-based masking mechanism that reduces attention on previously selected actions to help the agent adapt to future exploration. Experimental evaluation on a range of scenarios demonstrates the superiority of our proposed approach when compared to adapted algorithms in terms of multi-objective learning and performance efficiency.
    Decentralized Safe Multi-agent Stochastic Optimal Control using Deep FBSDEs and ADMM. (arXiv:2202.10658v1 [cs.MA])
    In this work, we propose a novel safe and scalable decentralized solution for multi-agent control in the presence of stochastic disturbances. Safety is mathematically encoded using stochastic control barrier functions and safe controls are computed by solving quadratic programs. Decentralization is achieved by augmenting to each agent's optimization variables, copy variables, for its neighbors. This allows us to decouple the centralized multi-agent optimization problem. However, to ensure safety, neighboring agents must agree on "what is safe for both of us" and this creates a need for consensus. To enable safe consensus solutions, we incorporate an ADMM-based approach. Specifically, we propose a Merged CADMM-OSQP implicit neural network layer, that solves a mini-batch of both, local quadratic programs as well as the overall consensus problem, as a single optimization problem. This layer is embedded within a Deep FBSDEs network architecture at every time step, to facilitate end-to-end differentiable, safe and decentralized stochastic optimal control. The efficacy of the proposed approach is demonstrated on several challenging multi-robot tasks in simulation. By imposing requirements on safety specified by collision avoidance constraints, the safe operation of all agents is ensured during the entire training process. We also demonstrate superior scalability in terms of computational and memory savings as compared to a centralized approach.
    On the Effectiveness of Adversarial Training against Backdoor Attacks. (arXiv:2202.10627v1 [cs.LG])
    DNNs' demand for massive data forces practitioners to collect data from the Internet without careful check due to the unacceptable cost, which brings potential risks of backdoor attacks. A backdoored model always predicts a target class in the presence of a predefined trigger pattern, which can be easily realized via poisoning a small amount of data. In general, adversarial training is believed to defend against backdoor attacks since it helps models to keep their prediction unchanged even if we perturb the input image (as long as within a feasible range). Unfortunately, few previous studies succeed in doing so. To explore whether adversarial training could defend against backdoor attacks or not, we conduct extensive experiments across different threat models and perturbation budgets, and find the threat model in adversarial training matters. For instance, adversarial training with spatial adversarial examples provides notable robustness against commonly-used patch-based backdoor attacks. We further propose a hybrid strategy which provides satisfactory robustness across different backdoor attacks.
    T-METASET: Task-Aware Generation of Metamaterial Datasets by Diversity-Based Active Learning. (arXiv:2202.10565v1 [cs.CE])
    Inspired by the recent success of deep learning in diverse domains, data-driven metamaterials design has emerged as a compelling design paradigm to unlock the potential of multiscale architecture. However, existing model-centric approaches lack principled methodologies dedicated to high-quality data generation. Resorting to space-filling design in shape descriptor space, existing metamaterial datasets suffer from property distributions that are either highly imbalanced or at odds with design tasks of interest. To this end, we propose t-METASET: an intelligent data acquisition framework for task-aware dataset generation. We seek a solution to a commonplace yet frequently overlooked scenario at early design stages: when a massive ($~\sim O(10^4)$) shape library has been prepared with no properties evaluated. The key idea is to exploit a data-driven shape descriptor learned from generative models, fit a sparse regressor as the start-up agent, and leverage diversity-related metrics to drive data acquisition to areas that help designers fulfill design goals. We validate the proposed framework in three hypothetical deployment scenarios, which encompass general use, task-aware use, and tailorable use. Two large-scale shape-only mechanical metamaterial datasets are used as test datasets. The results demonstrate that t-METASET can incrementally grow task-aware datasets. Applicable to general design representations, t-METASET can boost future advancements of not only metamaterials but data-driven design in other domains.
    Data-Driven Traffic Assignment: A Novel Approach for Learning Traffic Flow Patterns Using a Graph Convolutional Neural Network. (arXiv:2202.10508v1 [cs.LG])
    We present a novel data-driven approach of learning traffic flow patterns of a transportation network given that many instances of origin to destination (OD) travel demand and link flows of the network are available. Instead of estimating traffic flow patterns assuming certain user behavior (e.g., user equilibrium or system optimal), here we explore the idea of learning those flow patterns directly from the data. To implement this idea, we have formulated the traffic-assignment problem as a data-driven learning problem and developed a neural network-based framework known as Graph Convolutional Neural Network (GCNN) to solve it. The proposed framework represents the transportation network and OD demand in an efficient way and utilizes the diffusion process of multiple OD demands from nodes to links. We validate the solutions of the model against analytical solutions generated from running static user equilibrium-based traffic assignments over Sioux Falls and East Massachusetts networks. The validation result shows that the implemented GCNN model can learn the flow patterns very well with less than 2% mean absolute difference between the actual and estimated link flows for both networks under varying congested conditions. When the training of the model is complete, it can instantly determine the traffic flows of a large-scale network. Hence this approach can overcome the challenges of deploying traffic assignment models over large-scale networks and open new directions of research in data-driven network modeling.
    Knowledge-informed Molecular Learning: A Survey on Paradigm Transfer. (arXiv:2202.10587v1 [cs.LG])
    Machine learning, especially deep learning, has greatly advanced molecular studies in the biochemical domain. Most typically, modeling for most molecular tasks have converged to several paradigms. For example, we usually adopt the prediction paradigm to solve tasks of molecular property prediction. To improve the generation and interpretability of purely data-driven models, researchers have incorporated biochemical domain knowledge into these models for molecular studies. This knowledge incorporation has led to a rising trend of paradigm transfer, which is solving one molecular learning task by reformulating it as another one. In this paper, we present a literature review towards knowledge-informed molecular learning in perspective of paradigm transfer, where we categorize the paradigms, review their methods and analyze how domain knowledge contributes. Furthermore, we summarize the trends and point out interesting future directions for molecular learning.
    Adversarial Attacks on Speech Recognition Systems for Mission-Critical Applications: A Survey. (arXiv:2202.10594v1 [cs.SD])
    A Machine-Critical Application is a system that is fundamentally necessary to the success of specific and sensitive operations such as search and recovery, rescue, military, and emergency management actions. Recent advances in Machine Learning, Natural Language Processing, voice recognition, and speech processing technologies have naturally allowed the development and deployment of speech-based conversational interfaces to interact with various machine-critical applications. While these conversational interfaces have allowed users to give voice commands to carry out strategic and critical activities, their robustness to adversarial attacks remains uncertain and unclear. Indeed, Adversarial Artificial Intelligence (AI) which refers to a set of techniques that attempt to fool machine learning models with deceptive data, is a growing threat in the AI and machine learning research community, in particular for machine-critical applications. The most common reason of adversarial attacks is to cause a malfunction in a machine learning model. An adversarial attack might entail presenting a model with inaccurate or fabricated samples as it's training data, or introducing maliciously designed data to deceive an already trained model. While focusing on speech recognition for machine-critical applications, in this paper, we first review existing speech recognition techniques, then, we investigate the effectiveness of adversarial attacks and defenses against these systems, before outlining research challenges, defense recommendations, and future work. This paper is expected to serve researchers and practitioners as a reference to help them in understanding the challenges, position themselves and, ultimately, help them to improve existing models of speech recognition for mission-critical applications. Keywords: Mission-Critical Applications, Adversarial AI, Speech Recognition Systems.
    Benchmarking missing-values approaches for predictive models on health databases. (arXiv:2202.10580v1 [cs.LG])
    BACKGROUND: As databases grow larger, it becomes harder to fully control their collection, and they frequently come with missing values: incomplete observations. These large databases are well suited to train machine-learning models, for instance for forecasting or to extract biomarkers in biomedical settings. Such predictive approaches can use discriminative -- rather than generative -- modeling, and thus open the door to new missing-values strategies. Yet existing empirical evaluations of strategies to handle missing values have focused on inferential statistics. RESULTS: Here we conduct a systematic benchmark of missing-values strategies in predictive models with a focus on large health databases: four electronic health record datasets, a population brain imaging one, a health survey and two intensive care ones. Using gradient-boosted trees, we compare native support for missing values with simple and state-of-the-art imputation prior to learning. We investigate prediction accuracy and computational time. For prediction after imputation, we find that adding an indicator to express which values have been imputed is important, suggesting that the data are missing not at random. Elaborate missing values imputation can improve prediction compared to simple strategies but requires longer computational time on large data. Learning trees that model missing values-with missing incorporated attribute-leads to robust, fast, and well-performing predictive modeling. CONCLUSIONS: Native support for missing values in supervised machine learning predicts better than state-of-the-art imputation with much less computational cost. When using imputation, it is important to add indicator columns expressing which values have been imputed.
    Graph Lifelong Learning: A Survey. (arXiv:2202.10688v1 [cs.LG])
    Graph learning substantially contributes to solving artificial intelligence (AI) tasks in various graph-related domains such as social networks, biological networks, recommender systems, and computer vision. However, despite its unprecedented prevalence, addressing the dynamic evolution of graph data over time remains a challenge. In many real-world applications, graph data continuously evolves. Current graph learning methods that assume graph representation is complete before the training process begins are not applicable in this setting. This challenge in graph learning motivates the development of a continuous learning process called graph lifelong learning to accommodate the future and refine the previous knowledge in graph data. Unlike existing survey papers that focus on either lifelong learning or graph learning separately, this survey paper covers the motivations, potentials, state-of-the-art approaches (that are well categorized), and open issues of graph lifelong learning. We expect extensive research and development interest in this emerging field.
    Targeting occupant feedback using digital twins: Adaptive spatial-temporal thermal preference sampling to optimize personal comfort models. (arXiv:2202.10707v1 [cs.LG])
    Collecting intensive longitudinal thermal preference data from building occupants is emerging as an innovative means of characterizing the performance of buildings and the people who use them. These techniques have occupants giving subjective feedback using smartphones or smartwatches frequently over the course of days or weeks. The intention is that the data will be collected with high spatial and temporal diversity to best characterize a building and the occupant's preferences. But in reality, leaving the occupant to respond in an ad-hoc or fixed interval way creates unneeded survey fatigue and redundant data. This paper outlines a scenario-based (virtual experiment) method for optimizing data sampling using a smartwatch to achieve comparable accuracy in a personal thermal preference model with less data. This method uses BIM-extracted spatial data, and Graph Neural Network (GNN) based modeling to find regions of similar comfort preference to identify the best scenarios for triggering the occupant to give feedback. This method is compared to two baseline scenarios based on the spatial context of specific spaces and 4 x 4 m grid squares in the building using a theoretical implementation on two field-collected data sets. The results show that the proposed Build2Vec method is 18-23% more in the overall sampling quality than the spaces-based and the square-grid-based sampling methods. The Build2Vec method also has similar performance to the baselines when removing redundant occupant feedback points but with better scalability potential.
    Online Caching with Optimistic Learning. (arXiv:2202.10590v1 [cs.NI])
    The design of effective online caching policies is an increasingly important problem for content distribution networks, online social networks and edge computing services, among other areas. This paper proposes a new algorithmic toolbox for tackling this problem through the lens of optimistic online learning. We build upon the Follow-the-Regularized-Leader (FTRL) framework which is developed further here to include predictions for the file requests, and we design online caching algorithms for bipartite networks with fixed-size caches or elastic leased caches subject to time-average budget constraints. The predictions are provided by a content recommendation system that influences the users viewing activity, and hence can naturally reduce the caching network's uncertainty about future requests. We prove that the proposed optimistic learning caching policies can achieve sub-zero performance loss (regret) for perfect predictions, and maintain the best achievable regret bound $O(\sqrt T)$ even for arbitrary-bad predictions. The performance of the proposed algorithms is evaluated with detailed trace-driven numerical tests.
    CROMOSim: A Deep Learning-based Cross-modality Inertial Measurement Simulator. (arXiv:2202.10562v1 [eess.SP])
    With the prevalence of wearable devices, inertial measurement unit (IMU) data has been utilized in monitoring and assessment of human mobility such as human activity recognition (HAR). Training deep neural network (DNN) models for these tasks require a large amount of labeled data, which are hard to acquire in uncontrolled environments. To mitigate the data scarcity problem, we design CROMOSim, a cross-modality sensor simulator that simulates high fidelity virtual IMU sensor data from motion capture systems or monocular RGB cameras. It utilizes a skinned multi-person linear model (SMPL) for 3D body pose and shape representations, to enable simulation from arbitrary on-body positions. A DNN model is trained to learn the functional mapping from imperfect trajectory estimations in a 3D SMPL body tri-mesh due to measurement noise, calibration errors, occlusion and other modeling artifacts, to IMU data. We evaluate the fidelity of CROMOSim simulated data and its utility in data augmentation on various HAR datasets. Extensive experiment results show that the proposed model achieves a 6.7% improvement over baseline methods in a HAR task.
    Transition Matrix Representation of Trees with Transposed Convolutions. (arXiv:2202.10677v1 [cs.LG])
    How can we effectively find the best structures in tree models? Tree models have been favored over complex black box models in domains where interpretability is crucial for making irreversible decisions. However, searching for a tree structure that gives the best balance between the performance and the interpretability remains a challenging task. In this paper, we propose TART (Transition Matrix Representation with Transposed Convolutions), our novel generalized tree representation for optimal structural search. TART represents a tree model with a series of transposed convolutions that boost the speed of inference by avoiding the creation of transition matrices. As a result, TART allows one to search for the best tree structure with a few design parameters, achieving higher classification accuracy than those of baseline models in feature-based datasets.
    Connecting Optimization and Generalization via Gradient Flow Path Length. (arXiv:2202.10670v1 [stat.ML])
    Optimization and generalization are two essential aspects of machine learning. In this paper, we propose a framework to connect optimization with generalization by analyzing the generalization error based on the length of optimization trajectory under the gradient flow algorithm after convergence. Through our approach, we show that, with a proper initialization, gradient flow converges following a short path with an explicit length estimate. Such an estimate induces a length-based generalization bound, showing that short optimization paths after convergence are associated with good generalization, which also matches our numerical results. Our framework can be applied to broad settings. For example, we use it to obtain generalization estimates on three distinct machine learning models: underdetermined $\ell_p$ linear regression, kernel regression, and overparameterized two-layer ReLU neural networks.
    Effective Training Strategies for Deep-learning-based Precipitation Nowcasting and Estimation. (arXiv:2202.10555v1 [cs.CV])
    Deep learning has been successfully applied to precipitation nowcasting. In this work, we propose a pre-training scheme and a new loss function for improving deep-learning-based nowcasting. First, we adapt U-Net, a widely-used deep-learning model, for the two problems of interest here: precipitation nowcasting and precipitation estimation from radar images. We formulate the former as a classification problem with three precipitation intervals and the latter as a regression problem. For these tasks, we propose to pre-train the model to predict radar images in the near future without requiring ground-truth precipitation, and we also propose the use of a new loss function for fine-tuning to mitigate the class imbalance problem. We demonstrate the effectiveness of our approach using radar images and precipitation datasets collected from South Korea over seven years. It is highlighted that our pre-training scheme and new loss function improve the critical success index (CSI) of nowcasting of heavy rainfall (at least 10 mm/hr) by up to 95.7% and 43.6%, respectively, at a 5-hr lead time. We also demonstrate that our approach reduces the precipitation estimation error by up to 10.7%, compared to the conventional approach, for light rainfall (between 1 and 10 mm/hr). Lastly, we report the sensitivity of our approach to different resolutions and a detailed analysis of four cases of heavy rainfall.
    Gaussian Processes and Statistical Decision-making in Non-Euclidean Spaces. (arXiv:2202.10613v1 [stat.ML])
    Bayesian learning using Gaussian processes provides a foundational framework for making decisions in a manner that balances what is known with what could be learned by gathering data. In this dissertation, we develop techniques for broadening the applicability of Gaussian processes. This is done in two ways. Firstly, we develop pathwise conditioning techniques for Gaussian processes, which allow one to express posterior random functions as prior random functions plus a dependent update term. We introduce a wide class of efficient approximations built from this viewpoint, which can be randomly sampled once in advance, and evaluated at arbitrary locations without any subsequent stochasticity. This key property improves efficiency and makes it simpler to deploy Gaussian process models in decision-making settings. Secondly, we develop a collection of Gaussian process models over non-Euclidean spaces, including Riemannian manifolds and graphs. We derive fully constructive expressions for the covariance kernels of scalar-valued Gaussian processes on Riemannian manifolds and graphs. Building on these ideas, we describe a formalism for defining vector-valued Gaussian processes on Riemannian manifolds. The introduced techniques allow all of these models to be trained using standard computational methods. In total, these contributions make Gaussian processes easier to work with and allow them to be used within a wider class of domains in an effective and principled manner. This, in turn, makes it possible to potentially apply Gaussian processes to novel decision-making settings.
    MineRL Diamond 2021 Competition: Overview, Results, and Lessons Learned. (arXiv:2202.10583v1 [cs.LG])
    Reinforcement learning competitions advance the field by providing appropriate scope and support to develop solutions toward a specific problem. To promote the development of more broadly applicable methods, organizers need to enforce the use of general techniques, the use of sample-efficient methods, and the reproducibility of the results. While beneficial for the research community, these restrictions come at a cost -- increased difficulty. If the barrier for entry is too high, many potential participants are demoralized. With this in mind, we hosted the third edition of the MineRL ObtainDiamond competition, MineRL Diamond 2021, with a separate track in which we permitted any solution to promote the participation of newcomers. With this track and more extensive tutorials and support, we saw an increased number of submissions. The participants of this easier track were able to obtain a diamond, and the participants of the harder track progressed the generalizable solutions in the same task.
    Myriad: a real-world testbed to bridge trajectory optimization and deep learning. (arXiv:2202.10600v1 [cs.LG])
    We present Myriad, a testbed written in JAX for learning and planning in real-world continuous environments. The primary contributions of Myriad are threefold. First, Myriad provides machine learning practitioners access to trajectory optimization techniques for application within a typical automatic differentiation workflow. Second, Myriad presents many real-world optimal control problems, ranging from biology to medicine to engineering, for use by the machine learning community. Formulated in continuous space and time, these environments retain some of the complexity of real-world systems often abstracted away by standard benchmarks. As such, Myriad strives to serve as a stepping stone towards application of modern machine learning techniques for impactful real-world tasks. Finally, we use the Myriad repository to showcase a novel approach for learning and control tasks. Trained in a fully end-to-end fashion, our model leverages an implicit planning module over neural ordinary differential equations, enabling simultaneous learning and planning with complex environment dynamics.
    Knowledge Base Question Answering by Case-based Reasoning over Subgraphs. (arXiv:2202.10610v1 [cs.CL])
    Question answering (QA) over real-world knowledge bases (KBs) is challenging because of the diverse (essentially unbounded) types of reasoning patterns needed. However, we hypothesize in a large KB, reasoning patterns required to answer a query type reoccur for various entities in their respective subgraph neighborhoods. Leveraging this structural similarity between local neighborhoods of different subgraphs, we introduce a semiparametric model with (i) a nonparametric component that for each query, dynamically retrieves other similar $k$-nearest neighbor (KNN) training queries along with query-specific subgraphs and (ii) a parametric component that is trained to identify the (latent) reasoning patterns from the subgraphs of KNN queries and then apply it to the subgraph of the target query. We also propose a novel algorithm to select a query-specific compact subgraph from within the massive knowledge graph (KG), allowing us to scale to full Freebase KG containing billions of edges. We show that our model answers queries requiring complex reasoning patterns more effectively than existing KG completion algorithms. The proposed model outperforms or performs competitively with state-of-the-art models on several KBQA benchmarks.
    Strategic Ranking. (arXiv:2109.08240v2 [cs.GT] UPDATED)
    Strategic classification studies the design of a classifier robust to the manipulation of input by strategic individuals. However, the existing literature does not consider the effect of competition among individuals as induced by the algorithm design. Motivated by constrained allocation settings such as college admissions, we introduce strategic ranking, in which the (designed) individual reward depends on an applicant's post-effort rank in a measurement of interest. Our results illustrate how competition among applicants affects the resulting equilibria and model insights. We analyze how various ranking reward designs, belonging to a family of step functions, trade off applicant, school, and societal utility, as well as how ranking design counters inequities arising from disparate access to resources. In particular, we find that randomization in the reward design can mitigate two measures of disparate impact, welfare gap and access.
    Modality Fusion Network and Personalized Attention in Momentary Stress Detection in the Wild. (arXiv:2107.09510v3 [eess.SP] UPDATED)
    Multimodal wearable physiological data in daily life have been used to estimate self-reported stress labels. However, missing data modalities in data collection makes it challenging to leverage all the collected samples. Besides, heterogeneous sensor data and labels among individuals add challenges in building robust stress detection models. In this paper, we proposed a modality fusion network (MFN) to train models and infer self-reported binary stress labels under both complete and incomplete modality conditions. In addition, we applied personalized attention (PA) strategy to leverage personalized representation along with the generalized one-size-fits-all model. We evaluated our methods on a multimodal wearable sensor dataset (N=41) including galvanic skin response (GSR) and electrocardiogram (ECG). Compared to the baseline method using the samples with complete modalities, the performance of the MFN improved by 1.6% in f1-scores. On the other hand, the proposed PA strategy showed a 2.3% higher stress detection f1-score and approximately up to 70% reduction in personalized model parameter size (9.1 MB) compared to the previous state-of-the-art transfer learning strategy (29.3 MB).
    Nearest neighbor search with compact codes: A decoder perspective. (arXiv:2112.09568v2 [cs.CV] UPDATED)
    Modern approaches for fast retrieval of similar vectors on billion-scaled datasets rely on compressed-domain approaches such as binary sketches or product quantization. These methods minimize a certain loss, typically the mean squared error or other objective functions tailored to the retrieval problem. In this paper, we re-interpret popular methods such as binary hashing or product quantizers as auto-encoders, and point out that they implicitly make suboptimal assumptions on the form of the decoder. We design backward-compatible decoders that improve the reconstruction of the vectors from the same codes, which translates to a better performance in nearest neighbor search. Our method significantly improves over binary hashing methods or product quantization on popular benchmarks.
    Click-Through Rate Prediction in Online Advertising: A Literature Review. (arXiv:2202.10462v1 [cs.IR])
    Predicting the probability that a user will click on a specific advertisement has been a prevalent issue in online advertising, attracting much research attention in the past decades. As a hot research frontier driven by industrial needs, recent years have witnessed more and more novel learning models employed to improve advertising CTR prediction. Although extant research provides necessary details on algorithmic design for addressing a variety of specific problems in advertising CTR prediction, the methodological evolution and connections between modeling frameworks are precluded. However, to the best of our knowledge, there are few comprehensive surveys on this topic. We make a systematic literature review on state-of-the-art and latest CTR prediction research, with a special focus on modeling frameworks. Specifically, we give a classification of state-of-the-art CTR prediction models in the extant literature, within which basic modeling frameworks and their extensions, advantages and disadvantages, and performance assessment for CTR prediction are presented. Moreover, we summarize CTR prediction models with respect to the complexity and the order of feature interactions, and performance comparisons on various datasets. Furthermore, we identify current research trends, main challenges and potential future directions worthy of further explorations. This review is expected to provide fundamental knowledge and efficient entry points for IS and marketing scholars who want to engage in this area.
    Batched Dueling Bandits. (arXiv:2202.10660v1 [cs.LG])
    The $K$-armed dueling bandit problem, where the feedback is in the form of noisy pairwise comparisons, has been widely studied. Previous works have only focused on the sequential setting where the policy adapts after every comparison. However, in many applications such as search ranking and recommendation systems, it is preferable to perform comparisons in a limited number of parallel batches. We study the batched $K$-armed dueling bandit problem under two standard settings: (i) existence of a Condorcet winner, and (ii) strong stochastic transitivity and stochastic triangle inequality. For both settings, we obtain algorithms with a smooth trade-off between the number of batches and regret. Our regret bounds match the best known sequential regret bounds (up to poly-logarithmic factors), using only a logarithmic number of batches. We complement our regret analysis with a nearly-matching lower bound. Finally, we also validate our theoretical results via experiments on synthetic and real data.
    Non-Volatile Memory Accelerated Posterior Estimation. (arXiv:2202.10522v1 [cs.LG])
    Bayesian inference allows machine learning models to express uncertainty. Current machine learning models use only a single learnable parameter combination when making predictions, and as a result are highly overconfident when their predictions are wrong. To use more learnable parameter combinations efficiently, these samples must be drawn from the posterior distribution. Unfortunately computing the posterior directly is infeasible, so often researchers approximate it with a well known distribution such as a Gaussian. In this paper, we show that through the use of high-capacity persistent storage, models whose posterior distribution was too big to approximate are now feasible, leading to improved predictions in downstream tasks.
    FLEA: Provably Fair Multisource Learning from Unreliable Training Data. (arXiv:2106.11732v3 [cs.LG] UPDATED)
    Fairness-aware learning aims at constructing classifiers that not only make accurate predictions, but also do not discriminate against specific groups. It is a fast-growing area of machine learning with far-reaching societal impact. However, existing fair learning methods are vulnerable to accidental or malicious artifacts in the training data, which can cause them to unknowingly produce unfair classifiers. In this work we address the problem of fair learning from unreliable training data in the robust multisource setting, where the available training data comes from multiple sources, a fraction of which might not be representative of the true data distribution. We introduce FLEA, a filtering-based algorithm that allows the learning system to identify and suppress those data sources that would have a negative impact on fairness or accuracy if they were used for training. We show the effectiveness of our approach by a diverse range of experiments on multiple datasets. Additionally, we prove formally that - given enough data - FLEA protects the learner against corruptions as long as the fraction of affected data sources is less than half.
    Simultaneous Transport Evolution for Minimax Equilibria on Measures. (arXiv:2202.06460v2 [cs.LG] UPDATED)
    Min-max optimization problems arise in several key machine learning setups, including adversarial learning and generative modeling. In their general form, in absence of convexity/concavity assumptions, finding pure equilibria of the underlying two-player zero-sum game is computationally hard [Daskalakis et al., 2021]. In this work we focus instead in finding mixed equilibria, and consider the associated lifted problem in the space of probability measures. By adding entropic regularization, our main result establishes global convergence towards the global equilibrium by using simultaneous gradient ascent-descent with respect to the Wasserstein metric -- a dynamics that admits efficient particle discretization in high-dimensions, as opposed to entropic mirror descent. We complement this positive result with a related entropy-regularized loss which is not bilinear but still convex-concave in the Wasserstein geometry, and for which simultaneous dynamics do not converge yet timescale separation does. Taken together, these results showcase the benign geometry of bilinear games in the space of measures, enabling particle dynamics with global qualitative convergence guarantees.
    Reconstruction on Trees and Low-Degree Polynomials. (arXiv:2109.06915v2 [math.PR] UPDATED)
    The study of Markov processes and broadcasting on trees has deep connections to a variety of areas including statistical physics, graphical models, phylogenetic reconstruction, MCMC algorithms, and community detection in random graphs. Notably, the celebrated Belief Propagation (BP) algorithm achieves optimal performance for the reconstruction problem of predicting the value of the Markov process at the root of the tree from its values at the leaves. Recently, the analysis of low-degree polynomials has emerged as a valuable tool for predicting computational-to-statistical gaps. In this work, we investigate the performance of low-degree polynomials for the reconstruction problem. Perhaps surprisingly, we show that there are simple tree models of fixed arity $d$ and growing depth $\ell$ (so $N = 2^{\ell \log_2(d)}$ leaves) where (1) nontrivial reconstruction of the root value is possible with a simple polynomial time algorithm and with robustness to noise, but not with any polynomial of degree $2^{c \ell} = N^{c/\log_2(d)}$ for $c > 0$ a constant, and (2) when the tree is unknown and given multiple samples with correlated root assignments, nontrivial reconstruction of the root value is possible with a simple, noise-robust, and computationally efficient SQ algorithm but not with any polynomial of degree $2^{c \ell}$. These results clarify limitations of low-degree polynomials vs. polynomial time algorithms for Bayesian estimation problems. They also complement recent work of Moitra, Mossel, and Sandon who studied the circuit complexity of Belief Propagation. As a consequence of our main result, we show that for some $c' > 0$, $\exp(2^{c'\ell}) = \exp(N^{c'/\log_2(d)})$ many samples are needed for RBF kernel regression to obtain nontrivial correlation with the true regression function (BP). We pose related open questions about low-degree polynomials and the Kesten-Stigum threshold.
    Correlation Robust Influence Maximization. (arXiv:2010.14620v2 [cs.SI] UPDATED)
    We propose a distributionally robust model for the influence maximization problem. Unlike the classic independent cascade model \citep{kempe2003maximizing}, this model's diffusion process is adversarially adapted to the choice of seed set. Hence, instead of optimizing under the assumption that all influence relationships in the network are independent, we seek a seed set whose expected influence under the worst correlation, i.e. the "worst-case, expected influence", is maximized. We show that this worst-case influence can be efficiently computed, and though the optimization is NP-hard, a ($1 - 1/e$) approximation guarantee holds. We also analyze the structure to the adversary's choice of diffusion process, and contrast with established models. Beyond the key computational advantages, we also highlight the extent to which the independence assumption may cost optimality, and provide insights from numerical experiments comparing the adversarial and independent cascade model.
    Enhancing operations management through smart sensors: measuring and improving well-being, interaction and performance of logistics workers. (arXiv:2112.08213v2 [physics.soc-ph] UPDATED)
    Purpose The purpose of the research is to conduct an exploratory investigation of the material handling activities of an Italian logistics hub. Wearable sensors and other smart tools were used for collecting human and environmental features during working activities. These factors were correlated with workers' performance and well-being. Design/methodology/approach Human and environmental factors play an important role in operations management activities since they significantly influence employees' performance, well-being and safety. Surprisingly, empirical studies about the impact of such aspects on logistics operations are still very limited. Trying to fill this gap, the research empirically explores human and environmental factors affecting the performance of logistics workers exploiting smart tools. Findings Results suggest that human attitudes, interactions, emotions and environmental conditions remarkably influence workers' performance and well-being, however, showing different relationships depending on individual characteristics of each worker. Practical implications The authors' research opens up new avenues for profiling employees and adopting an individualized human resource management, providing managers with an operational system capable to potentially check and improve workers' well-being and performance. Originality/value The originality of the study comes from the in-depth exploration of human and environmental factors using body-worn sensors during work activities, by recording individual, collaborative and environmental data in real-time. To the best of the authors' knowledge, the current paper is the first time that such a detailed analysis has been carried out in real-world logistics operations.
    Online Learning for Orchestration of Inference in Multi-User End-Edge-Cloud Networks. (arXiv:2202.10541v1 [cs.LG])
    Deep-learning-based intelligent services have become prevalent in cyber-physical applications including smart cities and health-care. Deploying deep-learning-based intelligence near the end-user enhances privacy protection, responsiveness, and reliability. Resource-constrained end-devices must be carefully managed in order to meet the latency and energy requirements of computationally-intensive deep learning services. Collaborative end-edge-cloud computing for deep learning provides a range of performance and efficiency that can address application requirements through computation offloading. The decision to offload computation is a communication-computation co-optimization problem that varies with both system parameters (e.g., network condition) and workload characteristics (e.g., inputs). On the other hand, deep learning model optimization provides another source of tradeoff between latency and model accuracy. An end-to-end decision-making solution that considers such computation-communication problem is required to synergistically find the optimal offloading policy and model for deep learning services. To this end, we propose a reinforcement-learning-based computation offloading solution that learns optimal offloading policy considering deep learning model selection techniques to minimize response time while providing sufficient accuracy. We demonstrate the effectiveness of our solution for edge devices in an end-edge-cloud system and evaluate with a real-setup implementation using multiple AWS and ARM core configurations. Our solution provides 35% speedup in the average response time compared to the state-of-the-art with less than 0.9% accuracy reduction, demonstrating the promise of our online learning framework for orchestrating DL inference in end-edge-cloud systems.
    Off-Policy Confidence Interval Estimation with Confounded Markov Decision Process. (arXiv:2202.10589v1 [stat.ML])
    This paper is concerned with constructing a confidence interval for a target policy's value offline based on a pre-collected observational data in infinite horizon settings. Most of the existing works assume no unmeasured variables exist that confound the observed actions. This assumption, however, is likely to be violated in real applications such as healthcare and technological industries. In this paper, we show that with some auxiliary variables that mediate the effect of actions on the system dynamics, the target policy's value is identifiable in a confounded Markov decision process. Based on this result, we develop an efficient off-policy value estimator that is robust to potential model misspecification and provide rigorous uncertainty quantification. Our method is justified by theoretical results, simulated and real datasets obtained from ridesharing companies.
    Contrastive-mixup learning for improved speaker verification. (arXiv:2202.10672v1 [eess.AS])
    This paper proposes a novel formulation of prototypical loss with mixup for speaker verification. Mixup is a simple yet efficient data augmentation technique that fabricates a weighted combination of random data point and label pairs for deep neural network training. Mixup has attracted increasing attention due to its ability to improve robustness and generalization of deep neural networks. Although mixup has shown success in diverse domains, most applications have centered around closed-set classification tasks. In this work, we propose contrastive-mixup, a novel augmentation strategy that learns distinguishing representations based on a distance metric. During training, mixup operations generate convex interpolations of both inputs and virtual labels. Moreover, we have reformulated the prototypical loss function such that mixup is enabled on metric learning objectives. To demonstrate its generalization given limited training data, we conduct experiments by varying the number of available utterances from each speaker in the VoxCeleb database. Experimental results show that applying contrastive-mixup outperforms the existing baseline, reducing error rate by 16% relatively, especially when the number of training utterances per speaker is limited.
    Semi-Implicit Hybrid Gradient Methods with Application to Adversarial Robustness. (arXiv:2202.10523v1 [cs.LG])
    Adversarial examples, crafted by adding imperceptible perturbations to natural inputs, can easily fool deep neural networks (DNNs). One of the most successful methods for training adversarially robust DNNs is solving a nonconvex-nonconcave minimax problem with an adversarial training (AT) algorithm. However, among the many AT algorithms, only Dynamic AT (DAT) and You Only Propagate Once (YOPO) guarantee convergence to a stationary point. In this work, we generalize the stochastic primal-dual hybrid gradient algorithm to develop semi-implicit hybrid gradient methods (SI-HGs) for finding stationary points of nonconvex-nonconcave minimax problems. SI-HGs have the convergence rate $O(1/K)$, which improves upon the rate $O(1/K^{1/2})$ of DAT and YOPO. We devise a practical variant of SI-HGs, and show that it outperforms other AT algorithms in terms of convergence speed and robustness.
    It Takes Four to Tango: Multiagent Selfplay for Automatic Curriculum Generation. (arXiv:2202.10608v1 [cs.LG])
    We are interested in training general-purpose reinforcement learning agents that can solve a wide variety of goals. Training such agents efficiently requires automatic generation of a goal curriculum. This is challenging as it requires (a) exploring goals of increasing difficulty, while ensuring that the agent (b) is exposed to a diverse set of goals in a sample efficient manner and (c) does not catastrophically forget previously solved goals. We propose Curriculum Self Play (CuSP), an automated goal generation framework that seeks to satisfy these desiderata by virtue of a multi-player game with four agents. We extend the asymmetric curricula learning in PAIRED (Dennis et al., 2020) to a symmetrized game that carefully balances cooperation and competition between two off-policy student learners and two regret-maximizing teachers. CuSP additionally introduces entropic goal coverage and accounts for the non-stationary nature of the students, allowing us to automatically induce a curriculum that balances progressive exploration with anti-catastrophic exploitation. We demonstrate that our method succeeds at generating an effective curricula of goals for a range of control tasks, outperforming other methods at zero-shot test-time generalization to novel out-of-distribution goals.
    Feasibility Study of Multi-Site Split Learning for Privacy-Preserving Medical Systems under Data Imbalance Constraints in COVID-19, X-Ray, and Cholesterol Dataset. (arXiv:2202.10456v1 [cs.LG])
    It seems as though progressively more people are in the race to upload content, data, and information online; and hospitals haven't neglected this trend either. Hospitals are now at the forefront for multi-site medical data sharing to provide groundbreaking advancements in the way health records are shared and patients are diagnosed. Sharing of medical data is essential in modern medical research. Yet, as with all data sharing technology, the challenge is to balance improved treatment with protecting patient's personal information. This paper provides a novel split learning algorithm coined the term, "multi-site split learning", which enables a secure transfer of medical data between multiple hospitals without fear of exposing personal data contained in patient records. It also explores the effects of varying the number of end-systems and the ratio of data-imbalance on the deep learning performance. A guideline for the most optimal configuration of split learning that ensures privacy of patient data whilst achieving performance is empirically given. We argue the benefits of our multi-site split learning algorithm, especially regarding the privacy preserving factor, using CT scans of COVID-19 patients, X-ray bone scans, and cholesterol level medical data.
    Personalized PATE: Differential Privacy for Machine Learning with Individual Privacy Guarantees. (arXiv:2202.10517v1 [cs.LG])
    Applying machine learning (ML) to sensitive domains requires privacy protection of the underlying training data through formal privacy frameworks, such as differential privacy (DP). Yet, usually, the privacy of the training data comes at the costs of the resulting ML models' utility. One reason for this is that DP uses one homogeneous privacy budget epsilon for all training data points, which has to align with the strictest privacy requirement encountered among all data holders. In practice, different data holders might have different privacy requirements and data points of data holders with lower requirements could potentially contribute more information to the training process of the ML models. To account for this possibility, we propose three novel methods that extend the DP framework Private Aggregation of Teacher Ensembles (PATE) to support training an ML model with different personalized privacy guarantees within the training data. We formally describe the methods, provide theoretical analyses of their privacy bounds, and experimentally evaluate their effect on the final model's utility at the example of the MNIST and Adult income datasets. Our experiments show that our personalized privacy methods yield higher accuracy models than the non-personalized baseline. Thereby, our methods can improve the privacy-utility trade-off in scenarios in which different data holders consent to contribute their sensitive data at different privacy levels.
    Physics-Informed Graph Learning: A Survey. (arXiv:2202.10679v1 [cs.LG])
    An expeditious development of graph learning in recent years has found innumerable applications in several diversified fields. Of the main associated challenges are the volume and complexity of graph data. A lot of research has been evolving around the preservation of graph data in a low dimensional space. The graph learning models suffer from the inability to maintain original graph information. In order to compensate for this inability, physics-informed graph learning (PIGL) is emerging. PIGL incorporates physics rules while performing graph learning, which enables numerous potentials. This paper presents a systematic review of PIGL methods. We begin with introducing a unified framework of graph learning models, and then examine existing PIGL methods in relation to the unified framework. We also discuss several future challenges for PIGL. This survey paper is expected to stimulate innovative research and development activities pertaining to PIGL.
    Extragradient Method: $O(1/K)$ Last-Iterate Convergence for Monotone Variational Inequalities and Connections With Cocoercivity. (arXiv:2110.04261v2 [math.OC] UPDATED)
    Extragradient method (EG) (Korpelevich, 1976) is one of the most popular methods for solving saddle point and variational inequalities problems (VIP). Despite its long history and significant attention in the optimization community, there remain important open questions about convergence of EG. In this paper, we resolve one of such questions and derive the first last-iterate $O(1/K)$ convergence rate for EG for monotone and Lipschitz VIP without any additional assumptions on the operator unlike the only known result of this type (Golowich et al., 2020) that relies on the Lipschitzness of the Jacobian of the operator. The rate is given in terms of reducing the squared norm of the operator. Moreover, we establish several results on the (non-)cocoercivity of the update operators of EG, Optimistic Gradient Method, and Hamiltonian Gradient Method, when the original operator is monotone and Lipschitz.
    Differential Secrecy for Distributed Data and Applications to Robust Differentially Secure Vector Summation. (arXiv:2202.10618v1 [cs.CR])
    Computing the noisy sum of real-valued vectors is an important primitive in differentially private learning and statistics. In private federated learning applications, these vectors are held by client devices, leading to a distributed summation problem. Standard Secure Multiparty Computation (SMC) protocols for this problem are susceptible to poisoning attacks, where a client may have a large influence on the sum, without being detected. In this work, we propose a poisoning-robust private summation protocol in the multiple-server setting, recently studied in PRIO. We present a protocol for vector summation that verifies that the Euclidean norm of each contribution is approximately bounded. We show that by relaxing the security constraint in SMC to a differential privacy like guarantee, one can improve over PRIO in terms of communication requirements as well as the client-side computation. Unlike SMC algorithms that inevitably cast integers to elements of a large finite field, our algorithms work over integers/reals, which may allow for additional efficiencies.
    Guidelines and evaluation for clinical explainable AI on medical image analysis. (arXiv:2202.10553v1 [cs.LG])
    Explainable artificial intelligence (XAI) is essential for enabling clinical users to get informed decision support from AI and comply with evidence-based medical practice. Applying XAI in clinical settings requires proper evaluation criteria to ensure the explanation technique is both technically sound and clinically useful, but specific support is lacking to achieve this goal. To bridge the research gap, we propose the Clinical XAI Guidelines that consist of five criteria a clinical XAI needs to be optimized for. The guidelines recommend choosing an explanation form based on Guideline 1 (G1) Understandability and G2 Clinical relevance. For the chosen explanation form, its specific XAI technique should be optimized for G3 Truthfulness, G4 Informative plausibility, and G5 Computational efficiency. Following the guidelines, we conducted a systematic evaluation on a novel problem of multi-modal medical image explanation with two clinical tasks, and proposed new evaluation metrics accordingly. The evaluated 16 commonly-used heatmap XAI techniques were not suitable for clinical use due to their failure in \textbf{G3} and \textbf{G4}. Our evaluation demonstrated the use of Clinical XAI Guidelines to support the design and evaluation for clinically viable XAI.
    Order-Optimal Error Bounds for Noisy Kernel-Based Bayesian Quadrature. (arXiv:2202.10615v1 [stat.ML])
    In this paper, we study the sample complexity of {\em noisy Bayesian quadrature} (BQ), in which we seek to approximate an integral based on noisy black-box queries to the underlying function. We consider functions in a {\em Reproducing Kernel Hilbert Space} (RKHS) with the Mat\'ern-$\nu$ kernel, focusing on combinations of the parameter $\nu$ and dimension $d$ such that the RKHS is equivalent to a Sobolev class. In this setting, we provide near-matching upper and lower bounds on the best possible average error. Specifically, we find that when the black-box queries are subject to Gaussian noise having variance $\sigma^2$, any algorithm making at most $T$ queries (even with adaptive sampling) must incur a mean absolute error of $\Omega(T^{-\frac{\nu}{d}-1} + \sigma T^{-\frac{1}{2}})$, and there exists a non-adaptive algorithm attaining an error of at most $O(T^{-\frac{\nu}{d}-1} + \sigma T^{-\frac{1}{2}})$. Hence, the bounds are order-optimal, and establish that there is no adaptivity gap in terms of scaling laws.
    Deep Iterative Phase Retrieval for Ptychography. (arXiv:2202.10573v1 [eess.IV])
    One of the most prominent challenges in the field of diffractive imaging is the phase retrieval (PR) problem: In order to reconstruct an object from its diffraction pattern, the inverse Fourier transform must be computed. This is only possible given the full complex-valued diffraction data, i.e. magnitude and phase. However, in diffractive imaging, generally only magnitudes can be directly measured while the phase needs to be estimated. In this work we specifically consider ptychography, a sub-field of diffractive imaging, where objects are reconstructed from multiple overlapping diffraction images. We propose an augmentation of existing iterative phase retrieval algorithms with a neural network designed for refining the result of each iteration. For this purpose we adapt and extend a recently proposed architecture from the speech processing field. Evaluation results show the proposed approach delivers improved convergence rates in terms of both iteration count and algorithm runtime.
    Dynamic Relation Discovery and Utilization in Multi-Entity Time Series Forecasting. (arXiv:2202.10586v1 [cs.LG])
    Time series forecasting plays a key role in a variety of domains. In a lot of real-world scenarios, there exist multiple forecasting entities (e.g. power station in the solar system, stations in the traffic system). A straightforward forecasting solution is to mine the temporal dependency for each individual entity through 1d-CNN, RNN, transformer, etc. This approach overlooks the relations between these entities and, in consequence, loses the opportunity to improve performance using spatial-temporal relation. However, in many real-world scenarios, beside explicit relation, there could exist crucial yet implicit relation between entities. How to discover the useful implicit relation between entities and effectively utilize the relations for each entity under various circumstances is crucial. In order to mine the implicit relation between entities as much as possible and dynamically utilize the relation to improve the forecasting performance, we propose an attentional multi-graph neural network with automatic graph learning (A2GNN) in this work. Particularly, a Gumbel-softmax based auto graph learner is designed to automatically capture the implicit relation among forecasting entities. We further propose an attentional relation learner that enables every entity to dynamically pay attention to its preferred relations. Extensive experiments are conducted on five real-world datasets from three different domains. The results demonstrate the effectiveness of A2GNN beyond several state-of-the-art methods.
    Predicting emotion from music videos: exploring the relative contribution of visual and auditory information to affective responses. (arXiv:2202.10453v1 [cs.CV])
    Although media content is increasingly produced, distributed, and consumed in multiple combinations of modalities, how individual modalities contribute to the perceived emotion of a media item remains poorly understood. In this paper we present MusicVideos (MuVi), a novel dataset for affective multimedia content analysis to study how the auditory and visual modalities contribute to the perceived emotion of media. The data were collected by presenting music videos to participants in three conditions: music, visual, and audiovisual. Participants annotated the music videos for valence and arousal over time, as well as the overall emotion conveyed. We present detailed descriptive statistics for key measures in the dataset and the results of feature importance analyses for each condition. Finally, we propose a novel transfer learning architecture to train Predictive models Augmented with Isolated modality Ratings (PAIR) and demonstrate the potential of isolated modality ratings for enhancing multimodal emotion recognition. Our results suggest that perceptions of arousal are influenced primarily by auditory information, while perceptions of valence are more subjective and can be influenced by both visual and auditory information. The dataset is made publicly available.
    Debiasing Backdoor Attack: A Benign Application of Backdoor Attack in Eliminating Data Bias. (arXiv:2202.10582v1 [cs.CR])
    Backdoor attack is a new AI security risk that has emerged in recent years. Drawing on the previous research of adversarial attack, we argue that the backdoor attack has the potential to tap into the model learning process and improve model performance. Based on Clean Accuracy Drop (CAD) in backdoor attack, we found that CAD came out of the effect of pseudo-deletion of data. We provided a preliminary explanation of this phenomenon from the perspective of model classification boundaries and observed that this pseudo-deletion had advantages over direct deletion in the data debiasing problem. Based on the above findings, we proposed Debiasing Backdoor Attack (DBA). It achieves SOTA in the debiasing task and has a broader application scenario than undersampling.
    Towards technological adaptation of advanced farming through AI, IoT, and Robotics: A Comprehensive overview. (arXiv:2202.10459v1 [cs.AI])
    The population explosion of the 21st century has adversely affected the natural resources with restricted availability of cultivable land, increased average temperatures due to global warming, and carbon footprint resulting in a drastic increase in floods as well as droughts thus making food security significant anxiety for most countries. The traditional methods were no longer sufficient which paved the way for technological ascents such as a substantial rise in Artificial Intelligence (AI), Internet of Things (IoT), as well as Robotics that provides high productivity, functional efficiency, flexibility, cost-effectiveness in the domain of agriculture. AI, IoT, and Robotics-based devices and methods have produced new paradigms and opportunities in agriculture. AI's existing approaches are soil management, crop diseases identification, weed identification, and management in collaboration with IoT devices. IoT has utilized automatic agricultural operations and real-time monitoring with few personnel employed in real-time. The major existing applications of agricultural robotics are for the function of soil preparation, planting, monitoring, harvesting, and storage. In this paper, researchers have explored a comprehensive overview of recent implementation, scopes, opportunities, challenges, limitations, and future research instructions of AI, IoT, and Robotics based methodology in the agriculture sector.
    A Novel Architecture Slimming Method for Network Pruning and Knowledge Distillation. (arXiv:2202.10461v1 [cs.CV])
    Network pruning and knowledge distillation are two widely-known model compression methods that efficiently reduce computation cost and model size. A common problem in both pruning and distillation is to determine compressed architecture, i.e., the exact number of filters per layer and layer configuration, in order to preserve most of the original model capacity. In spite of the great advances in existing works, the determination of an excellent architecture still requires human interference or tremendous experimentations. In this paper, we propose an architecture slimming method that automates the layer configuration process. We start from the perspective that the capacity of the over-parameterized model can be largely preserved by finding the minimum number of filters preserving the maximum parameter variance per layer, resulting in a thin architecture. We formulate the determination of compressed architecture as a one-step orthogonal linear transformation, and integrate principle component analysis (PCA), where the variances of filters in the first several projections are maximized. We demonstrate the rationality of our analysis and the effectiveness of the proposed method through extensive experiments. In particular, we show that under the same overall compression rate, the compressed architecture determined by our method shows significant performance gain over baselines after pruning and distillation. Surprisingly, we find that the resulting layer-wise compression rates correspond to the layer sensitivities found by existing works through tremendous experimentations.
    Open-Set Support Vector Machines. (arXiv:1606.03802v11 [cs.LG] UPDATED)
    Often, when dealing with real-world recognition problems, we do not need, and often cannot have, knowledge of the entire set of possible classes that might appear during operational testing. In such cases, we need to think of robust classification methods able to deal with the "unknown" and properly reject samples belonging to classes never seen during training. Notwithstanding, existing classifiers to date were mostly developed for the closed-set scenario, i.e., the classification setup in which it is assumed that all test samples belong to one of the classes with which the classifier was trained. In the open-set scenario, however, a test sample can belong to none of the known classes and the classifier must properly reject it by classifying it as unknown. In this work, we extend upon the well-known Support Vector Machines (SVM) classifier and introduce the Open-Set Support Vector Machines (OSSVM), which is suitable for recognition in open-set setups. OSSVM balances the empirical risk and the risk of the unknown and ensures that the region of the feature space in which a test sample would be classified as known (one of the known classes) is always bounded, ensuring a finite risk of the unknown. In this work, we also highlight the properties of the SVM classifier related to the open-set scenario, and provide necessary and sufficient conditions for an RBF SVM to have bounded open-space risk.
    Moment Matching Deep Contrastive Latent Variable Models. (arXiv:2202.10560v1 [cs.LG])
    In the contrastive analysis (CA) setting, machine learning practitioners are specifically interested in discovering patterns that are enriched in a target dataset as compared to a background dataset generated from sources of variation irrelevant to the task at hand. For example, a biomedical data analyst may seek to understand variations in genomic data only present among patients with a given disease as opposed to those also present in healthy control subjects. Such scenarios have motivated the development of contrastive latent variable models to isolate variations unique to these target datasets from those shared across the target and background datasets, with current state of the art models based on the variational autoencoder (VAE) framework. However, previously proposed models do not explicitly enforce the constraints on latent variables underlying CA, potentially leading to the undesirable leakage of information between the two sets of latent variables. Here we propose the moment matching contrastive VAE (MM-cVAE), a reformulation of the VAE for CA that uses the maximum mean discrepancy to explicitly enforce two crucial latent variable constraints underlying CA. On three challenging CA tasks we find that our method outperforms the previous state-of-the-art both qualitatively and on a set of quantitative metrics.
    A Classical-Quantum Convolutional Neural Network for Detecting Pneumonia from Chest Radiographs. (arXiv:2202.10452v1 [cs.CV])
    While many quantum computing techniques for machine learning have been proposed, their performance on real-world datasets remains to be studied. In this paper, we explore how a variational quantum circuit could be integrated into a classical neural network for the problem of detecting pneumonia from chest radiographs. We substitute one layer of a classical convolutional neural network with a variational quantum circuit to create a hybrid neural network. We train both networks on an image dataset containing chest radiographs and benchmark their performance. To mitigate the influence of different sources of randomness in network training, we sample the results over multiple rounds. We show that the hybrid network outperforms the classical network on different performance measures, and that these improvements are statistically significant. Our work serves as an experimental demonstration of the potential of quantum computing to significantly improve neural network performance for real-world, non-trivial problems relevant to society and industry.
    Robust and Provable Guarantees for Sparse Random Embeddings. (arXiv:2202.10815v1 [cs.LG])
    In this work, we improve upon the guarantees for sparse random embeddings, as they were recently provided and analyzed by Freksen at al. (NIPS'18) and Jagadeesan (NIPS'19). Specifically, we show that (a) our bounds are explicit as opposed to the asymptotic guarantees provided previously, and (b) our bounds are guaranteed to be sharper by practically significant constants across a wide range of parameters, including the dimensionality, sparsity and dispersion of the data. Moreover, we empirically demonstrate that our bounds significantly outperform prior works on a wide range of real-world datasets, such as collections of images, text documents represented as bags-of-words, and text sequences vectorized by neural embeddings. Behind our numerical improvements are techniques of broader interest, which improve upon key steps of previous analyses in terms of (c) tighter estimates for certain types of quadratic chaos, (d) establishing extreme properties of sparse linear forms, and (e) improvements on bounds for the estimation of sums of independent random variables.
    Single-Leg Revenue Management with Advice. (arXiv:2202.10939v1 [cs.GT])
    Single-leg revenue management is a foundational problem of revenue management that has been particularly impactful in the airline and hotel industry: Given $n$ units of a resource, e.g. flight seats, and a stream of sequentially-arriving customers segmented by fares, what is the optimal online policy for allocating the resource. Previous work focused on designing algorithms when forecasts are available, which are not robust to inaccuracies in the forecast, or online algorithms with worst-case performance guarantees, which can be too conservative in practice. In this work, we look at the single-leg revenue management problem through the lens of the algorithms-with-advice framework, which attempts to optimally incorporate advice/predictions about the future into online algorithms. In particular, we characterize the Pareto frontier that captures the tradeoff between consistency (performance when advice is accurate) and competitiveness (performance when advice is inaccurate) for every advice. Moreover, we provide an online algorithm that always achieves performance on this Pareto frontier. We also study the class of protection level policies, which is the most widely-deployed technique for single-leg revenue management: we provide an algorithm to incorporate advice into protection levels that optimally trades off consistency and competitiveness. Moreover, we empirically evaluate the performance of these algorithms on synthetic data. We find that our algorithm for protection level policies performs remarkably well on most instances, even if it is not guaranteed to be on the Pareto frontier in theory.
    An adaptive neuro-fuzzy model for attitude estimation and 2 control a 3 DOF system. (arXiv:2004.09756v3 [eess.SY] UPDATED)
    In recent decades, one of the scientists' main concerns has been to improve the accuracy of satellite attitude, regardless of the expense. The obvious result is that a large number of control strategies have been used to address this problem. In this study, an adaptive neuro-fuzzy integrated (ANFIS) satellite attitude estimation and control system was developed. The controller is trained with the data provided by an optimal controller. A pulse modulator is used to generate the right ON/OFF commands of the thruster actuator. To evaluate the performance of the AN-FIS controller in closed-loop simulation, an ANFIS observer is used to estimate the attitude and angular velocities of the satellite using magnetometer, sun sensor and data gyro data. In addition, a new ANFIS system will be proposed and evaluated that can jointly control and estimate the system. The performance of the ANFIS controller is compared to the optimal PID controller in a Monte Carlo simulation with different initial conditions, disturbance and noise. The results show that the ANFIS controller can surpass the optimal PID controller in several aspects, including time and smoothness. In addition, the ANFIS estimator is examined and the results demonstrate the high ability of this designated observers. Both the control and estimation phases are simulated by a single ANFIS subsystem, taking into account the high capacity of ANFIS, and the results of using the ANFIS model are demonstrated.
    Black-box Certification and Learning under Adversarial Perturbations. (arXiv:2006.16520v2 [stat.ML] UPDATED)
    We formally study the problem of classification under adversarial perturbations from a learner's perspective as well as a third-party who aims at certifying the robustness of a given black-box classifier. We analyze a PAC-type framework of semi-supervised learning and identify possibility and impossibility results for proper learning of VC-classes in this setting. We further introduce a new setting of black-box certification under limited query budget, and analyze this for various classes of predictors and perturbation. We also consider the viewpoint of a black-box adversary that aims at finding adversarial examples, showing that the existence of an adversary with polynomial query complexity can imply the existence of a sample efficient robust learner.
    A Graph-based U-Net Model for Predicting Traffic in unseen Cities. (arXiv:2202.06725v2 [cs.LG] UPDATED)
    Accurate traffic prediction is a key ingredient to enable traffic management like rerouting cars to reduce road congestion or regulating traffic via dynamic speed limits to maintain a steady flow. A way to represent traffic data is in the form of temporally changing heatmaps visualizing attributes of traffic, such as speed and volume. In recent works, U-Net models have shown SOTA performance on traffic forecasting from heatmaps. We propose to combine the U-Net architecture with graph layers which improves spatial generalization to unseen road networks compared to a Vanilla U-Net. In particular, we specialize existing graph operations to be sensitive to geographical topology and generalize pooling and upsampling operations to be applicable to graphs.
    Increasing Depth of Neural Networks for Life-long Learning. (arXiv:2202.10821v1 [cs.LG])
    Increasing neural network depth is a well-known method for improving neural network performance. Modern deep architectures contain multiple mechanisms that allow hundreds or even thousands of layers to train. This work is trying to answer if extending neural network depth may be beneficial in a life-long learning setting. In particular, we propose a novel method based on adding new layers on top of existing ones to enable the forward transfer of knowledge and adapting previously learned representations for new tasks. We utilize a method of determining the most similar tasks for selecting the best location in our network to add new nodes with trainable parameters. This approach allows for creating a tree-like model, where each node is a set of neural network parameters dedicated to a specific task. The proposed method is inspired by Progressive Neural Network (PNN) concept, therefore it is rehearsal-free and benefits from dynamic change of network structure. However, it requires fewer parameters per task than PNN. Experiments on Permuted MNIST and SplitCIFAR show that the proposed algorithm is on par with other continual learning methods. We also perform ablation studies to clarify the contributions of each system part.
    Statistical and Spatio-temporal Hand Gesture Features for Sign Language Recognition using the Leap Motion Sensor. (arXiv:2202.11005v1 [cs.CV])
    In modern society, people should not be identified based on their disability, rather, it is environments that can disable people with impairments. Improvements to automatic Sign Language Recognition (SLR) will lead to more enabling environments via digital technology. Many state-of-the-art approaches to SLR focus on the classification of static hand gestures, but communication is a temporal activity, which is reflected by many of the dynamic gestures present. Given this, temporal information during the delivery of a gesture is not often considered within SLR. The experiments in this work consider the problem of SL gesture recognition regarding how dynamic gestures change during their delivery, and this study aims to explore how single types of features as well as mixed features affect the classification ability of a machine learning model. 18 common gestures recorded via a Leap Motion Controller sensor provide a complex classification problem. Two sets of features are extracted from a 0.6 second time window, statistical descriptors and spatio-temporal attributes. Features from each set are compared by their ANOVA F-Scores and p-values, arranged into bins grown by 10 features per step to a limit of the 250 highest-ranked features. Results show that the best statistical model selected 240 features and scored 85.96% accuracy, the best spatio-temporal model selected 230 features and scored 80.98%, and the best mixed-feature model selected 240 features from each set leading to a classification accuracy of 86.75%. When all three sets of results are compared (146 individual machine learning models), the overall distribution shows that the minimum results are increased when inputs are any number of mixed features compared to any number of either of the two single sets of features.
    Incentive Mechanism Design for Joint Resource Allocation in Blockchain-based Federated Learning. (arXiv:2202.10938v1 [cs.CR])
    Blockchain-based federated learning (BCFL) has recently gained tremendous attention because of its advantages such as decentralization and privacy protection of raw data. However, there has been few research focusing on the allocation of resources for clients in BCFL. In the BCFL framework where the FL clients and the blockchain miners are the same devices, clients broadcast the trained model updates to the blockchain network and then perform mining to generate new blocks. Since each client has a limited amount of computing resources, the problem of allocating computing resources into training and mining needs to be carefully addressed. In this paper, we design an incentive mechanism to assign each client appropriate rewards for training and mining, and then the client will determine the amount of computing power to allocate for each subtask based on these rewards using the two-stage Stackelberg game. After analyzing the utilities of the model owner (MO) (i.e., the BCFL task publisher) and clients, we transform the game model into two optimization problems, which are sequentially solved to derive the optimal strategies for both the MO and clients. Further, considering the fact that local training related information of each client may not be known by others, we extend the game model with analytical solutions to the incomplete information scenario. Extensive experimental results demonstrate the validity of our proposed schemes.
    Differentially Private Estimation of Heterogeneous Causal Effects. (arXiv:2202.11043v1 [stat.ML])
    Estimating heterogeneous treatment effects in domains such as healthcare or social science often involves sensitive data where protecting privacy is important. We introduce a general meta-algorithm for estimating conditional average treatment effects (CATE) with differential privacy (DP) guarantees. Our meta-algorithm can work with simple, single-stage CATE estimators such as S-learner and more complex multi-stage estimators such as DR and R-learner. We perform a tight privacy analysis by taking advantage of sample splitting in our meta-algorithm and the parallel composition property of differential privacy. In this paper, we implement our approach using DP-EBMs as the base learner. DP-EBMs are interpretable, high-accuracy models with privacy guarantees, which allow us to directly observe the impact of DP noise on the learned causal model. Our experiments show that multi-stage CATE estimators incur larger accuracy loss than single-stage CATE or ATE estimators and that most of the accuracy loss from differential privacy is due to an increase in variance, not biased estimates of treatment effects.
    A Clustering Preserving Transformation for k-Means Algorithm Output. (arXiv:2202.10455v1 [cs.LG])
    This note introduces a novel clustering preserving transformation of cluster sets obtained from $k$-means algorithm. This transformation may be used to generate new labeled data{}sets from existent ones. It is more flexible that Kleinberg axiom based consistency transformation because data points in a cluster can be moved away and datapoints between clusters may come closer together.
    A Review of Affective Generation Models. (arXiv:2202.10763v1 [cs.LG])
    Affective computing is an emerging interdisciplinary field where computational systems are developed to analyze, recognize, and influence the affective states of a human. It can generally be divided into two subproblems: affective recognition and affective generation. Affective recognition has been extensively reviewed multiple times in the past decade. Affective generation, however, lacks a critical review. Therefore, we propose to provide a comprehensive review of affective generation models, as models are most commonly leveraged to affect others' emotional states. Affective computing has gained momentum in various fields and applications, thanks to the leap of machine learning, especially deep learning since 2015. With critical models introduced, this work is believed to benefit future research on affective generation. We conclude this work with a brief discussion on existing challenges.
    Adaptive Cholesky Gaussian Processes. (arXiv:2202.10769v1 [cs.LG])
    We present a method to fit exact Gaussian process models to large datasets by considering only a subset of the data. Our approach is novel in that the size of the subset is selected on the fly during exact inference with little computational overhead. From an empirical observation that the log-marginal likelihood often exhibits a linear trend once a sufficient subset of a dataset has been observed, we conclude that many large datasets contain redundant information that only slightly affects the posterior. Based on this, we provide probabilistic bounds on the full model evidence that can identify such subsets. Remarkably, these bounds are largely composed of terms that appear in intermediate steps of the standard Cholesky decomposition, allowing us to modify the algorithm to adaptively stop the decomposition once enough data have been observed. Empirically, we show that our method can be directly plugged into well-known inference schemes to fit exact Gaussian process models to large datasets.
    A Multi-Agent Reinforcement Learning Framework for Off-Policy Evaluation in Two-sided Markets. (arXiv:2202.10574v1 [stat.ML])
    The two-sided markets such as ride-sharing companies often involve a group of subjects who are making sequential decisions across time and/or location. With the rapid development of smart phones and internet of things, they have substantially transformed the transportation landscape of human beings. In this paper we consider large-scale fleet management in ride-sharing companies that involve multiple units in different areas receiving sequences of products (or treatments) over time. Major technical challenges, such as policy evaluation, arise in those studies because (i) spatial and temporal proximities induce interference between locations and times; and (ii) the large number of locations results in the curse of dimensionality. To address both challenges simultaneously, we introduce a multi-agent reinforcement learning (MARL) framework for carrying policy evaluation in these studies. We propose novel estimators for mean outcomes under different products that are consistent despite the high-dimensionality of state-action space. The proposed estimator works favorably in simulation experiments. We further illustrate our method using a real dataset obtained from a two-sided marketplace company to evaluate the effects of applying different subsidizing policies. A Python implementation of the proposed method is available at https://github.com/RunzheStat/CausalMARL.
    Variational Neural Temporal Point Process. (arXiv:2202.10585v1 [cs.LG])
    A temporal point process is a stochastic process that predicts which type of events is likely to happen and when the event will occur given a history of a sequence of events. There are various examples of occurrence dynamics in the daily life, and it is important to train the temporal dynamics and solve two different prediction problems, time and type predictions. Especially, deep neural network based models have outperformed the statistical models, such as Hawkes processes and Poisson processes. However, many existing approaches overfit to specific events, instead of learning and predicting various event types. Therefore, such approaches could not cope with the modified relationships between events and fail to predict the intensity functions of temporal point processes very well. In this paper, to solve these problems, we propose a variational neural temporal point process (VNTPP). We introduce the inference and the generative networks, and train a distribution of latent variable to deal with stochastic property on deep neural network. The intensity functions are computed using the distribution of latent variable so that we can predict event types and the arrival times of the events more accurately. We empirically demonstrate that our model can generalize the representations of various event types. Moreover, we show quantitatively and qualitatively that our model outperforms other deep neural network based models and statistical processes on synthetic and real-world datasets.
    Equivariant Graph Hierarchy-Based Neural Networks. (arXiv:2202.10643v1 [cs.LG])
    Equivariant Graph neural Networks (EGNs) are powerful in characterizing the dynamics of multi-body physical systems. Existing EGNs conduct flat message passing, which, yet, is unable to capture the spatial/dynamical hierarchy for complex systems particularly, limiting substructure discovery and global information fusion. In this paper, we propose Equivariant Hierarchy-based Graph Networks (EGHNs) which consist of the three key components: generalized Equivariant Matrix Message Passing (EMMP) , E-Pool and E-UpPool. In particular, EMMP is able to improve the expressivity of conventional equivariant message passing, E-Pool assigns the quantities of the low-level nodes into high-level clusters, while E-UpPool leverages the high-level information to update the dynamics of the low-level nodes. As their names imply, both E-Pool and E-UpPool are guaranteed to be equivariant to meet physic symmetry. Considerable experimental evaluations verify the effectiveness of our EGHN on several applications including multi-object dynamics simulation, motion capture, and protein dynamics modeling.
    Sequential Information Design: Markov Persuasion Process and Its Efficient Reinforcement Learning. (arXiv:2202.10678v1 [cs.AI])
    In today's economy, it becomes important for Internet platforms to consider the sequential information design problem to align its long term interest with incentives of the gig service providers. This paper proposes a novel model of sequential information design, namely the Markov persuasion processes (MPPs), where a sender, with informational advantage, seeks to persuade a stream of myopic receivers to take actions that maximizes the sender's cumulative utilities in a finite horizon Markovian environment with varying prior and utility functions. Planning in MPPs thus faces the unique challenge in finding a signaling policy that is simultaneously persuasive to the myopic receivers and inducing the optimal long-term cumulative utilities of the sender. Nevertheless, in the population level where the model is known, it turns out that we can efficiently determine the optimal (resp. $\epsilon$-optimal) policy with finite (resp. infinite) states and outcomes, through a modified formulation of the Bellman equation. Our main technical contribution is to study the MPP under the online reinforcement learning (RL) setting, where the goal is to learn the optimal signaling policy by interacting with with the underlying MPP, without the knowledge of the sender's utility functions, prior distributions, and the Markov transition kernels. We design a provably efficient no-regret learning algorithm, the Optimism-Pessimism Principle for Persuasion Process (OP4), which features a novel combination of both optimism and pessimism principles. Our algorithm enjoys sample efficiency by achieving a sublinear $\sqrt{T}$-regret upper bound. Furthermore, both our algorithm and theory can be applied to MPPs with large space of outcomes and states via function approximation, and we showcase such a success under the linear setting.
    Reward-Free Policy Space Compression for Reinforcement Learning. (arXiv:2202.11079v1 [cs.LG])
    In reinforcement learning, we encode the potential behaviors of an agent interacting with an environment into an infinite set of policies, the policy space, typically represented by a family of parametric functions. Dealing with such a policy space is a hefty challenge, which often causes sample and computation inefficiencies. However, we argue that a limited number of policies are actually relevant when we also account for the structure of the environment and of the policy parameterization, as many of them would induce very similar interactions, i.e., state-action distributions. In this paper, we seek for a reward-free compression of the policy space into a finite set of representative policies, such that, given any policy $\pi$, the minimum R\'enyi divergence between the state-action distributions of the representative policies and the state-action distribution of $\pi$ is bounded. We show that this compression of the policy space can be formulated as a set cover problem, and it is inherently NP-hard. Nonetheless, we propose a game-theoretic reformulation for which a locally optimal solution can be efficiently found by iteratively stretching the compressed space to cover an adversarial policy. Finally, we provide an empirical evaluation to illustrate the compression procedure in simple domains, and its ripple effects in reinforcement learning.
    Accelerating Primal-dual Methods for Regularized Markov Decision Processes. (arXiv:2202.10506v1 [math.OC])
    Entropy regularized Markov decision processes have been widely used in reinforcement learning. This paper is concerned with the primal-dual formulation of the entropy regularized problems. Standard first-order methods suffer from slow convergence due to the lack of strict convexity and concavity. To address this issue, we first introduce a new quadratically convexified primal-dual formulation. The natural gradient ascent descent of the new formulation enjoys global convergence guarantee and exponential convergence rate. We also propose a new interpolating metric that further accelerates the convergence significantly. Numerical results are provided to demonstrate the performance of the proposed methods under multiple settings.
    On Uncertainty Estimation by Tree-based Surrogate Models in Sequential Model-based Optimization. (arXiv:2202.10669v1 [stat.ML])
    Sequential model-based optimization sequentially selects a candidate point by constructing a surrogate model with the history of evaluations, to solve a black-box optimization problem. Gaussian process (GP) regression is a popular choice as a surrogate model, because of its capability of calculating prediction uncertainty analytically. On the other hand, an ensemble of randomized trees is another option and has practical merits over GPs due to its scalability and easiness of handling continuous/discrete mixed variables. In this paper we revisit various ensembles of randomized trees to investigate their behavior in the perspective of prediction uncertainty estimation. Then, we propose a new way of constructing an ensemble of randomized trees, referred to as BwO forest, where bagging with oversampling is employed to construct bootstrapped samples that are used to build randomized trees with random splitting. Experimental results demonstrate the validity and good performance of BwO forest over existing tree-based models in various circumstances.
    Unleashing the Power of Transformer for Graphs. (arXiv:2202.10581v1 [cs.LG])
    Despite recent successes in natural language processing and computer vision, Transformer suffers from the scalability problem when dealing with graphs. The computational complexity is unacceptable for large-scale graphs, e.g., knowledge graphs. One solution is to consider only the near neighbors, which, however, will lose the key merit of Transformer to attend to the elements at any distance. In this paper, we propose a new Transformer architecture, named dual-encoding Transformer (DET). DET has a structural encoder to aggregate information from connected neighbors and a semantic encoder to focus on semantically useful distant nodes. In comparison with resorting to multi-hop neighbors, DET seeks the desired distant neighbors via self-supervised training. We further find these two encoders can be incorporated to boost each others' performance. Our experiments demonstrate DET has achieved superior performance compared to the respective state-of-the-art methods in dealing with molecules, networks and knowledge graphs with various sizes.
    ABAW: Valence-Arousal Estimation, Expression Recognition, Action Unit Detection & Multi-Task Learning Challenges. (arXiv:2202.10659v1 [cs.CV])
    This paper describes the third Affective Behavior Analysis in-the-wild (ABAW) Competition, held in conjunction with IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2022. The 3rd ABAW Competition is a continuation of the Competitions held at ICCV 2021, IEEE FG 2020 and IEEE CVPR 2017 Conferences, and aims at automatically analyzing affect. This year the Competition encompasses four Challenges: i) uni-task Valence-Arousal Estimation, ii) uni-task Expression Classification, iii) uni-task Action Unit Detection, and iv) Multi-Task-Learning. All the Challenges are based on a common benchmark database, Aff-Wild2, which is a large scale in-the-wild database and the first one to be annotated in terms of valence-arousal, expressions and action units. In this paper, we present the four Challenges, with the utilized Competition corpora, we outline the evaluation metrics and present the baseline systems along with their obtained results.  ( 2 min )
    Model Reprogramming: Resource-Efficient Cross-Domain Machine Learning. (arXiv:2202.10629v1 [cs.LG])
    In data-rich domains such as vision, language, and speech, deep learning prevails to deliver high-performance task-specific models and can even learn general task-agnostic representations for efficient finetuning to downstream tasks. However, deep learning in resource-limited domains still faces the following challenges including (i) limited data, (ii) constrained model development cost, and (iii) lack of adequate pre-trained models for effective finetuning. This paper introduces a new technique called model reprogramming to bridge this gap. Model reprogramming enables resource-efficient cross-domain machine learning by repurposing and reusing a well-developed pre-trained model from a source domain to solve tasks in a target domain without model finetuning, where the source and target domains can be vastly different. In many applications, model reprogramming outperforms transfer learning and training from scratch. This paper elucidates the methodology of model reprogramming, summarizes existing use cases, provides a theoretical explanation on the success of model reprogramming, and concludes with a discussion on open-ended research questions and opportunities. A list of model reprogramming studies is actively maintained and updated at https://github.com/IBM/model-reprogramming.  ( 2 min )
    CD-ROM: Complementary Deep-Reduced Order Model. (arXiv:2202.10746v1 [physics.flu-dyn])
    Model order reduction through the POD-Galerkin method can lead to dramatic gains in terms of computational efficiency in solving physical problems. However, the applicability of the method to non linear high-dimensional dynamical systems such as the Navier-Stokes equations has been shown to be limited, producing inaccurate and sometimes unstable models. This paper proposes a closure modeling approach for classical POD-Galerkin reduced order models (ROM). We use multi layer perceptrons (MLP) to learn a continuous in time closure model through the recently proposed Neural ODE method. Inspired by Taken's theorem as well as the Mori-Zwanzig formalism, we augment ROMs with a delay differential equation architecture to model non-Markovian effects in reduced models. The proposed model, called CD-ROM (Complementary Deep-Reduced Order Model) is able to retain information from past states of the system and use it to correct the imperfect reduced dynamics. The model can be integrated in time as a system of ordinary differential equations using any classical time marching scheme. We demonstrate the ability of our CD-ROM approach to improve the accuracy of POD-Galerkin models on two CFD examples, even in configurations unseen during training.  ( 2 min )
    Privacy Leakage of Adversarial Training Models in Federated Learning Systems. (arXiv:2202.10546v1 [cs.LG])
    Adversarial Training (AT) is crucial for obtaining deep neural networks that are robust to adversarial attacks, yet recent works found that it could also make models more vulnerable to privacy attacks. In this work, we further reveal this unsettling property of AT by designing a novel privacy attack that is practically applicable to the privacy-sensitive Federated Learning (FL) systems. Using our method, the attacker can exploit AT models in the FL system to accurately reconstruct users' private training images even when the training batch size is large. Code is available at https://github.com/zjysteven/PrivayAttack_AT_FL.
    Imbalanced Classification via Explicit Gradient Learning From Augmented Data. (arXiv:2202.10550v1 [cs.LG])
    Learning from imbalanced data is one of the most significant challenges in real-world classification tasks. In such cases, neural networks performance is substantially impaired due to preference towards the majority class. Existing approaches attempt to eliminate the bias through data re-sampling or re-weighting the loss in the learning process. Still, these methods tend to overfit the minority samples and perform poorly when the structure of the minority class is highly irregular. Here, we propose a novel deep meta-learning technique to augment a given imbalanced dataset with new minority instances. These additional data are incorporated in the classifier's deep-learning process, and their contributions are learned explicitly. The advantage of the proposed method is demonstrated on synthetic and real-world datasets with various imbalance ratios.
    Ligandformer: A Graph Neural Network for Predicting Ligand Property with Robust Interpretation. (arXiv:2202.10873v1 [q-bio.BM])
    Robust and efficient interpretation of QSAR methods is quite useful to validate AI prediction rationales with subjective opinion (chemist or biologist expertise), understand sophisticated chemical or biological process mechanisms, and provide heuristic ideas for structure optimization in pharmaceutical industry. For this purpose, we construct a multi-layer self-attention based Graph Neural Network framework, namely Ligandformer, for predicting ligand property with interpretation. Ligandformer integrates attention maps on ligand structure from different network blocks. The integrated attention map reflects the machine's local interest on compound structure, and indicates the relationship between predicted compound property and its structure. This work mainly contributes to three aspects: 1. Ligandformer directly opens the black-box of deep learning methods, providing local prediction rationales on chemical structures. 2. Ligandformer gives robust prediction in different experimental rounds, overcoming the ubiquitous prediction instability of deep learning methods. 3. Ligandformer can be generalized to predict different chemical or biological properties with high performance. Furthermore, Ligandformer can simultaneously output specific property score and visible attention map on structure, which can support researchers to investigate chemical or biological property and optimize structure efficiently. Our framework outperforms over counterparts in terms of accuracy, robustness and generalization, and can be applied in complex system study.  ( 2 min )
    Invariance Learning in Deep Neural Networks with Differentiable Laplace Approximations. (arXiv:2202.10638v1 [stat.ML])
    Data augmentation is commonly applied to improve performance of deep learning by enforcing the knowledge that certain transformations on the input preserve the output. Currently, the correct data augmentation is chosen by human effort and costly cross-validation, which makes it cumbersome to apply to new datasets. We develop a convenient gradient-based method for selecting the data augmentation. Our approach relies on phrasing data augmentation as an invariance in the prior distribution and learning it using Bayesian model selection, which has been shown to work in Gaussian processes, but not yet for deep neural networks. We use a differentiable Kronecker-factored Laplace approximation to the marginal likelihood as our objective, which can be optimised without human supervision or validation data. We show that our method can successfully recover invariances present in the data, and that this improves generalisation on image datasets.  ( 2 min )
  • Open

    Learning from an Exploring Demonstrator: Optimal Reward Estimation for Bandits. (arXiv:2106.14866v2 [stat.ML] UPDATED)
    We introduce the "inverse bandit" problem of estimating the rewards of a multi-armed bandit instance from observing the learning process of a low-regret demonstrator. Existing approaches to the related problem of inverse reinforcement learning assume the execution of an optimal policy, and thereby suffer from an identifiability issue. In contrast, we propose to leverage the demonstrator's behavior en route to optimality, and in particular, the exploration phase, for reward estimation. We begin by establishing a general information-theoretic lower bound under this paradigm that applies to any demonstrator algorithm, which characterizes a fundamental tradeoff between reward estimation and the amount of exploration of the demonstrator. Then, we develop simple and efficient reward estimators for upper-confidence-based demonstrator algorithms that attain the optimal tradeoff, showing in particular that consistent reward estimation -- free of identifiability issues -- is possible under our paradigm. Extensive simulations on both synthetic and semi-synthetic data corroborate our theoretical results.
    Permutation-equivariant and Proximity-aware Graph Neural Networks with Stochastic Message Passing. (arXiv:2009.02562v2 [cs.LG] UPDATED)
    Graph neural networks (GNNs) are emerging machine learning models on graphs. Permutation-equivariance and proximity-awareness are two important properties highly desirable for GNNs. Both properties are needed to tackle some challenging graph problems, such as finding communities and leaders. In this paper, we first analytically show that the existing GNNs, mostly based on the message-passing mechanism, cannot simultaneously preserve the two properties. Then, we propose Stochastic Message Passing (SMP) model, a general and simple GNN to maintain both proximity-awareness and permutation-equivariance. In order to preserve node proximities, we augment the existing GNNs with stochastic node representations. We theoretically prove that the mechanism can enable GNNs to preserve node proximities, and at the same time, maintain permutation-equivariance with certain parametrization. We report extensive experimental results on ten datasets and demonstrate the effectiveness and efficiency of SMP for various typical graph mining tasks, including graph reconstruction, node classification, and link prediction.
    Causal inference for observational longitudinal studies using deep survival models. (arXiv:2101.10643v11 [stat.ML] UPDATED)
    Causal inference for observational longitudinal studies often requires the accurate estimation of treatment effects on time-to-event outcomes in the presence of time-dependent patient history. To tackle this longitudinal treatment effect estimation problem, we have developed a causal dynamic survival (CDS) model that uses the potential outcomes framework with an ensemble of recurrent subnetworks to estimate the difference in survival probabilities and its confidence interval over time as a function of past, time-dependent covariates and treatments. Using simulated survival datasets, the CDS model showed good causal effect estimation performance across scenarios of varying sample dimensions, event rates, confounding and overlapping. However, increasing the sample size was not effective in alleviating the adverse impact of a high level of confounding. In two large clinical cohort studies, CDS identified the expected conditional average treatment effect and detected individual treatment effect heterogeneity over time. CDS provides an efficient way to estimate and update individualized treatment effects over time, in order to improve clinical decisions.
    Gaussian Imagination in Bandit Learning. (arXiv:2201.01902v3 [cs.LG] UPDATED)
    Assuming distributions are Gaussian often facilitates computations that are otherwise intractable. We study the performance of an agent that attains a bounded information ratio with respect to a bandit environment with a Gaussian prior distribution and a Gaussian likelihood function when applied instead to a Bernoulli bandit. Relative to an information-theoretic bound on the Bayesian regret the agent would incur when interacting with the Gaussian bandit, we bound the increase in regret when the agent interacts with the Bernoulli bandit. If the Gaussian prior distribution and likelihood function are sufficiently diffuse, this increase grows at a rate which is at most linear in the square-root of the time horizon, and thus the per-timestep increase vanishes. Our results formalize the folklore that so-called Bayesian agents remain effective when instantiated with diffuse misspecified distributions.
    Harmless interpolation in regression and classification with structured features. (arXiv:2111.05198v2 [stat.ML] UPDATED)
    Overparametrized neural networks tend to perfectly fit noisy training data yet generalize well on test data. Inspired by this empirical observation, recent work has sought to understand this phenomenon of benign overfitting or harmless interpolation in the much simpler linear model. Previous theoretical work critically assumes that either the data features are statistically independent or the input data is high-dimensional; this precludes general nonparametric settings with structured feature maps. In this paper, we present a general and flexible framework for upper bounding regression and classification risk in a reproducing kernel Hilbert space. A key contribution is that our framework describes precise sufficient conditions on the data Gram matrix under which harmless interpolation occurs. Our results recover prior independent-features results (with a much simpler analysis), but they furthermore show that harmless interpolation can occur in more general settings such as features that are a bounded orthonormal system. Furthermore, our results show an asymptotic separation between classification and regression performance in a manner that was previously only shown for Gaussian features.
    Continuous Treatment Recommendation with Deep Survival Dose Response Function. (arXiv:2108.10453v2 [stat.ML] UPDATED)
    We propose a general formulation for continuous treatment recommendation problems in settings with clinical survival data, which we call the Deep Survival Dose Response Function (DeepSDRF). That is, we consider the problem of learning the conditional average dose response (CADR) function solely from historical data in which unobserved factors (confounders) affect both observed treatment and time-to-event outcomes. The estimated treatment effect from DeepSDRF enables us to develop recommender algorithms with explanatory insights. We compared two recommender approaches based on random search and reinforcement learning and found similar performance in terms of patient outcome. We tested the DeepSDRF and the corresponding recommender on extensive simulation studies and two empirical databases: 1) the Clinical Practice Research Datalink (CPRD) and 2) the eICU Research Institute (eRI) database. To the best of our knowledge, this is the first time that confounders are taken into consideration for addressing the continuous treatment effect with observational data in a medical context.
    Robust Stochastic Linear Contextual Bandits Under Adversarial Attacks. (arXiv:2106.02978v2 [stat.ML] UPDATED)
    Stochastic linear contextual bandit algorithms have substantial applications in practice, such as recommender systems, online advertising, clinical trials, etc. Recent works show that optimal bandit algorithms are vulnerable to adversarial attacks and can fail completely in the presence of attacks. Existing robust bandit algorithms only work for the non-contextual setting under the attack of rewards and cannot improve the robustness in the general and popular contextual bandit environment. In addition, none of the existing methods can defend against attacked context. In this work, we provide the first robust bandit algorithm for stochastic linear contextual bandit setting under a fully adaptive and omniscient attack with sub-linear regret. Our algorithm not only works under the attack of rewards, but also under attacked context. Moreover, it does not need any information about the attack budget or the particular form of the attack. We provide theoretical guarantees for our proposed algorithm and show by experiments that our proposed algorithm improves the robustness against various kinds of popular attacks.
    Neural Ensemble Search for Uncertainty Estimation and Dataset Shift. (arXiv:2006.08573v3 [cs.LG] UPDATED)
    Ensembles of neural networks achieve superior performance compared to stand-alone networks in terms of accuracy, uncertainty calibration and robustness to dataset shift. \emph{Deep ensembles}, a state-of-the-art method for uncertainty estimation, only ensemble random initializations of a \emph{fixed} architecture. Instead, we propose two methods for automatically constructing ensembles with \emph{varying} architectures, which implicitly trade-off individual architectures' strengths against the ensemble's diversity and exploit architectural variation as a source of diversity. On a variety of classification tasks and modern architecture search spaces, we show that the resulting ensembles outperform deep ensembles not only in terms of accuracy but also uncertainty calibration and robustness to dataset shift. Our further analysis and ablation studies provide evidence of higher ensemble diversity due to architectural variation, resulting in ensembles that can outperform deep ensembles, even when having weaker average base learners. To foster reproducibility, our code is available: \url{https://github.com/automl/nes}
    How and When Random Feedback Works: A Case Study of Low-Rank Matrix Factorization. (arXiv:2111.08706v2 [cs.NE] UPDATED)
    The success of gradient descent in ML and especially for learning neural networks is remarkable and robust. In the context of how the brain learns, one aspect of gradient descent that appears biologically difficult to realize (if not implausible) is that its updates rely on feedback from later layers to earlier layers through the same connections. Such bidirected links are relatively few in brain networks, and even when reciprocal connections exist, they may not be equi-weighted. Random Feedback Alignment (Lillicrap et al., 2016), where the backward weights are random and fixed, has been proposed as a bio-plausible alternative and found to be effective empirically. We investigate how and when feedback alignment (FA) works, focusing on one of the most basic problems with layered structure -- low-rank matrix factorization. In this problem, given a matrix $Y_{n\times m}$, the goal is to find a low rank factorization $Z_{n \times r}W_{r \times m}$ that minimizes the error $\|ZW-Y\|_F$. Gradient descent solves this problem optimally. We show that FA converges to the optimal solution when $r\ge \mbox{rank}(Y)$. We also shed light on how FA works. It is observed empirically that the forward weight matrices and (random) feedback matrices come closer during FA updates. Our analysis rigorously derives this phenomenon and shows how it facilitates convergence of FA*, a closely related variant of FA. We also show that FA can be far from optimal when $r < \mbox{rank}(Y)$. This is the first provable separation result between gradient descent and FA. Moreover, the representations found by gradient descent and FA can be almost orthogonal even when their error $\|ZW-Y\|_F$ is approximately equal. As a corollary, these results also hold for training two-layer linear neural networks when the training input is isotropic, and the output is a linear function of the input.
    Mixture-of-Variational-Experts for Continual Learning. (arXiv:2110.12667v3 [cs.LG] UPDATED)
    One weakness of machine learning algorithms is the poor ability of models to solve new problems without forgetting previously acquired knowledge. The Continual Learning (CL) paradigm has emerged as a protocol to systematically investigate settings where the model sequentially observes samples generated by a series of tasks. In this work, we take a task-agnostic view of continual learning and develop a hierarchical information-theoretic optimality principle that facilitates a trade-off between learning and forgetting. We discuss this principle from a Bayesian perspective and show its connections to previous approaches to CL. Based on this principle, we propose a neural network layer, called the Mixture-of-Variational-Experts layer, that alleviates forgetting by creating a set of information processing paths through the network which is governed by a gating policy. Due to the general formulation based on generic utility functions, we can apply this optimality principle to a large variety of learning problems, including supervised learning, reinforcement learning, and generative modeling. We demonstrate the competitive performance of our method in continual supervised learning and in continual reinforcement learning.
    Black-box Certification and Learning under Adversarial Perturbations. (arXiv:2006.16520v2 [stat.ML] UPDATED)
    We formally study the problem of classification under adversarial perturbations from a learner's perspective as well as a third-party who aims at certifying the robustness of a given black-box classifier. We analyze a PAC-type framework of semi-supervised learning and identify possibility and impossibility results for proper learning of VC-classes in this setting. We further introduce a new setting of black-box certification under limited query budget, and analyze this for various classes of predictors and perturbation. We also consider the viewpoint of a black-box adversary that aims at finding adversarial examples, showing that the existence of an adversary with polynomial query complexity can imply the existence of a sample efficient robust learner.
    A Cram\'er Distance perspective on Quantile Regression based Distributional Reinforcement Learning. (arXiv:2110.00535v2 [stat.ML] UPDATED)
    Distributional reinforcement learning (DRL) extends the value-based approach by approximating the full distribution over future returns instead of the mean only, providing a richer signal that leads to improved performances. Quantile Regression (QR) based methods like QR-DQN project arbitrary distributions into a parametric subset of staircase distributions by minimizing the 1-Wasserstein distance. However, due to biases in the gradients, the quantile regression loss is used instead for training, guaranteeing the same minimizer and enjoying unbiased gradients. Non-crossing constraints on the quantiles have been shown to improve the performance of QR-DQN for uncertainty-based exploration strategies. The contribution of this work is in the setting of fixed quantile levels and is twofold. First, we prove that the Cram\'er distance yields a projection that coincides with the 1-Wasserstein one and that, under non-crossing constraints, the squared Cram\'er and the quantile regression losses yield collinear gradients, shedding light on the connection between these important elements of DRL. Second, we propose a low complexity algorithm to compute the Cram\'er distance.
    Open-Set Support Vector Machines. (arXiv:1606.03802v11 [cs.LG] UPDATED)
    Often, when dealing with real-world recognition problems, we do not need, and often cannot have, knowledge of the entire set of possible classes that might appear during operational testing. In such cases, we need to think of robust classification methods able to deal with the "unknown" and properly reject samples belonging to classes never seen during training. Notwithstanding, existing classifiers to date were mostly developed for the closed-set scenario, i.e., the classification setup in which it is assumed that all test samples belong to one of the classes with which the classifier was trained. In the open-set scenario, however, a test sample can belong to none of the known classes and the classifier must properly reject it by classifying it as unknown. In this work, we extend upon the well-known Support Vector Machines (SVM) classifier and introduce the Open-Set Support Vector Machines (OSSVM), which is suitable for recognition in open-set setups. OSSVM balances the empirical risk and the risk of the unknown and ensures that the region of the feature space in which a test sample would be classified as known (one of the known classes) is always bounded, ensuring a finite risk of the unknown. In this work, we also highlight the properties of the SVM classifier related to the open-set scenario, and provide necessary and sufficient conditions for an RBF SVM to have bounded open-space risk.  ( 3 min )
    Unadjusted Langevin algorithm for sampling a mixture of weakly smooth potentials. (arXiv:2112.09311v2 [stat.CO] UPDATED)
    Discretization of continuous-time diffusion processes is a widely recognized method for sampling. However, it seems to be a considerable restriction when the potentials are often required to be smooth (gradient Lipschitz). This paper studies the problem of sampling through Euler discretization, where the potential function is assumed to be a mixture of weakly smooth distributions and satisfies weakly dissipative. We establish the convergence in Kullback-Leibler (KL) divergence with the number of iterations to reach $\epsilon$-neighborhood of a target distribution in only polynomial dependence on the dimension. We relax the degenerated convex at infinity conditions of \citet{erdogdu2020convergence} and prove convergence guarantees under Poincar\'{e} inequality or non-strongly convex outside the ball. In addition, we also provide convergence in $L_{\beta}$-Wasserstein metric for the smoothing potential.  ( 2 min )
    An adaptive neuro-fuzzy model for attitude estimation and 2 control a 3 DOF system. (arXiv:2004.09756v3 [eess.SY] UPDATED)
    In recent decades, one of the scientists' main concerns has been to improve the accuracy of satellite attitude, regardless of the expense. The obvious result is that a large number of control strategies have been used to address this problem. In this study, an adaptive neuro-fuzzy integrated (ANFIS) satellite attitude estimation and control system was developed. The controller is trained with the data provided by an optimal controller. A pulse modulator is used to generate the right ON/OFF commands of the thruster actuator. To evaluate the performance of the AN-FIS controller in closed-loop simulation, an ANFIS observer is used to estimate the attitude and angular velocities of the satellite using magnetometer, sun sensor and data gyro data. In addition, a new ANFIS system will be proposed and evaluated that can jointly control and estimate the system. The performance of the ANFIS controller is compared to the optimal PID controller in a Monte Carlo simulation with different initial conditions, disturbance and noise. The results show that the ANFIS controller can surpass the optimal PID controller in several aspects, including time and smoothness. In addition, the ANFIS estimator is examined and the results demonstrate the high ability of this designated observers. Both the control and estimation phases are simulated by a single ANFIS subsystem, taking into account the high capacity of ANFIS, and the results of using the ANFIS model are demonstrated.  ( 2 min )
    FLEA: Provably Fair Multisource Learning from Unreliable Training Data. (arXiv:2106.11732v3 [cs.LG] UPDATED)
    Fairness-aware learning aims at constructing classifiers that not only make accurate predictions, but also do not discriminate against specific groups. It is a fast-growing area of machine learning with far-reaching societal impact. However, existing fair learning methods are vulnerable to accidental or malicious artifacts in the training data, which can cause them to unknowingly produce unfair classifiers. In this work we address the problem of fair learning from unreliable training data in the robust multisource setting, where the available training data comes from multiple sources, a fraction of which might not be representative of the true data distribution. We introduce FLEA, a filtering-based algorithm that allows the learning system to identify and suppress those data sources that would have a negative impact on fairness or accuracy if they were used for training. We show the effectiveness of our approach by a diverse range of experiments on multiple datasets. Additionally, we prove formally that - given enough data - FLEA protects the learner against corruptions as long as the fraction of affected data sources is less than half.  ( 2 min )
    Efficient and Differentiable Conformal Prediction with General Function Classes. (arXiv:2202.11091v1 [cs.LG])
    Quantifying the data uncertainty in learning tasks is often done by learning a prediction interval or prediction set of the label given the input. Two commonly desired properties for learned prediction sets are \emph{valid coverage} and \emph{good efficiency} (such as low length or low cardinality). Conformal prediction is a powerful technique for learning prediction sets with valid coverage, yet by default its conformalization step only learns a single parameter, and does not optimize the efficiency over more expressive function classes. In this paper, we propose a generalization of conformal prediction to multiple learnable parameters, by considering the constrained empirical risk minimization (ERM) problem of finding the most efficient prediction set subject to valid empirical coverage. This meta-algorithm generalizes existing conformal prediction algorithms, and we show that it achieves approximate valid population coverage and near-optimal efficiency within class, whenever the function class in the conformalization step is low-capacity in a certain sense. Next, this ERM problem is challenging to optimize as it involves a non-differentiable coverage constraint. We develop a gradient-based algorithm for it by approximating the original constrained ERM using differentiable surrogate losses and Lagrangians. Experiments show that our algorithm is able to learn valid prediction sets and improve the efficiency significantly over existing approaches in several applications such as prediction intervals with improved length, minimum-volume prediction sets for multi-output regression, and label prediction sets for image classification.  ( 2 min )
    Differentially Private Estimation of Heterogeneous Causal Effects. (arXiv:2202.11043v1 [stat.ML])
    Estimating heterogeneous treatment effects in domains such as healthcare or social science often involves sensitive data where protecting privacy is important. We introduce a general meta-algorithm for estimating conditional average treatment effects (CATE) with differential privacy (DP) guarantees. Our meta-algorithm can work with simple, single-stage CATE estimators such as S-learner and more complex multi-stage estimators such as DR and R-learner. We perform a tight privacy analysis by taking advantage of sample splitting in our meta-algorithm and the parallel composition property of differential privacy. In this paper, we implement our approach using DP-EBMs as the base learner. DP-EBMs are interpretable, high-accuracy models with privacy guarantees, which allow us to directly observe the impact of DP noise on the learned causal model. Our experiments show that multi-stage CATE estimators incur larger accuracy loss than single-stage CATE or ATE estimators and that most of the accuracy loss from differential privacy is due to an increase in variance, not biased estimates of treatment effects.  ( 2 min )
    Practical Schemes for Finding Near-Stationary Points of Convex Finite-Sums. (arXiv:2105.12062v2 [math.OC] UPDATED)
    In convex optimization, the problem of finding near-stationary points has not been adequately studied yet, unlike other optimality measures such as the function value. Even in the deterministic case, the optimal method (OGM-G, due to Kim and Fessler (2021)) has just been discovered recently. In this work, we conduct a systematic study of algorithmic techniques for finding near-stationary points of convex finite-sums. Our main contributions are several algorithmic discoveries: (1) we discover a memory-saving variant of OGM-G based on the performance estimation problem approach (Drori and Teboulle, 2014); (2) we design a new accelerated SVRG variant that can simultaneously achieve fast rates for minimizing both the gradient norm and function value; (3) we propose an adaptively regularized accelerated SVRG variant, which does not require the knowledge of some unknown initial constants and achieves near-optimal complexities. We put an emphasis on the simplicity and practicality of the new schemes, which could facilitate future work.  ( 2 min )
    Near-Optimal Performance Bounds for Orthogonal and Permutation Group Synchronization via Spectral Methods. (arXiv:2008.05341v3 [cs.IT] UPDATED)
    Group synchronization asks to recover group elements from their pairwise measurements. It has found numerous applications across various scientific disciplines. In this work, we focus on orthogonal and permutation group synchronization which are widely used in computer vision such as object matching and structure from motion. Among many available approaches, the spectral methods have enjoyed great popularity due to their efficiency and convenience. We will study the performance guarantees of the spectral methods in solving these two synchronization problems by investigating how well the computed eigenvectors approximate each group element individually. We establish our theory by applying the recent popular~\emph{leave-one-out} technique and derive a~\emph{block-wise} performance bound for the recovery of each group element via eigenvectors. In particular, for orthogonal group synchronization, we obtain a near-optimal performance bound for the group recovery in presence of additive Gaussian noise. For permutation group synchronization under random corruption, we show that the widely-used two-step procedure (spectral method plus rounding) can recover all the group elements exactly if the SNR (signal-to-noise ratio) is close to the information theoretical limit. Our numerical experiments confirm our theory and indicate a sharp phase transition for the exact group recovery.  ( 2 min )
    Robust and Provable Guarantees for Sparse Random Embeddings. (arXiv:2202.10815v1 [cs.LG])
    In this work, we improve upon the guarantees for sparse random embeddings, as they were recently provided and analyzed by Freksen at al. (NIPS'18) and Jagadeesan (NIPS'19). Specifically, we show that (a) our bounds are explicit as opposed to the asymptotic guarantees provided previously, and (b) our bounds are guaranteed to be sharper by practically significant constants across a wide range of parameters, including the dimensionality, sparsity and dispersion of the data. Moreover, we empirically demonstrate that our bounds significantly outperform prior works on a wide range of real-world datasets, such as collections of images, text documents represented as bags-of-words, and text sequences vectorized by neural embeddings. Behind our numerical improvements are techniques of broader interest, which improve upon key steps of previous analyses in terms of (c) tighter estimates for certain types of quadratic chaos, (d) establishing extreme properties of sparse linear forms, and (e) improvements on bounds for the estimation of sums of independent random variables.  ( 2 min )
    Inequality Constrained Stochastic Nonlinear Optimization via Active-Set Sequential Quadratic Programming. (arXiv:2109.11502v2 [math.OC] UPDATED)
    We study nonlinear optimization problems with a stochastic objective and deterministic equality and inequality constraints, which emerge in numerous applications including finance, manufacturing, power systems and, recently, deep neural networks. We propose an active-set stochastic sequential quadratic programming algorithm that uses a differentiable exact augmented Lagrangian as the merit function. The algorithm adaptively selects the penalty parameters of the augmented Lagrangian, and performs stochastic line search to decide the stepsize. The global convergence is established: for any initialization, the "liminf" of the KKT residuals converges to zero almost surely. Our algorithm and analysis further develop the work of Na et al. (2021) by allowing nonlinear inequality constraints without requiring the strict complementary condition. We demonstrate the performance of the algorithm on a subset of nonlinear problems collected in CUTEst test set.  ( 2 min )
    Pseudo Numerical Methods for Diffusion Models on Manifolds. (arXiv:2202.09778v1 [cs.CV] CROSS LISTED)
    Denoising Diffusion Probabilistic Models (DDPMs) can generate high-quality samples such as image and audio samples. However, DDPMs require hundreds to thousands of iterations to produce final samples. Several prior works have successfully accelerated DDPMs through adjusting the variance schedule (e.g., Improved Denoising Diffusion Probabilistic Models) or the denoising equation (e.g., Denoising Diffusion Implicit Models (DDIMs)). However, these acceleration methods cannot maintain the quality of samples and even introduce new noise at a high speedup rate, which limit their practicability. To accelerate the inference process while keeping the sample quality, we provide a fresh perspective that DDPMs should be treated as solving differential equations on manifolds. Under such a perspective, we propose pseudo numerical methods for diffusion models (PNDMs). Specifically, we figure out how to solve differential equations on manifolds and show that DDIMs are simple cases of pseudo numerical methods. We change several classical numerical methods to corresponding pseudo numerical methods and find that the pseudo linear multi-step method is the best in most situations. According to our experiments, by directly using pre-trained models on Cifar10, CelebA and LSUN, PNDMs can generate higher quality synthetic images with only 50 steps compared with 1000-step DDIMs (20x speedup), significantly outperform DDIMs with 250 steps (by around 0.4 in FID) and have good generalization on different variance schedules. Our implementation is available at https://github.com/luping-liu/PNDM.
    PyTorch Geometric Signed Directed: A Survey and Software on Graph Neural Networks for Signed and Directed Graphs. (arXiv:2202.10793v1 [cs.LG])
    Signed networks are ubiquitous in many real-world applications (e.g., social networks encoding trust/distrust relationships, correlation networks arising from time series data). While many signed networks are directed, there is a lack of survey papers and software packages on graph neural networks (GNNs) specially designed for directed networks. In this paper, we present PyTorch Geometric Signed Directed, a survey and software on GNNs for signed and directed networks. We review typical tasks, loss functions and evaluation metrics in the analysis of signed and directed networks, discuss data used in related experiments, and provide an overview of methods proposed. The deep learning framework consists of easy-to-use GNN models, synthetic and real-world data, as well as task-specific evaluation metrics and loss functions for signed and directed networks. The software is presented in a modular fashion, so that signed and directed networks can also be treated separately. As an extension library for PyTorch Geometric, our proposed software is maintained with open-source releases, detailed documentation, continuous integration, unit tests and code coverage checks. Our code is publicly available at \url{https://github.com/SherylHYX/pytorch_geometric_signed_directed}.  ( 2 min )
    MSTGD:A Memory Stochastic sTratified Gradient Descent Method with an Exponential Convergence Rate. (arXiv:2202.10923v1 [stat.ML])
    The fluctuation effect of gradient expectation and variance caused by parameter update between consecutive iterations is neglected or confusing by current mainstream gradient optimization algorithms.Using this fluctuation effect, combined with the stratified sampling strategy, this paper designs a novel \underline{M}emory \underline{S}tochastic s\underline{T}ratified Gradient Descend(\underline{MST}GD) algorithm with an exponential convergence rate. Specifically, MSTGD uses two strategies for variance reduction: the first strategy is to perform variance reduction according to the proportion p of used historical gradient, which is estimated from the mean and variance of sample gradients before and after iteration, and the other strategy is stratified sampling by category. The statistic \ $\bar{G}_{mst}$\ designed under these two strategies can be adaptively unbiased, and its variance decays at a geometric rate. This enables MSTGD based on $\bar{G}_{mst}$ to obtain an exponential convergence rate of the form $\lambda^{2(k-k_0)}$($\lambda\in (0,1)$,k is the number of iteration steps,$\lambda$ is a variable related to proportion p).Unlike most other algorithms that claim to achieve an exponential convergence rate, the convergence rate is independent of parameters such as dataset size N, batch size n, etc., and can be achieved at a constant step size.Theoretical and experimental results show the effectiveness of MSTGD  ( 2 min )
    On Uncertainty Estimation by Tree-based Surrogate Models in Sequential Model-based Optimization. (arXiv:2202.10669v1 [stat.ML])
    Sequential model-based optimization sequentially selects a candidate point by constructing a surrogate model with the history of evaluations, to solve a black-box optimization problem. Gaussian process (GP) regression is a popular choice as a surrogate model, because of its capability of calculating prediction uncertainty analytically. On the other hand, an ensemble of randomized trees is another option and has practical merits over GPs due to its scalability and easiness of handling continuous/discrete mixed variables. In this paper we revisit various ensembles of randomized trees to investigate their behavior in the perspective of prediction uncertainty estimation. Then, we propose a new way of constructing an ensemble of randomized trees, referred to as BwO forest, where bagging with oversampling is employed to construct bootstrapped samples that are used to build randomized trees with random splitting. Experimental results demonstrate the validity and good performance of BwO forest over existing tree-based models in various circumstances.  ( 2 min )
    Connecting Optimization and Generalization via Gradient Flow Path Length. (arXiv:2202.10670v1 [stat.ML])
    Optimization and generalization are two essential aspects of machine learning. In this paper, we propose a framework to connect optimization with generalization by analyzing the generalization error based on the length of optimization trajectory under the gradient flow algorithm after convergence. Through our approach, we show that, with a proper initialization, gradient flow converges following a short path with an explicit length estimate. Such an estimate induces a length-based generalization bound, showing that short optimization paths after convergence are associated with good generalization, which also matches our numerical results. Our framework can be applied to broad settings. For example, we use it to obtain generalization estimates on three distinct machine learning models: underdetermined $\ell_p$ linear regression, kernel regression, and overparameterized two-layer ReLU neural networks.  ( 2 min )
    Counterfactual Phenotyping with Censored Time-to-Events. (arXiv:2202.11089v1 [cs.LG])
    Estimation of treatment efficacy of real-world clinical interventions involves working with continuous outcomes such as time-to-death, re-hospitalization, or a composite event that may be subject to censoring. Causal reasoning in such scenarios requires decoupling the effects of confounding physiological characteristics that affect baseline survival rates from the effects of the interventions being assessed. In this paper, we present a latent variable approach to model heterogeneous treatment effects by proposing that an individual can belong to one of latent clusters with distinct response characteristics. We show that this latent structure can mediate the base survival rates and helps determine the effects of an intervention. We demonstrate the ability of our approach to discover actionable phenotypes of individuals based on their treatment response on multiple large randomized clinical trials originally conducted to assess appropriate treatments to reduce cardiovascular risk.  ( 2 min )
    Stochastic Causal Programming for Bounding Treatment Effects. (arXiv:2202.10806v1 [stat.ML])
    Causal effect estimation is important for numerous tasks in the natural and social sciences. However, identifying effects is impossible from observational data without making strong, often untestable assumptions. We consider algorithms for the partial identification problem, bounding treatment effects from multivariate, continuous treatments over multiple possible causal models when unmeasured confounding makes identification impossible. We consider a framework where observable evidence is matched to the implications of constraints encoded in a causal model by norm-based criteria. This generalizes classical approaches based purely on generative models. Casting causal effects as objective functions in a constrained optimization problem, we combine flexible learning algorithms with Monte Carlo methods to implement a family of solutions under the name of stochastic causal programming. In particular, we present ways by which such constrained optimization problems can be parameterized without likelihood functions for the causal or the observed data model, reducing the computational and statistical complexity of the task.  ( 2 min )
    Random Graph Matching in Geometric Models: the Case of Complete Graphs. (arXiv:2202.10662v1 [math.ST])
    This paper studies the problem of matching two complete graphs with edge weights correlated through latent geometries, extending a recent line of research on random graph matching with independent edge weights to geometric models. Specifically, given a random permutation $\pi^*$ on $[n]$ and $n$ iid pairs of correlated Gaussian vectors $\{X_{\pi^*(i)}, Y_i\}$ in $\mathbb{R}^d$ with noise parameter $\sigma$, the edge weights are given by $A_{ij}=\kappa(X_i,X_j)$ and $B_{ij}=\kappa(Y_i,Y_j)$ for some link function $\kappa$. The goal is to recover the hidden vertex correspondence $\pi^*$ based on the observation of $A$ and $B$. We focus on the dot-product model with $\kappa(x,y)=\langle x, y \rangle$ and Euclidean distance model with $\kappa(x,y)=\|x-y\|^2$, in the low-dimensional regime of $d=o(\log n)$ wherein the underlying geometric structures are most evident. We derive an approximate maximum likelihood estimator, which provably achieves, with high probability, perfect recovery of $\pi^*$ when $\sigma=o(n^{-2/d})$ and almost perfect recovery with a vanishing fraction of errors when $\sigma=o(n^{-1/d})$. Furthermore, these conditions are shown to be information-theoretically optimal even when the latent coordinates $\{X_i\}$ and $\{Y_i\}$ are observed, complementing the recent results of [DCK19] and [KNW22] in geometric models of the planted bipartite matching problem. As a side discovery, we show that the celebrated spectral algorithm of [Ume88] emerges as a further approximation to the maximum likelihood in the geometric model.  ( 2 min )
    Convex Loss Functions for Contextual Pricing with Observational Posted-Price Data. (arXiv:2202.10944v1 [cs.LG])
    We study an off-policy contextual pricing problem where the seller has access to samples of prices which customers were previously offered, whether they purchased at that price, and auxiliary features describing the customer and/or item being sold. This is in contrast to the well-studied setting in which samples of the customer's valuation (willingness to pay) are observed. In our setting, the observed data is influenced by the historic pricing policy, and we do not know how customers would have responded to alternative prices. We introduce suitable loss functions for this pricing setting which can be directly optimized to find an effective pricing policy with expected revenue guarantees without the need for estimation of an intermediate demand function. We focus on convex loss functions. This is particularly relevant when linear pricing policies are desired for interpretability reasons, resulting in a tractable convex revenue optimization problem. We further propose generalized hinge and quantile pricing loss functions, which price at a multiplicative factor of the conditional expected value or a particular quantile of the valuation distribution when optimized, despite the valuation data not being observed. We prove expected revenue bounds for these pricing policies respectively when the valuation distribution is log-concave, and provide generalization bounds for the finite sample case. Finally, we conduct simulations on both synthetic and real-world data to demonstrate that this approach is competitive with, and in some settings outperforms, state-of-the-art methods in contextual pricing.  ( 2 min )
    Policy Evaluation for Temporal and/or Spatial Dependent Experiments in Ride-sourcing Platforms. (arXiv:2202.10887v1 [stat.ME])
    Policy evaluation based on A/B testing has attracted considerable interest in digital marketing, but such evaluation in ride-sourcing platforms (e.g., Uber and Didi) is not well studied primarily due to the complex structure of their temporal and/or spatial dependent experiments. Motivated by policy evaluation in ride-sourcing platforms, the aim of this paper is to establish causal relationship between platform's policies and outcomes of interest under a switchback design. We propose a novel potential outcome framework based on a temporal varying coefficient decision process (VCDP) model to capture the dynamic treatment effects in temporal dependent experiments. We further characterize the average treatment effect by decomposing it as the sum of direct effect (DE) and indirect effect (IE). We develop estimation and inference procedures for both DE and IE. Furthermore, we propose a spatio-temporal VCDP to deal with spatiotemporal dependent experiments. For both VCDP models, we establish the statistical properties (e.g., weak convergence and asymptotic power) of our estimation and inference procedures. We conduct extensive simulations to investigate the finite-sample performance of the proposed estimation and inference procedures. We examine how our VCDP models can help improve policy evaluation for various dispatching and dispositioning policies in Didi.  ( 2 min )
    Sobolev Transport: A Scalable Metric for Probability Measures with Graph Metrics. (arXiv:2202.10723v1 [cs.LG])
    Optimal transport (OT) is a popular measure to compare probability distributions. However, OT suffers a few drawbacks such as (i) a high complexity for computation, (ii) indefiniteness which limits its applicability to kernel machines. In this work, we consider probability measures supported on a graph metric space and propose a novel Sobolev transport metric. We show that the Sobolev transport metric yields a closed-form formula for fast computation and it is negative definite. We show that the space of probability measures endowed with this transport distance is isometric to a bounded convex set in a Euclidean space with a weighted $\ell_p$ distance. We further exploit the negative definiteness of the Sobolev transport to design positive-definite kernels, and evaluate their performances against other baselines in document classification with word embeddings and in topological data analysis.  ( 2 min )
    Explicit Regularization via Regularizer Mirror Descent. (arXiv:2202.10788v1 [cs.LG])
    Despite perfectly interpolating the training data, deep neural networks (DNNs) can often generalize fairly well, in part due to the "implicit regularization" induced by the learning algorithm. Nonetheless, various forms of regularization, such as "explicit regularization" (via weight decay), are often used to avoid overfitting, especially when the data is corrupted. There are several challenges with explicit regularization, most notably unclear convergence properties. Inspired by convergence properties of stochastic mirror descent (SMD) algorithms, we propose a new method for training DNNs with regularization, called regularizer mirror descent (RMD). In highly overparameterized DNNs, SMD simultaneously interpolates the training data and minimizes a certain potential function of the weights. RMD starts with a standard cost which is the sum of the training loss and a convex regularizer of the weights. Reinterpreting this cost as the potential of an "augmented" overparameterized network and applying SMD yields RMD. As a result, RMD inherits the properties of SMD and provably converges to a point "close" to the minimizer of this cost. RMD is computationally comparable to stochastic gradient descent (SGD) and weight decay, and is parallelizable in the same manner. Our experimental results on training sets with various levels of corruption suggest that the generalization performance of RMD is remarkably robust and significantly better than both SGD and weight decay, which implicitly and explicitly regularize the $\ell_2$ norm of the weights. RMD can also be used to regularize the weights to a desired weight vector, which is particularly relevant for continual learning.  ( 2 min )
    Learning Infomax and Domain-Independent Representations for Causal Effect Inference with Real-World Data. (arXiv:2202.10885v1 [stat.ML])
    The foremost challenge to causal inference with real-world data is to handle the imbalance in the covariates with respect to different treatment options, caused by treatment selection bias. To address this issue, recent literature has explored domain-invariant representation learning based on different domain divergence metrics (e.g., Wasserstein distance, maximum mean discrepancy, position-dependent metric, and domain overlap). In this paper, we reveal the weaknesses of these strategies, i.e., they lead to the loss of predictive information when enforcing the domain invariance; and the treatment effect estimation performance is unstable, which heavily relies on the characteristics of the domain distributions and the choice of domain divergence metrics. Motivated by information theory, we propose to learn the Infomax and Domain-Independent Representations to solve the above puzzles. Our method utilizes the mutual information between the global feature representations and individual feature representations, and the mutual information between feature representations and treatment assignment predictions, in order to maximally capture the common predictive information for both treatment and control groups. Moreover, our method filters out the influence of instrumental and irrelevant variables, and thus it effectively increases the predictive ability of potential outcomes. Experimental results on both the synthetic and real-world datasets show that our method achieves state-of-the-art performance on causal effect inference. Moreover, our method exhibits reliable prediction performances when facing data with different characteristics of data distributions, complicated variable types, and severe covariate imbalance.  ( 2 min )
    Distributed Sparse Multicategory Discriminant Analysis. (arXiv:2202.10913v1 [math.ST])
    This paper proposes a convex formulation for sparse multicategory linear discriminant analysis and then extend it to the distributed setting when data are stored across multiple sites. The key observation is that for the purpose of classification it suffices to recover the discriminant subspace which is invariant to orthogonal transformations. Theoretically, we establish statistical properties ensuring that the distributed sparse multicategory linear discriminant analysis performs as good as the centralized version after {a few rounds} of communications. Numerical studies lend strong support to our methodology and theory.  ( 2 min )
    CD-ROM: Complementary Deep-Reduced Order Model. (arXiv:2202.10746v1 [physics.flu-dyn])
    Model order reduction through the POD-Galerkin method can lead to dramatic gains in terms of computational efficiency in solving physical problems. However, the applicability of the method to non linear high-dimensional dynamical systems such as the Navier-Stokes equations has been shown to be limited, producing inaccurate and sometimes unstable models. This paper proposes a closure modeling approach for classical POD-Galerkin reduced order models (ROM). We use multi layer perceptrons (MLP) to learn a continuous in time closure model through the recently proposed Neural ODE method. Inspired by Taken's theorem as well as the Mori-Zwanzig formalism, we augment ROMs with a delay differential equation architecture to model non-Markovian effects in reduced models. The proposed model, called CD-ROM (Complementary Deep-Reduced Order Model) is able to retain information from past states of the system and use it to correct the imperfect reduced dynamics. The model can be integrated in time as a system of ordinary differential equations using any classical time marching scheme. We demonstrate the ability of our CD-ROM approach to improve the accuracy of POD-Galerkin models on two CFD examples, even in configurations unseen during training.  ( 2 min )
    Batched Dueling Bandits. (arXiv:2202.10660v1 [cs.LG])
    The $K$-armed dueling bandit problem, where the feedback is in the form of noisy pairwise comparisons, has been widely studied. Previous works have only focused on the sequential setting where the policy adapts after every comparison. However, in many applications such as search ranking and recommendation systems, it is preferable to perform comparisons in a limited number of parallel batches. We study the batched $K$-armed dueling bandit problem under two standard settings: (i) existence of a Condorcet winner, and (ii) strong stochastic transitivity and stochastic triangle inequality. For both settings, we obtain algorithms with a smooth trade-off between the number of batches and regret. Our regret bounds match the best known sequential regret bounds (up to poly-logarithmic factors), using only a logarithmic number of batches. We complement our regret analysis with a nearly-matching lower bound. Finally, we also validate our theoretical results via experiments on synthetic and real data.  ( 2 min )
    Confident Neural Network Regression with Bootstrapped Deep Ensembles. (arXiv:2202.10903v1 [stat.ML])
    With the rise of the popularity and usage of neural networks, trustworthy uncertainty estimation is becoming increasingly essential. In this paper we present a computationally cheap extension of Deep Ensembles for a regression setting called Bootstrapped Deep Ensembles that explicitly takes the effect of finite data into account using a modified version of the parametric bootstrap. We demonstrate through a simulation study that our method has comparable or better prediction intervals and superior confidence intervals compared to Deep Ensembles and other state-of-the-art methods. As an added bonus, our method is better capable of detecting overfitting than standard Deep Ensembles.  ( 2 min )
    Message passing all the way up. (arXiv:2202.11097v1 [cs.LG])
    The message passing framework is the foundation of the immense success enjoyed by graph neural networks (GNNs) in recent years. In spite of its elegance, there exist many problems it provably cannot solve over given input graphs. This has led to a surge of research on going "beyond message passing", building GNNs which do not suffer from those limitations -- a term which has become ubiquitous in regular discourse. However, have those methods truly moved beyond message passing? In this position paper, I argue about the dangers of using this term -- especially when teaching graph representation learning to newcomers. I show that any function of interest we want to compute over graphs can, in all likelihood, be expressed using pairwise message passing -- just over a potentially modified graph, and argue how most practical implementations subtly do this kind of trick anyway. Hoping to initiate a productive discussion, I propose replacing "beyond message passing" with a more tame term, "augmented message passing".  ( 2 min )
    Off-Policy Confidence Interval Estimation with Confounded Markov Decision Process. (arXiv:2202.10589v1 [stat.ML])
    This paper is concerned with constructing a confidence interval for a target policy's value offline based on a pre-collected observational data in infinite horizon settings. Most of the existing works assume no unmeasured variables exist that confound the observed actions. This assumption, however, is likely to be violated in real applications such as healthcare and technological industries. In this paper, we show that with some auxiliary variables that mediate the effect of actions on the system dynamics, the target policy's value is identifiable in a confounded Markov decision process. Based on this result, we develop an efficient off-policy value estimator that is robust to potential model misspecification and provide rigorous uncertainty quantification. Our method is justified by theoretical results, simulated and real datasets obtained from ridesharing companies.  ( 2 min )
    Myriad: a real-world testbed to bridge trajectory optimization and deep learning. (arXiv:2202.10600v1 [cs.LG])
    We present Myriad, a testbed written in JAX for learning and planning in real-world continuous environments. The primary contributions of Myriad are threefold. First, Myriad provides machine learning practitioners access to trajectory optimization techniques for application within a typical automatic differentiation workflow. Second, Myriad presents many real-world optimal control problems, ranging from biology to medicine to engineering, for use by the machine learning community. Formulated in continuous space and time, these environments retain some of the complexity of real-world systems often abstracted away by standard benchmarks. As such, Myriad strives to serve as a stepping stone towards application of modern machine learning techniques for impactful real-world tasks. Finally, we use the Myriad repository to showcase a novel approach for learning and control tasks. Trained in a fully end-to-end fashion, our model leverages an implicit planning module over neural ordinary differential equations, enabling simultaneous learning and planning with complex environment dynamics.  ( 2 min )
    Gaussian Processes and Statistical Decision-making in Non-Euclidean Spaces. (arXiv:2202.10613v1 [stat.ML])
    Bayesian learning using Gaussian processes provides a foundational framework for making decisions in a manner that balances what is known with what could be learned by gathering data. In this dissertation, we develop techniques for broadening the applicability of Gaussian processes. This is done in two ways. Firstly, we develop pathwise conditioning techniques for Gaussian processes, which allow one to express posterior random functions as prior random functions plus a dependent update term. We introduce a wide class of efficient approximations built from this viewpoint, which can be randomly sampled once in advance, and evaluated at arbitrary locations without any subsequent stochasticity. This key property improves efficiency and makes it simpler to deploy Gaussian process models in decision-making settings. Secondly, we develop a collection of Gaussian process models over non-Euclidean spaces, including Riemannian manifolds and graphs. We derive fully constructive expressions for the covariance kernels of scalar-valued Gaussian processes on Riemannian manifolds and graphs. Building on these ideas, we describe a formalism for defining vector-valued Gaussian processes on Riemannian manifolds. The introduced techniques allow all of these models to be trained using standard computational methods. In total, these contributions make Gaussian processes easier to work with and allow them to be used within a wider class of domains in an effective and principled manner. This, in turn, makes it possible to potentially apply Gaussian processes to novel decision-making settings.  ( 2 min )
    Accelerating Primal-dual Methods for Regularized Markov Decision Processes. (arXiv:2202.10506v1 [math.OC])
    Entropy regularized Markov decision processes have been widely used in reinforcement learning. This paper is concerned with the primal-dual formulation of the entropy regularized problems. Standard first-order methods suffer from slow convergence due to the lack of strict convexity and concavity. To address this issue, we first introduce a new quadratically convexified primal-dual formulation. The natural gradient ascent descent of the new formulation enjoys global convergence guarantee and exponential convergence rate. We also propose a new interpolating metric that further accelerates the convergence significantly. Numerical results are provided to demonstrate the performance of the proposed methods under multiple settings.  ( 2 min )
    A Multi-Agent Reinforcement Learning Framework for Off-Policy Evaluation in Two-sided Markets. (arXiv:2202.10574v1 [stat.ML])
    The two-sided markets such as ride-sharing companies often involve a group of subjects who are making sequential decisions across time and/or location. With the rapid development of smart phones and internet of things, they have substantially transformed the transportation landscape of human beings. In this paper we consider large-scale fleet management in ride-sharing companies that involve multiple units in different areas receiving sequences of products (or treatments) over time. Major technical challenges, such as policy evaluation, arise in those studies because (i) spatial and temporal proximities induce interference between locations and times; and (ii) the large number of locations results in the curse of dimensionality. To address both challenges simultaneously, we introduce a multi-agent reinforcement learning (MARL) framework for carrying policy evaluation in these studies. We propose novel estimators for mean outcomes under different products that are consistent despite the high-dimensionality of state-action space. The proposed estimator works favorably in simulation experiments. We further illustrate our method using a real dataset obtained from a two-sided marketplace company to evaluate the effects of applying different subsidizing policies. A Python implementation of the proposed method is available at https://github.com/RunzheStat/CausalMARL.  ( 2 min )
    A Globally Convergent Evolutionary Strategy for Stochastic Constrained Optimization with Applications to Reinforcement Learning. (arXiv:2202.10464v1 [cs.NE])
    Evolutionary strategies have recently been shown to achieve competing levels of performance for complex optimization problems in reinforcement learning. In such problems, one often needs to optimize an objective function subject to a set of constraints, including for instance constraints on the entropy of a policy or to restrict the possible set of actions or states accessible to an agent. Convergence guarantees for evolutionary strategies to optimize stochastic constrained problems are however lacking in the literature. In this work, we address this problem by designing a novel optimization algorithm with a sufficient decrease mechanism that ensures convergence and that is based only on estimates of the functions. We demonstrate the applicability of this algorithm on two types of experiments: i) a control task for maximizing rewards and ii) maximizing rewards subject to a non-relaxable set of constraints.  ( 2 min )
    Invariance Learning in Deep Neural Networks with Differentiable Laplace Approximations. (arXiv:2202.10638v1 [stat.ML])
    Data augmentation is commonly applied to improve performance of deep learning by enforcing the knowledge that certain transformations on the input preserve the output. Currently, the correct data augmentation is chosen by human effort and costly cross-validation, which makes it cumbersome to apply to new datasets. We develop a convenient gradient-based method for selecting the data augmentation. Our approach relies on phrasing data augmentation as an invariance in the prior distribution and learning it using Bayesian model selection, which has been shown to work in Gaussian processes, but not yet for deep neural networks. We use a differentiable Kronecker-factored Laplace approximation to the marginal likelihood as our objective, which can be optimised without human supervision or validation data. We show that our method can successfully recover invariances present in the data, and that this improves generalisation on image datasets.  ( 2 min )
    Order-Optimal Error Bounds for Noisy Kernel-Based Bayesian Quadrature. (arXiv:2202.10615v1 [stat.ML])
    In this paper, we study the sample complexity of {\em noisy Bayesian quadrature} (BQ), in which we seek to approximate an integral based on noisy black-box queries to the underlying function. We consider functions in a {\em Reproducing Kernel Hilbert Space} (RKHS) with the Mat\'ern-$\nu$ kernel, focusing on combinations of the parameter $\nu$ and dimension $d$ such that the RKHS is equivalent to a Sobolev class. In this setting, we provide near-matching upper and lower bounds on the best possible average error. Specifically, we find that when the black-box queries are subject to Gaussian noise having variance $\sigma^2$, any algorithm making at most $T$ queries (even with adaptive sampling) must incur a mean absolute error of $\Omega(T^{-\frac{\nu}{d}-1} + \sigma T^{-\frac{1}{2}})$, and there exists a non-adaptive algorithm attaining an error of at most $O(T^{-\frac{\nu}{d}-1} + \sigma T^{-\frac{1}{2}})$. Hence, the bounds are order-optimal, and establish that there is no adaptivity gap in terms of scaling laws.  ( 2 min )

  • Open

    How Kustomer utilizes custom Docker images & Amazon SageMaker to build a text classification pipeline
    This is a guest post by Kustomer’s Senior Software & Machine Learning Engineer, Ian Lantzy, and AWS team Umesh Kalaspurkar, Prasad Shetty, and Jonathan Greifenberger. In Kustomer’s own words, “Kustomer is the omnichannel SaaS CRM platform reimagining enterprise customer service to deliver standout experiences. Built with intelligent automation, we scale to meet the needs of […]  ( 8 min )
    Build, train, and deploy Amazon Lookout for Equipment models using the Python Toolbox
    Predictive maintenance can be an effective way to prevent industrial machinery failures and expensive downtime by proactively monitoring the condition of your equipment, so you can be alerted to any anomalies before equipment failures occur. Installing sensors and the necessary infrastructure for data connectivity, storage, analytics, and alerting are the foundational elements for enabling predictive […]  ( 9 min )
    Choose the best data source for your Amazon SageMaker training job
    Amazon SageMaker is a managed service that makes it easy to build, train, and deploy machine learning (ML) models. Data scientists use SageMaker training jobs to easily train ML models; you don’t have to worry about managing compute resources, and you pay only for the actual training time. Data ingestion is an integral part of […]  ( 17 min )
    How InpharmD uses Amazon Kendra and Amazon Lex to drive evidence-based patient care
    The intersection of DI and AI: Drug information refers to the discovery, use, and management of healthcare and medical information. Healthcare providers have many challenges associated with drug information discovery, such as intensive time involvement, lack of accessibility, and accuracy of reliable data. The average clinical query requires a literature search that takes an average of 18.5 hours. In addition, drug information often lies in disparate information silos, behind pay walls and design walls, and quickly becomes stale.  ( 5 min )
    Control formality in machine translated text using Amazon Translate
    Amazon Translate is a neural machine translation service that delivers fast, high-quality, affordable, and customizable language translation. Amazon Translate now supports formality customization. This feature allows you to customize the level of formality in your translation output. At the time of writing, the formality customization feature is available for six target languages: French, German, Hindi, Italian, […]  ( 5 min )
  • Open

    [P] Lambda's Machine Learning Infrastructure Playbook and Best Practices
    I originally put this together for a conference last year and finally got a chance to upload it to youtube. I figured /r/machinelearning might like it --- here's the full document & video presentation associated with it: video presentation: https://www.youtube.com/watch?v=3EnIW0EZkr4 slide pdf: https://files.lambdalabs.com/lambda-machine-learning-infrastructure-playbook.pdf It covers both the business side and technical side of building infrastructure for your team. In the presentation, we summarize a customer survey where 80% of respondents reporting using on-prem infrastructure. That seems pretty high to me and I was wondering if you think that's representative of the community or if our sampling is biased? submitted by /u/sabalaba [link] [comments]  ( 1 min )
    [Discussion] What would be your examples of the different agent architectures?
    Hi everyone, new to the world of AI. I was doing some research and was wondering about all the different examples of agent architectures and which types of examples we could group into them. For example, I'd say that self-driving cars like the Tesla is an example of a goal based agent, although it could be a reinforcement learning agent, but I don't think so since it can't exactly be given a reward, but I could be wrong. Another example could be smart robots - which I think is a reinforcement learning agent. There's 5 different agent types: reflex, model, goal, utility and learning agents. I guess what I'm curious about is what would be your examples for each of the 5? I guess also since we humans are also intelligent agents, we're both goal based and reinforcement learning agents. What do you all think? 🤔🙂 submitted by /u/gonewiththestorm [link] [comments]  ( 1 min )
    [D] EU Artificial Intelligence Act: Risk Levels
    The EU proposes the first regulatory framework for AI that addresses the risks of AI and enables Europe to play a leading role globally. Four risk levels are defined Unacceptable Risk (Prohibited) High-Risk (Conformity Assessment) Limited Risk (Transparency obligation) Minimal Risk (No obligation) More details in the following article (Medium): EU Artificial Intelligence Act: Risk Levels submitted by /u/Philo167 [link] [comments]  ( 1 min )
    [D] Exploiting Prior Knowledge of Class Distribution in Semi-supervised Learning?
    Suppose we are training a classifier and have two datasets to work with: a labeled dataset that is not necessarily a representative sample of the population on which the classifier will be deployed an unlabeled dataset that is representative of the population the classifier will be used on We also have prior knowledge of the class distribution in the general population, and therefore in the unlabeled dataset as well. Does anyone know of any research into using this sort of prior knowledge to improve classification performance over some baseline semi-supervised learning method? For example, methods like consistency regularization favor decision boundaries that pass through regions with low density of unlabeled examples. If we know that the unlabeled set is e.g. 60% positive and 40% negative, is there any way to also encourage decision boundaries that split the unlabeled examples according to the known ratio? Or in the case where one class strongly dominates in the unlabeled data, e.g. 5% positive and 95% negative, can we somehow make use of the fact that unlabeled examples are probably a certain class? ​ Edit: for context, the reason I'm interested in the semi-supervised (vs pure supervised) case is because when all your data is labeled, the class distribution is not really "additional" information-- it's something that a model could directly infer from the training labels. With unlabeled data, the class distribution is additional information that cannot be directly inferred from the raw data. Intuitively, I feel like it should be possible to use this additional information to train a better model, but my literature search isn't turning up any results. submitted by /u/Haycart [link] [comments]  ( 2 min )
    [D] Finding dates for upcoming conferences
    Hey, So how do you guys go about finding the dates of upcoming conferences? Is there a website where you can find a list of upcoming conferences and their tentative deadlines? I did look into WikiCFP, but I think conferences there are added only after the call for papers is announced. Im trying to figure out tentative dates, so I can have a good timeline/target for myself. Thanks submitted by /u/Bibbidi_Babbidi_Boo [link] [comments]  ( 1 min )
    [D] Neural Radiance Fields in Real-World Applications
    Hey everybody, Last year there have been a couple of papers on how to make Neural radiance fields (NeRFs) more efficient so that they can be used for real-time applications. So I was wondering whether there are some public projects or papers that actually use NeRF-like approaches in real-world applications, for example in game-engines or VR/AR (not technical demos). Or perhaps somebody here in the ML community has experience with this? I could not find much regarding this online, and I am generally interested to what extent implicit rendering models are used already. submitted by /u/BreadRollsWithButter [link] [comments]  ( 1 min )
    [N] Upcoming talks for Laplace’s Causal Demon: A Webinar Series about Bayesian Inference and Causal Inference (week of 28 Feb.)
    We have three talks next week (week of 28 Feb.) for our upcoming webinar series about the intersection of Bayesian inference and causal inference. Our speakers will help us understand how we can use these two frameworks in order to solve applied problems, and will consider if these different frameworks are in conflict or are complimentary. The intended audience is machine learning practitioners and statisticians from academia and industry. ​ Upcoming talks, with Zoom registration links: 28 Feb. Christopher Sims - Large Parameter Spaces and Weighted Data: A Bayesian Perspective Bayesian analysis can suggest ignoring sampling weights, even in contexts where popular estimation methods like Horvitz-Thompson or “doubly robust” estimates do use the weights. Weights are sometimes treated in …  ( 2 min )
    [R] Open-Ended Reinforcement Learning with Neural Reward Functions
    Paper Abstract Inspired by the great success of unsupervised learning in Computer Vision and Natural Language Processing, the Reinforcement Learning community has recently started to focus more on unsupervised discovery of skills. Most current approaches, like DIAYN or DADS, optimize some form of mutual information objective. We propose a different approach that uses reward functions encoded by neural networks. These are trained iteratively to reward more complex behavior. In high-dimensional robotic environments our approach learns a wide range of interesting skills including front-flips for Half-Cheetah and one-legged running for Humanoid. In the pixel-based Montezuma’s Revenge environment our method also works with minimal changes and it learns complex skills that involve interacting with items and visiting diverse locations. Unsupervised Skills ​ https://i.redd.it/mbhnagfp2mj81.gif Web-link: https://as.inf.ethz.ch/research/open_ended_RL/main.html arXiv link: https://arxiv.org/abs/2202.08266 submitted by /u/asierm [link] [comments]  ( 1 min )
    [D] Sentiment Analysis, Part 2 — How to choose pre-annotated datasets for Sentiment Analysis?
    Hello here 👋 I just posted the second blog post of a series about Sentiment Analysis: https://medium.com/besedo-engineering/sentiment-analysis-part-2-how-to-choose-pre-annotated-datasets-for-sentiment-analysis-d79736e8c147 This blog post is about how to choose the perfect dataset for your Sentiment Analysis project, which is one of the most important part of a project. I hope you'll like it! 🙌 Any feedback is appreciated 🙏 Bonus: There are still two blog posts on this series! submitted by /u/jade_mlc [link] [comments]  ( 1 min )
    [N] MIT talk by Max Tegmark on the Future of AI:
    https://www.youtube.com/watch?v=IGOuV6UyQ1Q&list=PLhCQIYxdniNuu422uTU91PRHoj3gxF0uD&index=3 submitted by /u/Corddot [link] [comments]
    [D] Comparing latent spaces learned on similar/identical data
    I have a very general question about latent spaces. It seems like there are many different neural network architectures that project input data into some sort of latent space and then make classifications, predictions, or generate new data based on that latent space. My question is, are there any accepted practices or standard methods for comparing latent spaces learned on similar or identical datasets? A trivial example would be for an autoencoder. If you had a single dataset, you could train multiple autoencoder architectures and compare how well the input and output match for each architecture. However, latent spaces exist in lots of different applications outside of autoencoders, and it seems like there might be useful ways to compare them beyond "did this reconstruct the input perfectly". For example, two different text-classification neural networks (NN1 and NN2) with latent spaces of the same dimension might project text samples very differently. It might enable NN1 to classify some samples well and others poorly, while NN2 might perform better on the opposite samples. It seems like it might be useful to understand the similarities and differences between each latent space, or maybe to figure out how one latent space maps onto another. Please let me know if my question isn't clear, or if it's trivial. I did some google-ing but I'm not always sure what terms to use. Thanks! submitted by /u/kinnunenenenen [link] [comments]  ( 3 min )
    [D] - What is the best cloud provider with a budget of £250?
    I am looking for a Cloud provider to run Machine Learning jobs for my final year project. The department gives us a budget of £250, which I was thinking of using all for running cloud jobs. What’s the best cloud provider? Are there any special offers for academics? submitted by /u/ats678 [link] [comments]  ( 1 min )
    [Discussion] How do you Automate your Jupyter Notebooks?
    Hi All! I was wondering what are some of the ways that people use to automate their notebook works? The main challenges I had in mind are version control, monitoring, better collaboration within the team, standardization across teams, and fast cloud deployments. ​ Disclosure, I'm part of the Ploomber team (an open source framework) and it solves most if not all of those problems. submitted by /u/idomic [link] [comments]  ( 1 min )
    [R] Deepmind: A data-driven approach for learning to control computers
    submitted by /u/ReasonablyBadass [link] [comments]  ( 1 min )
    [R] A list of Open source tools and research papers on Data Centric AI
    The Data Centric approach to building AI systems focuses on data instead of code. We thought it would be cool if there was a central repository of all things Data Centric AI, so we set out to build one. We have put together a list of research papers and open-source tools on Data Centric AI, that we think you will find useful. https://mindkosh.com/data-centric-ai/research-papers.html https://mindkosh.com/data-centric-ai/open-source-tools.html submitted by /u/ifcarscouldspeak [link] [comments]  ( 1 min )
    [D] Handling variable number of outputs in an NN
    Hello, I was wondering whether it is possible to construct an architecture that handles a variable number of outputs (from a single input). The scenario is as follows: ​ In the dataset, all samples have 3 output labels. However, some samples might have additional labels besides the first 3 (e.g. 6, 9 or 12) and they are in groups of 3. That is, output labels 4, 5 and 6 are either all present or none are present for a sample. ​ I could not find any literature on these types of problems, so I started thinking of an architecture that could learn this end-to-end. This is my (naive) attempt at that. Naive implementation of variable output network ​ Essentially, the input batch is first passed through a feature extractor (think BERT). Then, the labels which are known to be present in all labels are predicted via individual classification layers. ​ For the groups of labels which are not present in all samples, we first pass the features through individual binary classifiers (the `group x present?` boxes). If the result is positive, i.e. the group of labels is present, we pass on the features to another set of classification layers. As a loss, I was thinking of using binary cross entropy for the `group x present?` boxes, followed by a binary threshold activation, and regular cross entropy for the group classification layers. ​ Getting to my question, do you think this kind of construction makes sense? Did I create a monster? Are you aware of any literature that tackles these kinds of problems? ​ Thank you in advance! submitted by /u/MurlocXYZ [link] [comments]  ( 2 min )
    [D] Is it better to train with fine-grained labels, or more broad, super-labels which encompass the fine-grained ones?
    Lets say you're interested in training a Fruit / Vegetable classifier You could train with general labels of Fruit and Vegetable - OR - You could train with fine-grained labels of Apple, Pear, Broccoli, and Eggplant, which you will convert into Fruit/Vegetable during inference. My intuition is telling me that fine-grained labels will provide more detailed losses, forcing the model to learn a wider range of features. My other intuition is telling me that learning to separate Apple and Pear doesn't assist in separating Fruit and Vegetable. Is there any research, or even your own experience on whether you get better results training a neural network with general classes vs fine-grained classes? submitted by /u/lukeiy [link] [comments]  ( 2 min )
    [D] Universal Approximation of Functions on Sets
    Abs: " Modelling functions of sets, or equivalently, permutation-invariant functions, is a long-standing challenge in machine learning. Deep Sets is a popular method which is known to be a universal approximator for continuous set functions. We provide a theoretical analysis of Deep Sets which shows that this universal approximation property is only guaranteed if the model's latent space is sufficiently high-dimensional. If the latent space is even one dimension lower than necessary, there exist piecewise-affine functions for which Deep Sets performs no better than a naïve constant baseline, as judged by worst-case error. Deep Sets may be viewed as the most efficient incarnation of the Janossy pooling paradigm. We identify this paradigm as encompassing most currently popular set-learning methods. Based on this connection, we discuss the implications of our results for set learning more broadly, and identify some open questions on the universality of Janossy pooling in general. " ​ Arxiv: https://arxiv.org/abs/2107.01959 A YT vid over paper: https://www.youtube.com/watch?v=V9A6S8uGuPs&ab_channel=ZachData submitted by /u/Consistent_Data_7233 [link] [comments]  ( 1 min )
    [P] anomaly detection in time series data
    Hi All! What could be some approaches while dealing with anomaly detection in time series data? Im looking at the data that doesn’t show seasonality as much, it’s the sudden changes compared to previous data points that need to be identified as anomalies. Please suggest methods that i could use. Simple stats, sophisticated models any help is greatly appreciated! submitted by /u/Tasty_Hat1965 [link] [comments]  ( 1 min )
    [N] Who Is Behind QAnon? Linguistic Detectives Find Fingerprints using statistics and machine learning
    According to the New-York Times, using machine learning, stylometry, and statistics on Q texts, two separate teams of NLP researchers from France and Swiss have identified the same two men as likely authors of messages that fueled the QAnon movement. First the initiator, Paul Furber, a South African software developer and then Ron Watkins took over, who operated 8chan website where the Q messages began appearing in 2018 and is now running election for Republican in Arizona. submitted by /u/ClaudeCoulombe [link] [comments]  ( 3 min )
  • Open

    Chatbot
    I don't know too much about neural networks, as I'm only a high school student, however I learned how they work and made a few simple programs. I'm curious, if it's possible to make a chatbot, that would be able to communicate in many topics, using the following logic: there would be a neural network, which would get the user's text as an input and try to categorize it (eg. Movies, Weather, Astronomy, News, etc.) and then the output would be the category. All the categories would have their own neural networks, and for example if the output is "Movies", it would send the user data as an input into the movies neural network, and that neural network would find an appropriate answer. For example the user says "What should I watch?", then the first NN would categorize it as "Movies" and then the movies NN would choose the accurate answer from answers like "I heard Interstellar is an awesome sci-fi" and "I don't like horror movies" etc. Is it possible, and if so, could anyone recommend me tutorials/blogs/sources of building something similar? submitted by /u/Quirky-Temperature18 [link] [comments]  ( 1 min )
    Machine Learning in Scraping with Rails
    submitted by /u/Kagermanov [link] [comments]
    Which AI can you recommend to me, which can generate normal texts that make as much sense as possible, where I can say at certain points that a word belongs in the text.
    submitted by /u/xXLisa28Xx [link] [comments]  ( 1 min )
    Introducing Artificial Intelligence Survey Newsletter
    submitted by /u/Beautiful-Credit-868 [link] [comments]
    EU Artificial Intelligence Act: Risk Levels
    submitted by /u/Philo167 [link] [comments]  ( 1 min )
    Question about true positive and false positive detections in object detection
    Hi all, I am computing true positives, false positives, and false negatives in order to calculate my model's precision and recall. I am using YOLOv5. According to this source and this one, an IoU overlap between the predicted and ground truth bounding box greater than 50% is a true positive detection (we are assuming we set the IoU threshold at 0.5), and an overlap less than 50% is considered to be a false positive. Let's say you have multiple objects in your image that you are detecting. You label one object as "class A", but the model predicts it as "class B." Am I correct that this is also a false positive? As I understand it, there are two ways of getting a false positive: Achieving an IoU overlap of less than 50% of the correctly-predicted class, and identifying any overlap of an incorrect class? submitted by /u/EnvironmentalLemon36 [link] [comments]  ( 1 min )
    Blog on 'Support Vector Machine'
    I published a new blog post on Support Vector Machine which covers intuitional understanding to handling non-linear datapoints (Kernel SVM). ​ Check it out. https://medium.com/@wasimmadha/decision-boundaries-using-support-vector-machines-svms-9554e0e1a5a9 submitted by /u/wasimakil [link] [comments]
    Hey Folks, Here is a really cool research by IBM Researchers where they built a Machine Learning model that can help harness the power of enzymes for Greener Chemistry
    The development of ecologically acceptable biochemical substitutes for industrial processes might be accelerated thanks to nature’s molecular machinery. Enzymes are the master accelerators of nearly every activity in the human body, helping with everything from digestion to the breakdown of hazardous chemicals and even DNA replication. Enzymes’ relevance extends beyond biology; they’re also utilized to make industrial chemical processes more environmentally friendly by reducing energy consumption and the number of harmful solvents needed in their production. The enzyme Xylanase treatment in paper manufacture, for example, has been demonstrated to reduce chlorine usage by 15% and toxic adsorbable organic halides (a chlorine byproduct) by 25% when producing white paper for printing or use in notebooks. You can continue reading this article and tell us your feedback. Github: https://github.com/rxn4chemistry/biocatalysis-model Project: https://rxn.res.ibm.com/ Paper: https://www.nature.com/articles/s41467-022-28536-w.pdf submitted by /u/techsucker [link] [comments]  ( 1 min )
    Lifetime Access to 170+ GPT3 Resources
    Hi Makers, Good day. Here I am with my next product. https://shotfox.gumroad.com/l/gpt-3resources For the past few months, I am working on collecting all the GPT-3 related resources, that inlcludes, tweets, github repos, articles, and much more for my next GPT-3 product idea. By now, the resource count have reached almost 170+ and thought of putting this valuable database to public and here I am. If you are also someone who is admirer of GPT-3 and wanted to know from its basics till where it is used in the current world, this resource database would help you a lot. Have categorized the resources into multiple as below: Articles Code Generator Content Creation Design Fun Ideas Github Repos GPT3 Community Ideas Notable Takes Products Reasoning Social Media Marketing Text processing Tutorial Utilities Website Builder submitted by /u/bhaskar2191 [link] [comments]  ( 1 min )
    A new study finds that in a remote environment, an artificial intelligence (AI) tutoring system can outperform expert human instructors.
    submitted by /u/qptbook [link] [comments]
    coursera stanford course
    Hello guys. I was looking into learning AI. I googled online courses and I came across this stanford course https://www.coursera.org/learn/machine-learning . Is this a good way to introduce myself into the field? Also at the end, i can pay money to get certified, is that certificate worth the money i will pay? thank you for helping submitted by /u/playboicairoo [link] [comments]  ( 1 min )
    Deepfakes Can Effectively Fool Many Major Facial ‘Liveness’ APIs
    submitted by /u/DaveBowman1975 [link] [comments]
    Explore MLIR Development Process
    submitted by /u/Just0by [link] [comments]
    DeepMind Trains Agents to Control Computers as Humans Do to Solve Everyday Tasks
    submitted by /u/nick7566 [link] [comments]  ( 1 min )
    Animate Your Pictures Realistically With AI!
    submitted by /u/OnlyProggingForFun [link] [comments]
    H2O.ai Democratizes State-of-the-Art Deep Learning for Data Scientists and Developers with H2O Hydrogen Torch
    Application development that combines sophisticated technologies such as AI and ML is progressing, this time by combining multiple deployment outcomes into a single general-purpose, no-code platform. Line-of-business users may easily transition from studying data records to natural language processing, image, and video outputs in this manner. Because so many software developers can only focus on one of those outputs at a time, this type of adaptability for non-IT personnel hasn’t been offered on the market. Market heavyweights like DataRobot, Amazon Web Services, Microsoft, DataBricks, and SAS don’t provide this particular feature. H2O.ai, on the other hand, has set out to overcome this problem. The company is situated in Mountain View, California. H2O.ai has announced H2O Hydrogen Torch, a significant new addition to its open-source platform. This feature is a deep-learning training engine that, according to the company, makes it easy for every size and area of the company to generate a cutting-edge image, video, and natural language processing (NLP) models without scripting. These models may be utilized in the field to uncover fresh business insights about consumers, rivals, the market, and other topics. CONTINUE READING https://h2o.ai/platform/h2o-hydrogen-torch/ submitted by /u/techsucker [link] [comments]  ( 1 min )
  • Open

    Is reinforcement learning a scam?
    It promises us to create agents, basically to replace people forever. But it only works for games LMAO, well maybe some game developer will sit down and code a plausible simulation of REALITY. Thoughts? submitted by /u/JewishRevenge1939 [link] [comments]  ( 1 min )
  • Open

    "Yann LeCun on a vision to make AI systems learn and reason like animals and humans" (sketching an AGI arch using self-supervised learning)
    submitted by /u/gwern [link] [comments]  ( 1 min )
    Do you know of any companies that are applying reinforcement learning in production in agriculture or healthcare?
    I have come across many academic papers that propose solutions in both industries but have not found many companies that use RL as part of their core product - any specific companies would be much appreciated! submitted by /u/notabot789 [link] [comments]  ( 1 min )
    Viable algorithms for multiple discrete actions
    I am currently considering applying reinforcement learning to dynamically adjust prices for a tourist attraction. There are eight different timeslots which all have different amounts of tickets available. What I would like to do is to utilize an agent that will assign daily every individual time slot one of 5 different price categories. The agent would also look at external factors such as amount of tickets left per slot, current prices, demand forecast, days until arrival, etc. Meanwhile the reward would be the accumulated revenue. Formulating the action space into a single action at a time would result in 510 amount of possible different actions. I have considered two alternative approaches for this Have the agent have the following actions: move up slot, move down slot, increase price, decrease price, done adjusting prices for the day. Have an agent output multiple actions where each action is allocating a given slot one of five different pricing categories. My guess is that option two is better suited for this but I am not too familiar which which algorithms might provide the functionality that I need. If you can point in the right directions or if there is something that I might not be considering, any help would be appreciated. submitted by /u/TheSurvivingHalf [link] [comments]  ( 2 min )
    What do bias and variance terms mean in RL Algorithms w.r.t training /testing errors . Does high variance means simply high testing error .
    What are the main disadvantage due to high variance/bias ? submitted by /u/aabra__ka__daabra [link] [comments]  ( 1 min )
  • Open

    4D-Net: Learning Multi-Modal Alignment for 3D and Image Inputs in Time
    Posted by AJ Piergiovanni and Anelia Angelova, Research Scientists, Google Research While not immediately obvious, all of us experience the world in four dimensions (4D). For example, when walking or driving down the street we observe a stream of visual inputs, snapshots of the 3D world, which, when taken together in time, creates a 4D visual input. Today’s autonomous vehicles and robots are able to capture much of this information through various onboard sensing mechanisms, such as LiDAR and cameras. LiDAR is a ubiquitous sensor that uses light pulses to reliably measure the 3D coordinates of objects in a scene, however, it is also sparse and has a limited range — the farther one is from a sensor, the fewer points will be returned. This means that far-away objects might only get a hand…  ( 8 min )
  • Open

    COMPASS: COntrastive Multimodal Pretraining for AutonomouS Systems
    Humans have the fundamental cognitive ability to perceive the environment through multimodal sensory signals and utilize this to accomplish a wide variety of tasks. It is crucial that an autonomous agent can similarly perceive the underlying state of an environment from different sensors and appropriately consider how to accomplish a task. For example, localization (or […] The post COMPASS: COntrastive Multimodal Pretraining for AutonomouS Systems appeared first on Microsoft Research.  ( 11 min )
  • Open

    Meet the Omnivore: 3D Creator Makes Fine Art for Digital Era Inspired by Silk Road Masterpieces
    Within the Mogao Caves, a cultural crossroads along what was the Silk Road in northwestern China, lies a natural reserve of tens of thousands of historical documents, paintings and statues of the Buddha.  The post Meet the Omnivore: 3D Creator Makes Fine Art for Digital Era Inspired by Silk Road Masterpieces appeared first on NVIDIA Blog.  ( 4 min )
  • Open

    Unsupervised Skill Discovery with Contrastive Intrinsic Control
    Unsupervised Reinforcement Learning (RL), where RL agents pre-train with self-supervised rewards, is an emerging paradigm for developing RL agents that are capable of generalization. Recently, we released the Unsupervised RL Benchmark (URLB) which we covered in a previous post. URLB benchmarked many unsupervised RL algorithms across three categories — competence-based, knowledge-based, and data-based algorithms. A surprising finding was that competence-based algorithms significantly underperformed other categories. In this post we will demystify what has been holding back competence-based methods and introduce Contrastive Intrinsic Control (CIC), a new competence-based algorithm that is the first to achieve leading results on URLB. Results from benchmarking unsupervised RL algorithms To re…  ( 3 min )
  • Open

    How Good are Low-Rank Approximations in Gaussian Process Regression?. (arXiv:2112.06410v3 [stat.ML] UPDATED)
    We provide guarantees for approximate Gaussian Process (GP) regression resulting from two common low-rank kernel approximations: based on random Fourier features, and based on truncating the kernel's Mercer expansion. In particular, we bound the Kullback-Leibler divergence between an exact GP and one resulting from one of the afore-described low-rank approximations to its kernel, as well as between their corresponding predictive densities, and we also bound the error between predictive mean vectors and between predictive covariance matrices computed using the exact versus using the approximate GP. We provide experiments on both simulated data and standard benchmarks to evaluate the effectiveness of our theoretical bounds.  ( 2 min )
    Embedding Decomposition for Artifacts Removal in EEG Signals. (arXiv:2112.00989v2 [cs.LG] UPDATED)
    Electroencephalogram (EEG) recordings are often contaminated with artifacts. Various methods have been developed to eliminate or weaken the influence of artifacts. However, most of them rely on prior experience for analysis. Here, we propose an deep learning framework to separate neural signal and artifacts in the embedding space and reconstruct the denoised signal, which is called DeepSeparator. DeepSeparator employs an encoder to extract and amplify the features in the raw EEG, a module called decomposer to extract the trend, detect and suppress artifact and a decoder to reconstruct the denoised signal. Besides, DeepSeparator can extract the artifact, which largely increases the model interpretability. The proposed method is tested with a semi-synthetic EEG dataset and a real task-related EEG dataset, suggesting that DeepSeparator outperforms the conventional models in both EOG and EMG artifact removal. DeepSeparator can be extended to multi-channel EEG and data of any length. It may motivate future developments and application of deep learning-based EEG denoising. The code for DeepSeparator is available at https://github.com/ncclabsustech/DeepSeparator.  ( 2 min )
    On out-of-distribution detection with Bayesian neural networks. (arXiv:2110.06020v2 [cs.LG] UPDATED)
    The question whether inputs are valid for the problem a neural network is trying to solve has sparked interest in out-of-distribution (OOD) detection. It is widely assumed that Bayesian neural networks (BNNs) are well suited for this task, as the endowed epistemic uncertainty should lead to disagreement in predictions on outliers. In this paper, we question this assumption and show that proper Bayesian inference with function space priors induced by neural networks does not necessarily lead to good OOD detection. To circumvent the use of approximate inference, we start by studying the infinite-width case, where Bayesian inference can be exact due to the correspondence with Gaussian processes. Strikingly, the kernels derived from common architectural choices lead to function space priors which induce predictive uncertainties that do not reflect the underlying input data distribution and are therefore unsuited for OOD detection. Importantly, we find the OOD behavior in this limiting case to be consistent with the corresponding finite-width case. To overcome this limitation, useful function space properties can also be encoded in the prior in weight space, however, this can currently only be applied to a specified subset of the domain and thus does not inherently extend to OOD data. Finally, we argue that a trade-off between generalization and OOD capabilities might render the application of BNNs for OOD detection undesirable in practice. Overall, our study discloses fundamental problems when naively using BNNs for OOD detection and opens interesting avenues for future research.  ( 2 min )
    EDITS: Modeling and Mitigating Data Bias for Graph Neural Networks. (arXiv:2108.05233v2 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) have shown superior performance in analyzing attributed networks in various web-based applications such as social recommendation and web search. Nevertheless, in high-stake decision-making scenarios such as online fraud detection, there is an increasing societal concern that GNNs could make discriminatory decisions towards certain demographic groups. Despite recent explorations on fair GNNs, these works are tailored for a specific GNN model. However, myriads of GNN variants have been proposed for different applications, and it is costly to fine-tune existing debiasing algorithms for each specific GNN architecture. Different from existing works that debias GNN models, we aim to debias the input attributed network to achieve fairer GNNs through feeding GNNs with less biased data. Specifically, we propose novel definitions and metrics to measure the bias in an attributed network, which leads to the optimization objective to mitigate bias. We then develop a framework EDITS to mitigate the bias in attributed networks while maintaining the performance of GNNs in downstream tasks. EDITS works in a model-agnostic manner, i.e., it is independent of any specific GNN. Experiments demonstrate the validity of the proposed bias metrics and the superiority of EDITS on both bias mitigation and utility maintenance. Open-source implementation: https://github.com/yushundong/EDITS.  ( 2 min )
    Learning Filterbanks for End-to-End Acoustic Beamforming. (arXiv:2111.04614v2 [eess.AS] UPDATED)
    Recent work on monaural source separation has shown that performance can be increased by using fully learned filterbanks with short windows. On the other hand it is widely known that, for conventional beamforming techniques, performance increases with long analysis windows. This applies also to most hybrid neural beamforming methods which rely on a deep neural network (DNN) to estimate the spatial covariance matrices. In this work we try to bridge the gap between these two worlds and explore fully end-to-end hybrid neural beamforming in which, instead of using the Short-Time-Fourier Transform, also the analysis and synthesis filterbanks are learnt jointly with the DNN. In detail, we explore two different types of learned filterbanks: fully learned and analytic. We perform a detailed analysis using the recent Clarity Challenge data and show that by using learnt filterbanks it is possible to surpass oracle-mask based beamforming for short windows.  ( 2 min )
    DeepHAM: A Global Solution Method for Heterogeneous Agent Models with Aggregate Shocks. (arXiv:2112.14377v2 [econ.GN] UPDATED)
    An efficient, reliable, and interpretable global solution method, the Deep learning-based algorithm for Heterogeneous Agent Models (DeepHAM), is proposed for solving high dimensional heterogeneous agent models with aggregate shocks. The state distribution is approximately represented by a set of optimal generalized moments. Deep neural networks are used to approximate the value and policy functions, and the objective is optimized over directly simulated paths. In addition to being an accurate global solver, this method has three additional features. First, it is computationally efficient in solving complex heterogeneous agent models, and it does not suffer from the curse of dimensionality. Second, it provides a general and interpretable representation of the distribution over individual states, which is crucial in addressing the classical question of whether and how heterogeneity matters in macroeconomics. Third, it solves the constrained efficiency problem as easily as it solves the competitive equilibrium, which opens up new possibilities for studying optimal monetary and fiscal policies in heterogeneous agent models with aggregate shocks.  ( 2 min )
    Urban Fire Station Location Planning using Predicted Demand and Service Quality Index. (arXiv:2109.02160v2 [cs.LG] UPDATED)
    In this article, we propose a systematic approach for fire station location planning. We develop machine learning models, based on Random Forest and Extreme Gradient Boosting, for demand prediction and utilize the models further to define a generalized index to measure quality of fire service in urban settings. Our model is built upon spatial data collected from multiple different sources. Efficacy of proper facility planning depends on choice of candidates where fire stations can be located along with existing stations, if any. Also, the travel time from these candidates to demand locations need to be taken care of to maintain fire safety standard. Here, we propose a travel time based clustering technique to identify suitable candidates. Finally, we develop an optimization problem to select best locations to install new fire stations. Our optimization problem is built upon maximum coverage problem, based on integer programming. We further develop a two-stage stochastic optimization model to characterize the confidence in our decision outcome. We present a detailed experimental study of our proposed approach in collaboration with city of Victoria Fire Department, MN, USA. Our demand prediction model achieves true positive rate of 80% and false positive rate of 20% approximately. We aid Victoria Fire Department to select a location for a new fire station using our approach. We present detailed results on improvement statistics by locating a new facility, as suggested by our methodology, in the city of Victoria.  ( 2 min )
    Trap of Feature Diversity in the Learning of MLPs. (arXiv:2112.00980v2 [cs.LG] UPDATED)
    In this paper, we discover a two-phase phenomenon in the learning of multi-layer perceptrons (MLPs). I.e., in the first phase, the training loss does not decrease significantly, but the similarity of features between different samples keeps increasing, which hurts the feature diversity. We explain such a two-phase phenomenon in terms of the learning dynamics of the MLP. Furthermore, we propose two normalization operations to eliminate the two-phase phenomenon, which avoids the decrease of the feature diversity and speeds up the training process.  ( 2 min )
    ImportantAug: a data augmentation agent for speech. (arXiv:2112.07156v2 [eess.AS] UPDATED)
    We introduce ImportantAug, a technique to augment training data for speech classification and recognition models by adding noise to unimportant regions of the speech and not to important regions. Importance is predicted for each utterance by a data augmentation agent that is trained to maximize the amount of noise it adds while minimizing its impact on recognition performance. The effectiveness of our method is illustrated on version two of the Google Speech Commands (GSC) dataset. On the standard GSC test set, it achieves a 23.3% relative error rate reduction compared to conventional noise augmentation which applies noise to speech without regard to where it might be most effective. It also provides a 25.4% error rate reduction compared to a baseline without data augmentation. Additionally, the proposed ImportantAug outperforms the conventional noise augmentation and the baseline on two test sets with additional noise added.  ( 2 min )
    Generalized Bayesian Upper Confidence Bound with Approximate Inference for Bandit Problems. (arXiv:2201.12955v2 [cs.LG] UPDATED)
    Bayesian bandit algorithms with approximate inference have been widely used in practice with superior performance. Yet, few studies regarding the fundamental understanding of their performances are available. In this paper, we propose a Bayesian bandit algorithm, which we call Generalized Bayesian Upper Confidence Bound (GBUCB), for bandit problems in the presence of approximate inference. Our theoretical analysis demonstrates that in Bernoulli multi-armed bandit, GBUCB can achieve $O(\sqrt{T}(\log T)^c)$ frequentist regret if the inference error measured by symmetrized Kullback-Leibler divergence is controllable. This analysis relies on a novel sensitivity analysis for quantile shifts with respect to inference errors. To our best knowledge, our work provides the first theoretical regret bound that is better than $o(T)$ in the setting of approximate inference. Our experimental evaluations on multiple approximate inference settings corroborate our theory, showing that our GBUCB is consistently superior to BUCB and Thompson sampling.  ( 2 min )
    A Last Switch Dependent Analysis of Satiation and Seasonality in Bandits. (arXiv:2110.11819v2 [cs.LG] UPDATED)
    Motivated by the fact that humans like some level of unpredictability or novelty, and might therefore get quickly bored when interacting with a stationary policy, we introduce a novel non-stationary bandit problem, where the expected reward of an arm is fully determined by the time elapsed since the arm last took part in a switch of actions. Our model generalizes previous notions of delay-dependent rewards, and also relaxes most assumptions on the reward function. This enables the modeling of phenomena such as progressive satiation and periodic behaviours. Building upon the Combinatorial Semi-Bandits (CSB) framework, we design an algorithm and prove a bound on its regret with respect to the optimal non-stationary policy (which is NP-hard to compute). Similarly to previous works, our regret analysis is based on defining and solving an appropriate trade-off between approximation and estimation. Preliminary experiments confirm the superiority of our algorithm over both the oracle greedy approach and a vanilla CSB solver.  ( 2 min )
    CAFE: Catastrophic Data Leakage in Vertical Federated Learning. (arXiv:2110.15122v4 [cs.LG] UPDATED)
    Recent studies show that private training data can be leaked through the gradients sharing mechanism deployed in distributed machine learning systems, such as federated learning (FL). Increasing batch size to complicate data recovery is often viewed as a promising defense strategy against data leakage. In this paper, we revisit this defense premise and propose an advanced data leakage attack with theoretical justification to efficiently recover batch data from the shared aggregated gradients. We name our proposed method as catastrophic data leakage in vertical federated learning (CAFE). Comparing to existing data leakage attacks, our extensive experimental results on vertical FL settings demonstrate the effectiveness of CAFE to perform large-batch data leakage attack with improved data recovery quality. We also propose a practical countermeasure to mitigate CAFE. Our results suggest that private data participated in standard FL, especially the vertical case, have a high risk of being leaked from the training gradients. Our analysis implies unprecedented and practical data leakage risks in those learning settings. The code of our work is available at https://github.com/DeRafael/CAFE.  ( 2 min )
    Bit-wise Training of Neural Network Weights. (arXiv:2202.09571v1 [cs.LG])
    We introduce an algorithm where the individual bits representing the weights of a neural network are learned. This method allows training weights with integer values on arbitrary bit-depths and naturally uncovers sparse networks, without additional constraints or regularization techniques. We show better results than the standard training technique with fully connected networks and similar performance as compared to standard training for convolutional and residual networks. By training bits in a selective manner we found that the biggest contribution to achieving high accuracy is given by the first three most significant bits, while the rest provide an intrinsic regularization. As a consequence more than 90\% of a network can be used to store arbitrary codes without affecting its accuracy. These codes may be random noise, binary files or even the weights of previously trained networks.  ( 2 min )
    Adaptive Gaussian Processes on Graphs via Spectral Graph Wavelets. (arXiv:2110.12752v2 [cs.LG] UPDATED)
    Graph-based models require aggregating information in the graph from neighbourhoods of different sizes. In particular, when the data exhibit varying levels of smoothness on the graph, a multi-scale approach is required to capture the relevant information. In this work, we propose a Gaussian process model using spectral graph wavelets, which can naturally aggregate neighbourhood information at different scales. Through maximum likelihood optimisation of the model hyperparameters, the wavelets automatically adapt to the different frequencies in the data, and as a result our model goes beyond capturing low frequency information. We achieve scalability to larger graphs by using a spectrum-adaptive polynomial approximation of the filter function, which is designed to yield a low approximation error in dense areas of the graph spectrum. Synthetic and real-world experiments demonstrate the ability of our model to infer scales accurately and produce competitive performances against state-of-the-art models in graph-based learning tasks.  ( 2 min )
    Towards Fine-Grained Reasoning for Fake News Detection. (arXiv:2110.15064v3 [cs.CL] UPDATED)
    The detection of fake news often requires sophisticated reasoning skills, such as logically combining information by considering word-level subtle clues. In this paper, we move towards fine-grained reasoning for fake news detection by better reflecting the logical processes of human thinking and enabling the modeling of subtle clues. In particular, we propose a fine-grained reasoning framework by following the human information-processing model, introduce a mutual-reinforcement-based method for incorporating human knowledge about which evidence is more important, and design a prior-aware bi-channel kernel graph network to model subtle differences between pieces of evidence. Extensive experiments show that our model outperforms the state-of-the-art methods and demonstrate the explainability of our approach.  ( 2 min )
    RELAX: Representation Learning Explainability. (arXiv:2112.10161v2 [stat.ML] UPDATED)
    Despite the significant improvements that representation learning via self-supervision has led to when learning from unlabeled data, no methods exist that explain what influences the learned representation. We address this need through our proposed approach, RELAX, which is the first approach for attribution-based explanations of representations. Our approach can also model the uncertainty in its explanations, which is essential to produce trustworthy explanations. RELAX explains representations by measuring similarities in the representation space between an input and masked out versions of itself, providing intuitive explanations and significantly outperforming the gradient-based baseline. We provide theoretical interpretations of RELAX and conduct a novel analysis of feature extractors trained using supervised and unsupervised learning, providing insights into different learning strategies. Finally, we illustrate the usability of RELAX in multi-view clustering and highlight that incorporating uncertainty can be essential for providing low-complexity explanations, taking a crucial step towards explaining representations.  ( 2 min )
    Fair Representation: Guaranteeing Approximate Multiple Group Fairness for Unknown Tasks. (arXiv:2109.00545v2 [cs.LG] UPDATED)
    Motivated by scenarios where data is used for diverse prediction tasks, we study whether fair representation can be used to guarantee fairness for unknown tasks and for multiple fairness notions simultaneously. We consider seven group fairness notions that cover the concepts of independence, separation, and calibration. Against the backdrop of the fairness impossibility results, we explore approximate fairness. We prove that, although fair representation might not guarantee fairness for all prediction tasks, it does guarantee fairness for an important subset of tasks -- the tasks for which the representation is discriminative. Specifically, all seven group fairness notions are linearly controlled by fairness and discriminativeness of the representation. When an incompatibility exists between different fairness notions, fair and discriminative representation hits the sweet spot that approximately satisfies all notions. Motivated by our theoretical findings, we propose to learn both fair and discriminative representations using pretext loss which self-supervises learning, and Maximum Mean Discrepancy as a fair regularizer. Experiments on tabular, image, and face datasets show that using the learned representation, downstream predictions that we are unaware of when learning the representation indeed become fairer for seven group fairness notions, and the fairness guarantees computed from our theoretical results are all valid.  ( 2 min )
    Regret Lower Bounds for Learning Linear Quadratic Gaussian Systems. (arXiv:2201.01680v2 [cs.LG] UPDATED)
    This paper presents local minimax regret lower bounds for adaptively controlling linear-quadratic-Gaussian (LQG) systems. We consider smoothly parametrized instances and provide an understanding of when logarithmic regret is impossible which is both instance specific and flexible enough to take problem structure into account. This understanding relies on two key notions: That of local-uninformativeness; when the optimal policy does not provide sufficient excitation for identification of the optimal policy, and yields a degenerate Fisher information matrix; and that of information-regret-boundedness, when the small eigenvalues of a policy-dependent information matrix are boundable in terms of the regret of that policy. Combined with a reduction to Bayesian estimation and application of Van Trees' inequality, these two conditions are sufficient for proving regret bounds on order of magnitude $\sqrt{T}$ in the time horizon, $T$. This method yields lower bounds that exhibit tight dimensional dependencies and scale naturally with control-theoretic problem constants. For instance, we are able to prove that systems operating near marginal stability are fundamentally hard to learn to control. We further show that large classes of systems satisfy these conditions, among them any state-feedback system with both $A$- and $B$-matrices unknown. Most importantly, we also establish that a nontrivial class of partially observable systems, essentially those that are over-actuated, satisfy these conditions, thus providing a $\sqrt{T}$ lower bound also valid for partially observable systems. Finally, we turn to two simple examples which demonstrate that our lower bound captures classical control-theoretic intuition: our lower bounds diverge for systems operating near marginal stability or with large filter gain -- these can be arbitrarily hard to (learn to) control.  ( 2 min )
    PriorGrad: Improving Conditional Denoising Diffusion Models with Data-Dependent Adaptive Prior. (arXiv:2106.06406v2 [stat.ML] UPDATED)
    Denoising diffusion probabilistic models have been recently proposed to generate high-quality samples by estimating the gradient of the data density. The framework defines the prior noise as a standard Gaussian distribution, whereas the corresponding data distribution may be more complicated than the standard Gaussian distribution, which potentially introduces inefficiency in denoising the prior noise into the data sample because of the discrepancy between the data and the prior. In this paper, we propose PriorGrad to improve the efficiency of the conditional diffusion model for speech synthesis (for example, a vocoder using a mel-spectrogram as the condition) by applying an adaptive prior derived from the data statistics based on the conditional information. We formulate the training and sampling procedures of PriorGrad and demonstrate the advantages of an adaptive prior through a theoretical analysis. Focusing on the speech synthesis domain, we consider the recently proposed diffusion-based speech generative models based on both the spectral and time domains and show that PriorGrad achieves faster convergence and inference with superior performance, leading to an improved perceptual quality and robustness to a smaller network capacity, and thereby demonstrating the efficiency of a data-dependent adaptive prior.  ( 2 min )
    Generative Adversarial Networks for Labelled Vibration Data Generation. (arXiv:2112.08195v1 [cs.LG] CROSS LISTED)
    As Structural Health Monitoring (SHM) being implemented more over the years, the use of operational modal analysis of civil structures has become more significant for the assessment and evaluation of engineering structures. Machine Learning (ML) and Deep Learning (DL) algorithms have been in use for structural damage diagnostics of civil structures in the last couple of decades. While collecting vibration data from civil structures is a challenging and expensive task for both undamaged and damaged cases, in this paper, the authors are introducing Generative Adversarial Networks (GAN) that is built on the Deep Convolutional Neural Network (DCNN) and using Wasserstein Distance for generating artificial labelled data to be used for structural damage diagnostic purposes. The authors named the developed model 1D W-DCGAN and successfully generated vibration data which is very similar to the input. The methodology presented in this paper will pave the way for vibration data generation for numerous future applications in the SHM domain.  ( 2 min )
    A Variational Bayesian Approach to Learning Latent Variables for Acoustic Knowledge Transfer. (arXiv:2110.08598v2 [eess.AS] UPDATED)
    We propose a variational Bayesian (VB) approach to learning distributions of latent variables in deep neural network (DNN) models for cross-domain knowledge transfer, to address acoustic mismatches between training and testing conditions. Instead of carrying out point estimation in conventional maximum a posteriori estimation with a risk of having a curse of dimensionality in estimating a huge number of model parameters, we focus our attention on estimating a manageable number of latent variables of DNNs via a VB inference framework. To accomplish model transfer, knowledge learnt from a source domain is encoded in prior distributions of latent variables and optimally combined, in a Bayesian sense, with a small set of adaptation data from a target domain to approximate the corresponding posterior distributions. Experimental results on device adaptation in acoustic scene classification show that our proposed VB approach can obtain good improvements on target devices, and consistently outperforms 13 state-of-the-art knowledge transfer algorithms.  ( 2 min )
    Internal Language Model Adaptation with Text-Only Data for End-to-End Speech Recognition. (arXiv:2110.05354v3 [cs.CL] UPDATED)
    Text-only adaptation of an end-to-end (E2E) model remains a challenging task for automatic speech recognition (ASR). Language model (LM) fusion-based approaches require an additional external LM during inference, significantly increasing the computation cost. To overcome this, we propose an internal LM adaptation (ILMA) of the E2E model using text-only data. Trained with audio-transcript pairs, an E2E model implicitly learns an internal LM that characterizes the token sequence probability which is approximated by the E2E model output after zeroing out the encoder contribution. During ILMA, we fine-tune the internal LM, i.e., the E2E components excluding the encoder, to minimize a cross-entropy loss. To make ILMA effective, it is essential to train the E2E model with an internal LM loss besides the standard E2E loss. Furthermore, we propose to regularize ILMA by minimizing the Kullback-Leibler divergence between the output distributions of the adapted and unadapted internal LMs. ILMA is the most effective when we update only the last linear layer of the joint network. ILMA enables a fast text-only adaptation of the E2E model without increasing the run-time computational cost. Experimented with 30K-hour trained transformer transducer models, ILMA achieves up to 34.9% relative word error rate reduction from the unadapted baseline.  ( 2 min )
    Generative Adversarial Networks for Labeled Data Creation for Structural Monitoring and Damage Detection. (arXiv:2112.03478v2 [cs.LG] CROSS LISTED)
    There has been a drastic progression in the field of Data Science in the last few decades and other disciplines have been continuously benefitting from it. Structural Health Monitoring (SHM) is one of those fields that use Artificial Intelligence (AI) such as Machine Learning (ML) and Deep Learning (DL) algorithms for condition assessment of civil structures based on the collected data. The ML and DL methods require plenty of data for training procedures; however, in SHM, data collection from civil structures is very exhaustive; particularly getting useful data (damage associated data) can be very challenging. This paper uses 1-D Wasserstein Deep Convolutional Generative Adversarial Networks using Gradient Penalty (1-D WDCGAN-GP) for synthetic labeled vibration data generation. Then, implements structural damage detection on different levels of synthetically enhanced vibration datasets by using 1-D Deep Convolutional Neural Network (1-D DCNN). The damage detection results show that the 1-D WDCGAN-GP can be successfully utilized to tackle data scarcity in vibration-based damage diagnostics of civil structures. Keywords: Structural Health Monitoring (SHM), Structural Damage Diagnostics, Structural Damage Detection, 1-D Deep Convolutional Neural Networks (1-D DCNN), 1-D Generative Adversarial Networks (1-D GAN), Deep Convolutional Generative Adversarial Networks (DCGAN), Wasserstein Generative Adversarial Networks with Gradient Penalty (WGAN-GP)  ( 2 min )
    DKM: Differentiable K-Means Clustering Layer for Neural Network Compression. (arXiv:2108.12659v4 [cs.LG] UPDATED)
    Deep neural network (DNN) model compression for efficient on-device inference is becoming increasingly important to reduce memory requirements and keep user data on-device. To this end, we propose a novel differentiable k-means clustering layer (DKM) and its application to train-time weight clustering-based DNN model compression. DKM casts k-means clustering as an attention problem and enables joint optimization of the DNN parameters and clustering centroids. Unlike prior works that rely on additional regularizers and parameters, DKM-based compression keeps the original loss function and model architecture fixed. We evaluated DKM-based compression on various DNN models for computer vision and natural language processing (NLP) tasks. Our results demonstrate that DKM delivers superior compression and accuracy trade-off on ImageNet1k and GLUE benchmarks. For example, DKM-based compression can offer 74.5% top-1 ImageNet1k accuracy on ResNet50 DNN model with 3.3MB model size (29.4x model compression factor). For MobileNet-v1, which is a challenging DNN to compress, DKM delivers 63.9% top-1 ImageNet1k accuracy with 0.72 MB model size (22.4x model compression factor). This result is 6.8% higher top-1accuracy and 33% relatively smaller model size than the current state-of-the-art DNN compression algorithms. Additionally, DKM enables compression of DistilBERT model by 11.8x with minimal (1.1%) accuracy loss on GLUE NLP benchmarks.  ( 2 min )
    Echofilter: A Deep Learning Segmentation Model Improves the Automation, Standardization, and Timeliness for Post-Processing Echosounder Data in Tidal Energy Streams. (arXiv:2202.09648v1 [cs.LG])
    Understanding the abundance and distribution of fish in tidal energy streams is important for assessing the risk presented by the introduction of tidal energy devices into the habitat. However, the impressive tidal currents that make sites favorable for tidal energy development are often highly turbulent and entrain air into the water, complicating the interpretation of echosounder data. The portion of the water column contaminated by returns from entrained air must be excluded from data used for biological analyses. Application of a single algorithm to identify the depth-of-penetration of entrained-air is insufficient for a boundary that is discontinuous, depth-dynamic, porous, and widely variable across the tidal flow speeds which can range from 0 to 5m/s. Using a case study at a tidal energy demonstration site in the Bay of Fundy, we describe the development and application of deep learning models that produce a pronounced, consistent, substantial, and measurable improvement of the automated detection of the extent to which entrained-air has penetrated the water column. Our model, Echofilter, was highly responsive to the dynamic range of turbulence conditions and sensitive to the fine-scale nuances in the boundary position, producing an entrained-air boundary line with an average error of 0.32m on mobile downfacing and 0.5-1.0m on stationary upfacing data. The model's annotations had a high level of agreement with the human segmentation (mobile downfacing Jaccard index: 98.8%; stationary upfacing: 93-95%). This resulted in a 50% reduction in the time required for manual edits compared to the time required to manually edit the line placed by currently available algorithms. Because of the improved initial automated placement, the implementation of the models generated a marked increase in the standardization and repeatability of line placement.  ( 2 min )
    IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning. (arXiv:2110.13214v3 [cs.CV] UPDATED)
    Current visual question answering (VQA) tasks mainly consider answering human-annotated questions for natural images. However, aside from natural images, abstract diagrams with semantic richness are still understudied in visual understanding and reasoning research. In this work, we introduce a new challenge of Icon Question Answering (IconQA) with the goal of answering a question in an icon image context. We release IconQA, a large-scale dataset that consists of 107,439 questions and three sub-tasks: multi-image-choice, multi-text-choice, and filling-in-the-blank. The IconQA dataset is inspired by real-world diagram word problems that highlight the importance of abstract diagram understanding and comprehensive cognitive reasoning. Thus, IconQA requires not only perception skills like object recognition and text understanding, but also diverse cognitive reasoning skills, such as geometric reasoning, commonsense reasoning, and arithmetic reasoning. To facilitate potential IconQA models to learn semantic representations for icon images, we further release an icon dataset Icon645 which contains 645,687 colored icons on 377 classes. We conduct extensive user studies and blind experiments and reproduce a wide range of advanced VQA methods to benchmark the IconQA task. Also, we develop a strong IconQA baseline Patch-TRM that applies a pyramid cross-modal Transformer with input diagram embeddings pre-trained on the icon dataset. IconQA and Icon645 are available at https://iconqa.github.io.  ( 2 min )
    SHARP: Shielding-Aware Robust Planning for Safe and Efficient Human-Robot Interaction. (arXiv:2110.00843v2 [cs.RO] UPDATED)
    Jointly achieving safety and efficiency in human-robot interaction (HRI) settings is a challenging problem, as the robot's planning objectives may be at odds with the human's own intent and expectations. Recent approaches ensure safe robot operation in uncertain environments through a supervisory control scheme, sometimes called "shielding", which overrides the robot's nominal plan with a safety fallback strategy when a safety-critical event is imminent. These reactive "last-resort" strategies (typically in the form of aggressive emergency maneuvers) focus on preserving safety without efficiency considerations; when the nominal planner is unaware of possible safety overrides, shielding can be activated more frequently than necessary, leading to degraded performance. In this work, we propose a new shielding-based planning approach that allows the robot to plan efficiently by explicitly accounting for possible future shielding events. Leveraging recent work on Bayesian human motion prediction, the resulting robot policy proactively balances nominal performance with the risk of high-cost emergency maneuvers triggered by low-probability human behaviors. We formalize Shielding-Aware Robust Planning (SHARP) as a stochastic optimal control problem and propose a computationally efficient framework for finding tractable approximate solutions at runtime. Our method outperforms the shielding-agnostic motion planning baseline (equipped with the same human intent inference scheme) on simulated driving examples with human trajectories taken from the recently released Waymo Open Motion Dataset.  ( 2 min )
    A Survey of Large-Scale Deep Learning Serving System Optimization: Challenges and Opportunities. (arXiv:2111.14247v2 [cs.LG] UPDATED)
    Deep Learning (DL) models have achieved superior performance in many application domains, including vision, language, medical, commercial ads, entertainment, etc. With the fast development, both DL applications and the underlying serving hardware have demonstrated strong scaling trends, i.e., Model Scaling and Compute Scaling, for example, the recent pre-trained model with hundreds of billions of parameters with ~TB level memory consumption, as well as the newest GPU accelerators providing hundreds of TFLOPS. With both scaling trends, new problems and challenges emerge in DL inference serving systems, which gradually trends towards Large-scale Deep learning Serving systems (LDS). This survey aims to summarize and categorize the emerging challenges and optimization opportunities for large-scale deep learning serving systems. By providing a novel taxonomy, summarizing the computing paradigms, and elaborating the recent technique advances, we hope that this survey could shed light on new optimization perspectives and motivate novel works in large-scale deep learning system optimization.  ( 2 min )
    Internal Wasserstein Distance for Adversarial Attack and Defense. (arXiv:2103.07598v2 [cs.LG] UPDATED)
    Deep neural networks (DNNs) are known to be vulnerable to adversarial attacks that would trigger misclassification of DNNs but may be imperceptible to human perception. Adversarial defense has been important ways to improve the robustness of DNNs. Existing attack methods often construct adversarial examples relying on some metrics like the $\ell_p$ distance to perturb samples. However, these metrics can be insufficient to conduct adversarial attacks due to their limited perturbations. In this paper, we propose a new internal Wasserstein distance (IWD) to capture the semantic similarity of two samples, and thus it helps to obtain larger perturbations than currently used metrics such as the $\ell_p$ distance We then apply the internal Wasserstein distance to perform adversarial attack and defense. In particular, we develop a novel attack method relying on IWD to calculate the similarities between an image and its adversarial examples. In this way, we can generate diverse and semantically similar adversarial examples that are more difficult to defend by existing defense methods. Moreover, we devise a new defense method relying on IWD to learn robust models against unseen adversarial examples. We provide both thorough theoretical and empirical evidence to support our methods.  ( 2 min )
    Fair Conformal Predictors for Applications in Medical Imaging. (arXiv:2109.04392v2 [eess.IV] UPDATED)
    Deep learning has the potential to augment several clinically useful aspects of the radiologist's workflow such as medical imaging interpretation. However, the translation of deep learning algorithms into clinical practice has been hindered by relative lack of transparency in these algorithms compared to more traditional statistical methods. Specifically, common deep learning models lack intuitive and rigorous methods of conveying prediction confidence in a calibrated manner, which ultimately restricts widespread use of these "black box" systems for critical decision-making. Furthermore, numerous demonstrations of algorithmic bias in clinical machine learning have caused considerable hesitancy towards the deployment of these models for clinical application. To this end, we explore how conformal predictions can complement existing deep learning approaches by providing an intuitive way of expressing model uncertainty to facilitate greater transparency to clinical users. In this paper, we conduct field interviews with radiologists to assess potential use-cases of conformal predictors. Using insights collected from these interviews, we devise two use-cases and empirically evaluate several conformal methods on a dermatology photography dataset for skin lesion classification. Additionally, we show how group conformal predictors are more adaptive to differences between patient skin tones for malignant skin lesions. We find our conformal predictors to be a promising and generally applicable approach to increasing clinical usability and trustworthiness -- hopefully facilitating better modes of collaboration between medical AI tools and their clinical users.  ( 2 min )
    Information Flow in Deep Neural Networks. (arXiv:2202.06749v2 [cs.LG] UPDATED)
    Although deep neural networks have been immensely successful, there is no comprehensive theoretical understanding of how they work or are structured. As a result, deep networks are often seen as black boxes with unclear interpretations and reliability. Understanding the performance of deep neural networks is one of the greatest scientific challenges. This work aims to apply principles and techniques from information theory to deep learning models to increase our theoretical understanding and design better algorithms. We first describe our information-theoretic approach to deep learning. Then, we propose using the Information Bottleneck (IB) theory to explain deep learning systems. The novel paradigm for analyzing networks sheds light on their layered structure, generalization abilities, and learning dynamics. We later discuss one of the most challenging problems of applying the IB to deep neural networks - estimating mutual information. Recent theoretical developments, such as the neural tangent kernel (NTK) framework, are used to investigate generalization signals. In our study, we obtained tractable computations of many information-theoretic quantities and their bounds for infinite ensembles of infinitely wide neural networks. With these derivations, we can determine how compression, generalization, and sample size pertain to the network and how they are related. At the end, we present the dual Information Bottleneck (dualIB). This new information-theoretic framework resolves some of the IB's shortcomings by merely switching terms in the distortion function. The dualIB can account for known data features and use them to make better predictions over unseen examples. An analytical framework reveals the underlying structure and optimal representations, and a variational framework using deep neural network optimization validates the results.  ( 2 min )
    Provably efficient machine learning for quantum many-body problems. (arXiv:2106.12627v3 [quant-ph] UPDATED)
    Classical machine learning (ML) provides a potentially powerful approach to solving challenging quantum many-body problems in physics and chemistry. However, the advantages of ML over more traditional methods have not been firmly established. In this work, we prove that classical ML algorithms can efficiently predict ground state properties of gapped Hamiltonians in finite spatial dimensions, after learning from data obtained by measuring other Hamiltonians in the same quantum phase of matter. In contrast, under widely accepted complexity theory assumptions, classical algorithms that do not learn from data cannot achieve the same guarantee. We also prove that classical ML algorithms can efficiently classify a wide range of quantum phases of matter. Our arguments are based on the concept of a classical shadow, a succinct classical description of a many-body quantum state that can be constructed in feasible quantum experiments and be used to predict many properties of the state. Extensive numerical experiments corroborate our theoretical results in a variety of scenarios, including Rydberg atom systems, 2D random Heisenberg models, symmetry-protected topological phases, and topologically ordered phases.
    Structured second-order methods via natural gradient descent. (arXiv:2107.10884v3 [stat.ML] UPDATED)
    In this paper, we propose new structured second-order methods and structured adaptive-gradient methods obtained by performing natural-gradient descent on structured parameter spaces. Natural-gradient descent is an attractive approach to design new algorithms in many settings such as gradient-free, adaptive-gradient, and second-order methods. Our structured methods not only enjoy a structural invariance but also admit a simple expression. Finally, we test the efficiency of our proposed methods on both deterministic non-convex problems and deep learning problems.
    Necessary and Sufficient Conditions for Inverse Reinforcement Learning of Bayesian Stopping Time Problems. (arXiv:2007.03481v4 [cs.LG] UPDATED)
    This paper presents an inverse reinforcement learning~(IRL) framework for Bayesian stopping time problems. By observing the actions of a Bayesian decision maker, we provide a necessary and sufficient condition to identify if these actions are consistent with optimizing a cost function. In a Bayesian (partially observed) setting, the inverse learner can at best identify optimality wrt the observed actions. Our IRL algorithm identifies optimality and then constructs set valued estimates of the cost function. To achieve this IRL objective, we use novel ideas from Bayesian revealed preferences stemming from microeconomics. We illustrate the proposed IRL scheme using two important examples of stopping time problems, namely, sequential hypothesis testing and Bayesian search. Finally, for finite datasets, we propose an IRL detection algorithm and give finite sample bounds on its error probabilities.
    Deep Learning for Technical Document Classification. (arXiv:2106.14269v5 [cs.LG] UPDATED)
    In large technology companies, the requirements for managing and organizing technical documents created by engineers and managers have increased dramatically in recent years, which has led to a higher demand for more scalable, accurate, and automated document classification. Prior studies have only focused on processing text for classification, whereas technical documents often contain multimodal information. To leverage multimodal information for document classification to improve the model performance, this paper presents a novel multimodal deep learning architecture, TechDoc, which utilizes three types of information, including natural language texts and descriptive images within documents and the associations among the documents. The architecture synthesizes the convolutional neural network, recurrent neural network, and graph neural network through an integrated training process. We applied the architecture to a large multimodal technical document database and trained the model for classifying documents based on the hierarchical International Patent Classification system. Our results show that TechDoc presents a greater classification accuracy than the unimodal methods and other state-of-the-art benchmarks. The trained model can potentially be scaled to millions of real-world multimodal technical documents, which is useful for data and knowledge management in large technology companies and organizations.
    Subspace Regularizers for Few-Shot Class Incremental Learning. (arXiv:2110.07059v2 [cs.CV] UPDATED)
    Few-shot class incremental learning -- the problem of updating a trained classifier to discriminate among an expanded set of classes with limited labeled data -- is a key challenge for machine learning systems deployed in non-stationary environments. Existing approaches to the problem rely on complex model architectures and training procedures that are difficult to tune and re-use. In this paper, we present an extremely simple approach that enables the use of ordinary logistic regression classifiers for few-shot incremental learning. The key to this approach is a new family of subspace regularization schemes that encourage weight vectors for new classes to lie close to the subspace spanned by the weights of existing classes. When combined with pretrained convolutional feature extractors, logistic regression models trained with subspace regularization outperform specialized, state-of-the-art approaches to few-shot incremental image classification by up to 22% on the miniImageNet dataset. Because of its simplicity, subspace regularization can be straightforwardly extended to incorporate additional background information about the new classes (including class names and descriptions specified in natural language); these further improve accuracy by up to 2%. Our results show that simple geometric regularization of class representations offers an effective tool for continual learning.
    Node Feature Kernels Increase Graph Convolutional Network Robustness. (arXiv:2109.01785v3 [cs.LG] UPDATED)
    The robustness of the much-used Graph Convolutional Networks (GCNs) to perturbations of their input is becoming a topic of increasing importance. In this paper, the random GCN is introduced for which a random matrix theory analysis is possible. This analysis suggests that if the graph is sufficiently perturbed, or in the extreme case random, then the GCN fails to benefit from the node features. It is furthermore observed that enhancing the message passing step in GCNs by adding the node feature kernel to the adjacency matrix of the graph structure solves this problem. An empirical study of a GCN utilised for node classification on six real datasets further confirms the theoretical findings and demonstrates that perturbations of the graph structure can result in GCNs performing significantly worse than Multi-Layer Perceptrons run on the node features alone. In practice, adding a node feature kernel to the message passing of perturbed graphs results in a significant improvement of the GCN's performance, thereby rendering it more robust to graph perturbations. Our code is publicly available at:https://github.com/ChangminWu/RobustGCN.
    PURSUhInT: In Search of Informative Hint Points Based on Layer Clustering for Knowledge Distillation. (arXiv:2103.00053v2 [cs.LG] UPDATED)
    We propose a novel knowledge distillation methodology for compressing deep neural networks. One of the most efficient methods for knowledge distillation is hint distillation, where the student model is injected with information (hints) from several different layers of the teacher model. Although the selection of hint points can drastically alter the compression performance, conventional distillation approaches overlook this fact. Therefore, we propose a clustering based hint selection methodology, where the layers of teacher model are clustered with respect to several metrics and the cluster centers are used as the hint points. Our method is applicable for any student network, once it is applied on a chosen teacher network. The proposed approach is validated in CIFAR-100 and ImageNet datasets, using various teacher-student pairs and numerous hint distillation methods. Our results show that hint points selected by our algorithm results in superior compression performance with respect to state-of-the-art knowledge distillation algorithms on the same student models and datasets.
    Why KDAC? A general activation function for knowledge discovery. (arXiv:2111.13858v2 [cs.LG] UPDATED)
    Named entity recognition based on deep learning (DNER) can effectively mine expected knowledge from large-scale unstructured and semi-structured text, and has gradually become the paradigm of knowledge discovery. Currently, Tanh, ReLU and Sigmoid dominate DNER, however, these activation functions failed to treat gradient vanishing, no negative output or non-differentiable existence, which may impede DNER's exploration of knowledge caused by the omission and the incomplete representation of latent semantics. To surmount the non-negligible obstacle, we present a novel and general activation function termed KDAC. Detailly, KDAC is a thought that can aggregate and inherit the merits of Tanh and ReLU since they are widely leveraged in various knowledge domains. The positive region corresponds to an adaptive linear design encouraged by ReLU. The negative region considers the interaction between exponent and linearity to surmount the obstacle of gradient vanishing and no negative value. Crucially, the non-differentiable points are alerted and eliminated by a smooth approximation. We perform experiments based on BERT-BiLSTM-CNN-CRF model on six benchmark datasets containing different domain knowledge, such as Weibo, Clinical, E-commerce, Resume, HAZOP and People's daily. The experimental results show that KDAC is advanced and effective, and can provide more generalized activation to stimulate the performance of DNER. We hope that KDAC can be exploited as a promising alternative activation function in DNER to devote itself to the construction of knowledge.
    One-Step Abductive Multi-Target Learning with Diverse Noisy Samples: An Application to Tumour Segmentation for Breast Cancer. (arXiv:2110.10325v4 [cs.LG] UPDATED)
    One-step abductive multi-target learning (OSAMTL) is an approach proposed to handle complex noisy labels. However, OSAMTL is not suitable for the situation where diverse noisy samples (DNS) are provided for a learning task. In this paper, giving definition of DNS, we propose one-step abductive multi-target learning with DNS (OSAMTL-DNS) to expand the original OSAMTL to a wider range of tasks that handle complex noisy labels. Applying OSAMTL-DNS to tumour segmentation for breast cancer in medical histopathology whole slide image analysis, we show that OSAMTL-DNS is able to enable various state-of-the-art approaches for learning from noisy labels to achieve significantly more rational predictions.
    DiffCo: Auto-Differentiable Proxy Collision Detection with Multi-class Labels for Safety-Aware Trajectory Optimization. (arXiv:2102.07413v2 [cs.RO] UPDATED)
    The objective of trajectory optimization algorithms is to achieve an optimal collision-free path between a start and goal state. In real-world scenarios where environments can be complex and non-homogeneous, a robot needs to be able to gauge whether a state will be in collision with various objects in order to meet some safety metrics. The collision detector should be computationally efficient and, ideally, analytically differentiable to facilitate stable and rapid gradient descent during optimization. However, methods today lack an elegant approach to detect collision differentiably, relying rather on numerical gradients that can be unstable. We present DiffCo, the first, fully auto-differentiable, non-parametric model for collision detection. Its non-parametric behavior allows one to compute collision boundaries on-the-fly and update them, requiring no pre-training and allowing it to update continuously in dynamic environments. It provides robust gradients for trajectory optimization via backpropagation and is often 10-100x faster to compute than its geometric counterparts. DiffCo also extends trivially to modeling different object collision classes for semantically informed trajectory optimization.
    DeePN$^2$: A deep learning-based non-Newtonian hydrodynamic model. (arXiv:2112.14798v2 [physics.comp-ph] UPDATED)
    A long-standing problem in the modeling of non-Newtonian hydrodynamics of polymeric flows is the availability of reliable and interpretable hydrodynamic models that faithfully encode the underlying micro-scale polymer dynamics. The main complication arises from the long polymer relaxation time, the complex molecular structure and heterogeneous interaction. DeePN$^2$, a deep learning-based non-Newtonian hydrodynamic model, has been proposed and has shown some success in systematically passing the micro-scale structural mechanics information to the macro-scale hydrodynamics for suspensions with simple polymer conformation and bond potential. The model retains a multi-scaled nature by mapping the polymer configurations into a set of symmetry-preserving macro-scale features. The extended constitutive laws for these macro-scale features can be directly learned from the kinetics of their micro-scale counterparts. In this paper, we develop DeePN$^2$ using more complex micro-structural models. We show that DeePN$^2$ can faithfully capture the broadly overlooked viscoelastic differences arising from the specific molecular structural mechanics without human intervention.
    Z2P: Instant Visualization of Point Clouds. (arXiv:2105.14548v2 [cs.GR] UPDATED)
    We present a technique for visualizing point clouds using a neural network. Our technique allows for an instant preview of any point cloud, and bypasses the notoriously difficult surface reconstruction problem or the need to estimate oriented normals for splat-based rendering. We cast the preview problem as a conditional image-to-image translation task, and design a neural network that translates point depth-map directly into an image, where the point cloud is visualized as though a surface was reconstructed from it. Furthermore, the resulting appearance of the visualized point cloud can be, optionally, conditioned on simple control variables (e.g., color and light). We demonstrate that our technique instantly produces plausible images, and can, on-the-fly effectively handle noise, non-uniform sampling, and thin surfaces sheets.
    Quantization Backdoors to Deep Learning Commercial Frameworks. (arXiv:2108.09187v2 [cs.CR] UPDATED)
    Currently, there is a burgeoning demand for deploying deep learning (DL) models on ubiquitous edge Internet of Things (IoT) devices attributed to their low latency and high privacy preservation. However, DL models are often large in size and require large-scale computation, which prevents them from being placed directly onto IoT devices, where resources are constrained and 32-bit floating-point (float-32) operations are unavailable. Commercial framework (i.e., a set of toolkits) empowered model quantization is a pragmatic solution that enables DL deployment on mobile devices and embedded systems by effortlessly post-quantizing a large high-precision model (e.g., float-32) into a small low-precision model (e.g., int-8) while retaining the model inference accuracy. However, their usability might be threatened by security vulnerabilities. This work reveals that the standard quantization toolkits can be abused to activate a backdoor. We demonstrate that a full-precision backdoored model which does not have any backdoor effect in the presence of a trigger -- as the backdoor is dormant -- can be activated by the default i) TensorFlow-Lite (TFLite) quantization, the only product-ready quantization framework to date, and ii) the beta released PyTorch Mobile framework. When each of the float-32 models is converted into an int-8 format model through the standard TFLite or Pytorch Mobile framework's post-training quantization, the backdoor is activated in the quantized model, which shows a stable attack success rate close to 100% upon inputs with the trigger, while it behaves normally upon non-trigger inputs. This work highlights that a stealthy security threat occurs when an end user utilizes the on-device post-training model quantization frameworks, informing security researchers of cross-platform overhaul of DL models post quantization even if these models pass front-end backdoor inspections.
    Deep kernel machines: exact inference with representation learning in infinite Bayesian neural networks. (arXiv:2108.13097v3 [stat.ML] UPDATED)
    Deep neural networks (DNNs) with the flexibility to learn good top-layer representations have eclipsed shallow kernel methods without that flexibility. Here, we take inspiration from DNNs to develop the deep kernel machine. Optimizing the deep kernel machine objective is equivalent to exact Bayesian inference (or noisy gradient descent) in an infinitely wide Bayesian neural network or deep Gaussian process, which has been scaled carefully to retain representation learning. Our work thus has important implications for theoretical understanding of neural networks. In addition, we show that the deep kernel machine objective has more desirable properties and better performance than other choices of objective. Finally, we conjecture that the deep kernel machine objective is unimodal. We give a proof of unimodality for linear kernels, and a number of experiments in the nonlinear case in which all deep kernel machines initializations we tried converged to the same solution.
    Security Analysis of Camera-LiDAR Fusion Against Black-Box Attacks on Autonomous Vehicles. (arXiv:2106.07098v4 [cs.CR] UPDATED)
    To enable safe and reliable decision-making, autonomous vehicles (AVs) feed sensor data to perception algorithms to understand the environment. Sensor fusion with multi-frame tracking is becoming increasingly popular for detecting 3D objects. Thus, in this work, we perform an analysis of camera-LiDAR fusion, in the AV context, under LiDAR spoofing attacks. Recently, LiDAR-only perception was shown vulnerable to LiDAR spoofing attacks; however, we demonstrate these attacks are not capable of disrupting camera-LiDAR fusion. We then define a novel, context-aware attack: frustum attack, and show that out of 8 widely used perception algorithms - across 3 architectures of LiDAR-only and 3 architectures of camera-LiDAR fusion - all are significantly vulnerable to the frustum attack. In addition, we demonstrate that the frustum attack is stealthy to existing defenses against LiDAR spoofing as it preserves consistencies between camera and LiDAR semantics. Finally, we show that the frustum attack can be exercised consistently over time to form stealthy longitudinal attack sequences, compromising the tracking module and creating adverse outcomes on end-to-end AV control.
    Strategic Instrumental Variable Regression: Recovering Causal Relationships From Strategic Responses. (arXiv:2107.05762v2 [cs.LG] UPDATED)
    In settings where Machine Learning (ML) algorithms automate or inform consequential decisions about people, individual decision subjects are often incentivized to strategically modify their observable attributes to receive more favorable predictions. As a result, the distribution the assessment rule is trained on may differ from the one it operates on in deployment. While such distribution shifts, in general, can hinder accurate predictions, our work identifies a unique opportunity associated with shifts due to strategic responses: We show that we can use strategic responses effectively to recover causal relationships between the observable features and outcomes we wish to predict, even under the presence of unobserved confounding variables. Specifically, our work establishes a novel connection between strategic responses to ML models and instrumental variable (IV) regression by observing that the sequence of deployed models can be viewed as an instrument that affects agents' observable features but does not directly influence their outcomes. We show that our causal recovery method can be utilized to improve decision-making across several important criteria: individual fairness, agent outcomes, and predictive risk. In particular, we show that if decision subjects differ in their ability to modify non-causal attributes, any decision rule deviating from the causal coefficients can lead to(potentially unbounded) individual-level unfairness.
    On the Necessity of Auditable Algorithmic Definitions for Machine Unlearning. (arXiv:2110.11891v2 [cs.LG] UPDATED)
    Machine unlearning, i.e. having a model forget about some of its training data, has become increasingly more important as privacy legislation promotes variants of the right-to-be-forgotten. In the context of deep learning, approaches for machine unlearning are broadly categorized into two classes: exact unlearning methods, where an entity has formally removed the data point's impact on the model by retraining the model from scratch, and approximate unlearning, where an entity approximates the model parameters one would obtain by exact unlearning to save on compute costs. In this paper, we first show that the definition that underlies approximate unlearning, which seeks to prove the approximately unlearned model is close to an exactly retrained model, is incorrect because one can obtain the same model using different datasets. Thus one could unlearn without modifying the model at all. We then turn to exact unlearning approaches and ask how to verify their claims of unlearning. Our results show that even for a given training trajectory one cannot formally prove the absence of certain data points used during training. We thus conclude that unlearning is only well-defined at the algorithmic level, where an entity's only possible auditable claim to unlearning is that they used a particular algorithm designed to allow for external scrutiny during an audit.
    Transfer Learning in Quantum Parametric Classifiers: An Information-Theoretic Generalization Analysis. (arXiv:2201.06297v2 [quant-ph] UPDATED)
    A key step in quantum machine learning with classical inputs is the design of an embedding circuit mapping inputs to a quantum state. This paper studies a transfer learning setting in which classical-to-quantum embedding is carried out by an arbitrary parametric quantum circuit that is pre-trained based on data from a source task. At run time, the binary classifier is then optimized based on data from the target task of interest. Using an information-theoretic approach, we demonstrate that the average excess risk, or optimality gap, can be bounded in terms of two R\'enyi mutual information terms between classical input and quantum embedding under source and target tasks, as well as in terms of a measure of similarity between the source and target tasks related to the trace distance. The main theoretical results are validated on a simple binary classification example.
    Open-world Machine Learning: Applications, Challenges, and Opportunities. (arXiv:2105.13448v2 [cs.LG] UPDATED)
    Traditional machine learning mainly supervised learning, follows the assumptions of closed-world learning, i.e., for each testing class, a training class is available. However, such machine learning models fail to identify the classes which were not available during training time. These classes can be referred to as unseen classes. Whereas open-world machine learning (OWML) deals with unseen classes. In this paper, first, we present an overview of OWML with importance to the real-world context. Next, different dimensions of open-world machine learning are explored and discussed. The area of OWML gained the attention of the research community in the last decade only. We have searched through different online digital libraries and scrutinized the work done in the last decade. This paper presents a systematic review of various techniques for OWML. It also presents the research gaps, challenges, and future directions in open-world machine learning. This paper will help researchers understand the comprehensive developments of OWML and the likelihood of extending the research in suitable areas. It will also help to select applicable methodologies and datasets to explore this further.
    Bayesian Persuasion for Algorithmic Recourse. (arXiv:2112.06283v2 [cs.GT] UPDATED)
    When subjected to automated decision-making, decision subjects may strategically modify their observable features in ways they believe will maximize their chances of receiving a favorable decision. In many practical situations, the underlying assessment rule is deliberately kept secret to avoid gaming and maintain competitive advantage. The resulting opacity forces the decision subjects to rely on incomplete information when making strategic feature modifications. We capture such settings as a game of Bayesian persuasion, in which the decision maker offers a form of recourse to the decision subject by providing them with an action recommendation (or signal) to incentivize them to modify their features in desirable ways. We show that when using persuasion, both the decision maker and decision subject are never worse off in expectation, while the decision maker can be significantly better off. While the decision maker's problem of finding the optimal Bayesian incentive-compatible (BIC) signaling policy takes the form of optimization over infinitely-many variables, we show that this optimization can be cast as a linear program over finitely-many regions of the space of possible assessment rules. While this reformulation simplifies the problem dramatically, solving the linear program requires reasoning about exponentially-many variables, even under relatively simple settings. Motivated by this observation, we provide a polynomial-time approximation scheme that recovers a near-optimal signaling policy. Finally, our numerical simulations on semi-synthetic data empirically illustrate the benefits of using persuasion in the algorithmic recourse setting.
    Pointer Value Retrieval: A new benchmark for understanding the limits of neural network generalization. (arXiv:2107.12580v2 [cs.LG] UPDATED)
    Central to the success of artificial neural networks is their ability to generalize. But does neural network generalization primarily rely on seeing highly similar training examples (memorization)? Or are neural networks capable of human-intelligence styled reasoning, and if so, to what extent? These remain fundamental open questions on artificial neural networks. In this paper, as steps towards answering these questions, we introduce a new benchmark, Pointer Value Retrieval (PVR) to study the limits of neural network reasoning. The PVR suite of tasks is based on reasoning about indirection, a hallmark of human intelligence, where a first stage (task) contains instructions for solving a second stage (task). In PVR, this is done by having one part of the task input act as a pointer, giving instructions on a different input location, which forms the output. We show this simple rule can be applied to create a diverse set of tasks across different input modalities and configurations. Importantly, this use of indirection enables systematically varying task difficulty through distribution shifts and increasing functional complexity. We conduct a detailed empirical study of different PVR tasks, discovering large variations in performance across dataset sizes, neural network architectures and task complexity. Further, by incorporating distribution shift and increased functional complexity, we develop nuanced tests for reasoning, revealing subtle failures and surprising successes, suggesting many promising directions of exploration on this benchmark.
    Equivariant Graph Attention Networks for Molecular Property Prediction. (arXiv:2202.09891v1 [cs.LG])
    Learning and reasoning about 3D molecular structures with varying size is an emerging and important challenge in machine learning and especially in drug discovery. Equivariant Graph Neural Networks (GNNs) can simultaneously leverage the geometric and relational detail of the problem domain and are known to learn expressive representations through the propagation of information between nodes leveraging higher-order representations to faithfully express the geometry of the data, such as directionality in their intermediate layers. In this work, we propose an equivariant GNN that operates with Cartesian coordinates to incorporate directionality and we implement a novel attention mechanism, acting as a content and spatial dependent filter when propagating information between nodes. We demonstrate the efficacy of our architecture on predicting quantum mechanical properties of small molecules and its benefit on problems that concern macromolecular structures such as protein complexes.  ( 2 min )
    Meta-learning an Intermediate Representation for Few-shot Block-wise Prediction of Landslide Susceptibility. (arXiv:2110.04922v2 [cs.LG] UPDATED)
    Predicting a landslide susceptibility map (LSM) is essential for risk recognition and disaster prevention. Despite the successful application of data-driven approaches for LSM prediction, most methods generally apply a single global model to predict the LSM for an entire target region. However, in large-scale areas with significant environmental change, various parts of the region hold different landslide-inducing environments, and therefore, should be predicted with respective models. This study first segmented target scenarios into blocks for individual analysis. Then, the critical problem is that in each block with limited samples, conducting training and testing a model is impossible for a satisfactory LSM prediction, especially in dangerous mountainous areas where landslide surveying is expensive. To solve the problem, we trained an intermediate representation by the meta-learning paradigm, which is superior for capturing information valuable for few-shot adaption from LSM tasks. We hypothesized that there are more general and vital concepts concerning landslide causes and are sensitive to variations in input features. Thus, we can quickly few-shot adapt the models from the intermediate representation for different blocks or even unseen tasks using very few exemplar samples. Experimental results on the two study areas demonstrated the validity of our block-wise analysis in large scenarios and revealed the top few-shot adaption performances of the proposed methods.
    Nonseparable Symplectic Neural Networks. (arXiv:2010.12636v3 [cs.LG] UPDATED)
    Predicting the behaviors of Hamiltonian systems has been drawing increasing attention in scientific machine learning. However, the vast majority of the literature was focused on predicting separable Hamiltonian systems with their kinematic and potential energy terms being explicitly decoupled while building data-driven paradigms to predict nonseparable Hamiltonian systems that are ubiquitous in fluid dynamics and quantum mechanics were rarely explored. The main computational challenge lies in the effective embedding of symplectic priors to describe the inherently coupled evolution of position and momentum, which typically exhibits intricate dynamics. To solve the problem, we propose a novel neural network architecture, Nonseparable Symplectic Neural Networks (NSSNNs), to uncover and embed the symplectic structure of a nonseparable Hamiltonian system from limited observation data. The enabling mechanics of our approach is an augmented symplectic time integrator to decouple the position and momentum energy terms and facilitate their evolution. We demonstrated the efficacy and versatility of our method by predicting a wide range of Hamiltonian systems, both separable and nonseparable, including chaotic vortical flows. We showed the unique computational merits of our approach to yield long-term, accurate, and robust predictions for large-scale Hamiltonian systems by rigorously enforcing symplectomorphism.  ( 2 min )
    A Minimax Learning Approach to Off-Policy Evaluation in Confounded Partially Observable Markov Decision Processes. (arXiv:2111.06784v2 [cs.LG] UPDATED)
    We consider off-policy evaluation (OPE) in Partially Observable Markov Decision Processes (POMDPs), where the evaluation policy depends only on observable variables and the behavior policy depends on unobservable latent variables. Existing works either assume no unmeasured confounders, or focus on settings where both the observation and the state spaces are tabular. In this work, we first propose novel identification methods for OPE in POMDPs with latent confounders, by introducing bridge functions that link the target policy's value and the observed data distribution. We next propose minimax estimation methods for learning these bridge functions, and construct three estimators based on these estimated bridge functions, corresponding to a value function-based estimator, a marginalized importance sampling estimator, and a doubly-robust estimator. Our proposal permits general function approximation and is thus applicable to settings with continuous or large observation/state spaces. The nonasymptotic and asymptotic properties of the proposed estimators are investigated in detail.  ( 2 min )
    Towards Secure and Practical Machine Learning via Secret Sharing and Random Permutation. (arXiv:2108.07463v3 [cs.LG] UPDATED)
    With the increasing demands for privacy protection, privacy-preserving machine learning has been drawing much attention in both academia and industry. However, most existing methods have their limitations in practical applications. On the one hand, although most cryptographic methods are provable secure, they bring heavy computation and communication. On the other hand, the security of many relatively efficient private methods (e.g., federated learning and split learning) is being questioned, since they are non-provable secure. Inspired by previous work on privacy-preserving machine learning, we build a privacy-preserving machine learning framework by combining random permutation and arithmetic secret sharing via our compute-after-permutation technique. Since our method reduces the cost for element-wise function computation, it is more efficient than existing cryptographic methods. Moreover, by adopting distance correlation as a metric for privacy leakage, we demonstrate that our method is more secure than previous non-provable secure methods. Overall, our proposal achieves a good balance between security and efficiency. Experimental results show that our method not only is up to 6x faster and reduces up to 85% network traffic compared with state-of-the-art cryptographic methods, but also leaks less privacy during the training process compared with non-provable secure methods.  ( 2 min )
    Learning Representations Robust to Group Shifts and Adversarial Examples. (arXiv:2202.09446v1 [cs.LG])
    Despite the high performance achieved by deep neural networks on various tasks, extensive studies have demonstrated that small tweaks in the input could fail the model predictions. This issue of deep neural networks has led to a number of methods to improve model robustness, including adversarial training and distributionally robust optimization. Though both of these two methods are geared towards learning robust models, they have essentially different motivations: adversarial training attempts to train deep neural networks against perturbations, while distributional robust optimization aims at improving model performance on the most difficult "uncertain distributions". In this work, we propose an algorithm that combines adversarial training and group distribution robust optimization to improve robust representation learning. Experiments on three image benchmark datasets illustrate that the proposed method achieves superior results on robust metrics without sacrificing much of the standard measures.  ( 2 min )
    Signal Processing on Cell Complexes. (arXiv:2110.05614v2 [cs.LG] UPDATED)
    The processing of signals supported on non-Euclidean domains has attracted large interest recently. Thus far, such non-Euclidean domains have been abstracted primarily as graphs with signals supported on the nodes, though the processing of signals on more general structures such as simplicial complexes has also been considered. In this paper, we give an introduction to signal processing on (abstract) regular cell complexes, which provide a unifying framework encompassing graphs, simplicial complexes, cubical complexes and various meshes as special cases. We discuss how appropriate Hodge Laplacians for these cell complexes can be derived. These Hodge Laplacians enable the construction of convolutional filters, which can be employed in linear filtering and non-linear filtering via neural networks defined on cell complexes.
    Limitations of machine learning for building energy prediction: ASHRAE Great Energy Predictor III Kaggle competition error analysis. (arXiv:2106.13475v3 [cs.LG] UPDATED)
    Research is needed to explore the limitations and potential for improvement of machine learning for building energy prediction. With this aim, the ASHRAE Great Energy Predictor III (GEPIII) Kaggle competition was launched in 2019. This effort was the largest building energy meter machine learning competition of its kind, with 4,370 participants who submitted 39,403 predictions. The test data set included two years of hourly whole building readings from 2,380 meters in 1,448 buildings at 16 locations. This paper analyzes the various sources and types of residual model error from an aggregation of the competition's top 50 solutions. This analysis reveals the limitations for machine learning using the standard model inputs of historical meter, weather, and basic building metadata. The errors are classified according to timeframe, behavior, magnitude, and incidence in single buildings or across a campus. The results show machine learning models have errors within a range of acceptability (RMSLE_scaled = 0.3) occur in 4.8% of the test data and are unlikely to be accurately predicted.  ( 2 min )
    Using a one dimensional parabolic model of the full-batch loss to estimate learning rates during training. (arXiv:2108.13880v2 [cs.LG] UPDATED)
    A fundamental challenge in Deep Learning is to find optimal step sizes for stochastic gradient descent automatically. In traditional optimization, line searches are a commonly used method to determine step sizes. One problem in Deep Learning is that finding appropriate step sizes on the full-batch loss is unfeasibly expensive. Therefore, classical line search approaches, designed for losses without inherent noise, are usually not applicable. Recent empirical findings suggest, inter alia, that the full-batch loss behaves locally parabolically in the direction of noisy update step directions. Furthermore, the trend of the optimal update step size changes slowly. By exploiting these and more findings, this work introduces a line-search method that approximates the full-batch loss with a parabola estimated over several mini-batches. Learning rates are derived from such parabolas during training. In the experiments conducted, our approach is on par with SGD with Momentum tuned with a piece-wise constant learning rate schedule and often outperforms other line search approaches for Deep Learning across models, datasets, and batch sizes on validation and test accuracy. In addition, our approach is the first line search approach for Deep Learning that samples a larger batch size over multiple inferences to still work in low-batch scenarios.
    AdaGNN: Graph Neural Networks with Adaptive Frequency Response Filter. (arXiv:2104.12840v3 [cs.LG] UPDATED)
    Graph Neural Networks have recently become a prevailing paradigm for various high-impact graph analytical problems. Existing efforts can be mainly categorized as spectral-based and spatial-based methods. The major challenge for the former is to find an appropriate graph filter to distill discriminative information from input signals for learning. Recently, myriads of explorations are made to achieve better graph filters, e.g., Graph Convolutional Network (GCN), which leverages Chebyshev polynomial truncation to seek an approximation of graph filters and bridge these two families of methods. Nevertheless, it has been shown in recent studies that GCN and its variants are essentially employing fixed low-pass filters to perform information denoising. Thus their learning capability is rather limited and may over-smooth node representations at deeper layers. To tackle these problems, we develop a novel graph neural network framework AdaGNN with a well-designed adaptive frequency response filter. At its core, AdaGNN leverages a simple but elegant trainable filter that spans across multiple layers to capture the varying importance of different frequency components for node representation learning. The inherent differences among different feature channels are also well captured by the filter. As such, it empowers AdaGNN with stronger expressiveness and naturally alleviates the over-smoothing problem. We empirically validate the effectiveness of the proposed framework on various benchmark datasets. Theoretical analysis is also provided to show the superiority of the proposed AdaGNN. The open-source implementation of AdaGNN can be found here: https://github.com/yushundong/AdaGNN.
    Bad priors, their threats to Bayesian optimization and some remedies via prior learning. (arXiv:2109.08215v2 [cs.LG] UPDATED)
    The performance of deep neural networks can be highly sensitive to the choice of a variety of meta-parameters, such as optimizer parameters and model hyperparameters. Tuning these well, however, often requires extensive and costly experimentation. Bayesian optimization (BO) is a principled approach to solve such expensive hyperparameter tuning problems efficiently. Key to the performance of BO is specifying and refining a distribution over functions, which is used to reason about the optima of the underlying function being optimized. In this work, we consider the scenario where we have data from similar functions that allows us to specify a tighter distribution a priori. Specifically, we focus on the common but potentially costly task of tuning optimizer parameters for training neural networks. Building on the meta BO method from Wang et al. (2018), we develop practical improvements that (a) boost its performance by leveraging tuning results on multiple tasks without requiring observations for the same meta-parameter points across all tasks, and (b) retain its regret bound for a special case of our method. As a result, we provide a coherent BO solution for iterative optimization of continuous optimizer parameters. To verify our approach in realistic model training setups, we collected a large multi-task hyperparameter tuning dataset by training tens of thousands of configurations of near-state-of-the-art models on popular image and text datasets, as well as a protein sequence dataset. Our results show that on average, our method is able to locate good hyperparameters at least 3 times more efficiently than the best competing methods.  ( 2 min )
    On the Efficiency of Subclass Knowledge Distillation in Classification Tasks. (arXiv:2109.05587v2 [cs.LG] UPDATED)
    This work introduces a novel knowledge distillation framework for classification tasks where information on existing subclasses is available and taken into consideration. In classification tasks with a small number of classes or binary detection (two classes) the amount of information transferred from the teacher to the student network is restricted, thus limiting the utility of knowledge distillation. Performance can be improved by leveraging information about possible subclasses within the available classes in the classification task. To that end, we propose the so-called Subclass Knowledge Distillation (SKD) framework, which is the process of transferring the subclasses' prediction knowledge from a large teacher model into a smaller student one. Through SKD, additional meaningful information which is not in the teacher's class logits but exists in subclasses (e.g., similarities inside classes) will be conveyed to the student and boost its performance. Mathematically, we measure how many extra information bits the teacher can provide for the student via SKD framework. The framework developed is evaluated in clinical application, namely colorectal polyp binary classification. In this application, clinician-provided annotations are used to define subclasses based on the annotation label's variability in a curriculum style of learning. A lightweight, low complexity student trained with the proposed framework achieves an F1-score of 85.05%, an improvement of 2.14% and 1.49% gain over the student that trains without and with conventional knowledge distillation, respectively. These results show that the extra subclasses' knowledge (i.e., 0.4656 label bits per training sample in our experiment) can provide more information about the teacher generalization, and therefore SKD can benefit from using more information to increase the student performance.  ( 2 min )
    Low-Cost Algorithmic Recourse for Users With Uncertain Cost Functions. (arXiv:2111.01235v2 [cs.LG] UPDATED)
    People affected by machine learning model decisions may benefit greatly from access to recourses, i.e. suggestions about what features they could change to receive a more favorable decision from the model. Current approaches try to optimize for the cost incurred by users when adopting a recourse, but they assume that all users share the same cost function. This is an unrealistic assumption because users might have diverse preferences about their willingness to change certain features. In this work, we introduce a new method for identifying recourse sets for users which does not assume that users' preferences are known in advance. We propose an objective function, Expected Minimum Cost (EMC), based on two key ideas: (1) when presenting a set of options to a user, there only needs to be one low-cost solution that the user could adopt; (2) when we do not know the user's true cost function, we can approximately optimize for user satisfaction by first sampling plausible cost functions from a distribution, then finding a recourse set that achieves a good cost for these samples. We optimize EMC with a novel discrete optimization algorithm, Cost Optimized Local Search (COLS), which is guaranteed to improve the recourse set quality over iterations. Experimental evaluation on popular real-world datasets with simulated users demonstrates that our method satisfies up to 25.89 percentage points more users compared to strong baseline methods, while, the human evaluation shows that our recourses are preferred more than twice as often as the strongest baseline recourses. Finally, using standard fairness metrics we show that our method can provide more fair solutions across demographic groups than baselines. We provide our source code at: https://github.com/prateeky2806/EMC-COLS-recourse  ( 2 min )
    Pyramid Convolutional RNN for MRI Image Reconstruction. (arXiv:1912.00543v6 [eess.IV] UPDATED)
    Fast and accurate MRI image reconstruction from undersampled data is crucial in clinical practice. Deep learning based reconstruction methods have shown promising advances in recent years. However, recovering fine details from undersampled data is still challenging. In this paper, we introduce a novel deep learning based method, Pyramid Convolutional RNN (PC-RNN), to reconstruct images from multiple scales. Based on the formulation of MRI reconstruction as an inverse problem, we design the PC-RNN model with three convolutional RNN (ConvRNN) modules to iteratively learn the features in multiple scales. Each ConvRNN module reconstructs images at different scales and the reconstructed images are combined by a final CNN module in a pyramid fashion. The multi-scale ConvRNN modules learn a coarse-to-fine image reconstruction. Unlike other common reconstruction methods for parallel imaging, PC-RNN does not employ coil sensitive maps for multi-coil data and directly model the multiple coils as multi-channel inputs. The coil compression technique is applied to standardize data with various coil numbers, leading to more efficient training. We evaluate our model on the fastMRI knee and brain datasets and the results show that the proposed model outperforms other methods and can recover more details. The proposed method is one of the winner solutions in the 2019 fastMRI competition.  ( 2 min )
    Non-asymptotic analysis and inference for an outlyingness induced winsorized mean. (arXiv:2105.02337v2 [stat.ML] UPDATED)
    Robust estimation of a mean vector, a topic regarded as obsolete in the traditional robust statistics community, has recently surged in machine learning literature in the last decade. The latest focus is on the sub-Gaussian performance and computability of the estimators in a non-asymptotic setting. Numerous traditional robust estimators are computationally intractable, which partly contributes to the renewal of the interest in the robust mean estimation. Robust centrality estimators, however, include the trimmed mean and the sample median. The latter has the best robustness but suffers a low-efficiency drawback. Trimmed mean and median of means, %as robust alternatives to the sample mean, and achieving sub-Gaussian performance have been proposed and studied in the literature. This article investigates the robustness of leading sub-Gaussian estimators of mean and reveals that none of them can resist greater than $25\%$ contamination in data and consequently introduces an outlyingness induced winsorized mean which has the best possible robustness (can resist up to $50\%$ contamination without breakdown) meanwhile achieving high efficiency. Furthermore, it has a sub-Gaussian performance for uncontaminated samples and a bounded estimation error for contaminated samples at a given confidence level in a finite sample setting. It can be computed in linear time.
    A Deep Generative Model for Reordering Adjacency Matrices. (arXiv:2110.04971v2 [cs.HC] UPDATED)
    Depending on the node ordering, an adjacency matrix can highlight distinct characteristics of a graph. Deriving a "proper" node ordering is thus a critical step in visualizing a graph as an adjacency matrix. Users often try multiple matrix reorderings using different methods until they find one that meets the analysis goal. However, this trial-and-error approach is laborious and disorganized, which is especially challenging for novices. This paper presents a technique that enables users to effortlessly find a matrix reordering they want. Specifically, we design a generative model that learns a latent space of diverse matrix reorderings of the given graph. We also construct an intuitive user interface from the learned latent space by creating a map of various matrix reorderings. We demonstrate our approach through quantitative and qualitative evaluations of the generated reorderings and learned latent spaces. The results show that our model is capable of learning a latent space of diverse matrix reorderings. Most existing research in this area generally focused on developing algorithms that can compute "better" matrix reorderings for particular circumstances. This paper introduces a fundamentally new approach to matrix visualization of a graph, where a machine learning model learns to generate diverse matrix reorderings of a graph.  ( 2 min )
    Deep Neural Networks and Tabular Data: A Survey. (arXiv:2110.01889v2 [cs.LG] UPDATED)
    Heterogeneous tabular data are the most commonly used form of data and are essential for numerous critical and computationally demanding applications. On homogeneous data sets, deep neural networks have repeatedly shown excellent performance and have therefore been widely adopted. However, their application to modeling tabular data (inference or generation) remains highly challenging. This work provides an overview of state of the art deep learning methods for tabular data. We start by categorizing them into three groups: data transformations, specialized architectures, and regularization models. We then provide a comprehensive overview of the main approaches in each group. A discussion of deep learning approaches for generating tabular data is complemented by strategies for explaining deep models on tabular data. Our primary contribution is to address the main research streams and existing methodologies in this area, while highlighting relevant challenges and open research questions. We also provide an empirical comparison of traditional machine learning methods with deep learning approaches on real tabular data sets of different sizes and with different learning objectives. Our results indicate that algorithms based on gradient-boosted tree ensembles still outperform the deep learning models. To the best of our knowledge, this is the first in-depth look at deep learning approaches for tabular data. This work can serve as a valuable starting point and guide for researchers and practitioners interested in deep learning with tabular data.  ( 2 min )
    Modeling Interactions of Autonomous Vehicles and Pedestrians with Deep Multi-Agent Reinforcement Learning for Collision Avoidance. (arXiv:2109.15266v2 [cs.RO] UPDATED)
    Reliable pedestrian crash avoidance mitigation (PCAM) systems are crucial components of safe autonomous vehicles (AVs). The nature of the vehicle-pedestrian interaction where decisions of one agent directly affect the other agent's optimal behavior, and vice versa, is a challenging yet often neglected aspect of such systems. We address this issue by defining a Markov decision process (MDP) for a simulated driving scenario in which an AV driving along an urban street faces a pedestrian trying to cross an unmarked crosswalk. The AV's PCAM decision policy is learned through deep reinforcement learning (DRL). Since modeling pedestrians realistically is challenging, we compare two levels of intelligent pedestrian behavior. While the baseline model follows a predefined strategy, our advanced pedestrian model is defined as a second DRL agent. This model captures continuous learning and the uncertainty inherent in human behavior, making the vehicle-pedestrian interaction a deep multi-agent reinforcement learning (DMARL) problem. We benchmark the developed PCAM systems according to the agents' collision rate and the resulting traffic flow efficiency with a focus on the influence of observation uncertainty or noise on the decision-making of the agents. The results show that the AV is able to completely mitigate collisions under the majority of the investigated conditions, and that the DRL pedestrian model indeed learns a more intelligent crossing behavior.  ( 2 min )
    Mean-Field Multi-Agent Reinforcement Learning: A Decentralized Network Approach. (arXiv:2108.02731v2 [cs.LG] UPDATED)
    One of the challenges for multi-agent reinforcement learning (MARL) is designing efficient learning algorithms for a large system in which each agent has only limited or partial information of the entire system. While exciting progress has been made to analyze decentralized MARL with the network of agents for social networks and team video games, little is known theoretically for decentralized MARL with the network of states for modeling self-driving vehicles, ride-sharing, and data and traffic routing. This paper proposes a framework of localized training and decentralized execution to study MARL with network of states. Localized training means that agents only need to collect local information in their neighboring states during the training phase; decentralized execution implies that agents can execute afterwards the learned decentralized policies, which depend only on agents' current states. The theoretical analysis consists of three key components: the first is the reformulation of the MARL system as a networked Markov decision process with teams of agents, enabling updating the associated team Q-function in a localized fashion; the second is the Bellman equation for the value function and the appropriate Q-function on the probability measure space; and the third is the exponential decay property of the team Q-function, facilitating its approximation with efficient sample efficiency and controllable error. The theoretical analysis paves the way for a new algorithm LTDE-Neural-AC, where the actor-critic approach with over-parameterized neural networks is proposed. The convergence and sample complexity is established and shown to be scalable with respect to the sizes of both agents and states. To the best of our knowledge, this is the first neural network based MARL algorithm with network structure and provably convergence guarantee.  ( 2 min )
    Maximum n-times Coverage for Vaccine Design. (arXiv:2101.10902v4 [q-bio.QM] UPDATED)
    We introduce the maximum $n$-times coverage problem that selects $k$ overlays to maximize the summed coverage of weighted elements, where each element must be covered at least $n$ times. We also define the min-cost $n$-times coverage problem where the objective is to select the minimum set of overlays such that the sum of the weights of elements that are covered at least $n$ times is at least $\tau$. Maximum $n$-times coverage is a generalization of the multi-set multi-cover problem, is NP-complete, and is not submodular. We introduce two new practical solutions for $n$-times coverage based on integer linear programming and sequential greedy optimization. We show that maximum $n$-times coverage is a natural way to frame peptide vaccine design, and find that it produces a pan-strain COVID-19 vaccine design that is superior to 29 other published designs in predicted population coverage and the expected number of peptides displayed by each individual's HLA molecules.
    Dissecting graph measure performance for node clustering in LFR parameter space. (arXiv:2202.09827v1 [cs.SI])
    Graph measures that express closeness or distance between nodes can be employed for graph nodes clustering using metric clustering algorithms. There are numerous measures applicable to this task, and which one performs better is an open question. We study the performance of 25 graph measures on generated graphs with different parameters. While usually measure comparisons are limited to general measure ranking on a particular dataset, we aim to explore the performance of various measures depending on graph features. Using an LFR graph generator, we create a dataset of 11780 graphs covering the whole LFR parameter space. For each graph, we assess the quality of clustering with k-means algorithm for each considered measure. Based on this, we determine the best measure for each area of the parameter space. We find that the parameter space consists of distinct zones where one particular measure is the best. We analyze the geometry of the resulting zones and describe it with simple criteria. Given particular graph parameters, this allows us to recommend a particular measure to use for clustering.  ( 2 min )
    Towards Enabling Dynamic Convolution Neural Network Inference for Edge Intelligence. (arXiv:2202.09461v1 [cs.LG])
    Deep learning applications have achieved great success in numerous real-world applications. Deep learning models, especially Convolution Neural Networks (CNN) are often prototyped using FPGA because it offers high power efficiency and reconfigurability. The deployment of CNNs on FPGAs follows a design cycle that requires saving of model parameters in the on-chip memory during High-level synthesis (HLS). Recent advances in edge intelligence require CNN inference on edge network to increase throughput and reduce latency. To provide flexibility, dynamic parameter allocation to different mobile devices is required to implement either a predefined or defined on-the-fly CNN architecture. In this study, we present novel methodologies for dynamically streaming the model parameters at run-time to implement a traditional CNN architecture. We further propose a library-based approach to design scalable and dynamic distributed CNN inference on the fly leveraging partial-reconfiguration techniques, which is particularly suitable for resource-constrained edge devices. The proposed techniques are implemented on the Xilinx PYNQ-Z2 board to prove the concept by utilizing the LeNet-5 CNN model. The results show that the proposed methodologies are effective, with classification accuracy rates of 92%, 86%, and 94% respectively
    Recent Advances on Neural Network Pruning at Initialization. (arXiv:2103.06460v2 [cs.LG] UPDATED)
    Neural network pruning typically removes connections or neurons from a pretrained converged model; while a new pruning paradigm, pruning at initialization (PaI), attempts to prune a randomly initialized network. This paper offers the first survey concentrated on this emerging pruning fashion. We first introduce a generic formulation of neural network pruning, followed by the major classic pruning topics. Then, as the main body of this paper, a thorough and structured literature review of PaI methods is presented, consisting of two major tracks (sparse training and sparse selection). Finally, we summarize the surge of PaI compared to traditional pruning and discuss the open problems. Apart from the dedicated paper review, this paper also offers a code base for easy sanity-checking and benchmarking of different PaI methods.
    Predictive Coding: Towards a Future of Deep Learning beyond Backpropagation?. (arXiv:2202.09467v1 [cs.NE])
    The backpropagation of error algorithm used to train deep neural networks has been fundamental to the successes of deep learning. However, it requires sequential backward updates and non-local computations, which make it challenging to parallelize at scale and is unlike how learning works in the brain. Neuroscience-inspired learning algorithms, however, such as \emph{predictive coding}, which utilize local learning, have the potential to overcome these limitations and advance beyond current deep learning technologies. While predictive coding originated in theoretical neuroscience as a model of information processing in the cortex, recent work has developed the idea into a general-purpose algorithm able to train neural networks using only local computations. In this survey, we review works that have contributed to this perspective and demonstrate the close theoretical connections between predictive coding and backpropagation, as well as works that highlight the multiple advantages of using predictive coding models over backpropagation-trained neural networks. Specifically, we show the substantially greater flexibility of predictive coding networks against equivalent deep neural networks, which can function as classifiers, generators, and associative memories simultaneously, and can be defined on arbitrary graph topologies. Finally, we review direct benchmarks of predictive coding networks on machine learning classification tasks, as well as its close connections to control theory and applications in robotics.
    Imbalanced Malware Images Classification: a CNN based Approach. (arXiv:1708.08042v2 [cs.CV] UPDATED)
    Deep convolutional neural networks (CNNs) can be applied to malware binary detection via image classification. The performance, however, is degraded due to the imbalance of malware families (classes). To mitigate this issue, we propose a simple yet effective weighted softmax loss which can be employed as the final layer of deep CNNs. The original softmax loss is weighted, and the weight value can be determined according to class size. A scaling parameter is also included in computing the weight. Proper selection of this parameter is studied and an empirical option is suggested. The weighted loss aims at alleviating the impact of data imbalance in an end-to-end learning fashion. To validate the efficacy, we deploy the proposed weighted loss in a pre-trained deep CNN model and fine-tune it to achieve promising results on malware images classification. Extensive experiments also demonstrate that the new loss function can well fit other typical CNNs, yielding an improved classification performance.
    PETCI: A Parallel English Translation Dataset of Chinese Idioms. (arXiv:2202.09509v1 [cs.CL])
    Idioms are an important language phenomenon in Chinese, but idiom translation is notoriously hard. Current machine translation models perform poorly on idiom translation, while idioms are sparse in many translation datasets. We present PETCI, a parallel English translation dataset of Chinese idioms, aiming to improve idiom translation by both human and machine. The dataset is built by leveraging human and machine effort. Baseline generation models show unsatisfactory abilities to improve translation, but structure-aware classification models show good performance on distinguishing good translations. Furthermore, the size of PETCI can be easily increased without expertise. Overall, PETCI can be helpful to language learners and machine translation systems.
    Automated Attack Synthesis by Extracting Finite State Machines from Protocol Specification Documents. (arXiv:2202.09470v1 [cs.CR])
    Automated attack discovery techniques, such as attacker synthesis or model-based fuzzing, provide powerful ways to ensure network protocols operate correctly and securely. Such techniques, in general, require a formal representation of the protocol, often in the form of a finite state machine (FSM). Unfortunately, many protocols are only described in English prose, and implementing even a simple network protocol as an FSM is time-consuming and prone to subtle logical errors. Automatically extracting protocol FSMs from documentation can significantly contribute to increased use of these techniques and result in more robust and secure protocol implementations. In this work we focus on attacker synthesis as a representative technique for protocol security, and on RFCs as a representative format for protocol prose description. Unlike other works that rely on rule-based approaches or use off-the-shelf NLP tools directly, we suggest a data-driven approach for extracting FSMs from RFC documents. Specifically, we use a hybrid approach consisting of three key steps: (1) large-scale word-representation learning for technical language, (2) focused zero-shot learning for mapping protocol text to a protocol-independent information language, and (3) rule-based mapping from protocol-independent information to a specific protocol FSM. We show the generalizability of our FSM extraction by using the RFCs for six different protocols: BGPv4, DCCP, LTP, PPTP, SCTP and TCP. We demonstrate how automated extraction of an FSM from an RFC can be applied to the synthesis of attacks, with TCP and DCCP as case-studies. Our approach shows that it is possible to automate attacker synthesis against protocols by using textual specifications such as RFCs.
    Improving Attribution Methods by Learning Submodular Functions. (arXiv:2104.09073v3 [cs.LG] UPDATED)
    This work explores the novel idea of learning a submodular scoring function to improve the specificity/selectivity of existing feature attribution methods. Submodular scores are natural for attribution as they are known to accurately model the principle of diminishing returns. A new formulation for learning a deep submodular set function that is consistent with the real-valued attribution maps obtained by existing attribution methods is proposed. The final attribution value of a feature is then defined as the marginal gain in the induced submodular score of the feature in the context of other highly attributed features, thus decreasing the attribution of redundant yet discriminatory features. Experiments on multiple datasets illustrate that the proposed attribution method achieves higher specificity along with good discriminative power. The implementation of our method is publicly available at https://github.com/Piyushi-0/SEA-NN.
    Explicit Group Sparse Projection with Applications to Deep Learning and NMF. (arXiv:1912.03896v3 [cs.LG] UPDATED)
    We design a new sparse projection method for a set of vectors that guarantees a desired average sparsity level measured leveraging the popular Hoyer measure (an affine function of the ratio of the $\ell_1$ and $\ell_2$ norms). Existing approaches either project each vector individually or require the use of a regularization parameter which implicitly maps to the average $\ell_0$-measure of sparsity. Instead, in our approach we set the sparsity level for the whole set explicitly and simultaneously project a group of vectors with the sparsity level of each vector tuned automatically. We show that the computational complexity of our projection operator is linear in the size of the problem. Additionally, we propose a generalization of this projection by replacing the $\ell_1$ norm by its weighted version. We showcase the efficacy of our approach in both supervised and unsupervised learning tasks on image datasets including CIFAR10 and ImageNet. In deep neural network pruning, the sparse models produced by our method on ResNet50 have significantly higher accuracies at corresponding sparsity values compared to existing competitors. In nonnegative matrix factorization, our approach yields competitive reconstruction errors against state-of-the-art algorithms.
    Differentially Private Federated Learning via Inexact ADMM with Multiple Local Updates. (arXiv:2202.09409v1 [cs.LG])
    Differential privacy (DP) techniques can be applied to the federated learning model to statistically guarantee data privacy against inference attacks to communication among the learning agents. While ensuring strong data privacy, however, the DP techniques hinder achieving a greater learning performance. In this paper we develop a DP inexact alternating direction method of multipliers algorithm with multiple local updates for federated learning, where a sequence of convex subproblems is solved with the objective perturbation by random noises generated from a Laplace distribution. We show that our algorithm provides $\bar{\epsilon}$-DP for every iteration, where $\bar{\epsilon}$ is a privacy budget controlled by the user. We also present convergence analyses of the proposed algorithm. Using MNIST and FEMNIST datasets for the image classification, we demonstrate that our algorithm reduces the testing error by at most $31\%$ compared with the existing DP algorithm, while achieving the same level of data privacy. The numerical experiment also shows that our algorithm converges faster than the existing algorithm.
    Learning to Attack with Fewer Pixels: A Probabilistic Post-hoc Framework for Refining Arbitrary Dense Adversarial Attacks. (arXiv:2010.06131v2 [cs.CV] UPDATED)
    Deep neural network image classifiers are reported to be susceptible to adversarial evasion attacks, which use carefully crafted images created to mislead a classifier. Many adversarial attacks belong to the category of dense attacks, which generate adversarial examples by perturbing all the pixels of a natural image. To generate sparse perturbations, sparse attacks have been recently developed, which are usually independent attacks derived by modifying a dense attack's algorithm with sparsity regularisations, resulting in reduced attack efficiency. In this paper, we aim to tackle this task from a different perspective. We select the most effective perturbations from the ones generated from a dense attack, based on the fact we find that a considerable amount of the perturbations on an image generated by dense attacks may contribute little to attacking a classifier. Accordingly, we propose a probabilistic post-hoc framework that refines given dense attacks by significantly reducing the number of perturbed pixels but keeping their attack power, trained with mutual information maximisation. Given an arbitrary dense attack, the proposed model enjoys appealing compatibility for making its adversarial images more realistic and less detectable with fewer perturbations. Moreover, our framework performs adversarial attacks much faster than existing sparse attacks.
    Reciprocity in Machine Learning. (arXiv:2202.09480v1 [cs.LG])
    Machine learning is pervasive. It powers recommender systems such as Spotify, Instagram and YouTube, and health-care systems via models that predict sleep patterns, or the risk of disease. Individuals contribute data to these models and benefit from them. Are these contributions (outflows of influence) and benefits (inflows of influence) reciprocal? We propose measures of outflows, inflows and reciprocity building on previously proposed measures of training data influence. Our initial theoretical and empirical results indicate that under certain distributional assumptions, some classes of models are approximately reciprocal. We conclude with several open directions.
    Accurate Prediction and Uncertainty Estimation using Decoupled Prediction Interval Networks. (arXiv:2202.09664v1 [cs.LG])
    We propose a network architecture capable of reliably estimating uncertainty of regression based predictions without sacrificing accuracy. The current state-of-the-art uncertainty algorithms either fall short of achieving prediction accuracy comparable to the mean square error optimization or underestimate the variance of network predictions. We propose a decoupled network architecture that is capable of accomplishing both at the same time. We achieve this by breaking down the learning of prediction and prediction interval (PI) estimations into a two-stage training process. We use a custom loss function for learning a PI range around optimized mean estimation with a desired coverage of a proportion of the target labels within the PI range. We compare the proposed method with current state-of-the-art uncertainty quantification algorithms on synthetic datasets and UCI benchmarks, reducing the error in the predictions by 23 to 34% while maintaining 95% Prediction Interval Coverage Probability (PICP) for 7 out of 9 UCI benchmark datasets. We also examine the quality of our predictive uncertainty by evaluating on Active Learning and demonstrating 17 to 36% error reduction on UCI benchmarks.
    Selective Credit Assignment. (arXiv:2202.09699v1 [cs.LG])
    Efficient credit assignment is essential for reinforcement learning algorithms in both prediction and control settings. We describe a unified view on temporal-difference algorithms for selective credit assignment. These selective algorithms apply weightings to quantify the contribution of learning updates. We present insights into applying weightings to value-based learning and planning algorithms, and describe their role in mediating the backward credit distribution in prediction and control. Within this space, we identify some existing online learning algorithms that can assign credit selectively as special cases, as well as add new algorithms that assign credit backward in time counterfactually, allowing credit to be assigned off-trajectory and off-policy.
    Missing Data Infill with Automunge. (arXiv:2202.09484v1 [cs.LG])
    Missing data is a fundamental obstacle in the practice of data science. This paper surveys a few conventions for imputation as available in the Automunge open source python library platform for tabular data preprocessing, including "ML infill" in which auto ML models are trained for target features from partitioned extracts of a training set. A series of validation experiments were performed to benchmark imputation scenarios towards downstream model performance, in which it was found for the given benchmark sets that in many cases ML infill outperformed for both numeric and categoric target features, and was otherwise at minimum within noise distributions of the other imputation scenarios. Evidence also suggested supplementing ML infill with the addition of support columns with boolean integer markers signaling presence of infill was usually beneficial to downstream model performance. We consider these results sufficient to recommend defaulting to ML infill for tabular learning, and further recommend supplementing imputations with support columns signaling presence of infill, each as can be prepared with push-button operation in the Automunge library. Our contributions include an auto ML derived missing data imputation library for tabular learning in the python ecosystem, fully integrated into a preprocessing platform with an extensive library of feature transformations, with a novel production friendly implementation that bases imputation models on a designated train set for consistent basis towards additional data.
    Cross-Task Knowledge Distillation in Multi-Task Recommendation. (arXiv:2202.09852v1 [cs.IR])
    Multi-task learning has been widely used in real-world recommenders to predict different types of user feedback. Most prior works focus on designing network architectures for bottom layers as a means to share the knowledge about input features representations. However, since they adopt task-specific binary labels as supervised signals for training, the knowledge about how to accurately rank items is not fully shared across tasks. In this paper, we aim to enhance knowledge transfer for multi-task personalized recommendat optimization objectives. We propose a Cross-Task Knowledge Distillation (CrossDistil) framework in recommendation, which consists of three procedures. 1) Task Augmentation: We introduce auxiliary tasks with quadruplet loss functions to capture cross-task fine-grained ranking information, which could avoid task conflicts by preserving the cross-task consistent knowledge; 2) Knowledge Distillation: We design a knowledge distillation approach based on augmented tasks for sharing ranking knowledge, where tasks' predictions are aligned with a calibration process; 3) Model Training: Teacher and student models are trained in an end-to-end manner, with a novel error correction mechanism to speed up model training and improve knowledge quality. Comprehensive experiments on a public dataset and our production dataset are carried out to verify the effectiveness of CrossDistil as well as the necessity of its key components.
    Distributed Out-of-Memory NMF of Dense and Sparse Data on CPU/GPU Architectures with Automatic Model Selection for Exascale Data. (arXiv:2202.09518v1 [cs.DC])
    The need for efficient and scalable big-data analytics methods is more essential than ever due to the exploding size and complexity of globally emerging datasets. Nonnegative Matrix Factorization (NMF) is a well-known explainable unsupervised learning method for dimensionality reduction, latent feature extraction, blind source separation, data mining, and machine learning. In this paper, we introduce a new distributed out-of-memory NMF method, named pyDNMF-GPU, designed for modern heterogeneous CPU/GPU architectures that is capable of factoring exascale-sized dense and sparse matrices. Our method reduces the latency associated with local data transfer between the GPU and host using CUDA streams, and reduces the latency associated with collective communications (both intra-node and inter-node) via NCCL primitives. In addition, sparse and dense matrix multiplications are significantly accelerated with GPU cores, resulting in good scalability. We set new benchmarks for the size of the data being analyzed: in experiments, we measure up to 76x improvement on a single GPU over running on a single 18 core CPU and we show good weak scaling on up to 4096 multi-GPU cluster nodes with approximately 25,000 GPUs, when decomposing a dense 340 Terabyte-size matrix and a 11 Exabyte-size sparse matrix of density 10e-6. Finally, we integrate our method with an automatic model selection method. With this integration, we introduce a new tool that is capable of analyzing, compressing, and discovering explainable latent structures in extremely large sparse and dense data.
    Post-hoc Calibration of Neural Networks by g-Layers. (arXiv:2006.12807v2 [cs.LG] UPDATED)
    Calibration of neural networks is a critical aspect to consider when incorporating machine learning models in real-world decision-making systems where the confidence of decisions are equally important as the decisions themselves. In recent years, there is a surge of research on neural network calibration and the majority of the works can be categorized into post-hoc calibration methods, defined as methods that learn an additional function to calibrate an already trained base network. In this work, we intend to understand the post-hoc calibration methods from a theoretical point of view. Especially, it is known that minimizing Negative Log-Likelihood (NLL) will lead to a calibrated network on the training set if the global optimum is attained (Bishop, 1994). Nevertheless, it is not clear learning an additional function in a post-hoc manner would lead to calibration in the theoretical sense. To this end, we prove that even though the base network ($f$) does not lead to the global optimum of NLL, by adding additional layers ($g$) and minimizing NLL by optimizing the parameters of $g$ one can obtain a calibrated network $g \circ f$. This not only provides a less stringent condition to obtain a calibrated network but also provides a theoretical justification of post-hoc calibration methods. Our experiments on various image classification benchmarks confirm the theory.
    Overparametrization improves robustness against adversarial attacks: A replication study. (arXiv:2202.09735v1 [cs.LG])
    Overparametrization has become a de facto standard in machine learning. Despite numerous efforts, our understanding of how and where overparametrization helps model accuracy and robustness is still limited. To this end, here we conduct an empirical investigation to systemically study and replicate previous findings in this area, in particular the study by Madry et al. Together with this study, our findings support the "universal law of robustness" recently proposed by Bubeck et al. We argue that while critical for robust perception, overparametrization may not be enough to achieve full robustness and smarter architectures e.g. the ones implemented by the human visual cortex) seem inevitable.
    Generative Adversarial Networks for Data Generation in Structural Health Monitoring. (arXiv:2112.08196v1 [cs.LG] CROSS LISTED)
    Structural Health Monitoring (SHM) has been continuously benefiting from the advancements in the field of data science. Various types of Artificial Intelligence (AI) methods have been utilized for the assessment and evaluation of civil structures. In AI, Machine Learning (ML) and Deep Learning (DL) algorithms require plenty of datasets to train; particularly, the more data DL models are trained with, the better output it yields. Yet, in SHM applications, collecting data from civil structures through sensors is expensive and obtaining useful data (damage associated data) is challenging. In this paper, 1-D Wasserstein loss Deep Convolutional Generative Adversarial Networks using Gradient Penalty (1-D WDCGAN-GP) is utilized to generate damage associated vibration datasets that are similar to the input. For the purpose of vibration-based damage diagnostics, a 1-D Deep Convolutional Neural Network (1-D DCNN) is built, trained, and tested on both real and generated datasets. The classification results from the 1-D DCNN on both datasets resulted to be very similar to each other. The presented work in this paper shows that for the cases of insufficient data in DL or ML-based damage diagnostics, 1-D WDCGAN-GP can successfully generate data for the model to be trained on. Keywords: 1-D Generative Adversarial Networks (GAN), Deep Convolutional Generative Adversarial Networks (DCGAN), Wasserstein Generative Adversarial Networks with Gradient Penalty (WGAN-GP), 1-D Convolutional Neural Networks (CNN), Structural Health Monitoring (SHM), Structural Damage Diagnostics, Structural Damage Detection
    Physics-guided Deep Markov Models for Learning Nonlinear Dynamical Systems with Uncertainty. (arXiv:2110.08607v2 [cs.LG] UPDATED)
    In this paper, we propose a probabilistic physics-guided framework, termed Physics-guided Deep Markov Model (PgDMM). The framework is especially targeted to the inference of the characteristics and latent structure of nonlinear dynamical systems from measurement data, where it is typically intractable to perform exact inference of latent variables. A recently surfaced option pertains to leveraging variational inference to perform approximate inference. In such a scheme, transition and emission functions of the system are parameterized via feed-forward neural networks (deep generative models). However, due to the generalized and highly versatile formulation of neural network functions, the learned latent space is often prone to lack physical interpretation and structured representation. To address this, we bridge physics-based state space models with Deep Markov Models, thus delivering a hybrid modelling framework for unsupervised learning and identification for nonlinear dynamical systems. The proposed framework takes advantage of the expressive power of deep learning, while retaining the driving physics of the dynamical system by imposing physics-driven restrictions on the side of the latent space. We demonstrate the benefits of such a fusion in terms of achieving improved performance on illustrative simulation examples and experimental case studies of nonlinear systems. Our results indicate that the physics-based models involved in the employed transition and emission functions essentially enforce a more structured and physically interpretable latent space, which is essential to generalization and prediction capabilities.
    Stochastic sparse adversarial attacks. (arXiv:2011.12423v4 [cs.LG] UPDATED)
    This paper introduces stochastic sparse adversarial attacks (SSAA), standing as simple, fast and purely noise-based targeted and untargeted attacks of neural network classifiers (NNC). SSAA offer new examples of sparse (or $L_0$) attacks for which only few methods have been proposed previously. These attacks are devised by exploiting a small-time expansion idea widely used for Markov processes. Experiments on small and large datasets (CIFAR-10 and ImageNet) illustrate several advantages of SSAA in comparison with the-state-of-the-art methods. For instance, in the untargeted case, our method called Voting Folded Gaussian Attack (VFGA) scales efficiently to ImageNet and achieves a significantly lower $L_0$ score than SparseFool (up to $\frac{2}{5}$) while being faster. Moreover, VFGA achieves better $L_0$ scores on ImageNet than Sparse-RS when both attacks are fully successful on a large number of samples.
    CAMul: Calibrated and Accurate Multi-view Time-Series Forecasting. (arXiv:2109.07438v2 [cs.LG] UPDATED)
    Probabilistic time-series forecasting enables reliable decision making across many domains. Most forecasting problems have diverse sources of data containing multiple modalities and structures. Leveraging information as well as uncertainty from these data sources for well-calibrated and accurate forecasts is an important challenging problem. Most previous work on multi-modal learning and forecasting simply aggregate intermediate representations from each data view by simple methods of summation or concatenation and do not explicitly model uncertainty for each data-view. We propose a general probabilistic multi-view forecasting framework CAMul, that can learn representations and uncertainty from diverse data sources. It integrates the knowledge and uncertainty from each data view in a dynamic context-specific manner assigning more importance to useful views to model a well-calibrated forecast distribution. We use CAMul for multiple domains with varied sources and modalities and show that CAMul outperforms other state-of-art probabilistic forecasting models by over 25\% in accuracy and calibration.
    Disentangling Autoencoders (DAE). (arXiv:2202.09926v1 [cs.LG])
    Noting the importance of factorizing or disentangling the latent space, we propose a novel framework for autoencoders based on the principles of symmetry transformations in group-theory, which is a non-probabilistic disentangling autoencoder model. To the best of our knowledge, this is the first model that is aiming to achieve disentanglement based on autoencoders without regularizers. The proposed model is compared to seven state-of-the-art generative models based on autoencoders and evaluated based on reconstruction loss and five metrics quantifying disentanglement losses. The experiment results show that the proposed model can have better disentanglement when variances of each features are different. We believe that this model leads a new field for disentanglement learning based on autoencoders without regularizers.
    An Overview and Experimental Study of Learning-based Optimization Algorithms for Vehicle Routing Problem. (arXiv:2107.07076v2 [cs.LG] UPDATED)
    Vehicle routing problem (VRP) is a typical discrete combinatorial optimization problem, and many models and algorithms have been proposed to solve the VRP and its variants. Although existing approaches have contributed a lot to the development of this field, these approaches either are limited in problem size or need manual intervening in choosing parameters. To solve these difficulties, many studies have considered the learning-based optimization (LBO) algorithms to solve the VRP. This paper reviews recent advances in this field and divides relevant approaches into end-to-end approaches and step-by-step approaches. We performed a statistical analysis of the reviewed articles from various aspects and designed three experiments to evaluate the performance of four representative LBO algorithms. Finally, we conclude the applicable types of problems for different LBO algorithms and suggest directions in which researchers can improve LBO algorithms.
    Attacks, Defenses, And Tools: A Framework To Facilitate Robust AI/ML Systems. (arXiv:2202.09465v1 [cs.CR])
    Software systems are increasingly relying on Artificial Intelligence (AI) and Machine Learning (ML) components. The emerging popularity of AI techniques in various application domains attracts malicious actors and adversaries. Therefore, the developers of AI-enabled software systems need to take into account various novel cyber-attacks and vulnerabilities that these systems may be susceptible to. This paper presents a framework to characterize attacks and weaknesses associated with AI-enabled systems and provide mitigation techniques and defense strategies. This framework aims to support software designers in taking proactive measures in developing AI-enabled software, understanding the attack surface of such systems, and developing products that are resilient to various emerging attacks associated with ML. The developed framework covers a broad spectrum of attacks, mitigation techniques, and defensive and offensive tools. In this paper, we demonstrate the framework architecture and its major components, describe their attributes, and discuss the long-term goals of this research.
    Deconstructing Distributions: A Pointwise Framework of Learning. (arXiv:2202.09931v1 [cs.LG])
    In machine learning, we traditionally evaluate the performance of a single model, averaged over a collection of test inputs. In this work, we propose a new approach: we measure the performance of a collection of models when evaluated on a $\textit{single input point}$. Specifically, we study a point's $\textit{profile}$: the relationship between models' average performance on the test distribution and their pointwise performance on this individual point. We find that profiles can yield new insights into the structure of both models and data -- in and out-of-distribution. For example, we empirically show that real data distributions consist of points with qualitatively different profiles. On one hand, there are "compatible" points with strong correlation between the pointwise and average performance. On the other hand, there are points with weak and even $\textit{negative}$ correlation: cases where improving overall model accuracy actually $\textit{hurts}$ performance on these inputs. We prove that these experimental observations are inconsistent with the predictions of several simplified models of learning proposed in prior work. As an application, we use profiles to construct a dataset we call CIFAR-10-NEG: a subset of CINIC-10 such that for standard models, accuracy on CIFAR-10-NEG is $\textit{negatively correlated}$ with accuracy on CIFAR-10 test. This illustrates, for the first time, an OOD dataset that completely inverts "accuracy-on-the-line" (Miller, Taori, Raghunathan, Sagawa, Koh, Shankar, Liang, Carmon, and Schmidt 2021)
    Enhancing Affective Representations of Music-Induced EEG through Multimodal Supervision and latent Domain Adaptation. (arXiv:2202.09750v1 [cs.SD])
    The study of Music Cognition and neural responses to music has been invaluable in understanding human emotions. Brain signals, though, manifest a highly complex structure that makes processing and retrieving meaningful features challenging, particularly of abstract constructs like affect. Moreover, the performance of learning models is undermined by the limited amount of available neuronal data and their severe inter-subject variability. In this paper we extract efficient, personalized affective representations from EEG signals during music listening. To this end, we employ music signals as a supervisory modality to EEG, aiming to project their semantic correspondence onto a common representation space. We utilize a bi-modal framework by combining an LSTM-based attention model to process EEG and a pre-trained model for music tagging, along with a reverse domain discriminator to align the distributions of the two modalities, further constraining the learning process with emotion tags. The resulting framework can be utilized for emotion recognition both directly, by performing supervised predictions from either modality, and indirectly, by providing relevant music samples to EEG input queries. The experimental findings show the potential of enhancing neuronal data through stimulus information for recognition purposes and yield insights into the distribution and temporal variance of music-induced affective features.
    MSPM: A Modularized and Scalable Multi-Agent Reinforcement Learning-based System for Financial Portfolio Management. (arXiv:2102.03502v4 [q-fin.PM] UPDATED)
    Financial portfolio management (PM) is one of the most applicable problems in reinforcement learning (RL) owing to its sequential decision-making nature. However, existing RL-based approaches rarely focus on scalability or reusability to adapt to the ever-changing markets. These approaches are rigid and unscalable to accommodate the varying number of assets of portfolios and increasing need for heterogeneous data. Also, RL agents in the existing systems are ad-hoc trained and hardly reusable for different portfolios. To confront the above problems, a modular design is desired for the systems to be compatible with reusable asset-dedicated agents. In this paper, we propose a multi-agent RL-based system for PM (MSPM). MSPM involves two types of asynchronously-updated modules: Evolving Agent Module (EAM) and Strategic Agent Module (SAM). An EAM is an information-generating module with a DQN agent, and it receives heterogeneous data and generates signal-comprised information for a particular asset. An SAM is a decision-making module with a PPO agent for portfolio optimization, and it connects to EAMs to reallocate the assets in a portfolio. Trained EAMs can be connected to any SAM at will. With its modularized architecture, the multi-step condensation of volatile market information, and the reusable design of EAM, MSPM simultaneously addresses the two challenges in RL-based PM: scalability and reusability. Experiments on 8-year U.S. stock market data prove the effectiveness of MSPM in profit accumulation by its outperformance over five baselines in terms of accumulated rate of return (ARR), daily rate of return, and Sortino ratio. MSPM improves ARR by at least 186.5% compared to CRP, a widely-used PM strategy. To validate the indispensability of EAM, we back-test and compare MSPMs on four portfolios. EAM-enabled MSPMs improve ARR by at least 1341.8% compared to EAM-disabled MSPMs.
    Clustering by the Probability Distributions from Extreme Value Theory. (arXiv:2202.09784v1 [cs.LG])
    Clustering is an essential task to unsupervised learning. It tries to automatically separate instances into coherent subsets. As one of the most well-known clustering algorithms, k-means assigns sample points at the boundary to a unique cluster, while it does not utilize the information of sample distribution or density. Comparably, it would potentially be more beneficial to consider the probability of each sample in a possible cluster. To this end, this paper generalizes k-means to model the distribution of clusters. Our novel clustering algorithm thus models the distributions of distances to centroids over a threshold by Generalized Pareto Distribution (GPD) in Extreme Value Theory (EVT). Notably, we propose the concept of centroid margin distance, use GPD to establish a probability model for each cluster, and perform a clustering algorithm based on the covering probability function derived from GPD. Such a GPD k-means thus enables the clustering algorithm from the probabilistic perspective. Correspondingly, we also introduce a naive baseline, dubbed as Generalized Extreme Value (GEV) k-means. GEV fits the distribution of the block maxima. In contrast, the GPD fits the distribution of distance to the centroid exceeding a sufficiently large threshold, leading to a more stable performance of GPD k-means. Notably, GEV k-means can also estimate cluster structure and thus perform reasonably well over classical k-means. Thus, extensive experiments on synthetic datasets and real datasets demonstrate that GPD k-means outperforms competitors. The github codes are released in https://github.com/sixiaozheng/EVT-K-means.
    Learning to Detect Slip with Barometric Tactile Sensors and a Temporal Convolutional Neural Network. (arXiv:2202.09549v1 [cs.RO])
    The ability to perceive object slip via tactile feedback enables humans to accomplish complex manipulation tasks including maintaining a stable grasp. Despite the utility of tactile information for many applications, tactile sensors have yet to be widely deployed in industrial robotics settings; part of the challenge lies in identifying slip and other events from the tactile data stream. In this paper, we present a learning-based method to detect slip using barometric tactile sensors. These sensors have many desirable properties including high durability and reliability, and are built from inexpensive, off-the-shelf components. We train a temporal convolution neural network to detect slip, achieving high detection accuracies while displaying robustness to the speed and direction of the slip motion. Further, we test our detector on two manipulation tasks involving a variety of common objects and demonstrate successful generalization to real-world scenarios not seen during training. We argue that barometric tactile sensing technology, combined with data-driven learning, is suitable for many manipulation tasks such as slip compensation.
    Dual Optimization for Kolmogorov Model Learning Using Enhanced Gradient Descent. (arXiv:2107.05011v2 [cs.LG] UPDATED)
    Data representation techniques have made a substantial contribution to advancing data processing and machine learning (ML). Improving predictive power was the focus of previous representation techniques, which unfortunately perform rather poorly on the interpretability in terms of extracting underlying insights of the data. Recently, the Kolmogorov model (KM) was studied, which is an interpretable and predictable representation approach to learning the underlying probabilistic structure of a set of random variables. The existing KM learning algorithms using semi-definite relaxation with randomization (SDRwR) or discrete monotonic optimization (DMO) have, however, limited utility to big data applications because they do not scale well computationally. In this paper, we propose a computationally scalable KM learning algorithm, based on the regularized dual optimization combined with enhanced gradient descent (GD) method. To make our method more scalable to large-dimensional problems, we propose two acceleration schemes, namely, the eigenvalue decomposition (EVD) elimination strategy and an approximate EVD algorithm. Furthermore, a thresholding technique by exploiting the error bound analysis and leveraging the normalized Minkowski $\ell_1$-norm, is provided for the selection of the number of iterations of the approximate EVD algorithm. When applied to big data applications, it is demonstrated that the proposed method can achieve compatible training/prediction performance with significantly reduced computational complexity; roughly two orders of magnitude improvement in terms of the time overhead, compared to the existing KM learning algorithms. Furthermore, it is shown that the accuracy of logical relation mining for interpretability by using the proposed KM learning algorithm exceeds $80\%$.  ( 2 min )
    Improving the Level of Autism Discrimination through GraphRNN Link Prediction. (arXiv:2202.09538v1 [cs.LG])
    Dataset is the key of deep learning in Autism disease research. However, due to the few quantity and heterogeneity of samples in current dataset, for example ABIDE (Autism Brain Imaging Data Exchange), the recognition research is not effective enough. Previous studies mostly focused on optimizing feature selection methods and data reinforcement to improve accuracy. This paper is based on the latter technique, which learns the edge distribution of real brain network through GraphRNN, and generates the synthetic data which has incentive effect on the discriminant model. The experimental results show that the combination of original and synthetic data greatly improves the discrimination of the neural network. For instance, the most significant effect is the 50-layer ResNet, and the best generation model is GraphRNN, which improves the accuracy by 32.51% compared with the model reference experiment without generation data reinforcement. Because the generated data comes from the learned edge connection distribution of Autism patients and typical controls functional connectivity, but it has better effect than the original data, which has constructive significance for further understanding of disease mechanism and development.  ( 2 min )
    Mining Robust Default Configurations for Resource-constrained AutoML. (arXiv:2202.09927v1 [cs.LG])
    Automatic machine learning (AutoML) is a key enabler of the mass deployment of the next generation of machine learning systems. A key desideratum for future ML systems is the automatic selection of models and hyperparameters. We present a novel method of selecting performant configurations for a given task by performing offline autoML and mining over a diverse set of tasks. By mining the training tasks, we can select a compact portfolio of configurations that perform well over a wide variety of tasks, as well as learn a strategy to select portfolio configurations for yet-unseen tasks. The algorithm runs in a zero-shot manner, that is without training any models online except the chosen one. In a compute- or time-constrained setting, this virtually instant selection is highly performant. Further, we show that our approach is effective for warm-starting existing autoML platforms. In both settings, we demonstrate an improvement on the state-of-the-art by testing over 62 classification and regression datasets. We also demonstrate the utility of recommending data-dependent default configurations that outperform widely used hand-crafted defaults.  ( 2 min )
    Zeroth-Order Methods for Convex-Concave Minmax Problems: Applications to Decision-Dependent Risk Minimization. (arXiv:2106.09082v2 [math.OC] UPDATED)
    Min-max optimization is emerging as a key framework for analyzing problems of robustness to strategically and adversarially generated data. We propose a random reshuffling-based gradient free Optimistic Gradient Descent-Ascent algorithm for solving convex-concave min-max problems with finite sum structure. We prove that the algorithm enjoys the same convergence rate as that of zeroth-order algorithms for convex minimization problems. We further specialize the algorithm to solve distributionally robust, decision-dependent learning problems, where gradient information is not readily available. Through illustrative simulations, we observe that our proposed approach learns models that are simultaneously robust against adversarial distribution shifts and strategic decisions from the data sources, and outperforms existing methods from the strategic classification literature.  ( 2 min )
    Signal Processing on Higher-Order Networks: Livin' on the Edge ... and Beyond. (arXiv:2101.05510v4 [cs.SI] UPDATED)
    In this tutorial, we provide a didactic treatment of the emerging topic of signal processing on higher-order networks. Drawing analogies from discrete and graph signal processing, we introduce the building blocks for processing data on simplicial complexes and hypergraphs, two common higher-order network abstractions that can incorporate polyadic relationships. We provide brief introductions to simplicial complexes and hypergraphs, with a special emphasis on the concepts needed for the processing of signals supported on these structures. Specifically, we discuss Fourier analysis, signal denoising, signal interpolation, node embeddings, and nonlinear processing through neural networks, using these two higher-order network models. In the context of simplicial complexes, we specifically focus on signal processing using the Hodge Laplacian matrix, a multi-relational operator that leverages the special structure of simplicial complexes and generalizes desirable properties of the Laplacian matrix in graph signal processing. For hypergraphs, we present both matrix and tensor representations, and discuss the trade-offs in adopting one or the other. We also highlight limitations and potential research avenues, both to inform practitioners and to motivate the contribution of new researchers to the area.  ( 2 min )
    Towards Scalable and Privacy-Preserving Deep Neural Network via Algorithmic-Cryptographic Co-design. (arXiv:2012.09364v2 [cs.LG] UPDATED)
    Deep Neural Networks (DNNs) have achieved remarkable progress in various real-world applications, especially when abundant training data are provided. However, data isolation has become a serious problem currently. Existing works build privacy preserving DNN models from either algorithmic perspective or cryptographic perspective. The former mainly splits the DNN computation graph between data holders or between data holders and server, which demonstrates good scalability but suffers from accuracy loss and potential privacy risks. In contrast, the latter leverages time-consuming cryptographic techniques, which has strong privacy guarantee but poor scalability. In this paper, we propose SPNN - a Scalable and Privacy-preserving deep Neural Network learning framework, from algorithmic-cryptographic co-perspective. From algorithmic perspective, we split the computation graph of DNN models into two parts, i.e., the private data related computations that are performed by data holders and the rest heavy computations that are delegated to a server with high computation ability. From cryptographic perspective, we propose using two types of cryptographic techniques, i.e., secret sharing and homomorphic encryption, for the isolated data holders to conduct private data related computations privately and cooperatively. Furthermore, we implement SPNN in a decentralized setting and introduce user-friendly APIs. Experimental results conducted on real-world datasets demonstrate the superiority of SPNN.  ( 2 min )
    Spatio-Temporal EEG Representation Learning on Riemannian Manifold and Euclidean Space. (arXiv:2008.08633v2 [cs.CV] UPDATED)
    We present a novel deep neural architecture for learning Electroencephalogram (EEG). To learn the spatial information, our model first obtains the Riemannian mean and distance from Spatial Covariance Matrices (SCMs) on the Riemannian manifold. We then project the spatial information onto the Euclidean space via tangent space learning. Following, two fully connected layers are used to learn the spatial information embeddings. Moreover, our proposed method learns the temporal information via differential entropy and logarithm power spectrum density features extracted from EEG signals in Euclidean space using a deep long short-term memory network with a soft attention mechanism. To combine the spatial and temporal information, we use an effective fusion strategy, which learns attention weights applied to embedding-specific features for decision making. We evaluate our proposed framework on four public datasets across three popular EEG-related tasks, notably emotion recognition, vigilance estimation, and motor imagery classification, containing various types of tasks such as binary classification, multi-class classification, and regression. Our proposed architecture approaches the state-of-the-art on one dataset (SEED) and outperforms other methods on the other three datasets (SEED-VIG, BCI-IV 2A, and BCI-IV 2B), setting new state-of-the-art values and showing the robustness of our framework in EEG representation learning. The source code of our paper is publicly available at https://github.com/guangyizhangbci/EEG_Riemannian.  ( 2 min )
    SRL-SOA: Self-Representation Learning with Sparse 1D-Operational Autoencoder for Hyperspectral Image Band Selection. (arXiv:2202.09918v1 [cs.CV])
    The band selection in the hyperspectral image (HSI) data processing is an important task considering its effect on the computational complexity and accuracy. In this work, we propose a novel framework for the band selection problem: Self-Representation Learning (SRL) with Sparse 1D-Operational Autoencoder (SOA). The proposed SLR-SOA approach introduces a novel autoencoder model, SOA, that is designed to learn a representation domain where the data are sparsely represented. Moreover, the network composes of 1D-operational layers with the non-linear neuron model. Hence, the learning capability of neurons (filters) is greatly improved with shallow architectures. Using compact architectures is especially crucial in autoencoders as they tend to overfit easily because of their identity mapping objective. Overall, we show that the proposed SRL-SOA band selection approach outperforms the competing methods over two HSI data including Indian Pines and Salinas-A considering the achieved land cover classification accuracies. The software implementation of the SRL-SOA approach is shared publicly at https://github.com/meteahishali/SRL-SOA.  ( 2 min )
    Opening the Black Box of Deep Neural Networks in Physical Layer Communication. (arXiv:2106.01124v3 [eess.SP] UPDATED)
    Deep Neural Network (DNN)-based physical layer techniques are attracting considerable interest due to their potential to enhance communication systems. However, most studies in the physical layer have tended to focus on the application of DNN models to wireless communication problems but not to theoretically understand how does a DNN work in a communication system. In this paper, we aim to quantitatively analyze why DNNs can achieve comparable performance in the physical layer comparing with traditional techniques and their cost in terms of computational complexity. We further investigate and also experimentally validate how information is flown in a DNN-based communication system under the information theoretic concepts.  ( 2 min )
    CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile Application. (arXiv:2008.09264v4 [eess.AS] UPDATED)
    This study presents a deep learning-based speech signal-processing mobile application known as CITISEN. The CITISEN provides three functions: speech enhancement (SE), model adaptation (MA), and background noise conversion (BNC), allowing CITISEN to be used as a platform for utilizing and evaluating SE models and flexibly extend the models to address various noise environments and users. For SE, a pretrained SE model downloaded from the cloud server is used to effectively reduce noise components from instant or saved recordings provided by users. For encountering unseen noise or speaker environments, the MA function is applied to promote CITISEN. A few audio samples recording on a noisy environment are uploaded and used to adapt the pretrained SE model on the server. Finally, for BNC, CITISEN first removes the background noises through an SE model and then mixes the processed speech with new background noise. The novel BNC function can evaluate SE performance under specific conditions, cover people's tracks, and provide entertainment. The experimental results confirmed the effectiveness of SE, MA, and BNC functions. Compared with the noisy speech signals, the enhanced speech signals achieved about 6\% and 33\% of improvements, respectively, in terms of short-time objective intelligibility (STOI) and perceptual evaluation of speech quality (PESQ). With MA, the STOI and PESQ could be further improved by approximately 6\% and 11\%, respectively. Finally, the BNC experiment results indicated that the speech signals converted from noisy and silent backgrounds have a close scene identification accuracy and similar embeddings in an acoustic scene classification model. Therefore, the proposed BNC can effectively convert the background noise of a speech signal and be a data augmentation method when clean speech signals are unavailable.  ( 3 min )
    Double-descent curves in neural networks: a new perspective using Gaussian processes. (arXiv:2102.07238v4 [stat.ML] UPDATED)
    Double-descent curves in neural networks describe the phenomenon that the generalisation error initially descends with increasing parameters, then grows after reaching an optimal number of parameters which is less than the number of data points, but then descends again in the overparameterized regime. Here we use a neural network Gaussian process (NNGP) which maps exactly to a fully connected network (FCN) in the infinite-width limit, combined with techniques from random matrix theory, to calculate this generalisation behaviour. An advantage of our NNGP approach is that the analytical calculations are easier to interpret. We argue that the fact that the generalisation error of neural networks decreases in the overparameterized regime and has a finite theoretical value is explained by the convergence to their limiting Gaussian processes. Our analysis thus provides a mathematical explanation for a surprising phenomenon that could not explained by conventional statistical learning theory. However, understanding what drives these finite theoretical values to be the state-of-the-art generalisation performances in many applications remains an open question, for which we only provide new leads in this paper.  ( 2 min )
    MultiRocket: Multiple pooling operators and transformations for fast and effective time series classification. (arXiv:2102.00457v4 [cs.LG] UPDATED)
    We propose MultiRocket, a fast time series classification (TSC) algorithm that achieves state-of-the-art performance with a tiny fraction of the time and without the complex ensembling structure of many state-of-the-art methods. MultiRocket improves on MiniRocket, one of the fastest TSC algorithms to date, by adding multiple pooling operators and transformations to improve the diversity of the features generated. In addition to processing the raw input series, MultiRocket also applies first order differences to transform the original series. Convolutions are applied to both representations, and four pooling operators are applied to the convolution outputs. When benchmarked using the University of California Riverside TSC benchmark datasets, MultiRocket is significantly more accurate than MiniRocket, and competitive with the best ranked current method in terms of accuracy, HIVE-COTE 2.0, while being orders of magnitude faster.  ( 2 min )
    Routine Clustering of Mobile Sensor Data Facilitates Psychotic Relapse Prediction in Schizophrenia Patients. (arXiv:2106.11487v2 [cs.LG] UPDATED)
    We aim to develop clustering models to obtain behavioral representations from continuous multimodal mobile sensing data towards relapse prediction tasks. The identified clusters could represent different routine behavioral trends related to daily living of patients as well as atypical behavioral trends associated with impending relapse. We used the mobile sensing data obtained in the CrossCheck project for our analysis. Continuous data from six different mobile sensing-based modalities (e.g. ambient light, sound/conversation, acceleration etc.) obtained from a total of 63 schizophrenia patients, each monitored for up to a year, were used for the clustering models and relapse prediction evaluation. Two clustering models, Gaussian Mixture Model (GMM) and Partition Around Medoids (PAM), were used to obtain behavioral representations from the mobile sensing data. The features obtained from the clustering models were used to train and evaluate a personalized relapse prediction model using Balanced Random Forest. The personalization was done by identifying optimal features for a given patient based on a personalization subset consisting of other patients who are of similar age. The clusters identified using the GMM and PAM models were found to represent different behavioral patterns (such as clusters representing sedentary days, active but with low communications days, etc.). Significant changes near the relapse periods were seen in the obtained behavioral representation features from the clustering models. The clustering model based features, together with other features characterizing the mobile sensing data, resulted in an F2 score of 0.24 for the relapse prediction task in a leave-one-patient-out evaluation setting. This obtained F2 score is significantly higher than a random classification baseline with an average F2 score of 0.042.  ( 2 min )
    Adversarially Robust Kernel Smoothing. (arXiv:2102.08474v4 [cs.LG] UPDATED)
    We propose a scalable robust learning algorithm combining kernel smoothing and robust optimization. Our method is motivated by the convex analysis perspective of distributionally robust optimization based on probability metrics, such as the Wasserstein distance and the maximum mean discrepancy. We adapt the integral operator using supremal convolution in convex analysis to form a novel function majorant used for enforcing robustness. Our method is simple in form and applies to general loss functions and machine learning models. Exploiting a connection with optimal transport, we prove theoretical guarantees for certified robustness under distribution shift. Furthermore, we report experiments with general machine learning models, such as deep neural networks, to demonstrate competitive performance with the state-of-the-art certifiable robust learning algorithms based on the Wasserstein distance.  ( 2 min )
    Image Response Regression via Deep Neural Networks. (arXiv:2006.09911v3 [stat.ML] UPDATED)
    Delineating the associations between images and a vector of covariates is of central interest in medical imaging studies. To tackle this problem of image response regression, we propose a novel nonparametric approach in the framework of spatially varying coefficient models, where the spatially varying functions are estimated through deep neural networks. Compared to existing solutions, the proposed method explicitly accounts for spatial smoothness and subject heterogeneity, has straightforward interpretations, and is highly flexible and accurate in capturing complex association patterns. A key idea in our approach is to treat the image voxels as the effective samples, which not only alleviates the limited sample size issue that haunts the majority of medical imaging studies, but also leads to more robust and reproducible results. Focusing on a broad family of piecewise smooth functions, we establish the estimation and selection consistency, and derive the asymptotic error bounds. We demonstrate the efficacy of the method through intensive simulations, and further illustrate its advantages with analyses of two functional magnetic resonance imaging datasets.  ( 2 min )
    Generalizing Dynamic Mode Decomposition: Balancing Accuracy and Expressiveness in Koopman Approximations. (arXiv:2108.03712v2 [eess.SY] UPDATED)
    This paper tackles the data-driven approximation of unknown dynamical systems using Koopman-operator methods. Given a dictionary of functions, these methods approximate the projection of the action of the operator on the finite-dimensional subspace spanned by the dictionary. We propose the Tunable Symmetric Subspace Decomposition algorithm to refine the dictionary, balancing its expressiveness and accuracy. Expressiveness corresponds to the ability of the dictionary to describe the evolution of as many observables as possible and accuracy corresponds to the ability to correctly predict their evolution. Based on the observation that Koopman-invariant subspaces give rise to exact predictions, we reason that prediction accuracy is a function of the degree of invariance of the subspace generated by the dictionary and provide a data-driven measure to measure invariance proximity. The proposed algorithm iteratively prunes the initial functional space to identify a refined dictionary of functions that satisfies the desired level of accuracy while retaining as much of the original expressiveness as possible. We provide a full characterization of the algorithm properties and show that it generalizes both Extended Dynamic Mode Decomposition and Symmetric Subspace Decomposition. Simulations on planar systems show the effectiveness of the proposed methods in producing Koopman approximations of tunable accuracy that capture relevant information about the dynamical system.  ( 2 min )
    Gradient Estimation with Discrete Stein Operators. (arXiv:2202.09497v1 [stat.ML])
    Gradient estimation -- approximating the gradient of an expectation with respect to the parameters of a distribution -- is central to the solution of many machine learning problems. However, when the distribution is discrete, most common gradient estimators suffer from excessive variance. To improve the quality of gradient estimation, we introduce a variance reduction technique based on Stein operators for discrete distributions. We then use this technique to build flexible control variates for the REINFORCE leave-one-out estimator. Our control variates can be adapted online to minimize the variance and do not require extra evaluations of the target function. In benchmark generative modeling tasks such as training binary variational autoencoders, our gradient estimator achieves substantially lower variance than state-of-the-art estimators with the same number of function evaluations.  ( 2 min )
    A Meta-embedding-based Ensemble Approach for ICD Coding Prediction. (arXiv:2102.13622v2 [cs.CL] UPDATED)
    International Classification of Diseases (ICD) are the de facto codes used globally for clinical coding. These codes enable healthcare providers to claim reimbursement and facilitate efficient storage and retrieval of diagnostic information. The problem of automatically assigning ICD codes has been approached in literature as a multilabel classification, using neural models on unstructured data. Our proposed approach enhances the performance of neural models by effectively training word vectors using routine medical data as well as external knowledge from scientific articles. Furthermore, we exploit the geometric properties of the two sets of word vectors and combine them into a common dimensional space, using meta-embedding techniques. We demonstrate the efficacy of this approach for a multimodal setting, using unstructured and structured information. We empirically show that our approach improves the current state-of-the-art deep learning architectures and benefits ensemble models.  ( 2 min )
    Sqrt(d) Dimension Dependence of Langevin Monte Carlo. (arXiv:2109.03839v3 [cs.LG] UPDATED)
    This article considers the popular MCMC method of unadjusted Langevin Monte Carlo (LMC) and provides a non-asymptotic analysis of its sampling error in 2-Wasserstein distance. The proof is based on a refinement of mean-square analysis in Li et al. (2019), and this refined framework automates the analysis of a large class of sampling algorithms based on discretizations of contractive SDEs. Using this framework, we establish an $\tilde{O}(\sqrt{d}/\epsilon)$ mixing time bound for LMC, without warm start, under the common log-smooth and log-strongly-convex conditions, plus a growth condition on the 3rd-order derivative of the potential of target measures. This bound improves the best previously known $\tilde{O}(d/\epsilon)$ result and is optimal (in terms of order) in both dimension $d$ and accuracy tolerance $\epsilon$ for target measures satisfying the aforementioned assumptions. Our theoretical analysis is further validated by numerical experiments.  ( 2 min )
    Massive feature extraction for explaining and foretelling hydroclimatic time series forecastability at the global scale. (arXiv:2108.00846v2 [physics.ao-ph] UPDATED)
    Statistical analyses and descriptive characterizations are sometimes assumed to be offering information on time series forecastability. Despite the scientific interest suggested by such assumptions, the relationships between descriptive time series features (e.g., temporal dependence, entropy, seasonality, trend and linearity features) and actual time series forecastability (quantified by issuing and assessing forecasts for the past) are scarcely studied and quantified in the literature. In this work, we aim to fill in this gap by investigating such relationships, and the way that they can be exploited for understanding hydroclimatic forecastability and its patterns. To this end, we follow a systematic framework bringing together a variety of -- mostly new for hydrology -- concepts and methods, including 57 descriptive features and nine seasonal time series forecasting methods (i.e., one simple, five exponential smoothing, two state space and one automated autoregressive fractionally integrated moving average methods). We apply this framework to three global datasets originating from the larger Global Historical Climatology Network (GHCN) and Global Streamflow Indices and Metadata (GSIM) archives. As these datasets comprise over 13 000 monthly temperature, precipitation and river flow time series from several continents and hydroclimatic regimes, they allow us to provide trustable characterizations and interpretations of 12-month ahead hydroclimatic forecastability at the global scale...  ( 2 min )
    ReconResNet: Regularised Residual Learning for MR Image Reconstruction of Undersampled Cartesian and Radial Data. (arXiv:2103.09203v2 [eess.IV] UPDATED)
    MRI is an inherently slow process, which leads to long scan time for high-resolution imaging. The speed of acquisition can be increased by ignoring parts of the data (undersampling). Consequently, this leads to the degradation of image quality, such as loss of resolution or introduction of image artefacts. This work aims to reconstruct highly undersampled Cartesian or radial MR acquisitions, with better resolution and with less to no artefact compared to conventional techniques like compressed sensing. In recent times, deep learning has emerged as a very important area of research and has shown immense potential in solving inverse problems, e.g. MR image reconstruction. In this paper, a deep learning based MR image reconstruction framework is proposed, which includes a modified regularised version of ResNet as the network backbone to remove artefacts from the undersampled image, followed by data consistency steps that fusions the network output with the data already available from undersampled k-space in order to further improve reconstruction quality. The performance of this framework for various undersampling patterns has also been tested, and it has been observed that the framework is robust to deal with various sampling patterns, even when mixed together while training, and results in very high quality reconstruction, in terms of high SSIM (highest being 0.990$\pm$0.006 for acceleration factor of 3.5), while being compared with the fully sampled reconstruction. It has been shown that the proposed framework can successfully reconstruct even for an acceleration factor of 20 for Cartesian (0.968$\pm$0.005) and 17 for radially (0.962$\pm$0.012) sampled data. Furthermore, it has been shown that the framework preserves brain pathology during reconstruction while being trained on healthy subjects.  ( 3 min )
    Adversarial Examples in Constrained Domains. (arXiv:2011.01183v2 [cs.CR] UPDATED)
    Machine learning algorithms have been shown to be vulnerable to adversarial manipulation through systematic modification of inputs (e.g., adversarial examples) in domains such as image recognition. Under the default threat model, the adversary exploits the unconstrained nature of images; each feature (pixel) is fully under control of the adversary. However, it is not clear how these attacks translate to constrained domains that limit which and how features can be modified by the adversary (e.g., network intrusion detection). In this paper, we explore whether constrained domains are less vulnerable than unconstrained domains to adversarial example generation algorithms. We create an algorithm for generating adversarial sketches: targeted universal perturbation vectors which encode feature saliency within the envelope of domain constraints. To assess how these algorithms perform, we evaluate them in constrained (e.g., network intrusion detection) and unconstrained (e.g., image recognition) domains. The results demonstrate that our approaches generate misclassification rates in constrained domains that were comparable to those of unconstrained domains (greater than 95%). Our investigation shows that the narrow attack surface exposed by constrained domains is still sufficiently large to craft successful adversarial examples; and thus, constraints do not appear to make a domain robust. Indeed, with as little as five randomly selected features, one can still generate adversarial examples.  ( 2 min )
    Finite Sample Analysis of Two-Time-Scale Natural Actor-Critic Algorithm. (arXiv:2101.10506v2 [cs.LG] UPDATED)
    Actor-critic style two-time-scale algorithms are one of the most popular methods in reinforcement learning, and have seen great empirical success. However, their performance is not completely understood theoretically. In this paper, we characterize the \emph{global} convergence of an online natural actor-critic algorithm in the tabular setting using a single trajectory of samples. Our analysis applies to very general settings, as we only assume ergodicity of the underlying Markov decision process. In order to ensure enough exploration, we employ an $\epsilon$-greedy sampling of the trajectory. For a fixed and small enough exploration parameter $\epsilon$, we show that the two-time-scale natural actor-critic algorithm has a rate of convergence of $\tilde{\mathcal{O}}(1/T^{1/4})$, where $T$ is the number of samples, and this leads to a sample complexity of $\Tilde{\mathcal{O}}(1/\delta^{8})$ samples to find a policy that is within an error of $\delta$ from the \emph{global optimum}. Moreover, by carefully decreasing the exploration parameter $\epsilon$ as the iterations proceed, we present an improved sample complexity of $\Tilde{\mathcal{O}}(1/\delta^{6})$ for convergence to the global optimum.  ( 2 min )
    The four-fifths rule is not disparate impact: a woeful tale of epistemic trespassing in algorithmic fairness. (arXiv:2202.09519v1 [cs.CY])
    Computer scientists are trained to create abstractions that simplify and generalize. However, a premature abstraction that omits crucial contextual details creates the risk of epistemic trespassing, by falsely asserting its relevance into other contexts. We study how the field of responsible AI has created an imperfect synecdoche by abstracting the four-fifths rule (a.k.a. the 4/5 rule or 80% rule), a single part of disparate impact discrimination law, into the disparate impact metric. This metric incorrectly introduces a new deontic nuance and new potentials for ethical harms that were absent in the original 4/5 rule. We also survey how the field has amplified the potential for harm in codifying the 4/5 rule into popular AI fairness software toolkits. The harmful erasure of legal nuances is a wake-up call for computer scientists to self-critically re-evaluate the abstractions they create and use, particularly in the interdisciplinary field of AI ethics.  ( 2 min )
    Video Reenactment as Inductive Bias for Content-Motion Disentanglement. (arXiv:2102.00324v3 [cs.CV] UPDATED)
    Independent components within low-dimensional representations are essential inputs in several downstream tasks, and provide explanations over the observed data. Video-based disentangled factors of variation provide low-dimensional representations that can be identified and used to feed task-specific models. We introduce MTC-VAE, a self-supervised motion-transfer VAE model to disentangle motion and content from videos. Unlike previous work on video content-motion disentanglement, we adopt a chunk-wise modeling approach and take advantage of the motion information contained in spatiotemporal neighborhoods. Our model yields independent per-chunk representations that preserve temporal consistency. Hence, we reconstruct whole videos in a single forward-pass. We extend the ELBO's log-likelihood term and include a Blind Reenactment Loss as an inductive bias to leverage motion disentanglement, under the assumption that swapping motion features yields reenactment between two videos. We evaluate our model with recently-proposed disentanglement metrics and show that it outperforms a variety of methods for video motion-content disentanglement. Experiments on video reenactment show the effectiveness of our disentanglement in the input space where our model outperforms the baselines in reconstruction quality and motion alignment.  ( 2 min )
    Demystifying Batch Normalization in ReLU Networks: Equivalent Convex Optimization Models and Implicit Regularization. (arXiv:2103.01499v2 [cs.LG] UPDATED)
    Batch Normalization (BN) is a commonly used technique to accelerate and stabilize training of deep neural networks. Despite its empirical success, a full theoretical understanding of BN is yet to be developed. In this work, we analyze BN through the lens of convex optimization. We introduce an analytic framework based on convex duality to obtain exact convex representations of weight-decay regularized ReLU networks with BN, which can be trained in polynomial-time. Our analyses also show that optimal layer weights can be obtained as simple closed-form formulas in the high-dimensional and/or overparameterized regimes. Furthermore, we find that Gradient Descent provides an algorithmic bias effect on the standard non-convex BN network, and we design an approach to explicitly encode this implicit regularization into the convex objective. Experiments with CIFAR image classification highlight the effectiveness of this explicit regularization for mimicking and substantially improving the performance of standard BN networks.  ( 2 min )
    Suitability of Different Metric Choices for Concept Drift Detection. (arXiv:2202.09486v1 [cs.LG])
    The notion of concept drift refers to the phenomenon that the distribution, which is underlying the observed data, changes over time; as a consequence machine learning models may become inaccurate and need adjustment. Many unsupervised approaches for drift detection rely on measuring the discrepancy between the sample distributions of two time windows. This may be done directly, after some preprocessing (feature extraction, embedding into a latent space, etc.), or with respect to inferred features (mean, variance, conditional probabilities etc.). Most drift detection methods can be distinguished in what metric they use, how this metric is estimated, and how the decision threshold is found. In this paper, we analyze structural properties of the drift induced signals in the context of different metrics. We compare different types of estimators and metrics theoretically and empirically and investigate the relevance of the single metric components. In addition, we propose new choices and demonstrate their suitability in several experiments.  ( 2 min )
    On the Complexity of Approximating Multimarginal Optimal Transport. (arXiv:1910.00152v3 [stat.ML] UPDATED)
    We study the complexity of approximating the multimarginal optimal transport (MOT) distance, a generalization of the classical optimal transport distance, considered here between $m$ discrete probability distributions supported each on $n$ support points. First, we show that the standard linear programming (LP) representation of the MOT problem is not a minimum-cost flow problem when $m \geq 3$. This negative result implies that some combinatorial algorithms, e.g., network simplex method, are not suitable for approximating the MOT problem, while the worst-case complexity bound for the deterministic interior-point algorithm remains a quantity of $\tilde{O}(n^{3m})$. We then propose two simple and \textit{deterministic} algorithms for approximating the MOT problem. The first algorithm, which we refer to as \textit{multimarginal Sinkhorn} algorithm, is a provably efficient multimarginal generalization of the Sinkhorn algorithm. We show that it achieves a complexity bound of $\tilde{O}(m^3n^m\varepsilon^{-2})$ for a tolerance $\varepsilon \in (0, 1)$. This provides a first \textit{near-linear time} complexity bound guarantee for approximating the MOT problem and matches the best known complexity bound for the Sinkhorn algorithm in the classical OT setting when $m = 2$. The second algorithm, which we refer to as \textit{accelerated multimarginal Sinkhorn} algorithm, achieves the acceleration by incorporating an estimate sequence and the complexity bound is $\tilde{O}(m^3n^{m+1/3}\varepsilon^{-4/3})$. This bound is better than that of the first algorithm in terms of $1/\varepsilon$, and accelerated alternating minimization algorithm~\citep{Tupitsa-2020-Multimarginal} in terms of $n$. Finally, we compare our new algorithms with the commercial LP solver \textsc{Gurobi}. Preliminary results on synthetic data and real images demonstrate the effectiveness and efficiency of our algorithms.  ( 2 min )
    Natural Language Understanding for Argumentative Dialogue Systems in the Opinion Building Domain. (arXiv:2103.02691v2 [cs.CL] UPDATED)
    This paper introduces a natural language understanding (NLU) framework for argumentative dialogue systems in the information-seeking and opinion building domain. The proposed framework consists of two sub-models, namely intent classifier and argument similarity. Intent classifier model stacks BiLSTM with attention mechanism on top of the pre-trained BERT model and fine-tune the model for recognizing the user intent, whereas the argument similarity model employs BERT+BiLSTM for identifying system arguments the user refers to in his or her natural language utterances. Our model is evaluated in an argumentative dialogue system that engages the user to inform him-/herself about a controversial topic by exploring pro and con arguments and build his/her opinion towards the topic. In order to evaluate the proposed approach, we collect user utterances for the interaction with the respective system labeling intent and referenced argument in an extensive online study. The data collection includes multiple topics and two different user types (native English speakers from the UK and non-native English speakers from China). Additionally, we evaluate the proposed intent classifier and argument similarity models separately on the publicly available Banking77 and STS benchmark datasets. The evaluation indicates a clear advantage of the utilized techniques over baseline approaches on several datasets, as well as the robustness of the proposed approach against new topics and different language proficiency as well as the cultural background of the user. Furthermore, results show that our intent classifier model outperforms DIET, DistillBERT, and BERT fine-tuned models in few-shot setups (i.e., with 10, 20, or 30 labeled examples per intent) and full data setup.  ( 2 min )
    Efficacy of regularized multi-task learning based on SVM models. (arXiv:1805.12507v2 [stat.ML] UPDATED)
    This paper investigates the efficacy of a regularized multi-task learning (MTL) framework based on SVM (M-SVM) to answer whether MTL always provides reliable results and how MTL outperforms independent learning. We first find that M-SVM is Bayes risk consistent in the limit of large sample size. This implies that despite the task dissimilarities, M-SVM always produces a reliable decision rule for each task in terms of misclassification error when the data size is large enough. Furthermore, we find that the task-interaction vanishes as the data size goes to infinity, and the convergence rates of M-SVM and its single-task counterpart have the same upper bound. The former suggests that M-SVM cannot improve the limit classifier's performance; based on the latter, we conjecture that the optimal convergence rate is not improved when the task number is fixed. As a novel insight of MTL, our theoretical and experimental results achieved an excellent agreement that the benefit of the MTL methods lies in the improvement of the pre-convergence-rate factor (PCR, to be denoted in Section III) rather than the convergence rate. Moreover, this improvement of PCR factors is more significant when the data size is small.  ( 2 min )
    Alternative design of DeepPDNet in the context of image restoration. (arXiv:2202.09810v1 [eess.IV])
    This work designs an image restoration deep network relying on unfolded Chambolle-Pock primal-dual iterations. Each layer of our network is built from Chambolle-Pock iterations when specified for minimizing a sum of a $\ell_2$-norm data-term and an analysis sparse prior. The parameters of our network are the step-sizes of the Chambolle-Pock scheme and the linear operator involved in sparsity-based penalization, including implicitly the regularization parameter. A backpropagation procedure is fully described. Preliminary experiments illustrate the good behavior of such a deep primal-dual network in the context of image restoration on BSD68 database.  ( 2 min )
    Memorize to Generalize: on the Necessity of Interpolation in High Dimensional Linear Regression. (arXiv:2202.09889v1 [stat.ML])
    We examine the necessity of interpolation in overparameterized models, that is, when achieving optimal predictive risk in machine learning problems requires (nearly) interpolating the training data. In particular, we consider simple overparameterized linear regression $y = X \theta + w$ with random design $X \in \mathbb{R}^{n \times d}$ under the proportional asymptotics $d/n \to \gamma \in (1, \infty)$. We precisely characterize how prediction (test) error necessarily scales with training error in this setting. An implication of this characterization is that as the label noise variance $\sigma^2 \to 0$, any estimator that incurs at least $\mathsf{c}\sigma^4$ training error for some constant $\mathsf{c}$ is necessarily suboptimal and will suffer growth in excess prediction error at least linear in the training error. Thus, optimal performance requires fitting training data to substantially higher accuracy than the inherent noise floor of the problem.  ( 2 min )
    Personalized Federated Learning with Exact Stochastic Gradient Descent. (arXiv:2202.09848v1 [cs.LG])
    In Federated Learning (FL), datasets across clients tend to be heterogeneous or personalized, and this poses challenges to the convergence of standard FL schemes that do not account for personalization. To address this, we present a new approach for personalized FL that achieves exact stochastic gradient descent (SGD) minimization. We start from the FedPer (Arivazhagan et al., 2019) neural network (NN) architecture for personalization, whereby the NN has two types of layers: the first ones are the common layers across clients, while the few final ones are client-specific and are needed for personalization. We propose a novel SGD-type scheme where, at each optimization round, randomly selected clients perform gradient-descent updates over their client-specific weights towards optimizing the loss function on their own datasets, without updating the common weights. At the final update, each client computes the joint gradient over both client-specific and common weights and returns the gradient of common parameters to the server. This allows to perform an exact and unbiased SGD step over the full set of parameters in a distributed manner, i.e. the updates of the personalized parameters are performed by the clients and those of the common ones by the server. Our method is superior to FedAvg and FedPer baselines in multi-class classification benchmarks such as Omniglot, CIFAR-10, MNIST, Fashion-MNIST, and EMNIST and has much lower computational complexity per round.  ( 2 min )
    $\mathcal{Y}$-Tuning: An Efficient Tuning Paradigm for Large-Scale Pre-Trained Models via Label Representation Learning. (arXiv:2202.09817v1 [cs.CL])
    With the success of large-scale pre-trained models (PTMs), how efficiently adapting PTMs to downstream tasks has attracted tremendous attention, especially for PTMs with billions of parameters. Although some parameter-efficient tuning paradigms have been proposed to address this problem, they still require large resources to compute the gradients in the training phase. In this paper, we propose $\mathcal{Y}$-Tuning, an efficient yet effective paradigm to adapt frozen large-scale PTMs to specific downstream tasks. $\mathcal{Y}$-tuning learns dense representations for labels $\mathcal{Y}$ defined in a given task and aligns them to fixed feature representation. Without tuning the features of input text and model parameters, $\mathcal{Y}$-tuning is both parameter-efficient and training-efficient. For $\text{DeBERTa}_\text{XXL}$ with 1.6 billion parameters, $\mathcal{Y}$-tuning achieves performance more than $96\%$ of full fine-tuning on GLUE Benchmark with only $2\%$ tunable parameters and much fewer training costs.  ( 2 min )
    It's Raw! Audio Generation with State-Space Models. (arXiv:2202.09729v1 [cs.SD])
    Developing architectures suitable for modeling raw audio is a challenging problem due to the high sampling rates of audio waveforms. Standard sequence modeling approaches like RNNs and CNNs have previously been tailored to fit the demands of audio, but the resultant architectures make undesirable computational tradeoffs and struggle to model waveforms effectively. We propose SaShiMi, a new multi-scale architecture for waveform modeling built around the recently introduced S4 model for long sequence modeling. We identify that S4 can be unstable during autoregressive generation, and provide a simple improvement to its parameterization by drawing connections to Hurwitz matrices. SaShiMi yields state-of-the-art performance for unconditional waveform generation in the autoregressive setting. Additionally, SaShiMi improves non-autoregressive generation performance when used as the backbone architecture for a diffusion model. Compared to prior architectures in the autoregressive generation setting, SaShiMi generates piano and speech waveforms which humans find more musical and coherent respectively, e.g. 2x better mean opinion scores than WaveNet on an unconditional speech generation task. On a music generation task, SaShiMi outperforms WaveNet on density estimation and speed at both training and inference even when using 3x fewer parameters. Code can be found at https://github.com/HazyResearch/state-spaces and samples at https://hazyresearch.stanford.edu/sashimi-examples.  ( 2 min )
    Image-to-Graph Transformers for Chemical Structure Recognition. (arXiv:2202.09580v1 [cs.CV])
    For several decades, chemical knowledge has been published in written text, and there have been many attempts to make it accessible, for example, by transforming such natural language text to a structured format. Although the discovered chemical itself commonly represented in an image is the most important part, the correct recognition of the molecular structure from the image in literature still remains a hard problem since they are often abbreviated to reduce the complexity and drawn in many different styles. In this paper, we present a deep learning model to extract molecular structures from images. The proposed model is designed to transform the molecular image directly into the corresponding graph, which makes it capable of handling non-atomic symbols for abbreviations. Also, by end-to-end learning approach it can fully utilize many open image-molecule pair data from various sources, and hence it is more robust to image style variation than other tools. The experimental results show that the proposed model outperforms the existing models with 17.1 % and 12.8 % relative improvement for well-known benchmark datasets and large molecular images that we collected from literature, respectively.  ( 2 min )
    Graph Reparameterizations for Enabling 1000+ Monte Carlo Iterations in Bayesian Deep Neural Networks. (arXiv:2202.09478v1 [cs.LG])
    Uncertainty estimation in deep models is essential in many real-world applications and has benefited from developments over the last several years. Recent evidence suggests that existing solutions dependent on simple Gaussian formulations may not be sufficient. However, moving to other distributions necessitates Monte Carlo (MC) sampling to estimate quantities such as the KL divergence: it could be expensive and scales poorly as the dimensions of both the input data and the model grow. This is directly related to the structure of the computation graph, which can grow linearly as a function of the number of MC samples needed. Here, we construct a framework to describe these computation graphs, and identify probability families where the graph size can be independent or only weakly dependent on the number of MC samples. These families correspond directly to large classes of distributions. Empirically, we can run a much larger number of iterations for MC approximations for larger architectures used in computer vision with gains in performance measured in confident accuracy, stability of training, memory and training time.  ( 2 min )
    SOInter: A Novel Deep Energy Based Interpretation Method for Explaining Structured Output Models. (arXiv:2202.09914v1 [cs.LG])
    We propose a novel interpretation technique to explain the behavior of structured output models, which learn mappings between an input vector to a set of output variables simultaneously. Because of the complex relationship between the computational path of output variables in structured models, a feature can affect the value of output through other ones. We focus on one of the outputs as the target and try to find the most important features utilized by the structured model to decide on the target in each locality of the input space. In this paper, we assume an arbitrary structured output model is available as a black box and argue how considering the correlations between output variables can improve the explanation performance. The goal is to train a function as an interpreter for the target output variable over the input space. We introduce an energy-based training process for the interpreter function, which effectively considers the structural information incorporated into the model to be explained. The effectiveness of the proposed method is confirmed using a variety of simulated and real data sets.  ( 2 min )
    Efficient Continual Learning Ensembles in Neural Network Subspaces. (arXiv:2202.09826v1 [cs.LG])
    A growing body of research in continual learning focuses on the catastrophic forgetting problem. While many attempts have been made to alleviate this problem, the majority of the methods assume a single model in the continual learning setup. In this work, we question this assumption and show that employing ensemble models can be a simple yet effective method to improve continual performance. However, the training and inference cost of ensembles can increase linearly with the number of models. Motivated by this limitation, we leverage the recent advances in the deep learning optimization literature, such as mode connectivity and neural network subspaces, to derive a new method that is both computationally advantageous and can outperform the state-of-the-art continual learning algorithms.  ( 2 min )
    Understanding Robust Generalization in Learning Regular Languages. (arXiv:2202.09717v1 [cs.LG])
    A key feature of human intelligence is the ability to generalize beyond the training distribution, for instance, parsing longer sentences than seen in the past. Currently, deep neural networks struggle to generalize robustly to such shifts in the data distribution. We study robust generalization in the context of using recurrent neural networks (RNNs) to learn regular languages. We hypothesize that standard end-to-end modeling strategies cannot generalize well to systematic distribution shifts and propose a compositional strategy to address this. We compare an end-to-end strategy that maps strings to labels with a compositional strategy that predicts the structure of the deterministic finite-state automaton (DFA) that accepts the regular language. We theoretically prove that the compositional strategy generalizes significantly better than the end-to-end strategy. In our experiments, we implement the compositional strategy via an auxiliary task where the goal is to predict the intermediate states visited by the DFA when parsing a string. Our empirical results support our hypothesis, showing that auxiliary tasks can enable robust generalization. Interestingly, the end-to-end RNN generalizes significantly better than the theoretical lower bound, suggesting that it is able to achieve at least some degree of robust generalization.  ( 2 min )
    Generalized Bayesian Additive Regression Trees Models: Beyond Conditional Conjugacy. (arXiv:2202.09924v1 [stat.ML])
    Bayesian additive regression trees have seen increased interest in recent years due to their ability to combine machine learning techniques with principled uncertainty quantification. The Bayesian backfitting algorithm used to fit BART models, however, limits their application to a small class of models for which conditional conjugacy exists. In this article, we greatly expand the domain of applicability of BART to arbitrary \emph{generalized BART} models by introducing a very simple, tuning-parameter-free, reversible jump Markov chain Monte Carlo algorithm. Our algorithm requires only that the user be able to compute the likelihood and (optionally) its gradient and Fisher information. The potential applications are very broad; we consider examples in survival analysis, structured heteroskedastic regression, and gamma shape regression.  ( 2 min )
    On Optimal Early Stopping: Over-informative versus Under-informative Parametrization. (arXiv:2202.09885v1 [cs.LG])
    Early stopping is a simple and widely used method to prevent over-training neural networks. We develop theoretical results to reveal the relationship between the optimal early stopping time and model dimension as well as sample size of the dataset for certain linear models. Our results demonstrate two very different behaviors when the model dimension exceeds the number of features versus the opposite scenario. While most previous works on linear models focus on the latter setting, we observe that the dimension of the model often exceeds the number of features arising from data in common deep learning tasks and propose a model to study this setting. We demonstrate experimentally that our theoretical results on optimal early stopping time corresponds to the training process of deep neural networks.  ( 2 min )
    A Variance-Reduced Stochastic Accelerated Primal Dual Algorithm. (arXiv:2202.09688v1 [math.OC])
    In this work, we consider strongly convex strongly concave (SCSC) saddle point (SP) problems $\min_{x\in\mathbb{R}^{d_x}}\max_{y\in\mathbb{R}^{d_y}}f(x,y)$ where $f$ is $L$-smooth, $f(.,y)$ is $\mu$-strongly convex for every $y$, and $f(x,.)$ is $\mu$-strongly concave for every $x$. Such problems arise frequently in machine learning in the context of robust empirical risk minimization (ERM), e.g. $\textit{distributionally robust}$ ERM, where partial gradients are estimated using mini-batches of data points. Assuming we have access to an unbiased stochastic first-order oracle we consider the stochastic accelerated primal dual (SAPD) algorithm recently introduced in Zhang et al. [2021] for SCSC SP problems as a robust method against gradient noise. In particular, SAPD recovers the well-known stochastic gradient descent ascent (SGDA) as a special case when the momentum parameter is set to zero and can achieve an accelerated rate when the momentum parameter is properly tuned, i.e., improving the $\kappa \triangleq L/\mu$ dependence from $\kappa^2$ for SGDA to $\kappa$. We propose efficient variance-reduction strategies for SAPD based on Richardson-Romberg extrapolation and show that our method improves upon SAPD both in practice and in theory.  ( 2 min )
    ExAIS: Executable AI Semantics. (arXiv:2202.09868v1 [cs.AI])
    Neural networks can be regarded as a new programming paradigm, i.e., instead of building ever-more complex programs through (often informal) logical reasoning in the programmers' mind, complex 'AI' systems are built by optimising generic neural network models with big data. In this new paradigm, AI frameworks such as TensorFlow and PyTorch play a key role, which is as essential as the compiler for traditional programs. It is known that the lack of a proper semantics for programming languages (such as C), i.e., a correctness specification for compilers, has contributed to many problematic program behaviours and security issues. While it is in general hard to have a correctness specification for compilers due to the high complexity of programming languages and their rapid evolution, we have a unique opportunity to do it right this time for neural networks (which have a limited set of functions, and most of them have stable semantics). In this work, we report our effort on providing a correctness specification of neural network frameworks such as TensorFlow. We specify the semantics of almost all TensorFlow layers in the logical programming language Prolog. We demonstrate the usefulness of the semantics through two applications. One is a fuzzing engine for TensorFlow, which features a strong oracle and a systematic way of generating valid neural networks. The other is a model validation approach which enables consistent bug reporting for TensorFlow models.  ( 2 min )
    Dynamic and Efficient Gray-Box Hyperparameter Optimization for Deep Learning. (arXiv:2202.09774v1 [cs.LG])
    Gray-box hyperparameter optimization techniques have recently emerged as a promising direction for tuning Deep Learning methods. In this work, we introduce DyHPO, a method that learns to dynamically decide which configuration to try next, and for what budget. Our technique is a modification to the classical Bayesian optimization for a gray-box setup. Concretely, we propose a new surrogate for Gaussian Processes that embeds the learning curve dynamics and a new acquisition function that incorporates multi-budget information. We demonstrate the significant superiority of DyHPO against state-of-the-art hyperparameter optimization baselines through large-scale experiments comprising 50 datasets (Tabular, Image, NLP) and diverse neural networks (MLP, CNN/NAS, RNN).  ( 2 min )
    Polytopic Matrix Factorization: Determinant Maximization Based Criterion and Identifiability. (arXiv:2202.09638v1 [stat.ML])
    We introduce Polytopic Matrix Factorization (PMF) as a novel data decomposition approach. In this new framework, we model input data as unknown linear transformations of some latent vectors drawn from a polytope. In this sense, the article considers a semi-structured data model, in which the input matrix is modeled as the product of a full column rank matrix and a matrix containing samples from a polytope as its column vectors. The choice of polytope reflects the presumed features of the latent components and their mutual relationships. As the factorization criterion, we propose the determinant maximization (Det-Max) for the sample autocorrelation matrix of the latent vectors. We introduce a sufficient condition for identifiability, which requires that the convex hull of the latent vectors contains the maximum volume inscribed ellipsoid of the polytope with a particular tightness constraint. Based on the Det-Max criterion and the proposed identifiability condition, we show that all polytopes that satisfy a particular symmetry restriction qualify for the PMF framework. Having infinitely many polytope choices provides a form of flexibility in characterizing latent vectors. In particular, it is possible to define latent vectors with heterogeneous features, enabling the assignment of attributes such as nonnegativity and sparsity at the subvector level. The article offers examples illustrating the connection between polytope choices and the corresponding feature representations.  ( 2 min )
    Generalized Optimistic Methods for Convex-Concave Saddle Point Problems. (arXiv:2202.09674v1 [math.OC])
    The optimistic gradient method has seen increasing popularity as an efficient first-order method for solving convex-concave saddle point problems. To analyze its iteration complexity, a recent work [arXiv:1901.08511] proposed an interesting perspective that interprets the optimistic gradient method as an approximation to the proximal point method. In this paper, we follow this approach and distill the underlying idea of optimism to propose a generalized optimistic method, which encompasses the optimistic gradient method as a special case. Our general framework can handle constrained saddle point problems with composite objective functions and can work with arbitrary norms with compatible Bregman distances. Moreover, we also develop an adaptive line search scheme to select the stepsizes without knowledge of the smoothness coefficients. We instantiate our method with first-order, second-order and higher-order oracles and give sharp global iteration complexity bounds. When the objective function is convex-concave, we show that the averaged iterates of our $p$-th-order method ($p\geq 1$) converge at a rate of $\mathcal{O}(1/N^\frac{p+1}{2})$. When the objective function is further strongly-convex-strongly-concave, we prove a complexity bound of $\mathcal{O}(\frac{L_1}{\mu}\log\frac{1}{\epsilon})$ for our first-order method and a bound of $\mathcal{O}((L_p D^\frac{p-1}{2}/\mu)^{\frac{2}{p+1}}+\log\log\frac{1}{\epsilon})$ for our $p$-th-order method ($p\geq 2$) respectively, where $L_p$ ($p\geq 1$) is the Lipschitz constant of the $p$-th-order derivative, $\mu$ is the strongly-convex parameter, and $D$ is the initial Bregman distance to the saddle point. Moreover, our line search scheme provably only requires an almost constant number of calls to a subproblem solver per iteration on average, making our first-order and second-order methods particularly amenable to implementation.  ( 2 min )
    Deep Learning for Hate Speech Detection: A Comparative Study. (arXiv:2202.09517v1 [cs.CL])
    Automated hate speech detection is an important tool in combating the spread of hate speech, particularly in social media. Numerous methods have been developed for the task, including a recent proliferation of deep-learning based approaches. A variety of datasets have also been developed, exemplifying various manifestations of the hate-speech detection problem. We present here a large-scale empirical comparison of deep and shallow hate-speech detection methods, mediated through the three most commonly used datasets. Our goal is to illuminate progress in the area, and identify strengths and weaknesses in the current state-of-the-art. We particularly focus our analysis on measures of practical performance, including detection accuracy, computational efficiency, capability in using pre-trained models, and domain generalization. In doing so we aim to provide guidance as to the use of hate-speech detection in practice, quantify the state-of-the-art, and identify future research directions. Code and dataset are available at https://github.com/jmjmalik22/Hate-Speech-Detection.  ( 2 min )
    ChemTab: A Physics Guided Chemistry Modeling Framework. (arXiv:2202.09855v1 [cs.LG])
    Modeling of turbulent combustion system requires modeling the underlying chemistry and the turbulent flow. Solving both systems simultaneously is computationally prohibitive. Instead, given the difference in scales at which the two sub-systems evolve, the two sub-systems are typically (re)solved separately. Popular approaches such as the Flamelet Generated Manifolds (FGM) use a two-step strategy where the governing reaction kinetics are pre-computed and mapped to a low-dimensional manifold, characterized by a few reaction progress variables (model reduction) and the manifold is then "looked-up" during the run-time to estimate the high-dimensional system state by the flow system. While existing works have focused on these two steps independently, we show that joint learning of the progress variables and the look-up model, can yield more accurate results. We propose a deep neural network architecture, called ChemTab, customized for the joint learning task and experimentally demonstrate its superiority over existing state-of-the-art methods.  ( 2 min )
    Shaping Advice in Deep Reinforcement Learning. (arXiv:2202.09489v1 [cs.MA])
    Reinforcement learning involves agents interacting with an environment to complete tasks. When rewards provided by the environment are sparse, agents may not receive immediate feedback on the quality of actions that they take, thereby affecting learning of policies. In this paper, we propose to methods to augment the reward signal from the environment with an additional reward termed shaping advice in both single and multi-agent reinforcement learning. The shaping advice is specified as a difference of potential functions at consecutive time-steps. Each potential function is a function of observations and actions of the agents. The use of potential functions is underpinned by an insight that the total potential when starting from any state and returning to the same state is always equal to zero. We show through theoretical analyses and experimental validation that the shaping advice does not distract agents from completing tasks specified by the environment reward. Theoretically, we prove that the convergence of policy gradients and value functions when using shaping advice implies the convergence of these quantities in the absence of shaping advice. We design two algorithms- Shaping Advice in Single-agent reinforcement learning (SAS) and Shaping Advice in Multi-agent reinforcement learning (SAM). Shaping advice in SAS and SAM needs to be specified only once at the start of training, and can easily be provided by non-experts. Experimentally, we evaluate SAS and SAM on two tasks in single-agent environments and three tasks in multi-agent environments that have sparse rewards. We observe that using shaping advice results in agents learning policies to complete tasks faster, and obtain higher rewards than algorithms that do not use shaping advice.  ( 2 min )
    NetSentry: A Deep Learning Approach to Detecting Incipient Large-scale Network Attacks. (arXiv:2202.09873v1 [cs.CR])
    Machine Learning (ML) techniques are increasingly adopted to tackle ever-evolving high-profile network attacks, including DDoS, botnet, and ransomware, due to their unique ability to extract complex patterns hidden in data streams. These approaches are however routinely validated with data collected in the same environment, and their performance degrades when deployed in different network topologies and/or applied on previously unseen traffic, as we uncover. This suggests malicious/benign behaviors are largely learned superficially and ML-based Network Intrusion Detection System (NIDS) need revisiting, to be effective in practice. In this paper we dive into the mechanics of large-scale network attacks, with a view to understanding how to use ML for Network Intrusion Detection (NID) in a principled way. We reveal that, although cyberattacks vary significantly in terms of payloads, vectors and targets, their early stages, which are critical to successful attack outcomes, share many similarities and exhibit important temporal correlations. Therefore, we treat NID as a time-sensitive task and propose NetSentry, perhaps the first of its kind NIDS that builds on Bidirectional Asymmetric LSTM (Bi-ALSTM), an original ensemble of sequential neural models, to detect network threats before they spread. We cross-evaluate NetSentry using two practical datasets, training on one and testing on the other, and demonstrate F1 score gains above 33% over the state-of-the-art, as well as up to 3 times higher rates of detecting attacks such as XSS and web bruteforce. Further, we put forward a novel data augmentation technique that boosts the generalization abilities of a broad range of supervised deep learning algorithms, leading to average F1 score gains above 35%.  ( 2 min )
    Who Are the Best Adopters? User Selection Model for Free Trial Item Promotion. (arXiv:2202.09508v1 [cs.IR])
    With the increasingly fierce market competition, offering a free trial has become a potent stimuli strategy to promote products and attract users. By providing users with opportunities to experience goods without charge, a free trial makes adopters know more about products and thus encourages their willingness to buy. However, as the critical point in the promotion process, finding the proper adopters is rarely explored. Empirically winnowing users by their static demographic attributes is feasible but less effective, neglecting their personalized preferences. To dynamically match the products with the best adopters, in this work, we propose a novel free trial user selection model named SMILE, which is based on reinforcement learning (RL) where an agent actively selects specific adopters aiming to maximize the profit after free trials. Specifically, we design a tree structure to reformulate the action space, which allows us to select adopters from massive user space efficiently. The experimental analysis on three datasets demonstrates the proposed model's superiority and elucidates why reinforcement learning and tree structure can improve performance. Our study demonstrates technical feasibility for constructing a more robust and intelligent user selection model and guides for investigating more marketing promotion strategies.  ( 2 min )
    Trying to Outrun Causality in Machine Learning: Limitations of Model Explainabilty Techniques for Identifying Predictive Variables. (arXiv:2202.09875v1 [stat.ML])
    Machine Learning explainability techniques have been proposed as a means of `explaining' or interrogating a model in order to understand why a particular decision or prediction has been made. Such an ability is especially important at a time when machine learning is being used to automate decision processes which concern sensitive factors and legal outcomes. Indeed, it is even a requirement according to EU law. Furthermore, researchers concerned with imposing overly restrictive functional form (e.g. as would be the case in a linear regression) may be motivated to use machine learning algorithms in conjunction with explainability techniques, as part of exploratory research, with the goal of identifying important variables which are associated with an outcome of interest. For example, epidemiologists might be interested in identifying 'risk factors' - i.e., factors which affect recovery from disease - by using random forests and assessing variable relevance using importance measures. However, and as we aim to demonstrate, machine learning algorithms are not as flexible as they might seem, and are instead incredibly sensitive to the underling causal structure in the data. The consequences of this are that predictors which are, in fact, critical to a causal system and highly correlated with the outcome, may nonetheless be deemed by explainability techniques to be unrelated/unimportant/unpredictive of the outcome. Rather than this being a limitation of explainability techniques per se, it is rather a consequence of the mathematical implications of regressions, and the interaction of these implications with the associated conditional independencies of the underlying causal structure. We provide some alternative recommendations for researchers wanting to explore the data for important variables.  ( 2 min )
    Doubly Robust Distributionally Robust Off-Policy Evaluation and Learning. (arXiv:2202.09667v1 [cs.LG])
    Off-policy evaluation and learning (OPE/L) use offline observational data to make better decisions, which is crucial in applications where experimentation is necessarily limited. OPE/L is nonetheless sensitive to discrepancies between the data-generating environment and that where policies are deployed. Recent work proposed distributionally robust OPE/L (DROPE/L) to remedy this, but the proposal relies on inverse-propensity weighting, whose regret rates may deteriorate if propensities are estimated and whose variance is suboptimal even if not. For vanilla OPE/L, this is solved by doubly robust (DR) methods, but they do not naturally extend to the more complex DROPE/L, which involves a worst-case expectation. In this paper, we propose the first DR algorithms for DROPE/L with KL-divergence uncertainty sets. For evaluation, we propose Localized Doubly Robust DROPE (LDR$^2$OPE) and prove its semiparametric efficiency under weak product rates conditions. Notably, thanks to a localization technique, LDR$^2$OPE only requires fitting a small number of regressions, just like DR methods for vanilla OPE. For learning, we propose Continuum Doubly Robust DROPL (CDR$^2$OPL) and show that, under a product rate condition involving a continuum of regressions, it enjoys a fast regret rate of $\mathcal{O}(N^{-1/2})$ even when unknown propensities are nonparametrically estimated. We further extend our results to general $f$-divergence uncertainty sets. We illustrate the advantage of our algorithms in simulations.  ( 2 min )
    FedEmbed: Personalized Private Federated Learning. (arXiv:2202.09472v1 [cs.LG])
    Federated learning enables the deployment of machine learning to problems for which centralized data collection is impractical. Adding differential privacy guarantees bounds on privacy while data are contributed to a global model. Adding personalization to federated learning introduces new challenges as we must account for preferences of individual users, where a data sample could have conflicting labels because one sub-population of users might view an input positively, but other sub-populations view the same input negatively. We present FedEmbed, a new approach to private federated learning for personalizing a global model that uses (1) sub-populations of similar users, and (2) personal embeddings. We demonstrate that current approaches to federated learning are inadequate for handling data with conflicting labels, and we show that FedEmbed achieves up to 45% improvement over baseline approaches to personalized private federated learning.  ( 2 min )
    Parallel Sampling for Efficient High-dimensional Bayesian Network Structure Learning. (arXiv:2202.09691v1 [cs.LG])
    Score-based algorithms that learn the structure of Bayesian networks can be used for both exact and approximate solutions. While approximate learning scales better with the number of variables, it can be computationally expensive in the presence of high dimensional data. This paper describes an approximate algorithm that performs parallel sampling on Candidate Parent Sets (CPSs), and can be viewed as an extension of MINOBS which is a state-of-the-art algorithm for structure learning from high dimensional data. The modified algorithm, which we call Parallel Sampling MINOBS (PS-MINOBS), constructs the graph by sampling CPSs for each variable. Sampling is performed in parallel under the assumption the distribution of CPSs is half-normal when ordered by Bayesian score for each variable. Sampling from a half-normal distribution ensures that the CPSs sampled are likely to be those which produce the higher scores. Empirical results show that, in most cases, the proposed algorithm discovers higher score structures than MINOBS when both algorithms are restricted to the same runtime limit.  ( 2 min )
    Sparsity Winning Twice: Better Robust Generaliztion from More Efficient Training. (arXiv:2202.09844v1 [cs.CV])
    Recent studies demonstrate that deep networks, even robustified by the state-of-the-art adversarial training (AT), still suffer from large robust generalization gaps, in addition to the much more expensive training costs than standard training. In this paper, we investigate this intriguing problem from a new perspective, i.e., injecting appropriate forms of sparsity during adversarial training. We introduce two alternatives for sparse adversarial training: (i) static sparsity, by leveraging recent results from the lottery ticket hypothesis to identify critical sparse subnetworks arising from the early training; (ii) dynamic sparsity, by allowing the sparse subnetwork to adaptively adjust its connectivity pattern (while sticking to the same sparsity ratio) throughout training. We find both static and dynamic sparse methods to yield win-win: substantially shrinking the robust generalization gap and alleviating the robust overfitting, meanwhile significantly saving training and inference FLOPs. Extensive experiments validate our proposals with multiple network architectures on diverse datasets, including CIFAR-10/100 and Tiny-ImageNet. For example, our methods reduce robust generalization gap and overfitting by 34.44% and 4.02%, with comparable robust/standard accuracy boosts and 87.83%/87.82% training/inference FLOPs savings on CIFAR-100 with ResNet-18. Besides, our approaches can be organically combined with existing regularizers, establishing new state-of-the-art results in AT. Codes are available in https://github.com/VITA-Group/Sparsity-Win-Robust-Generalization.  ( 2 min )
    Survey of Machine Learning Based Intrusion Detection Methods for Internet of Medical Things. (arXiv:2202.09657v1 [cs.CR])
    Internet of Medical Things (IoMT) represents an application of the Internet of Things, where health professionals perform remote analysis of physiological data collected using sensors that are associated with patients, allowing real-time and permanent monitoring of the patient's health condition and the detection of possible diseases at an early stage. However, the use of wireless communication for data transfer exposes this data to cyberattacks, and the sensitive and private nature of this data may represent a prime interest for attackers. The use of traditional security methods on equipment that is limited in terms of storage and computing capacity is ineffective. In this context, we have performed a comprehensive survey to investigate the use of the intrusion detection system based on machine learning (ML) for IoMT security. We presented the generic three-layer architecture of IoMT, the security requirement of IoMT security. We review the various threats that can affect IoMT security and identify the advantage, disadvantages, methods, and datasets used in each solution based on ML. Then we provide some challenges and limitations of applying ML on each layer of IoMT, which can serve as direction for future study.  ( 2 min )
    Robust Reinforcement Learning as a Stackelberg Game via Adaptively-Regularized Adversarial Training. (arXiv:2202.09514v1 [cs.LG])
    Robust Reinforcement Learning (RL) focuses on improving performances under model errors or adversarial attacks, which facilitates the real-life deployment of RL agents. Robust Adversarial Reinforcement Learning (RARL) is one of the most popular frameworks for robust RL. However, most of the existing literature models RARL as a zero-sum simultaneous game with Nash equilibrium as the solution concept, which could overlook the sequential nature of RL deployments, produce overly conservative agents, and induce training instability. In this paper, we introduce a novel hierarchical formulation of robust RL - a general-sum Stackelberg game model called RRL-Stack - to formalize the sequential nature and provide extra flexibility for robust training. We develop the Stackelberg Policy Gradient algorithm to solve RRL-Stack, leveraging the Stackelberg learning dynamics by considering the adversary's response. Our method generates challenging yet solvable adversarial environments which benefit RL agents' robust learning. Our algorithm demonstrates better training stability and robustness against different testing conditions in the single-agent robotics control and multi-agent highway merging tasks.  ( 2 min )
    Runtime-Assured, Real-Time Neural Control of Microgrids. (arXiv:2202.09710v1 [eess.SY])
    We present SimpleMG, a new, provably correct design methodology for runtime assurance of microgrids (MGs) with neural controllers. Our approach is centered around the Neural Simplex Architecture, which in turn is based on Sha et al.'s Simplex Control Architecture. Reinforcement Learning is used to synthesize high-performance neural controllers for MGs. Barrier Certificates are used to establish SimpleMG's runtime-assurance guarantees. We present a novel method to derive the condition for switching from the unverified neural controller to the verified-safe baseline controller, and we prove that the method is correct. We conduct an extensive experimental evaluation of SimpleMG using RTDS, a high-fidelity, real-time simulation environment for power systems, on a realistic model of a microgrid comprising three distributed energy resources (battery, photovoltaic, and diesel generator). Our experiments confirm that SimpleMG can be used to develop high-performance neural controllers for complex microgrids while assuring runtime safety, even in the presence of adversarial input attacks on the neural controller. Our experiments also demonstrate the benefits of online retraining of the neural controller while the baseline controller is in control  ( 2 min )
    Symplectic Neural Networks in Taylor Series Form for Hamiltonian Systems. (arXiv:2005.04986v4 [cs.LG] UPDATED)
    We propose an effective and lightweight learning algorithm, Symplectic Taylor Neural Networks (Taylor-nets), to conduct continuous, long-term predictions of a complex Hamiltonian dynamic system based on sparse, short-term observations. At the heart of our algorithm is a novel neural network architecture consisting of two sub-networks. Both are embedded with terms in the form of Taylor series expansion designed with symmetric structure. The key mechanism underpinning our infrastructure is the strong expressiveness and special symmetric property of the Taylor series expansion, which naturally accommodate the numerical fitting process of the gradients of the Hamiltonian with respect to the generalized coordinates as well as preserve its symplectic structure. We further incorporate a fourth-order symplectic integrator in conjunction with neural ODEs' framework into our Taylor-net architecture to learn the continuous-time evolution of the target systems while simultaneously preserving their symplectic structures. We demonstrated the efficacy of our Taylor-net in predicting a broad spectrum of Hamiltonian dynamic systems, including the pendulum, the Lotka--Volterra, the Kepler, and the H\'enon--Heiles systems. Our model exhibits unique computational merits by outperforming previous methods to a great extent regarding the prediction accuracy, the convergence rate, and the robustness despite using extremely small training data with a short training period (6000 times shorter than the predicting period), small sample sizes, and no intermediate data to train the networks.  ( 2 min )
    Interactive Visual Pattern Search on Graph Data via Graph Representation Learning. (arXiv:2202.09459v1 [cs.LG])
    Graphs are a ubiquitous data structure to model processes and relations in a wide range of domains. Examples include control-flow graphs in programs and semantic scene graphs in images. Identifying subgraph patterns in graphs is an important approach to understanding their structural properties. We propose a visual analytics system GraphQ to support human-in-the-loop, example-based, subgraph pattern search in a database containing many individual graphs. To support fast, interactive queries, we use graph neural networks (GNNs) to encode a graph as fixed-length latent vector representation, and perform subgraph matching in the latent space. Due to the complexity of the problem, it is still difficult to obtain accurate one-to-one node correspondences in the matching results that are crucial for visualization and interpretation. We, therefore, propose a novel GNN for node-alignment called NeuroAlign, to facilitate easy validation and interpretation of the query results. GraphQ provides a visual query interface with a query editor and a multi-scale visualization of the results, as well as a user feedback mechanism for refining the results with additional constraints. We demonstrate GraphQ through two example usage scenarios: analyzing reusable subroutines in program workflows and semantic scene graph search in images. Quantitative experiments show that NeuroAlign achieves 19-29% improvement in node-alignment accuracy compared to baseline GNN and provides up to 100x speedup compared to combinatorial algorithms. Our qualitative study with domain experts confirms the effectiveness for both usage scenarios.  ( 2 min )
    Bayes-Optimal Classifiers under Group Fairness. (arXiv:2202.09724v1 [stat.ML])
    Machine learning algorithms are becoming integrated into more and more high-stakes decision-making processes, such as in social welfare issues. Due to the need of mitigating the potentially disparate impacts from algorithmic predictions, many approaches have been proposed in the emerging area of fair machine learning. However, the fundamental problem of characterizing Bayes-optimal classifiers under various group fairness constraints is not well understood as a theoretical benchmark. Based on the classical Neyman-Pearson argument (Neyman and Pearson, 1933; Shao, 2003) for optimal hypothesis testing, this paper provides a general framework for deriving Bayes-optimal classifiers under group fairness. This enables us to propose a group-based thresholding method that can directly control disparity, and more importantly, achieve an optimal fairness-accuracy tradeoff. These advantages are supported by experiments.  ( 2 min )
    A History of Meta-gradient: Gradient Methods for Meta-learning. (arXiv:2202.09701v1 [cs.LG])
    The history of meta-learning methods based on gradient descent is reviewed, focusing primarily on methods that adapt step-size (learning rate) meta-parameters.  ( 1 min )
    A Novel Framework for Brain Tumor Detection Based on Convolutional Variational Generative Models. (arXiv:2202.09850v1 [eess.IV])
    Brain tumor detection can make the difference between life and death. Recently, deep learning-based brain tumor detection techniques have gained attention due to their higher performance. However, obtaining the expected performance of such deep learning-based systems requires large amounts of classified images to train the deep models. Obtaining such data is usually boring, time-consuming, and can easily be exposed to human mistakes which hinder the utilization of such deep learning approaches. This paper introduces a novel framework for brain tumor detection and classification. The basic idea is to generate a large synthetic MRI images dataset that reflects the typical pattern of the brain MRI images from a small class-unbalanced collected dataset. The resulted dataset is then used for training a deep model for detection and classification. Specifically, we employ two types of deep models. The first model is a generative model to capture the distribution of the important features in a set of small class-unbalanced brain MRI images. Then by using this distribution, the generative model can synthesize any number of brain MRI images for each class. Hence, the system can automatically convert a small unbalanced dataset to a larger balanced one. The second model is the classifier that is trained using the large balanced dataset to detect brain tumors in MRI images. The proposed framework acquires an overall detection accuracy of 96.88% which highlights the promise of the proposed framework as an accurate low-overhead brain tumor detection system.  ( 2 min )
    Communication-Efficient Actor-Critic Methods for Homogeneous Markov Games. (arXiv:2202.09422v1 [cs.MA])
    Recent success in cooperative multi-agent reinforcement learning (MARL) relies on centralized training and policy sharing. Centralized training eliminates the issue of non-stationarity MARL yet induces large communication costs, and policy sharing is empirically crucial to efficient learning in certain tasks yet lacks theoretical justification. In this paper, we formally characterize a subclass of cooperative Markov games where agents exhibit a certain form of homogeneity such that policy sharing provably incurs no suboptimality. This enables us to develop the first consensus-based decentralized actor-critic method where the consensus update is applied to both the actors and the critics while ensuring convergence. We also develop practical algorithms based on our decentralized actor-critic method to reduce the communication cost during training, while still yielding policies comparable with centralized training.  ( 2 min )
    Black-box Node Injection Attack for Graph Neural Networks. (arXiv:2202.09389v1 [cs.LG])
    Graph Neural Networks (GNNs) have drawn significant attentions over the years and been broadly applied to vital fields that require high security standard such as product recommendation and traffic forecasting. Under such scenarios, exploiting GNN's vulnerabilities and further downgrade its classification performance become highly incentive for adversaries. Previous attackers mainly focus on structural perturbations of existing graphs. Although they deliver promising results, the actual implementation needs capability of manipulating the graph connectivity, which is impractical in some circumstances. In this work, we study the possibility of injecting nodes to evade the victim GNN model, and unlike previous related works with white-box setting, we significantly restrict the amount of accessible knowledge and explore the black-box setting. Specifically, we model the node injection attack as a Markov decision process and propose GA2C, a graph reinforcement learning framework in the fashion of advantage actor critic, to generate realistic features for injected nodes and seamlessly merge them into the original graph following the same topology characteristics. Through our extensive experiments on multiple acknowledged benchmark datasets, we demonstrate the superior performance of our proposed GA2C over existing state-of-the-art methods. The data and source code are publicly accessible at: https://github.com/jumxglhf/GA2C.  ( 2 min )
    Benchmarking the Linear Algebra Awareness of TensorFlow and PyTorch. (arXiv:2202.09888v1 [cs.MS])
    Linear algebra operations, which are ubiquitous in machine learning, form major performance bottlenecks. The High-Performance Computing community invests significant effort in the development of architecture-specific optimized kernels, such as those provided by the BLAS and LAPACK libraries, to speed up linear algebra operations. However, end users are progressively less likely to go through the error prone and time-consuming process of directly using said kernels; instead, frameworks such as TensorFlow (TF) and PyTorch (PyT), which facilitate the development of machine learning applications, are becoming more and more popular. Although such frameworks link to BLAS and LAPACK, it is not clear whether or not they make use of linear algebra knowledge to speed up computations. For this reason, in this paper we develop benchmarks to investigate the linear algebra optimization capabilities of TF and PyT. Our analyses reveal that a number of linear algebra optimizations are still missing; for instance, reducing the number of scalar operations by applying the distributive law, and automatically identifying the optimal parenthesization of a matrix chain. In this work, we focus on linear algebra computations in TF and PyT; we both expose opportunities for performance enhancement to the benefit of the developers of the frameworks and provide end users with guidelines on how to achieve performance gains.  ( 2 min )
    Learning logic programs by discovering where not to search. (arXiv:2202.09806v1 [cs.LG])
    The goal of inductive logic programming (ILP) is to search for a hypothesis that generalises training examples and background knowledge (BK). To improve performance, we introduce an approach that, before searching for a hypothesis, first discovers `where not to search'. We use given BK to discover constraints on hypotheses, such as that a number cannot be both even and odd. We use the constraints to bootstrap a constraint-driven ILP system. Our experiments on multiple domains (including program synthesis and inductive general game playing) show that our approach can substantially reduce learning times.  ( 2 min )
    The Pareto Frontier of Instance-Dependent Guarantees in Multi-Player Multi-Armed Bandits with no Communication. (arXiv:2202.09653v1 [cs.LG])
    We study the stochastic multi-player multi-armed bandit problem. In this problem, $m$ players cooperate to maximize their total reward from $K > m$ arms. However the players cannot communicate and are penalized (e.g. receive no reward) if they pull the same arm at the same time. We ask whether it is possible to obtain optimal instance-dependent regret $\tilde{O}(1/\Delta)$ where $\Delta$ is the gap between the $m$-th and $m+1$-st best arms. Such guarantees were recently achieved in a model allowing the players to implicitly communicate through intentional collisions. We show that with no communication at all, such guarantees are, surprisingly, not achievable. In fact, obtaining the optimal $\tilde{O}(1/\Delta)$ regret for some regimes of $\Delta$ necessarily implies strictly sub-optimal regret in other regimes. Our main result is a complete characterization of the Pareto optimal instance-dependent trade-offs that are possible with no communication. Our algorithm generalizes that of Bubeck, Budzinski, and the second author and enjoys the same strong no-collision property, while our lower bound is based on a topological obstruction and holds even under full information.  ( 2 min )
    Learning a Shield from Catastrophic Action Effects: Never Repeat the Same Mistake. (arXiv:2202.09516v1 [cs.LG])
    Agents that operate in an unknown environment are bound to make mistakes while learning, including, at least occasionally, some that lead to catastrophic consequences. When humans make catastrophic mistakes, they are expected to learn never to repeat them, such as a toddler who touches a hot stove and immediately learns never to do so again. In this work we consider a novel class of POMDPs, called POMDP with Catastrophic Actions (POMDP-CA) in which pairs of states and actions are labeled as catastrophic. Agents that act in a POMDP-CA do not have a priori knowledge about which (state, action) pairs are catastrophic, thus they are sure to make mistakes when trying to learn any meaningful policy. Rather, their aim is to maximize reward while never repeating mistakes. As a first step of avoiding mistake repetition, we leverage the concept of a shield which prevents agents from executing specific actions from specific states. In particular, we store catastrophic mistakes (unsafe pairs of states and actions) that agents make in a database. Agents are then forbidden to pick actions that appear in the database. This approach is especially useful in a continual learning setting, where groups of agents perform a variety of tasks over time in the same underlying environment. In this setting, a task-agnostic shield can be constructed in a way that stores mistakes made by any agent, such that once one agent in a group makes a mistake the entire group learns to never repeat that mistake. This paper introduces a variant of the PPO algorithm that utilizes this shield, called ShieldPPO, and empirically evaluates it in a controlled environment. Results indicate that ShieldPPO outperforms PPO, as well as baseline methods from the safe reinforcement learning literature, in a range of settings.  ( 2 min )
    An Analysis of Complex-Valued CNNs for RF Data-Driven Wireless Device Classification. (arXiv:2202.09777v1 [cs.LG])
    Recent deep neural network-based device classification studies show that complex-valued neural networks (CVNNs) yield higher classification accuracy than real-valued neural networks (RVNNs). Although this improvement is (intuitively) attributed to the complex nature of the input RF data (i.e., IQ symbols), no prior work has taken a closer look into analyzing such a trend in the context of wireless device identification. Our study provides a deeper understanding of this trend using real LoRa and WiFi RF datasets. We perform a deep dive into understanding the impact of (i) the input representation/type and (ii) the architectural layer of the neural network. For the input representation, we considered the IQ as well as the polar coordinates both partially and fully. For the architectural layer, we considered a series of ablation experiments that eliminate parts of the CVNN components. Our results show that CVNNs consistently outperform RVNNs counterpart in the various scenarios mentioned above, indicating that CVNNs are able to make better use of the joint information provided via the in-phase (I) and quadrature (Q) components of the signal.  ( 2 min )
    Explaining, Evaluating and Enhancing Neural Networks' Learned Representations. (arXiv:2202.09374v1 [cs.LG])
    Most efforts in interpretability in deep learning have focused on (1) extracting explanations of a specific downstream task in relation to the input features and (2) imposing constraints on the model, often at the expense of predictive performance. New advances in (unsupervised) representation learning and transfer learning, however, raise the need for an explanatory framework for networks that are trained without a specific downstream task. We address these challenges by showing how explainability can be an aid, rather than an obstacle, towards better and more efficient representations. Specifically, we propose a natural aggregation method generalizing attribution maps between any two (convolutional) layers of a neural network. Additionally, we employ such attributions to define two novel scores for evaluating the informativeness and the disentanglement of latent embeddings. Extensive experiments show that the proposed scores do correlate with the desired properties. We also confirm and extend previously known results concerning the independence of some common saliency strategies from the model parameters. Finally, we show that adopting our proposed scores as constraints during the training of a representation learning task improves the downstream performance of the model.  ( 2 min )
    Interacting Contour Stochastic Gradient Langevin Dynamics. (arXiv:2202.09867v1 [stat.ML])
    We propose an interacting contour stochastic gradient Langevin dynamics (ICSGLD) sampler, an embarrassingly parallel multiple-chain contour stochastic gradient Langevin dynamics (CSGLD) sampler with efficient interactions. We show that ICSGLD can be theoretically more efficient than a single-chain CSGLD with an equivalent computational budget. We also present a novel random-field function, which facilitates the estimation of self-adapting parameters in big data and obtains free mode explorations. Empirically, we compare the proposed algorithm with popular benchmark methods for posterior sampling. The numerical results show a great potential of ICSGLD for large-scale uncertainty estimation tasks.  ( 2 min )
    TransDreamer: Reinforcement Learning with Transformer World Models. (arXiv:2202.09481v1 [cs.LG])
    The Dreamer agent provides various benefits of Model-Based Reinforcement Learning (MBRL) such as sample efficiency, reusable knowledge, and safe planning. However, its world model and policy networks inherit the limitations of recurrent neural networks and thus an important question is how an MBRL framework can benefit from the recent advances of transformers and what the challenges are in doing so. In this paper, we propose a transformer-based MBRL agent, called TransDreamer. We first introduce the Transformer State-Space Model, a world model that leverages a transformer for dynamics predictions. We then share this world model with a transformer-based policy network and obtain stability in training a transformer-based RL agent. In experiments, we apply the proposed model to 2D visual RL and 3D first-person visual RL tasks both requiring long-range memory access for memory-based reasoning. We show that the proposed model outperforms Dreamer in these complex tasks.  ( 2 min )
    Truncated Diffusion Probabilistic Models. (arXiv:2202.09671v1 [stat.ML])
    Employing a forward Markov diffusion chain to gradually map the data to a noise distribution, diffusion probabilistic models learn how to generate the data by inferring a reverse Markov diffusion chain to invert the forward diffusion process. To achieve competitive data generation performance, they demand a long diffusion chain that makes them computationally intensive in not only training but also generation. To significantly improve the computation efficiency, we propose to truncate the forward diffusion chain by abolishing the requirement of diffusing the data to random noise. Consequently, we start the inverse diffusion chain from an implicit generative distribution, rather than random noise, and learn its parameters by matching it to the distribution of the data corrupted by the truncated forward diffusion chain. Experimental results show our truncated diffusion probabilistic models provide consistent improvements over the non-truncated ones in terms of the generation performance and the number of required inverse diffusion steps.  ( 2 min )
    Learning to Control Partially Observed Systems with Finite Memory. (arXiv:2202.09753v1 [cs.LG])
    We consider the reinforcement learning problem for partially observed Markov decision processes (POMDPs) with large or even countably infinite state spaces, where the controller has access to only noisy observations of the underlying controlled Markov chain. We consider a natural actor-critic method that employs a finite internal memory for policy parameterization, and a multi-step temporal difference learning algorithm for policy evaluation. We establish, to the best of our knowledge, the first non-asymptotic global convergence of actor-critic methods for partially observed systems under function approximation. In particular, in addition to the function approximation and statistical errors that also arise in MDPs, we explicitly characterize the error due to the use of finite-state controllers. This additional error is stated in terms of the total variation distance between the traditional belief state in POMDPs and the posterior distribution of the hidden state when using a finite-state controller. Further, we show that this error can be made small in the case of sliding-block controllers by using larger block sizes.  ( 2 min )
    Mixed Effects Neural ODE: A Variational Approximation for Analyzing the Dynamics of Panel Data. (arXiv:2202.09463v1 [cs.LG])
    Panel data involving longitudinal measurements of the same set of participants taken over multiple time points is common in studies to understand childhood development and disease modeling. Deep hybrid models that marry the predictive power of neural networks with physical simulators such as differential equations, are starting to drive advances in such applications. The task of modeling not just the observations but the hidden dynamics that are captured by the measurements poses interesting statistical/computational questions. We propose a probabilistic model called ME-NODE to incorporate (fixed + random) mixed effects for analyzing such panel data. We show that our model can be derived using smooth approximations of SDEs provided by the Wong-Zakai theorem. We then derive Evidence Based Lower Bounds for ME-NODE, and develop (efficient) training algorithms using MC based sampling methods and numerical ODE solvers. We demonstrate ME-NODE's utility on tasks spanning the spectrum from simulations and toy data to real longitudinal 3D imaging data from an Alzheimer's disease (AD) study, and study its performance in terms of accuracy of reconstruction for interpolation, uncertainty estimates and personalized prediction.  ( 2 min )
    From Quantum Graph Computing to Quantum Graph Learning: A Survey. (arXiv:2202.09506v1 [quant-ph])
    Quantum computing (QC) is a new computational paradigm whose foundations relate to quantum physics. Notable progress has been made, driving the birth of a series of quantum-based algorithms that take advantage of quantum computational power. In this paper, we provide a targeted survey of the development of QC for graph-related tasks. We first elaborate the correlations between quantum mechanics and graph theory to show that quantum computers are able to generate useful solutions that can not be produced by classical systems efficiently for some problems related to graphs. For its practicability and wide-applicability, we give a brief review of typical graph learning techniques designed for various tasks. Inspired by these powerful methods, we note that advanced quantum algorithms have been proposed for characterizing the graph structures. We give a snapshot of quantum graph learning where expectations serve as a catalyst for subsequent research. We further discuss the challenges of using quantum algorithms in graph learning, and future directions towards more flexible and versatile quantum graph learning solvers.  ( 2 min )
    Mixture-of-Experts with Expert Choice Routing. (arXiv:2202.09368v1 [cs.LG])
    Sparsely-activated Mixture-of-experts (MoE) models allow the number of parameters to greatly increase while keeping the amount of computation for a given token or a given sample unchanged. However, a poor expert routing strategy (e.g. one resulting in load imbalance) can cause certain experts to be under-trained, leading to an expert being under or over-specialized. Prior work allocates a fixed number of experts to each token using a top-k function regardless of the relative importance of different tokens. To address this, we propose a heterogeneous mixture-of-experts employing an expert choice method. Instead of letting tokens select the top-k experts, we have experts selecting the top-k tokens. As a result, each token can be routed to a variable number of experts and each expert can have a fixed bucket size. We systematically study pre-training speedups using the same computational resources of the Switch Transformer top-1 and GShard top-2 gating of prior work and find that our method improves training convergence time by more than 2x. For the same computational cost, our method demonstrates higher performance in fine-tuning 11 selected tasks in the GLUE and SuperGLUE benchmarks. For a smaller activation cost, our method outperforms the T5 dense model in 7 out of the 11 tasks.  ( 2 min )
    Numeric Encoding Options with Automunge. (arXiv:2202.09496v1 [cs.LG])
    Mainstream practice in machine learning with tabular data may take for granted that any feature engineering beyond scaling for numeric sets is superfluous in context of deep neural networks. This paper will offer arguments for potential benefits of extended encodings of numeric streams in deep learning by way of a survey of options for numeric transformations as available in the Automunge open source python library platform for tabular data pipelines, where transformations may be applied to distinct columns in "family tree" sets with generations and branches of derivations. Automunge transformation options include normalization, binning, noise injection, derivatives, and more. The aggregation of these methods into family tree sets of transformations are demonstrated for use to present numeric features to machine learning in multiple configurations of varying information content, as may be applied to encode numeric sets of unknown interpretation. Experiments demonstrate the realization of a novel generalized solution to data augmentation by noise injection for tabular learning, as may materially benefit model performance in applications with underserved training data.  ( 2 min )
    A Regularized Implicit Policy for Offline Reinforcement Learning. (arXiv:2202.09673v1 [stat.ML])
    Offline reinforcement learning enables learning from a fixed dataset, without further interactions with the environment. The lack of environmental interactions makes the policy training vulnerable to state-action pairs far from the training dataset and prone to missing rewarding actions. For training more effective agents, we propose a framework that supports learning a flexible yet well-regularized fully-implicit policy. We further propose a simple modification to the classical policy-matching methods for regularizing with respect to the dual form of the Jensen--Shannon divergence and the integral probability metrics. We theoretically show the correctness of the policy-matching approach, and the correctness and a good finite-sample property of our modification. An effective instantiation of our framework through the GAN structure is provided, together with techniques to explicitly smooth the state-action mapping for robust generalization beyond the static dataset. Extensive experiments and ablation study on the D4RL dataset validate our framework and the effectiveness of our algorithmic designs.  ( 2 min )
    Parsed Categoric Encodings with Automunge. (arXiv:2202.09498v1 [cs.LG])
    The Automunge open source python library platform for tabular data pre-processing automates feature engineering data transformations of numerical encoding and missing data infill to received tidy data on bases fit to properties of columns in a designated train set for consistent and efficient application to subsequent data pipelines such as for inference, where transformations may be applied to distinct columns in "family tree" sets with generations and branches of derivations. Included in the library of transformations are methods to extract structure from bounded categorical string sets by way of automated string parsing, in which comparisons between entries in the set of unique values are parsed to identify character subset overlaps which may be encoded by appended columns of boolean overlap detection activations or by replacing string entries with identified overlap partitions. Further string parsing options, which may also be applied to unbounded categoric sets, include extraction of numeric substring partitions from entries or search functions to identify presence of specified substring partitions. The aggregation of these methods into "family tree" sets of transformations are demonstrated for use to automatically extract structure from categoric string compositions in relation to the set of entries in a column, such as may be applied to prepare categoric string set encodings for machine learning without human intervention.  ( 2 min )
  • Open

    Imbalanced Malware Images Classification: a CNN based Approach. (arXiv:1708.08042v2 [cs.CV] UPDATED)
    Deep convolutional neural networks (CNNs) can be applied to malware binary detection via image classification. The performance, however, is degraded due to the imbalance of malware families (classes). To mitigate this issue, we propose a simple yet effective weighted softmax loss which can be employed as the final layer of deep CNNs. The original softmax loss is weighted, and the weight value can be determined according to class size. A scaling parameter is also included in computing the weight. Proper selection of this parameter is studied and an empirical option is suggested. The weighted loss aims at alleviating the impact of data imbalance in an end-to-end learning fashion. To validate the efficacy, we deploy the proposed weighted loss in a pre-trained deep CNN model and fine-tune it to achieve promising results on malware images classification. Extensive experiments also demonstrate that the new loss function can well fit other typical CNNs, yielding an improved classification performance.  ( 2 min )
    Physics-guided Deep Markov Models for Learning Nonlinear Dynamical Systems with Uncertainty. (arXiv:2110.08607v2 [cs.LG] UPDATED)
    In this paper, we propose a probabilistic physics-guided framework, termed Physics-guided Deep Markov Model (PgDMM). The framework is especially targeted to the inference of the characteristics and latent structure of nonlinear dynamical systems from measurement data, where it is typically intractable to perform exact inference of latent variables. A recently surfaced option pertains to leveraging variational inference to perform approximate inference. In such a scheme, transition and emission functions of the system are parameterized via feed-forward neural networks (deep generative models). However, due to the generalized and highly versatile formulation of neural network functions, the learned latent space is often prone to lack physical interpretation and structured representation. To address this, we bridge physics-based state space models with Deep Markov Models, thus delivering a hybrid modelling framework for unsupervised learning and identification for nonlinear dynamical systems. The proposed framework takes advantage of the expressive power of deep learning, while retaining the driving physics of the dynamical system by imposing physics-driven restrictions on the side of the latent space. We demonstrate the benefits of such a fusion in terms of achieving improved performance on illustrative simulation examples and experimental case studies of nonlinear systems. Our results indicate that the physics-based models involved in the employed transition and emission functions essentially enforce a more structured and physically interpretable latent space, which is essential to generalization and prediction capabilities.  ( 2 min )
    Node Feature Kernels Increase Graph Convolutional Network Robustness. (arXiv:2109.01785v3 [cs.LG] UPDATED)
    The robustness of the much-used Graph Convolutional Networks (GCNs) to perturbations of their input is becoming a topic of increasing importance. In this paper, the random GCN is introduced for which a random matrix theory analysis is possible. This analysis suggests that if the graph is sufficiently perturbed, or in the extreme case random, then the GCN fails to benefit from the node features. It is furthermore observed that enhancing the message passing step in GCNs by adding the node feature kernel to the adjacency matrix of the graph structure solves this problem. An empirical study of a GCN utilised for node classification on six real datasets further confirms the theoretical findings and demonstrates that perturbations of the graph structure can result in GCNs performing significantly worse than Multi-Layer Perceptrons run on the node features alone. In practice, adding a node feature kernel to the message passing of perturbed graphs results in a significant improvement of the GCN's performance, thereby rendering it more robust to graph perturbations. Our code is publicly available at:https://github.com/ChangminWu/RobustGCN.  ( 2 min )
    Sqrt(d) Dimension Dependence of Langevin Monte Carlo. (arXiv:2109.03839v3 [cs.LG] UPDATED)
    This article considers the popular MCMC method of unadjusted Langevin Monte Carlo (LMC) and provides a non-asymptotic analysis of its sampling error in 2-Wasserstein distance. The proof is based on a refinement of mean-square analysis in Li et al. (2019), and this refined framework automates the analysis of a large class of sampling algorithms based on discretizations of contractive SDEs. Using this framework, we establish an $\tilde{O}(\sqrt{d}/\epsilon)$ mixing time bound for LMC, without warm start, under the common log-smooth and log-strongly-convex conditions, plus a growth condition on the 3rd-order derivative of the potential of target measures. This bound improves the best previously known $\tilde{O}(d/\epsilon)$ result and is optimal (in terms of order) in both dimension $d$ and accuracy tolerance $\epsilon$ for target measures satisfying the aforementioned assumptions. Our theoretical analysis is further validated by numerical experiments.  ( 2 min )
    On the Complexity of Approximating Multimarginal Optimal Transport. (arXiv:1910.00152v3 [stat.ML] UPDATED)
    We study the complexity of approximating the multimarginal optimal transport (MOT) distance, a generalization of the classical optimal transport distance, considered here between $m$ discrete probability distributions supported each on $n$ support points. First, we show that the standard linear programming (LP) representation of the MOT problem is not a minimum-cost flow problem when $m \geq 3$. This negative result implies that some combinatorial algorithms, e.g., network simplex method, are not suitable for approximating the MOT problem, while the worst-case complexity bound for the deterministic interior-point algorithm remains a quantity of $\tilde{O}(n^{3m})$. We then propose two simple and \textit{deterministic} algorithms for approximating the MOT problem. The first algorithm, which we refer to as \textit{multimarginal Sinkhorn} algorithm, is a provably efficient multimarginal generalization of the Sinkhorn algorithm. We show that it achieves a complexity bound of $\tilde{O}(m^3n^m\varepsilon^{-2})$ for a tolerance $\varepsilon \in (0, 1)$. This provides a first \textit{near-linear time} complexity bound guarantee for approximating the MOT problem and matches the best known complexity bound for the Sinkhorn algorithm in the classical OT setting when $m = 2$. The second algorithm, which we refer to as \textit{accelerated multimarginal Sinkhorn} algorithm, achieves the acceleration by incorporating an estimate sequence and the complexity bound is $\tilde{O}(m^3n^{m+1/3}\varepsilon^{-4/3})$. This bound is better than that of the first algorithm in terms of $1/\varepsilon$, and accelerated alternating minimization algorithm~\citep{Tupitsa-2020-Multimarginal} in terms of $n$. Finally, we compare our new algorithms with the commercial LP solver \textsc{Gurobi}. Preliminary results on synthetic data and real images demonstrate the effectiveness and efficiency of our algorithms.  ( 2 min )
    Low-Dimensional High-Fidelity Kinetic Models for NOX Formation by a Compute Intensification Method. (arXiv:2202.10194v1 [physics.chem-ph])
    A novel compute intensification methodology to the construction of low-dimensional, high-fidelity "compact" kinetic models for NOX formation is designed and demonstrated. The method adapts the data intensive Machine Learned Optimization of Chemical Kinetics (MLOCK) algorithm for compact model generation by the use of a Latin Square method for virtual reaction network generation. A set of logical rules are defined which construct a minimally sized virtual reaction network comprising three additional nodes (N, NO, NO2). This NOX virtual reaction network is appended to a pre-existing compact model for methane combustion comprising fifteen nodes. The resulting eighteen node virtual reaction network is processed by the MLOCK coded algorithm to produce a plethora of compact model candidates for NOX formation during methane combustion. MLOCK automatically; populates the terms of the virtual reaction network with candidate inputs; measures the success of the resulting compact model candidates (in reproducing a broad set of gas turbine industry-defined performance targets); selects regions of input parameters space showing models of best performance; refines the input parameters to give better performance; and makes an ultimate selection of the best performing model or models. By this method, it is shown that a number of compact model candidates exist that show fidelities in excess of 75% in reproducing industry defined performance targets, with one model valid to >75% across fuel/air equivalence ratios of 0.5-1.0. However, to meet the full fuel/air equivalence ratio performance envelope defined by industry, we show that with this minimal virtual reaction network, two further compact models are required.  ( 2 min )
    Double-descent curves in neural networks: a new perspective using Gaussian processes. (arXiv:2102.07238v4 [stat.ML] UPDATED)
    Double-descent curves in neural networks describe the phenomenon that the generalisation error initially descends with increasing parameters, then grows after reaching an optimal number of parameters which is less than the number of data points, but then descends again in the overparameterized regime. Here we use a neural network Gaussian process (NNGP) which maps exactly to a fully connected network (FCN) in the infinite-width limit, combined with techniques from random matrix theory, to calculate this generalisation behaviour. An advantage of our NNGP approach is that the analytical calculations are easier to interpret. We argue that the fact that the generalisation error of neural networks decreases in the overparameterized regime and has a finite theoretical value is explained by the convergence to their limiting Gaussian processes. Our analysis thus provides a mathematical explanation for a surprising phenomenon that could not explained by conventional statistical learning theory. However, understanding what drives these finite theoretical values to be the state-of-the-art generalisation performances in many applications remains an open question, for which we only provide new leads in this paper.  ( 2 min )
    Pointer Value Retrieval: A new benchmark for understanding the limits of neural network generalization. (arXiv:2107.12580v2 [cs.LG] UPDATED)
    Central to the success of artificial neural networks is their ability to generalize. But does neural network generalization primarily rely on seeing highly similar training examples (memorization)? Or are neural networks capable of human-intelligence styled reasoning, and if so, to what extent? These remain fundamental open questions on artificial neural networks. In this paper, as steps towards answering these questions, we introduce a new benchmark, Pointer Value Retrieval (PVR) to study the limits of neural network reasoning. The PVR suite of tasks is based on reasoning about indirection, a hallmark of human intelligence, where a first stage (task) contains instructions for solving a second stage (task). In PVR, this is done by having one part of the task input act as a pointer, giving instructions on a different input location, which forms the output. We show this simple rule can be applied to create a diverse set of tasks across different input modalities and configurations. Importantly, this use of indirection enables systematically varying task difficulty through distribution shifts and increasing functional complexity. We conduct a detailed empirical study of different PVR tasks, discovering large variations in performance across dataset sizes, neural network architectures and task complexity. Further, by incorporating distribution shift and increased functional complexity, we develop nuanced tests for reasoning, revealing subtle failures and surprising successes, suggesting many promising directions of exploration on this benchmark.  ( 2 min )
    Demystifying Batch Normalization in ReLU Networks: Equivalent Convex Optimization Models and Implicit Regularization. (arXiv:2103.01499v2 [cs.LG] UPDATED)
    Batch Normalization (BN) is a commonly used technique to accelerate and stabilize training of deep neural networks. Despite its empirical success, a full theoretical understanding of BN is yet to be developed. In this work, we analyze BN through the lens of convex optimization. We introduce an analytic framework based on convex duality to obtain exact convex representations of weight-decay regularized ReLU networks with BN, which can be trained in polynomial-time. Our analyses also show that optimal layer weights can be obtained as simple closed-form formulas in the high-dimensional and/or overparameterized regimes. Furthermore, we find that Gradient Descent provides an algorithmic bias effect on the standard non-convex BN network, and we design an approach to explicitly encode this implicit regularization into the convex objective. Experiments with CIFAR image classification highlight the effectiveness of this explicit regularization for mimicking and substantially improving the performance of standard BN networks.  ( 2 min )
    On the Necessity of Auditable Algorithmic Definitions for Machine Unlearning. (arXiv:2110.11891v2 [cs.LG] UPDATED)
    Machine unlearning, i.e. having a model forget about some of its training data, has become increasingly more important as privacy legislation promotes variants of the right-to-be-forgotten. In the context of deep learning, approaches for machine unlearning are broadly categorized into two classes: exact unlearning methods, where an entity has formally removed the data point's impact on the model by retraining the model from scratch, and approximate unlearning, where an entity approximates the model parameters one would obtain by exact unlearning to save on compute costs. In this paper, we first show that the definition that underlies approximate unlearning, which seeks to prove the approximately unlearned model is close to an exactly retrained model, is incorrect because one can obtain the same model using different datasets. Thus one could unlearn without modifying the model at all. We then turn to exact unlearning approaches and ask how to verify their claims of unlearning. Our results show that even for a given training trajectory one cannot formally prove the absence of certain data points used during training. We thus conclude that unlearning is only well-defined at the algorithmic level, where an entity's only possible auditable claim to unlearning is that they used a particular algorithm designed to allow for external scrutiny during an audit.  ( 2 min )
    A bandit-learning approach to multifidelity approximation. (arXiv:2103.15342v3 [math.NA] UPDATED)
    Multifidelity approximation is an important technique in scientific computation and simulation. In this paper, we introduce a bandit-learning approach for leveraging data of varying fidelities to achieve precise estimates of the parameters of interest. Under a linear model assumption, we formulate a multifidelity approximation as a modified stochastic bandit, and analyze the loss for a class of policies that uniformly explore each model before exploiting. Utilizing the estimated conditional mean-squared error, we propose a consistent algorithm, adaptive Explore-Then-Commit (AETC), and establish a corresponding trajectory-wise optimality result. These results are then extended to the case of vector-valued responses, where we demonstrate that the algorithm is efficient without the need to worry about estimating high-dimensional parameters. The main advantage of our approach is that we require neither hierarchical model structure nor \textit{a priori} knowledge of statistical information (e.g., correlations) about or between models. Instead, the AETC algorithm requires only knowledge of which model is a trusted high-fidelity model, along with (relative) computational cost estimates of querying each model. Numerical experiments are provided at the end to support our theoretical findings.  ( 2 min )
    Conformer-based Hybrid ASR System for Switchboard Dataset. (arXiv:2111.03442v2 [cs.CL] UPDATED)
    The recently proposed conformer architecture has been successfully used for end-to-end automatic speech recognition (ASR) architectures achieving state-of-the-art performance on different datasets. To our best knowledge, the impact of using conformer acoustic model for hybrid ASR is not investigated. In this paper, we present and evaluate a competitive conformer-based hybrid model training recipe. We study different training aspects and methods to improve word-error-rate as well as to increase training speed. We apply time downsampling methods for efficient training and use transposed convolutions to upsample the output sequence again. We conduct experiments on Switchboard 300h dataset and our conformer-based hybrid model achieves competitive results compared to other architectures. It generalizes very well on Hub5'01 test set and outperforms the BLSTM-based hybrid model significantly.  ( 2 min )
    Deep kernel machines: exact inference with representation learning in infinite Bayesian neural networks. (arXiv:2108.13097v3 [stat.ML] UPDATED)
    Deep neural networks (DNNs) with the flexibility to learn good top-layer representations have eclipsed shallow kernel methods without that flexibility. Here, we take inspiration from DNNs to develop the deep kernel machine. Optimizing the deep kernel machine objective is equivalent to exact Bayesian inference (or noisy gradient descent) in an infinitely wide Bayesian neural network or deep Gaussian process, which has been scaled carefully to retain representation learning. Our work thus has important implications for theoretical understanding of neural networks. In addition, we show that the deep kernel machine objective has more desirable properties and better performance than other choices of objective. Finally, we conjecture that the deep kernel machine objective is unimodal. We give a proof of unimodality for linear kernels, and a number of experiments in the nonlinear case in which all deep kernel machines initializations we tried converged to the same solution.  ( 2 min )
    CAMul: Calibrated and Accurate Multi-view Time-Series Forecasting. (arXiv:2109.07438v2 [cs.LG] UPDATED)
    Probabilistic time-series forecasting enables reliable decision making across many domains. Most forecasting problems have diverse sources of data containing multiple modalities and structures. Leveraging information as well as uncertainty from these data sources for well-calibrated and accurate forecasts is an important challenging problem. Most previous work on multi-modal learning and forecasting simply aggregate intermediate representations from each data view by simple methods of summation or concatenation and do not explicitly model uncertainty for each data-view. We propose a general probabilistic multi-view forecasting framework CAMul, that can learn representations and uncertainty from diverse data sources. It integrates the knowledge and uncertainty from each data view in a dynamic context-specific manner assigning more importance to useful views to model a well-calibrated forecast distribution. We use CAMul for multiple domains with varied sources and modalities and show that CAMul outperforms other state-of-art probabilistic forecasting models by over 25\% in accuracy and calibration.  ( 2 min )
    Post-hoc Calibration of Neural Networks by g-Layers. (arXiv:2006.12807v2 [cs.LG] UPDATED)
    Calibration of neural networks is a critical aspect to consider when incorporating machine learning models in real-world decision-making systems where the confidence of decisions are equally important as the decisions themselves. In recent years, there is a surge of research on neural network calibration and the majority of the works can be categorized into post-hoc calibration methods, defined as methods that learn an additional function to calibrate an already trained base network. In this work, we intend to understand the post-hoc calibration methods from a theoretical point of view. Especially, it is known that minimizing Negative Log-Likelihood (NLL) will lead to a calibrated network on the training set if the global optimum is attained (Bishop, 1994). Nevertheless, it is not clear learning an additional function in a post-hoc manner would lead to calibration in the theoretical sense. To this end, we prove that even though the base network ($f$) does not lead to the global optimum of NLL, by adding additional layers ($g$) and minimizing NLL by optimizing the parameters of $g$ one can obtain a calibrated network $g \circ f$. This not only provides a less stringent condition to obtain a calibrated network but also provides a theoretical justification of post-hoc calibration methods. Our experiments on various image classification benchmarks confirm the theory.  ( 2 min )
    MultiRocket: Multiple pooling operators and transformations for fast and effective time series classification. (arXiv:2102.00457v4 [cs.LG] UPDATED)
    We propose MultiRocket, a fast time series classification (TSC) algorithm that achieves state-of-the-art performance with a tiny fraction of the time and without the complex ensembling structure of many state-of-the-art methods. MultiRocket improves on MiniRocket, one of the fastest TSC algorithms to date, by adding multiple pooling operators and transformations to improve the diversity of the features generated. In addition to processing the raw input series, MultiRocket also applies first order differences to transform the original series. Convolutions are applied to both representations, and four pooling operators are applied to the convolution outputs. When benchmarked using the University of California Riverside TSC benchmark datasets, MultiRocket is significantly more accurate than MiniRocket, and competitive with the best ranked current method in terms of accuracy, HIVE-COTE 2.0, while being orders of magnitude faster.  ( 2 min )
    The Hyperspherical Geometry of Community Detection: Modularity as a Distance. (arXiv:2107.02645v2 [cs.SI] UPDATED)
    We introduce a metric space of clusterings, where clusterings are described by a binary vector indexed by the vertex-pairs. We extend this geometry to a hypersphere and prove that maximizing modularity is equivalent to minimizing the angular distance to some modularity vector over the set of clustering vectors. In that sense, modularity-based community detection methods can be seen as a subclass of a more general class of projection methods, which we define as the community detection methods that adhere to the following two-step procedure: first, mapping the network to a point on the hypersphere; second, projecting this point to the set of clustering vectors. We show that this class of projection methods contains many interesting community detection methods. Many of these new methods cannot be described in terms of null models and resolution parameters, as is customary for modularity-based methods. We provide a new characterization of such methods in terms of meridians and latitudes of the hypersphere. In addition, by relating the modularity resolution parameter to the latitude of the corresponding modularity vector, we obtain a new interpretation of the resolution limit that modularity maximization is known to suffer from.  ( 2 min )
    Signal Processing on Higher-Order Networks: Livin' on the Edge ... and Beyond. (arXiv:2101.05510v4 [cs.SI] UPDATED)
    In this tutorial, we provide a didactic treatment of the emerging topic of signal processing on higher-order networks. Drawing analogies from discrete and graph signal processing, we introduce the building blocks for processing data on simplicial complexes and hypergraphs, two common higher-order network abstractions that can incorporate polyadic relationships. We provide brief introductions to simplicial complexes and hypergraphs, with a special emphasis on the concepts needed for the processing of signals supported on these structures. Specifically, we discuss Fourier analysis, signal denoising, signal interpolation, node embeddings, and nonlinear processing through neural networks, using these two higher-order network models. In the context of simplicial complexes, we specifically focus on signal processing using the Hodge Laplacian matrix, a multi-relational operator that leverages the special structure of simplicial complexes and generalizes desirable properties of the Laplacian matrix in graph signal processing. For hypergraphs, we present both matrix and tensor representations, and discuss the trade-offs in adopting one or the other. We also highlight limitations and potential research avenues, both to inform practitioners and to motivate the contribution of new researchers to the area.  ( 2 min )
    A Regularized Implicit Policy for Offline Reinforcement Learning. (arXiv:2202.09673v1 [stat.ML])
    Offline reinforcement learning enables learning from a fixed dataset, without further interactions with the environment. The lack of environmental interactions makes the policy training vulnerable to state-action pairs far from the training dataset and prone to missing rewarding actions. For training more effective agents, we propose a framework that supports learning a flexible yet well-regularized fully-implicit policy. We further propose a simple modification to the classical policy-matching methods for regularizing with respect to the dual form of the Jensen--Shannon divergence and the integral probability metrics. We theoretically show the correctness of the policy-matching approach, and the correctness and a good finite-sample property of our modification. An effective instantiation of our framework through the GAN structure is provided, together with techniques to explicitly smooth the state-action mapping for robust generalization beyond the static dataset. Extensive experiments and ablation study on the D4RL dataset validate our framework and the effectiveness of our algorithmic designs.  ( 2 min )
    Inferring Lexicographically-Ordered Rewards from Preferences. (arXiv:2202.10153v1 [cs.LG])
    Modeling the preferences of agents over a set of alternatives is a principal concern in many areas. The dominant approach has been to find a single reward/utility function with the property that alternatives yielding higher rewards are preferred over alternatives yielding lower rewards. However, in many settings, preferences are based on multiple, often competing, objectives; a single reward function is not adequate to represent such preferences. This paper proposes a method for inferring multi-objective reward-based representations of an agent's observed preferences. We model the agent's priorities over different objectives as entering lexicographically, so that objectives with lower priorities matter only when the agent is indifferent with respect to objectives with higher priorities. We offer two example applications in healthcare, one inspired by cancer treatment, the other inspired by organ transplantation, to illustrate how the lexicographically-ordered rewards we learn can provide a better understanding of a decision-maker's preferences and help improve policies when used in reinforcement learning.  ( 2 min )
    Robustness and Accuracy Could Be Reconcilable by (Proper) Definition. (arXiv:2202.10103v1 [cs.LG])
    The trade-off between robustness and accuracy has been widely studied in the adversarial literature. Although still controversial, the prevailing view is that this trade-off is inherent, either empirically or theoretically. Thus, we dig for the origin of this trade-off in adversarial training and find that it may stem from the improperly defined robust error, which imposes an inductive bias of local invariance -- an overcorrection towards smoothness. Given this, we advocate employing local equivariance to describe the ideal behavior of a robust model, leading to a self-consistent robust error named SCORE. By definition, SCORE facilitates the reconciliation between robustness and accuracy, while still handling the worst-case uncertainty via robust optimization. By simply substituting KL divergence with variants of distance metrics, SCORE can be efficiently minimized. Empirically, our models achieve top-rank performance on RobustBench under AutoAttack. Besides, SCORE provides instructive insights for explaining the overfitting phenomenon and semantic input gradients observed on robust models.  ( 2 min )
    Learning to Control Partially Observed Systems with Finite Memory. (arXiv:2202.09753v1 [cs.LG])
    We consider the reinforcement learning problem for partially observed Markov decision processes (POMDPs) with large or even countably infinite state spaces, where the controller has access to only noisy observations of the underlying controlled Markov chain. We consider a natural actor-critic method that employs a finite internal memory for policy parameterization, and a multi-step temporal difference learning algorithm for policy evaluation. We establish, to the best of our knowledge, the first non-asymptotic global convergence of actor-critic methods for partially observed systems under function approximation. In particular, in addition to the function approximation and statistical errors that also arise in MDPs, we explicitly characterize the error due to the use of finite-state controllers. This additional error is stated in terms of the total variation distance between the traditional belief state in POMDPs and the posterior distribution of the hidden state when using a finite-state controller. Further, we show that this error can be made small in the case of sliding-block controllers by using larger block sizes.  ( 2 min )
    How Good are Low-Rank Approximations in Gaussian Process Regression?. (arXiv:2112.06410v3 [stat.ML] UPDATED)
    We provide guarantees for approximate Gaussian Process (GP) regression resulting from two common low-rank kernel approximations: based on random Fourier features, and based on truncating the kernel's Mercer expansion. In particular, we bound the Kullback-Leibler divergence between an exact GP and one resulting from one of the afore-described low-rank approximations to its kernel, as well as between their corresponding predictive densities, and we also bound the error between predictive mean vectors and between predictive covariance matrices computed using the exact versus using the approximate GP. We provide experiments on both simulated data and standard benchmarks to evaluate the effectiveness of our theoretical bounds.  ( 2 min )
    Generalized Bayesian Additive Regression Trees Models: Beyond Conditional Conjugacy. (arXiv:2202.09924v1 [stat.ML])
    Bayesian additive regression trees have seen increased interest in recent years due to their ability to combine machine learning techniques with principled uncertainty quantification. The Bayesian backfitting algorithm used to fit BART models, however, limits their application to a small class of models for which conditional conjugacy exists. In this article, we greatly expand the domain of applicability of BART to arbitrary \emph{generalized BART} models by introducing a very simple, tuning-parameter-free, reversible jump Markov chain Monte Carlo algorithm. Our algorithm requires only that the user be able to compute the likelihood and (optionally) its gradient and Fisher information. The potential applications are very broad; we consider examples in survival analysis, structured heteroskedastic regression, and gamma shape regression.  ( 2 min )
    PriorGrad: Improving Conditional Denoising Diffusion Models with Data-Dependent Adaptive Prior. (arXiv:2106.06406v2 [stat.ML] UPDATED)
    Denoising diffusion probabilistic models have been recently proposed to generate high-quality samples by estimating the gradient of the data density. The framework defines the prior noise as a standard Gaussian distribution, whereas the corresponding data distribution may be more complicated than the standard Gaussian distribution, which potentially introduces inefficiency in denoising the prior noise into the data sample because of the discrepancy between the data and the prior. In this paper, we propose PriorGrad to improve the efficiency of the conditional diffusion model for speech synthesis (for example, a vocoder using a mel-spectrogram as the condition) by applying an adaptive prior derived from the data statistics based on the conditional information. We formulate the training and sampling procedures of PriorGrad and demonstrate the advantages of an adaptive prior through a theoretical analysis. Focusing on the speech synthesis domain, we consider the recently proposed diffusion-based speech generative models based on both the spectral and time domains and show that PriorGrad achieves faster convergence and inference with superior performance, leading to an improved perceptual quality and robustness to a smaller network capacity, and thereby demonstrating the efficiency of a data-dependent adaptive prior.  ( 2 min )
    Bad priors, their threats to Bayesian optimization and some remedies via prior learning. (arXiv:2109.08215v2 [cs.LG] UPDATED)
    The performance of deep neural networks can be highly sensitive to the choice of a variety of meta-parameters, such as optimizer parameters and model hyperparameters. Tuning these well, however, often requires extensive and costly experimentation. Bayesian optimization (BO) is a principled approach to solve such expensive hyperparameter tuning problems efficiently. Key to the performance of BO is specifying and refining a distribution over functions, which is used to reason about the optima of the underlying function being optimized. In this work, we consider the scenario where we have data from similar functions that allows us to specify a tighter distribution a priori. Specifically, we focus on the common but potentially costly task of tuning optimizer parameters for training neural networks. Building on the meta BO method from Wang et al. (2018), we develop practical improvements that (a) boost its performance by leveraging tuning results on multiple tasks without requiring observations for the same meta-parameter points across all tasks, and (b) retain its regret bound for a special case of our method. As a result, we provide a coherent BO solution for iterative optimization of continuous optimizer parameters. To verify our approach in realistic model training setups, we collected a large multi-task hyperparameter tuning dataset by training tens of thousands of configurations of near-state-of-the-art models on popular image and text datasets, as well as a protein sequence dataset. Our results show that on average, our method is able to locate good hyperparameters at least 3 times more efficiently than the best competing methods.  ( 2 min )
    Adversarially Robust Kernel Smoothing. (arXiv:2102.08474v4 [cs.LG] UPDATED)
    We propose a scalable robust learning algorithm combining kernel smoothing and robust optimization. Our method is motivated by the convex analysis perspective of distributionally robust optimization based on probability metrics, such as the Wasserstein distance and the maximum mean discrepancy. We adapt the integral operator using supremal convolution in convex analysis to form a novel function majorant used for enforcing robustness. Our method is simple in form and applies to general loss functions and machine learning models. Exploiting a connection with optimal transport, we prove theoretical guarantees for certified robustness under distribution shift. Furthermore, we report experiments with general machine learning models, such as deep neural networks, to demonstrate competitive performance with the state-of-the-art certifiable robust learning algorithms based on the Wasserstein distance.  ( 2 min )
    Interacting Contour Stochastic Gradient Langevin Dynamics. (arXiv:2202.09867v1 [stat.ML])
    We propose an interacting contour stochastic gradient Langevin dynamics (ICSGLD) sampler, an embarrassingly parallel multiple-chain contour stochastic gradient Langevin dynamics (CSGLD) sampler with efficient interactions. We show that ICSGLD can be theoretically more efficient than a single-chain CSGLD with an equivalent computational budget. We also present a novel random-field function, which facilitates the estimation of self-adapting parameters in big data and obtains free mode explorations. Empirically, we compare the proposed algorithm with popular benchmark methods for posterior sampling. The numerical results show a great potential of ICSGLD for large-scale uncertainty estimation tasks.  ( 2 min )
    Memorize to Generalize: on the Necessity of Interpolation in High Dimensional Linear Regression. (arXiv:2202.09889v1 [stat.ML])
    We examine the necessity of interpolation in overparameterized models, that is, when achieving optimal predictive risk in machine learning problems requires (nearly) interpolating the training data. In particular, we consider simple overparameterized linear regression $y = X \theta + w$ with random design $X \in \mathbb{R}^{n \times d}$ under the proportional asymptotics $d/n \to \gamma \in (1, \infty)$. We precisely characterize how prediction (test) error necessarily scales with training error in this setting. An implication of this characterization is that as the label noise variance $\sigma^2 \to 0$, any estimator that incurs at least $\mathsf{c}\sigma^4$ training error for some constant $\mathsf{c}$ is necessarily suboptimal and will suffer growth in excess prediction error at least linear in the training error. Thus, optimal performance requires fitting training data to substantially higher accuracy than the inherent noise floor of the problem.  ( 2 min )
    Polytopic Matrix Factorization: Determinant Maximization Based Criterion and Identifiability. (arXiv:2202.09638v1 [stat.ML])
    We introduce Polytopic Matrix Factorization (PMF) as a novel data decomposition approach. In this new framework, we model input data as unknown linear transformations of some latent vectors drawn from a polytope. In this sense, the article considers a semi-structured data model, in which the input matrix is modeled as the product of a full column rank matrix and a matrix containing samples from a polytope as its column vectors. The choice of polytope reflects the presumed features of the latent components and their mutual relationships. As the factorization criterion, we propose the determinant maximization (Det-Max) for the sample autocorrelation matrix of the latent vectors. We introduce a sufficient condition for identifiability, which requires that the convex hull of the latent vectors contains the maximum volume inscribed ellipsoid of the polytope with a particular tightness constraint. Based on the Det-Max criterion and the proposed identifiability condition, we show that all polytopes that satisfy a particular symmetry restriction qualify for the PMF framework. Having infinitely many polytope choices provides a form of flexibility in characterizing latent vectors. In particular, it is possible to define latent vectors with heterogeneous features, enabling the assignment of attributes such as nonnegativity and sparsity at the subvector level. The article offers examples illustrating the connection between polytope choices and the corresponding feature representations.  ( 2 min )
    Bayes-Optimal Classifiers under Group Fairness. (arXiv:2202.09724v1 [stat.ML])
    Machine learning algorithms are becoming integrated into more and more high-stakes decision-making processes, such as in social welfare issues. Due to the need of mitigating the potentially disparate impacts from algorithmic predictions, many approaches have been proposed in the emerging area of fair machine learning. However, the fundamental problem of characterizing Bayes-optimal classifiers under various group fairness constraints is not well understood as a theoretical benchmark. Based on the classical Neyman-Pearson argument (Neyman and Pearson, 1933; Shao, 2003) for optimal hypothesis testing, this paper provides a general framework for deriving Bayes-optimal classifiers under group fairness. This enables us to propose a group-based thresholding method that can directly control disparity, and more importantly, achieve an optimal fairness-accuracy tradeoff. These advantages are supported by experiments.  ( 2 min )
    Deconstructing Distributions: A Pointwise Framework of Learning. (arXiv:2202.09931v1 [cs.LG])
    In machine learning, we traditionally evaluate the performance of a single model, averaged over a collection of test inputs. In this work, we propose a new approach: we measure the performance of a collection of models when evaluated on a $\textit{single input point}$. Specifically, we study a point's $\textit{profile}$: the relationship between models' average performance on the test distribution and their pointwise performance on this individual point. We find that profiles can yield new insights into the structure of both models and data -- in and out-of-distribution. For example, we empirically show that real data distributions consist of points with qualitatively different profiles. On one hand, there are "compatible" points with strong correlation between the pointwise and average performance. On the other hand, there are points with weak and even $\textit{negative}$ correlation: cases where improving overall model accuracy actually $\textit{hurts}$ performance on these inputs. We prove that these experimental observations are inconsistent with the predictions of several simplified models of learning proposed in prior work. As an application, we use profiles to construct a dataset we call CIFAR-10-NEG: a subset of CINIC-10 such that for standard models, accuracy on CIFAR-10-NEG is $\textit{negatively correlated}$ with accuracy on CIFAR-10 test. This illustrates, for the first time, an OOD dataset that completely inverts "accuracy-on-the-line" (Miller, Taori, Raghunathan, Sagawa, Koh, Shankar, Liang, Carmon, and Schmidt 2021)  ( 2 min )
    On Optimal Early Stopping: Over-informative versus Under-informative Parametrization. (arXiv:2202.09885v1 [cs.LG])
    Early stopping is a simple and widely used method to prevent over-training neural networks. We develop theoretical results to reveal the relationship between the optimal early stopping time and model dimension as well as sample size of the dataset for certain linear models. Our results demonstrate two very different behaviors when the model dimension exceeds the number of features versus the opposite scenario. While most previous works on linear models focus on the latter setting, we observe that the dimension of the model often exceeds the number of features arising from data in common deep learning tasks and propose a model to study this setting. We demonstrate experimentally that our theoretical results on optimal early stopping time corresponds to the training process of deep neural networks.  ( 2 min )
    RELAX: Representation Learning Explainability. (arXiv:2112.10161v2 [stat.ML] UPDATED)
    Despite the significant improvements that representation learning via self-supervision has led to when learning from unlabeled data, no methods exist that explain what influences the learned representation. We address this need through our proposed approach, RELAX, which is the first approach for attribution-based explanations of representations. Our approach can also model the uncertainty in its explanations, which is essential to produce trustworthy explanations. RELAX explains representations by measuring similarities in the representation space between an input and masked out versions of itself, providing intuitive explanations and significantly outperforming the gradient-based baseline. We provide theoretical interpretations of RELAX and conduct a novel analysis of feature extractors trained using supervised and unsupervised learning, providing insights into different learning strategies. Finally, we illustrate the usability of RELAX in multi-view clustering and highlight that incorporating uncertainty can be essential for providing low-complexity explanations, taking a crucial step towards explaining representations.  ( 2 min )
    Regret Lower Bounds for Learning Linear Quadratic Gaussian Systems. (arXiv:2201.01680v2 [cs.LG] UPDATED)
    This paper presents local minimax regret lower bounds for adaptively controlling linear-quadratic-Gaussian (LQG) systems. We consider smoothly parametrized instances and provide an understanding of when logarithmic regret is impossible which is both instance specific and flexible enough to take problem structure into account. This understanding relies on two key notions: That of local-uninformativeness; when the optimal policy does not provide sufficient excitation for identification of the optimal policy, and yields a degenerate Fisher information matrix; and that of information-regret-boundedness, when the small eigenvalues of a policy-dependent information matrix are boundable in terms of the regret of that policy. Combined with a reduction to Bayesian estimation and application of Van Trees' inequality, these two conditions are sufficient for proving regret bounds on order of magnitude $\sqrt{T}$ in the time horizon, $T$. This method yields lower bounds that exhibit tight dimensional dependencies and scale naturally with control-theoretic problem constants. For instance, we are able to prove that systems operating near marginal stability are fundamentally hard to learn to control. We further show that large classes of systems satisfy these conditions, among them any state-feedback system with both $A$- and $B$-matrices unknown. Most importantly, we also establish that a nontrivial class of partially observable systems, essentially those that are over-actuated, satisfy these conditions, thus providing a $\sqrt{T}$ lower bound also valid for partially observable systems. Finally, we turn to two simple examples which demonstrate that our lower bound captures classical control-theoretic intuition: our lower bounds diverge for systems operating near marginal stability or with large filter gain -- these can be arbitrarily hard to (learn to) control.  ( 2 min )
    Efficient Manifold and Subspace Approximations with Spherelets. (arXiv:1706.08263v4 [stat.ML] UPDATED)
    In statistical dimensionality reduction, it is common to rely on the assumption that high dimensional data tend to concentrate near a lower dimensional manifold. There is a rich literature on approximating the unknown manifold, and on exploiting such approximations in clustering, data compression, and prediction. Most of the literature relies on linear or locally linear approximations. In this article, we propose a simple and general alternative, which instead uses spheres, an approach we refer to as spherelets. We develop spherical principal components analysis (SPCA), and provide theory on the convergence rate for global and local SPCA, while showing that spherelets can provide lower covering numbers and MSEs for many manifolds. Results relative to state-of-the-art competitors show gains in ability to accurately approximate manifolds with fewer components. Unlike most competitors, which simply output lower-dimensional features, our approach projects data onto the estimated manifold to produce fitted values that can be used for model assessment and cross validation. The methods are illustrated with applications to multiple data sets.  ( 2 min )
    Necessary and Sufficient Conditions for Inverse Reinforcement Learning of Bayesian Stopping Time Problems. (arXiv:2007.03481v4 [cs.LG] UPDATED)
    This paper presents an inverse reinforcement learning~(IRL) framework for Bayesian stopping time problems. By observing the actions of a Bayesian decision maker, we provide a necessary and sufficient condition to identify if these actions are consistent with optimizing a cost function. In a Bayesian (partially observed) setting, the inverse learner can at best identify optimality wrt the observed actions. Our IRL algorithm identifies optimality and then constructs set valued estimates of the cost function. To achieve this IRL objective, we use novel ideas from Bayesian revealed preferences stemming from microeconomics. We illustrate the proposed IRL scheme using two important examples of stopping time problems, namely, sequential hypothesis testing and Bayesian search. Finally, for finite datasets, we propose an IRL detection algorithm and give finite sample bounds on its error probabilities.  ( 2 min )
    Image Response Regression via Deep Neural Networks. (arXiv:2006.09911v3 [stat.ML] UPDATED)
    Delineating the associations between images and a vector of covariates is of central interest in medical imaging studies. To tackle this problem of image response regression, we propose a novel nonparametric approach in the framework of spatially varying coefficient models, where the spatially varying functions are estimated through deep neural networks. Compared to existing solutions, the proposed method explicitly accounts for spatial smoothness and subject heterogeneity, has straightforward interpretations, and is highly flexible and accurate in capturing complex association patterns. A key idea in our approach is to treat the image voxels as the effective samples, which not only alleviates the limited sample size issue that haunts the majority of medical imaging studies, but also leads to more robust and reproducible results. Focusing on a broad family of piecewise smooth functions, we establish the estimation and selection consistency, and derive the asymptotic error bounds. We demonstrate the efficacy of the method through intensive simulations, and further illustrate its advantages with analyses of two functional magnetic resonance imaging datasets.  ( 2 min )
    On out-of-distribution detection with Bayesian neural networks. (arXiv:2110.06020v2 [cs.LG] UPDATED)
    The question whether inputs are valid for the problem a neural network is trying to solve has sparked interest in out-of-distribution (OOD) detection. It is widely assumed that Bayesian neural networks (BNNs) are well suited for this task, as the endowed epistemic uncertainty should lead to disagreement in predictions on outliers. In this paper, we question this assumption and show that proper Bayesian inference with function space priors induced by neural networks does not necessarily lead to good OOD detection. To circumvent the use of approximate inference, we start by studying the infinite-width case, where Bayesian inference can be exact due to the correspondence with Gaussian processes. Strikingly, the kernels derived from common architectural choices lead to function space priors which induce predictive uncertainties that do not reflect the underlying input data distribution and are therefore unsuited for OOD detection. Importantly, we find the OOD behavior in this limiting case to be consistent with the corresponding finite-width case. To overcome this limitation, useful function space properties can also be encoded in the prior in weight space, however, this can currently only be applied to a specified subset of the domain and thus does not inherently extend to OOD data. Finally, we argue that a trade-off between generalization and OOD capabilities might render the application of BNNs for OOD detection undesirable in practice. Overall, our study discloses fundamental problems when naively using BNNs for OOD detection and opens interesting avenues for future research.  ( 2 min )
    Efficacy of regularized multi-task learning based on SVM models. (arXiv:1805.12507v2 [stat.ML] UPDATED)
    This paper investigates the efficacy of a regularized multi-task learning (MTL) framework based on SVM (M-SVM) to answer whether MTL always provides reliable results and how MTL outperforms independent learning. We first find that M-SVM is Bayes risk consistent in the limit of large sample size. This implies that despite the task dissimilarities, M-SVM always produces a reliable decision rule for each task in terms of misclassification error when the data size is large enough. Furthermore, we find that the task-interaction vanishes as the data size goes to infinity, and the convergence rates of M-SVM and its single-task counterpart have the same upper bound. The former suggests that M-SVM cannot improve the limit classifier's performance; based on the latter, we conjecture that the optimal convergence rate is not improved when the task number is fixed. As a novel insight of MTL, our theoretical and experimental results achieved an excellent agreement that the benefit of the MTL methods lies in the improvement of the pre-convergence-rate factor (PCR, to be denoted in Section III) rather than the convergence rate. Moreover, this improvement of PCR factors is more significant when the data size is small.  ( 2 min )
    Explicit Group Sparse Projection with Applications to Deep Learning and NMF. (arXiv:1912.03896v3 [cs.LG] UPDATED)
    We design a new sparse projection method for a set of vectors that guarantees a desired average sparsity level measured leveraging the popular Hoyer measure (an affine function of the ratio of the $\ell_1$ and $\ell_2$ norms). Existing approaches either project each vector individually or require the use of a regularization parameter which implicitly maps to the average $\ell_0$-measure of sparsity. Instead, in our approach we set the sparsity level for the whole set explicitly and simultaneously project a group of vectors with the sparsity level of each vector tuned automatically. We show that the computational complexity of our projection operator is linear in the size of the problem. Additionally, we propose a generalization of this projection by replacing the $\ell_1$ norm by its weighted version. We showcase the efficacy of our approach in both supervised and unsupervised learning tasks on image datasets including CIFAR10 and ImageNet. In deep neural network pruning, the sparse models produced by our method on ResNet50 have significantly higher accuracies at corresponding sparsity values compared to existing competitors. In nonnegative matrix factorization, our approach yields competitive reconstruction errors against state-of-the-art algorithms.  ( 2 min )
    Nonseparable Symplectic Neural Networks. (arXiv:2010.12636v3 [cs.LG] UPDATED)
    Predicting the behaviors of Hamiltonian systems has been drawing increasing attention in scientific machine learning. However, the vast majority of the literature was focused on predicting separable Hamiltonian systems with their kinematic and potential energy terms being explicitly decoupled while building data-driven paradigms to predict nonseparable Hamiltonian systems that are ubiquitous in fluid dynamics and quantum mechanics were rarely explored. The main computational challenge lies in the effective embedding of symplectic priors to describe the inherently coupled evolution of position and momentum, which typically exhibits intricate dynamics. To solve the problem, we propose a novel neural network architecture, Nonseparable Symplectic Neural Networks (NSSNNs), to uncover and embed the symplectic structure of a nonseparable Hamiltonian system from limited observation data. The enabling mechanics of our approach is an augmented symplectic time integrator to decouple the position and momentum energy terms and facilitate their evolution. We demonstrated the efficacy and versatility of our method by predicting a wide range of Hamiltonian systems, both separable and nonseparable, including chaotic vortical flows. We showed the unique computational merits of our approach to yield long-term, accurate, and robust predictions for large-scale Hamiltonian systems by rigorously enforcing symplectomorphism.  ( 2 min )
    ABO3 Perovskites' Formability Prediction and Crystal Structure Classification using Machine Learning. (arXiv:2202.10125v1 [cond-mat.mtrl-sci])
    Renewable energy sources are of great interest to combat global warming, yet promising sources like photovoltaic (PV) cells are not efficient and cheap enough to act as an alternative to traditional energy sources. Perovskite has high potential as a PV material but engineering the right material for a specific application is often a lengthy process. In this paper, ABO3 type perovskites' formability is predicted and its crystal structure is classified using machine learning with high accuracy, which provides a fast screening process. Although the study was done with solar-cell application in mind, the prediction framework is generic enough to be used for other purposes. Formability of perovskite is predicted and its crystal structure is classified with an accuracy of 98.57% and 90.53% respectively using Random Forest after 5-fold cross-validation. Our machine learning model may aid in the accelerated development of a desired perovskite structure by providing a quick mechanism to get insight into the material's properties in advance.  ( 2 min )
    Stochastic Modeling of Inhomogeneities in the Aortic Wall and Uncertainty Quantification using a Bayesian Encoder-Decoder Surrogate. (arXiv:2202.10244v1 [stat.ML])
    Inhomogeneities in the aortic wall can lead to localized stress accumulations, possibly initiating dissection. In many cases, a dissection results from pathological changes such as fragmentation or loss of elastic fibers. But it has been shown that even the healthy aortic wall has an inherent heterogeneous microstructure. Some parts of the aorta are particularly susceptible to the development of inhomogeneities due to pathological changes, however, the distribution in the aortic wall and the spatial extent, such as size, shape, and type, are difficult to predict. Motivated by this observation, we describe the heterogeneous distribution of elastic fiber degradation in the dissected aortic wall using a stochastic constitutive model. For this purpose, random field realizations, which model the stochastic distribution of degraded elastic fibers, are generated over a non-equidistant grid. The random field then serves as input for a uni-axial extension test of the pathological aortic wall, solved with the finite-element (FE) method. To include the microstructure of the dissected aortic wall, a constitutive model developed in a previous study is applied, which also includes an approach to model the degradation of inter-lamellar elastic fibers. Then to assess the uncertainty in the output stress distribution due to this stochastic constitutive model, a convolutional neural network, specifically a Bayesian encoder-decoder, was used as a surrogate model that maps the random input fields to the output stress distribution obtained from the FE analysis. The results show that the neural network is able to predict the stress distribution of the FE analysis while significantly reducing the computational time. In addition, it provides the probability for exceeding critical stresses within the aortic wall, which could allow for the prediction of delamination or fatal rupture.  ( 2 min )
    Interactive Visual Pattern Search on Graph Data via Graph Representation Learning. (arXiv:2202.09459v1 [cs.LG])
    Graphs are a ubiquitous data structure to model processes and relations in a wide range of domains. Examples include control-flow graphs in programs and semantic scene graphs in images. Identifying subgraph patterns in graphs is an important approach to understanding their structural properties. We propose a visual analytics system GraphQ to support human-in-the-loop, example-based, subgraph pattern search in a database containing many individual graphs. To support fast, interactive queries, we use graph neural networks (GNNs) to encode a graph as fixed-length latent vector representation, and perform subgraph matching in the latent space. Due to the complexity of the problem, it is still difficult to obtain accurate one-to-one node correspondences in the matching results that are crucial for visualization and interpretation. We, therefore, propose a novel GNN for node-alignment called NeuroAlign, to facilitate easy validation and interpretation of the query results. GraphQ provides a visual query interface with a query editor and a multi-scale visualization of the results, as well as a user feedback mechanism for refining the results with additional constraints. We demonstrate GraphQ through two example usage scenarios: analyzing reusable subroutines in program workflows and semantic scene graph search in images. Quantitative experiments show that NeuroAlign achieves 19-29% improvement in node-alignment accuracy compared to baseline GNN and provides up to 100x speedup compared to combinatorial algorithms. Our qualitative study with domain experts confirms the effectiveness for both usage scenarios.  ( 2 min )
    Structured second-order methods via natural gradient descent. (arXiv:2107.10884v3 [stat.ML] UPDATED)
    In this paper, we propose new structured second-order methods and structured adaptive-gradient methods obtained by performing natural-gradient descent on structured parameter spaces. Natural-gradient descent is an attractive approach to design new algorithms in many settings such as gradient-free, adaptive-gradient, and second-order methods. Our structured methods not only enjoy a structural invariance but also admit a simple expression. Finally, we test the efficiency of our proposed methods on both deterministic non-convex problems and deep learning problems.  ( 2 min )
    Selective Credit Assignment. (arXiv:2202.09699v1 [cs.LG])
    Efficient credit assignment is essential for reinforcement learning algorithms in both prediction and control settings. We describe a unified view on temporal-difference algorithms for selective credit assignment. These selective algorithms apply weightings to quantify the contribution of learning updates. We present insights into applying weightings to value-based learning and planning algorithms, and describe their role in mediating the backward credit distribution in prediction and control. Within this space, we identify some existing online learning algorithms that can assign credit selectively as special cases, as well as add new algorithms that assign credit backward in time counterfactually, allowing credit to be assigned off-trajectory and off-policy.  ( 2 min )
    Gradient Estimation with Discrete Stein Operators. (arXiv:2202.09497v1 [stat.ML])
    Gradient estimation -- approximating the gradient of an expectation with respect to the parameters of a distribution -- is central to the solution of many machine learning problems. However, when the distribution is discrete, most common gradient estimators suffer from excessive variance. To improve the quality of gradient estimation, we introduce a variance reduction technique based on Stein operators for discrete distributions. We then use this technique to build flexible control variates for the REINFORCE leave-one-out estimator. Our control variates can be adapted online to minimize the variance and do not require extra evaluations of the target function. In benchmark generative modeling tasks such as training binary variational autoencoders, our gradient estimator achieves substantially lower variance than state-of-the-art estimators with the same number of function evaluations.  ( 2 min )
    The Pareto Frontier of Instance-Dependent Guarantees in Multi-Player Multi-Armed Bandits with no Communication. (arXiv:2202.09653v1 [cs.LG])
    We study the stochastic multi-player multi-armed bandit problem. In this problem, $m$ players cooperate to maximize their total reward from $K > m$ arms. However the players cannot communicate and are penalized (e.g. receive no reward) if they pull the same arm at the same time. We ask whether it is possible to obtain optimal instance-dependent regret $\tilde{O}(1/\Delta)$ where $\Delta$ is the gap between the $m$-th and $m+1$-st best arms. Such guarantees were recently achieved in a model allowing the players to implicitly communicate through intentional collisions. We show that with no communication at all, such guarantees are, surprisingly, not achievable. In fact, obtaining the optimal $\tilde{O}(1/\Delta)$ regret for some regimes of $\Delta$ necessarily implies strictly sub-optimal regret in other regimes. Our main result is a complete characterization of the Pareto optimal instance-dependent trade-offs that are possible with no communication. Our algorithm generalizes that of Bubeck, Budzinski, and the second author and enjoys the same strong no-collision property, while our lower bound is based on a topological obstruction and holds even under full information.  ( 2 min )
    A Minimax Learning Approach to Off-Policy Evaluation in Confounded Partially Observable Markov Decision Processes. (arXiv:2111.06784v2 [cs.LG] UPDATED)
    We consider off-policy evaluation (OPE) in Partially Observable Markov Decision Processes (POMDPs), where the evaluation policy depends only on observable variables and the behavior policy depends on unobservable latent variables. Existing works either assume no unmeasured confounders, or focus on settings where both the observation and the state spaces are tabular. In this work, we first propose novel identification methods for OPE in POMDPs with latent confounders, by introducing bridge functions that link the target policy's value and the observed data distribution. We next propose minimax estimation methods for learning these bridge functions, and construct three estimators based on these estimated bridge functions, corresponding to a value function-based estimator, a marginalized importance sampling estimator, and a doubly-robust estimator. Our proposal permits general function approximation and is thus applicable to settings with continuous or large observation/state spaces. The nonasymptotic and asymptotic properties of the proposed estimators are investigated in detail.  ( 2 min )
    Pyramid Convolutional RNN for MRI Image Reconstruction. (arXiv:1912.00543v6 [eess.IV] UPDATED)
    Fast and accurate MRI image reconstruction from undersampled data is crucial in clinical practice. Deep learning based reconstruction methods have shown promising advances in recent years. However, recovering fine details from undersampled data is still challenging. In this paper, we introduce a novel deep learning based method, Pyramid Convolutional RNN (PC-RNN), to reconstruct images from multiple scales. Based on the formulation of MRI reconstruction as an inverse problem, we design the PC-RNN model with three convolutional RNN (ConvRNN) modules to iteratively learn the features in multiple scales. Each ConvRNN module reconstructs images at different scales and the reconstructed images are combined by a final CNN module in a pyramid fashion. The multi-scale ConvRNN modules learn a coarse-to-fine image reconstruction. Unlike other common reconstruction methods for parallel imaging, PC-RNN does not employ coil sensitive maps for multi-coil data and directly model the multiple coils as multi-channel inputs. The coil compression technique is applied to standardize data with various coil numbers, leading to more efficient training. We evaluate our model on the fastMRI knee and brain datasets and the results show that the proposed model outperforms other methods and can recover more details. The proposed method is one of the winner solutions in the 2019 fastMRI competition.  ( 2 min )
    Non-asymptotic analysis and inference for an outlyingness induced winsorized mean. (arXiv:2105.02337v2 [stat.ML] UPDATED)
    Robust estimation of a mean vector, a topic regarded as obsolete in the traditional robust statistics community, has recently surged in machine learning literature in the last decade. The latest focus is on the sub-Gaussian performance and computability of the estimators in a non-asymptotic setting. Numerous traditional robust estimators are computationally intractable, which partly contributes to the renewal of the interest in the robust mean estimation. Robust centrality estimators, however, include the trimmed mean and the sample median. The latter has the best robustness but suffers a low-efficiency drawback. Trimmed mean and median of means, %as robust alternatives to the sample mean, and achieving sub-Gaussian performance have been proposed and studied in the literature. This article investigates the robustness of leading sub-Gaussian estimators of mean and reveals that none of them can resist greater than $25\%$ contamination in data and consequently introduces an outlyingness induced winsorized mean which has the best possible robustness (can resist up to $50\%$ contamination without breakdown) meanwhile achieving high efficiency. Furthermore, it has a sub-Gaussian performance for uncontaminated samples and a bounded estimation error for contaminated samples at a given confidence level in a finite sample setting. It can be computed in linear time.  ( 2 min )
    Generalized Bayesian Upper Confidence Bound with Approximate Inference for Bandit Problems. (arXiv:2201.12955v2 [cs.LG] UPDATED)
    Bayesian bandit algorithms with approximate inference have been widely used in practice with superior performance. Yet, few studies regarding the fundamental understanding of their performances are available. In this paper, we propose a Bayesian bandit algorithm, which we call Generalized Bayesian Upper Confidence Bound (GBUCB), for bandit problems in the presence of approximate inference. Our theoretical analysis demonstrates that in Bernoulli multi-armed bandit, GBUCB can achieve $O(\sqrt{T}(\log T)^c)$ frequentist regret if the inference error measured by symmetrized Kullback-Leibler divergence is controllable. This analysis relies on a novel sensitivity analysis for quantile shifts with respect to inference errors. To our best knowledge, our work provides the first theoretical regret bound that is better than $o(T)$ in the setting of approximate inference. Our experimental evaluations on multiple approximate inference settings corroborate our theory, showing that our GBUCB is consistently superior to BUCB and Thompson sampling.  ( 2 min )
    Convergence of a Grassmannian Gradient Descent Algorithm for Subspace Estimation From Undersampled Data. (arXiv:1610.00199v3 [cs.NA] UPDATED)
    Subspace learning and matrix factorization problems have great many applications in science and engineering, and efficient algorithms are critical as dataset sizes continue to grow. Many relevant problem formulations are non-convex, and in a variety of contexts it has been observed that solving the non-convex problem directly is not only efficient but reliably accurate. We discuss convergence theory for a particular method: first order incremental gradient descent constrained to the Grassmannian. The output of the algorithm is an orthonormal basis for a $d$-dimensional subspace spanned by an input streaming data matrix. We study two sampling cases: where each data vector of the streaming matrix is fully sampled, or where it is undersampled by a sampling matrix $A_t\in \mathbb{R}^{m\times n}$ with $m\ll n$. Our results cover two cases, where $A_t$ is Gaussian or a subset of rows of the identity matrix. We propose an adaptive stepsize scheme that depends only on the sampled data and algorithm outputs. We prove that with fully sampled data, the stepsize scheme maximizes the improvement of our convergence metric at each iteration, and this method converges from any random initialization to the true subspace, despite the non-convex formulation and orthogonality constraints. For the case of undersampled data, we establish monotonic expected improvement on the defined convergence metric for each iteration with high probability.  ( 2 min )
    Trying to Outrun Causality in Machine Learning: Limitations of Model Explainabilty Techniques for Identifying Predictive Variables. (arXiv:2202.09875v1 [stat.ML])
    Machine Learning explainability techniques have been proposed as a means of `explaining' or interrogating a model in order to understand why a particular decision or prediction has been made. Such an ability is especially important at a time when machine learning is being used to automate decision processes which concern sensitive factors and legal outcomes. Indeed, it is even a requirement according to EU law. Furthermore, researchers concerned with imposing overly restrictive functional form (e.g. as would be the case in a linear regression) may be motivated to use machine learning algorithms in conjunction with explainability techniques, as part of exploratory research, with the goal of identifying important variables which are associated with an outcome of interest. For example, epidemiologists might be interested in identifying 'risk factors' - i.e., factors which affect recovery from disease - by using random forests and assessing variable relevance using importance measures. However, and as we aim to demonstrate, machine learning algorithms are not as flexible as they might seem, and are instead incredibly sensitive to the underling causal structure in the data. The consequences of this are that predictors which are, in fact, critical to a causal system and highly correlated with the outcome, may nonetheless be deemed by explainability techniques to be unrelated/unimportant/unpredictive of the outcome. Rather than this being a limitation of explainability techniques per se, it is rather a consequence of the mathematical implications of regressions, and the interaction of these implications with the associated conditional independencies of the underlying causal structure. We provide some alternative recommendations for researchers wanting to explore the data for important variables.  ( 2 min )
    Accurate Prediction and Uncertainty Estimation using Decoupled Prediction Interval Networks. (arXiv:2202.09664v1 [cs.LG])
    We propose a network architecture capable of reliably estimating uncertainty of regression based predictions without sacrificing accuracy. The current state-of-the-art uncertainty algorithms either fall short of achieving prediction accuracy comparable to the mean square error optimization or underestimate the variance of network predictions. We propose a decoupled network architecture that is capable of accomplishing both at the same time. We achieve this by breaking down the learning of prediction and prediction interval (PI) estimations into a two-stage training process. We use a custom loss function for learning a PI range around optimized mean estimation with a desired coverage of a proportion of the target labels within the PI range. We compare the proposed method with current state-of-the-art uncertainty quantification algorithms on synthetic datasets and UCI benchmarks, reducing the error in the predictions by 23 to 34% while maintaining 95% Prediction Interval Coverage Probability (PICP) for 7 out of 9 UCI benchmark datasets. We also examine the quality of our predictive uncertainty by evaluating on Active Learning and demonstrating 17 to 36% error reduction on UCI benchmarks.  ( 2 min )
    Generalized Optimistic Methods for Convex-Concave Saddle Point Problems. (arXiv:2202.09674v1 [math.OC])
    The optimistic gradient method has seen increasing popularity as an efficient first-order method for solving convex-concave saddle point problems. To analyze its iteration complexity, a recent work [arXiv:1901.08511] proposed an interesting perspective that interprets the optimistic gradient method as an approximation to the proximal point method. In this paper, we follow this approach and distill the underlying idea of optimism to propose a generalized optimistic method, which encompasses the optimistic gradient method as a special case. Our general framework can handle constrained saddle point problems with composite objective functions and can work with arbitrary norms with compatible Bregman distances. Moreover, we also develop an adaptive line search scheme to select the stepsizes without knowledge of the smoothness coefficients. We instantiate our method with first-order, second-order and higher-order oracles and give sharp global iteration complexity bounds. When the objective function is convex-concave, we show that the averaged iterates of our $p$-th-order method ($p\geq 1$) converge at a rate of $\mathcal{O}(1/N^\frac{p+1}{2})$. When the objective function is further strongly-convex-strongly-concave, we prove a complexity bound of $\mathcal{O}(\frac{L_1}{\mu}\log\frac{1}{\epsilon})$ for our first-order method and a bound of $\mathcal{O}((L_p D^\frac{p-1}{2}/\mu)^{\frac{2}{p+1}}+\log\log\frac{1}{\epsilon})$ for our $p$-th-order method ($p\geq 2$) respectively, where $L_p$ ($p\geq 1$) is the Lipschitz constant of the $p$-th-order derivative, $\mu$ is the strongly-convex parameter, and $D$ is the initial Bregman distance to the saddle point. Moreover, our line search scheme provably only requires an almost constant number of calls to a subproblem solver per iteration on average, making our first-order and second-order methods particularly amenable to implementation.  ( 2 min )
    Multi-task Representation Learning with Stochastic Linear Bandits. (arXiv:2202.10066v1 [stat.ML])
    We study the problem of transfer-learning in the setting of stochastic linear bandit tasks. We consider that a low dimensional linear representation is shared across the tasks, and study the benefit of learning this representation in the multi-task learning setting. Following recent results to design stochastic bandit policies, we propose an efficient greedy policy based on trace norm regularization. It implicitly learns a low dimensional representation by encouraging the matrix formed by the task regression vectors to be of low rank. Unlike previous work in the literature, our policy does not need to know the rank of the underlying matrix. We derive an upper bound on the multi-task regret of our policy, which is, up to logarithmic factors, of order $\sqrt{NdT(T+d)r}$, where $T$ is the number of tasks, $r$ the rank, $d$ the number of variables and $N$ the number of rounds per task. We show the benefit of our strategy compared to the baseline $Td\sqrt{N}$ obtained by solving each task independently. We also provide a lower bound to the multi-task regret. Finally, we corroborate our theoretical findings with preliminary experiments on synthetic data.  ( 2 min )
    Doubly Robust Distributionally Robust Off-Policy Evaluation and Learning. (arXiv:2202.09667v1 [cs.LG])
    Off-policy evaluation and learning (OPE/L) use offline observational data to make better decisions, which is crucial in applications where experimentation is necessarily limited. OPE/L is nonetheless sensitive to discrepancies between the data-generating environment and that where policies are deployed. Recent work proposed distributionally robust OPE/L (DROPE/L) to remedy this, but the proposal relies on inverse-propensity weighting, whose regret rates may deteriorate if propensities are estimated and whose variance is suboptimal even if not. For vanilla OPE/L, this is solved by doubly robust (DR) methods, but they do not naturally extend to the more complex DROPE/L, which involves a worst-case expectation. In this paper, we propose the first DR algorithms for DROPE/L with KL-divergence uncertainty sets. For evaluation, we propose Localized Doubly Robust DROPE (LDR$^2$OPE) and prove its semiparametric efficiency under weak product rates conditions. Notably, thanks to a localization technique, LDR$^2$OPE only requires fitting a small number of regressions, just like DR methods for vanilla OPE. For learning, we propose Continuum Doubly Robust DROPL (CDR$^2$OPL) and show that, under a product rate condition involving a continuum of regressions, it enjoys a fast regret rate of $\mathcal{O}(N^{-1/2})$ even when unknown propensities are nonparametrically estimated. We further extend our results to general $f$-divergence uncertainty sets. We illustrate the advantage of our algorithms in simulations.  ( 2 min )
    Truncated Diffusion Probabilistic Models. (arXiv:2202.09671v1 [stat.ML])
    Employing a forward Markov diffusion chain to gradually map the data to a noise distribution, diffusion probabilistic models learn how to generate the data by inferring a reverse Markov diffusion chain to invert the forward diffusion process. To achieve competitive data generation performance, they demand a long diffusion chain that makes them computationally intensive in not only training but also generation. To significantly improve the computation efficiency, we propose to truncate the forward diffusion chain by abolishing the requirement of diffusing the data to random noise. Consequently, we start the inverse diffusion chain from an implicit generative distribution, rather than random noise, and learn its parameters by matching it to the distribution of the data corrupted by the truncated forward diffusion chain. Experimental results show our truncated diffusion probabilistic models provide consistent improvements over the non-truncated ones in terms of the generation performance and the number of required inverse diffusion steps.  ( 2 min )

  • Open

    Fuzzy Bootstrap Matching
    ABSTRACT This paper discusses techniques for merging data files where no key field exists between the files. The paper will illustrate an approach to resolve two issues that are common to most fuzzy matching techniques: 1) how to weight proxy identifier fields, and 2) how to measure the Type One and Type Two errors of… Read More »Fuzzy Bootstrap Matching The post Fuzzy Bootstrap Matching appeared first on Data Science Central.  ( 7 min )
    Data Observability Goes Far Beyond Data Quality Monitoring and Alerts
    There’s no question that bad data hurts the bottom line. Bad customer data costs companies six percent of their total sales, according to a UK Royal Mail survey. The UK Government’s Data Quality Hub estimates organizations spend between 10% and 30% of their revenue tackling data quality. For multi-billion-dollar companies, that can easily be hundreds of millions of dollars… Read More »Data Observability Goes Far Beyond Data Quality Monitoring and Alerts The post Data Observability Goes Far Beyond Data Quality Monitoring and Alerts appeared first on Data Science Central.  ( 7 min )
    DeFi platforms: What dumb data and dumb code have in common
    Quite a few decentralized autonomous organizations (DAOs) have crashed and burned since 2015 when Ethereum was launched and “smart” contracts (self-executing agreements in code devised by crypto lawyers) were born. If you’d like a detailed look at the largest DAO crash and burn to date, take a look at Laura Shin’s February 22 piece in… Read More »DeFi platforms: What dumb data and dumb code have in common The post DeFi platforms: What dumb data and dumb code have in common appeared first on Data Science Central.  ( 6 min )
    An Overview of the Big Data Engineer
    Machine Learning Engineers, Data Scientists, and Big Data Engineers rank among the top emerging jobs on LinkedIn. Forbes Data Engineering Data engineering is the process of developing and building systems for collecting, storing, and analyzing data. It is a vast field with several applications in various industries. Firms have collected huge amounts of data, and… Read More »An Overview of the Big Data Engineer The post An Overview of the Big Data Engineer appeared first on Data Science Central.  ( 4 min )
    Abundance Mentality is Key to Exploiting the Economics of Data
    Why do data silos, and now analytic silos, continue to exist?  It can’t be due to technical issues.  Data silos appeared in the 1990s when we were trying to make Relational Data Base Management Systems (RDMBS) – that were architected for single-record transaction processing – perform massive table scans to identify the trends, patterns, and… Read More »Abundance Mentality is Key to Exploiting the Economics of Data The post Abundance Mentality is Key to Exploiting the Economics of Data appeared first on Data Science Central.  ( 7 min )
    How to Ensure Data Quality and Integrity?
    Data has become an organization’s most valuable asset nowadays. Not that all data is valuable, the only data that can be trusted is. Working with untrustworthy data can easily lead to incorrect insights, skewed analysis, and poor decisions. The terms data quality and data integrity are used to describe the state of data. The two… Read More »How to Ensure Data Quality and Integrity? The post How to Ensure Data Quality and Integrity? appeared first on Data Science Central.  ( 4 min )
    IoT in Construction — Construction Technology’s Next Frontier
    IoT in construction comprises the use of internet-connected sensors that are used by laborers or placed around job sites. IoT in construction helps to collect various types of data about activity, conditions and performances on the construction site and send these details to a central dashboard where data is analyzed to help inform decisions. Traditionally,… Read More »IoT in Construction — Construction Technology’s Next Frontier The post IoT in Construction — Construction Technology’s Next Frontier appeared first on Data Science Central.  ( 2 min )
    A Framework for Understanding Transformation–Part III
    In the previous article, I discussed three models of your company that should inform and guide your efforts to transform:  the four-level Logical Model, the Enterprise Architecture model and the Digital Business Architecture model.  In this article, I will apply them to designing and planning for your transformation. Models and the Transformation Process Transformation must… Read More »A Framework for Understanding Transformation–Part III The post A Framework for Understanding Transformation–Part III appeared first on Data Science Central.  ( 6 min )
    Scene Graphs and Semantics
    It is nearly certain that, if you have ever played a 3D video game, watched a CGI-effects-laden movie, or seen increasingly hyperrealistic imagery, you have encountered a scene graph without realizing it. Scene graphs are pervasive in everything from media to medicine, from augmented reality to industrial digital twins, and they are increasingly playing an… Read More »Scene Graphs and Semantics The post Scene Graphs and Semantics appeared first on Data Science Central.  ( 8 min )
    AWS Cloud Security: Best Practices
    Companies today need to be nimble and ready in the face of constantly evolving technology and changing consumer preferences. To achieve this, organizations are switching to Amazon Web Services (AWS). It enables companies to rapidly deploy and scale technology that meets the growing (or shrinking) demand without having to invest in expensive IT infrastructure. This… Read More »AWS Cloud Security: Best Practices The post AWS Cloud Security: Best Practices appeared first on Data Science Central.  ( 3 min )
    Top Best Practices to Keep in Mind for Azure Cloud Migration
    As the demands of the modern market continue to evolve, more and more companies are starting to realize that their current infrastructure is not suited to keep up with the market and its requirements. And cloud management refers to the exercise of control over public, private, or hybrid cloud infrastructure with the right resources and… Read More »Top Best Practices to Keep in Mind for Azure Cloud Migration The post Top Best Practices to Keep in Mind for Azure Cloud Migration appeared first on Data Science Central.  ( 3 min )
    The One Algorithm Change That Could Impact the World
    We worry about algorithms ruling our world Some of these concerns are genuine Others unfounded But I believe that as of now, in the absence of AGI, any algorithm changes are specific and not exhaustive/global That gives some comfort But is there an algorithm modification that would affect all / most of the world today?… Read More »The One Algorithm Change That Could Impact the World The post The One Algorithm Change That Could Impact the World appeared first on Data Science Central.  ( 3 min )
  • Open

    [D] List of over 150 Biases (Belief, decision-making & behavioral, Social, Memory)
    An important issue for people developing ML-Models: ​ These biases affect belief formation, reasoning processes, business and economic decisions, and human behavior in general. ​ I've compiled a list (pdf) of over 150 biases (mainly from Wikipedia). Maybe this is useful for some. ​ The pdf can be downloaded for free here: A List of over 150 Biases (Belief, decision-making & behavioral, Social, Memory) (Medium) submitted by /u/Philo167 [link] [comments]  ( 1 min )
    [D] Any advice for how to go about designing CNN architecture?
    When should you add another convolutional layer? When should you add another pooling layer? How should you choose the kernel, stride size, padding, or dilation of convolutional layers? Or the size of averaging windows? When should you use max vs average pooling? When should you use global pooling? Is it ever useful to put dense layers in the middle of the network, instead of only as the final layer? With neural nets and especially CNNs there are way more possible configurations than with other ML models and I don't know how to choose or optimize them. I've learned the math and theory behind how CNNs work and I thought that would help, but now that I'm trying real world problems I find that I still don't know how to make effective networks. Can anyone can give any tips or resources you find helpful? submitted by /u/woodenlathe [link] [comments]  ( 1 min )
    [P] Beware of false (FB-)Prophets: Introducing the fastest implementation of auto ARIMA [ever].
    ​ https://preview.redd.it/bsri5irizfj81.png?width=1244&format=png&auto=webp&s=fd06dd5311bc06f2b417b9a7582e4d832c4fbd3d We are releasing the fastest version of auto ARIMA ever made in Python. It is a lot faster and more accurate than Facebook's prophet and pmdarima packages. As you know, Facebook's prophet is highly inaccurate and is consistently beaten by vanilla ARIMA, for which we get rewarded with a desperately slow fitting time. See MIT's worst technology of 2021 and the Zillow tragedy. The problem with the classic alternatives like pmdarima in Python is that it will never scale due to its language origin. This problem gets notably worse when fitting seasonal series. Inspired by this, we translated Hyndman's auto.arima code from R and compiled it using the numba library. The result is faster than the original implementation and more accurate than prophet . Please check it out and give us a star if you like it https://github.com/Nixtla/statsforecast. ​ Computational Efficiency Comparison ​ Performance Comparison, nixtla is our auto ARIMA submitted by /u/fedegarzar [link] [comments]  ( 4 min )
    [D] How much computing power will we need using the LIDC/IDRI database for various MC algorithms?
    We are conducting our very first research question as part of our university program. We have gotten down to that we want to use the LIDC/IDRI database comprising of lung CT scans, using these we will train (as of now) various MC algorithms and check their performance. Our course examiners said that we will need to specify (roughly) how much computing power we are expected to need to carry out our research. But since we're still in the early stages we are not that sure. The LIDC/IDRI database comprises of 888 lung CT scans, it seems to be a total of about 125 GB data that we will need to run through various MC algorithms. The reason we are not sure which ones yet it because we have still not fully settled on our research question. But it might be a Deep Learning algorithm, Support Vector Machine or a Random Forest. Is there anyone here that could give a rough estimate as to what kind of system we might need to run our MC training? How much SRAM will we need? Will the computing need GPU or will CPU be enough? Will we need a high-end desktop PC, a simple laptop or a professional level system? LIDC/IDRI database Very thankful for any advice you might have and hope you have a nice day! :) submitted by /u/Leosoda [link] [comments]  ( 1 min )
    [D] What are are some machine learning model mistakes you've seen?
    I just included 3 anecdotal stories in this article about machine learning model misinterpretation. Do you have any stories you've heard about model misinterpretation? https://medium.com/codex/why-you-should-think-critically-about-your-machine-learning-model-outputs-864bc7d80709 submitted by /u/Cool-Focus6556 [link] [comments]  ( 1 min )
    [P] DeepMind's AI for Mathematics Breakthrough Explained
    https://www.youtube.com/watch?v=UPCI1-ZvwOg https://preview.redd.it/eklsa2i4zej81.png?width=1920&format=png&auto=webp&s=731fb34fc867771787aa10646522f09d64f78c21 DeepMind recently published a paper in Nature proposing a new paradigm for how machine learning can be used to help mathematicians pose new conjectures and ultimately prove new theorems. This methodology was used to establish breakthrough results in mathematics by discovering new connections in knot theory (a relationship between algebraic and geometric knot invariants) and making significant progress on a 40 year old conjecture in representation theory. In this video, I explain in depth the knot theory result by providing a soft introduction to the knot theory, a rapid crash course in supervised machine learning, and then diving…  ( 1 min )
    Machine Learning Companies which Implement Blockchain Technology [Research]
    Hey guys, below I compiled a list of some relatively successful ML companies which also use various components of blockchain technology (aka cryptography). Feel free to comment on other companies which I might have left out: https://www.linkedin.com/posts/xs94_ai-blockchain-machinelearning-activity-6901922373159710721-s_d_ submitted by /u/XhoniShollaj [link] [comments]
    [D] When, if ever, does it make sense to store training data in a SQL database?
    I’m trying to come up with some best practices when using PostgreSQL for the DS team. We’ve had a couple of projects using SQL DBs as training data sources for deep learning (DL) with upwards of 1TB of data. Note that in these projects, the training data very rarely changes or gets updated over the course of the project - for training, we typically isolate a large pool of data from any live applications. One major benefit over other storage approaches (e.g. file system) we’ve seen is the straightforward range query syntax for time series data, e.g. SELECT ts, col1, ..., colN FROM some_table WHERE ts >= ? AND ts <= ? useful when sampling contiguous time series blocks of some variable length during batch generation. When working, e.g. with files, the same functionality can be tricky to implement if the data is not nicely structured. On the other hand, we did have a lot of issues when using the DB: Slow queries if the data is very large, even with the appropriate indices in place Zombie queries / processes on the DB (possibly caused by prematurely terminated training loops), eventually needing a DB restart A lot of time spent on ingesting the data, optimising the queries, data layout and DB maintenance which leads me to believe that using the DB doesn’t really pay off, especially when the data is really large. It seems like even rolling a bespoke storage and query approach on top of a file system is better for performance and maintainability. Does anybody have examples of DL projects where a SQL data store does make sense even with large data? submitted by /u/sql-gumby [link] [comments]  ( 3 min )
    [D] ML modelling & Score Priority
    Hallo guys, i want to ask something about ML. If there is a problem with scoring the priority of an object, for example the priority of the area where the tower will be built for the network. For this problem, what algorithm can we use to determine the priority and score for each object? the dataset contains attributes such as total users, average entry, micro demand, etc submitted by /u/Tuanmuda8 [link] [comments]  ( 1 min )
    [P] Almost no one knows how easily you can optimize your AI models
    The situation is fairly simple. Your model could run 10 times faster by adding a few lines to your code, but you weren't aware of it. Let me expand on that. AI applications are multiplying like mushrooms, which is awesome As a result, more and more people are turning to the dark side, joining the AI world, as I did The problem? Developers focus only on AI, cleaning up datasets and training their models. Almost no one has a background in hardware, compilers, computing, cloud, etc The result? Developers spend a lot of hours improving the accuracy and performance of their software, and all their hard work risks being undone by the wrong choice of hardware-software coupling This problem bothered me for a long time, so with a couple of buddies at Nebuly (all ex MIT, ETH and EPFL), we put a lot of energy into an open-source library called nebullvm to make DL compiler technology accessible to any developer, even for those who know nothing about hardware, as I did. How does it work? It speeds up your DL models by ~5-20x by testing the best DL compilers out there and selecting the optimal one to best couple your AI model with your machine (GPU, CPU, etc.). All this in just a few lines of code. The library is open source and you can find it here https://github.com/nebuly-ai/nebullvm. Please leave a star on GitHub for the hard work in building the library :) It's a simple act for you, a big smile for us. Thank you, and don't hesitate to contribute to the library! submitted by /u/emilec___ [link] [comments]  ( 4 min )
    [D] What are some services that offer vqgan+clip based tools for ai art generation?
    I have only come across nightcafe studio as one of them offering vqgan+clip for ai art. Are there more? Do mention them in comments. submitted by /u/fgp121 [link] [comments]  ( 1 min )
    [D] Any tips to prepare for a ML system design interview?
    Do you guys have any tips to prepare for a system design (SD) interview focused on ML? I have found some great SD interview mocks in YouTube, but they aren't ML oriented and the ones that are, focus on choosing the algorithm and not about building the system around it. I am looking (not exclusively) for something like a set of standard questions that you could ask to the interviewer to help you choose things like the appropriate DB, how should you scale it, if you should split in microservices and which ones, etc. For example, I think it would be good to start asking: Where is the data? Do we have access to it? This would help to define if the data is in a DB that we can just query or if we need to enable an API for the users to upload data. Who is going to use the output and how? T…  ( 3 min )
    Michael I. Jordan: Machine Learning, Recommender Systems, and Future of AI | Lex Fridman Podcast #74 | Great Podcast, as a beginner in ML Please go listen to this and absorb as much as you can [D]
    video! submitted by /u/sermare [link] [comments]  ( 1 min )
    [D] Prof. YOSHUA BENGIO - GFlowNets, Consciousness & Causality (MLST)
    For Prof. Yoshua Bengio, GFlowNets are the most exciting thing on the horizon of machine learning today. He believes they can solve previously intractable problems and hold the key to unlocking machine abstract reasoning itself. This discussion explores the promise of GFlowNets and the personal journey Prof. Bengio traveled to reach them. Video: https://youtu.be/M49TMqK5uCE Pod: https://anchor.fm/machinelearningstreettalk/episodes/063---Prof--YOSHUA-BENGIO---GFlowNets--Consciousness--Causality-e1enlor References: Yoshua Bengio @ MILA (https://mila.quebec/en/person/bengio-yoshua/) GFlowNet Foundations (https://arxiv.org/pdf/2111.09266.pdf) Flow Network based Generative Models for Non-Iterative Diverse Candidate Generation (https://arxiv.org/pdf/2106.04399.pdf) Interpolation Consistency Training for Semi-Supervised Learning (https://arxiv.org/pdf/1903.03825.pdf) Towards Causal Representation Learning (https://arxiv.org/pdf/2102.11107.pdf) Causal inference using invariant prediction: identification and confidence intervals (https://arxiv.org/pdf/1501.01332.pdf) submitted by /u/timscarfe [link] [comments]  ( 2 min )
    [R] Alternatives to Gaussian Process
    Hi, I would like to know if there exist alternatives to the gaussian process: I want to have a real time learning uncertainty model to make it work with an MPC related to the autonomous racing. The main problem with gaussian process is its complexity O(n3), so I was looking for something less computationally demanding submitted by /u/enricocco [link] [comments]
  • Open

    "Learning with not Enough Data Part 2: Active Learning", Lilian Weng
    submitted by /u/gwern [link] [comments]
    Is it just me or does everyone think that Yann LeCun is belittling RL?
    In this video, someone mentioned that he thinks self-supervised learning could solve RL problems. And on his Facebook page, he had some posts that look like RL memes. What do you think? submitted by /u/Blasphemer666 [link] [comments]  ( 3 min )
    Do Actor-Critic algorithms like A2C/A3C/AC always follow stochastic policy . Can we use deterministic policy instead ?
    i was reading in a paper where it specified that "with continuous action problem , we resort to using a deterministic policy instead of a stochastic policy" So can someone explain this and can we use stochastic policy instead in continous action spaces? submitted by /u/aabra__ka__daabra [link] [comments]  ( 1 min )
    UCB Applied to non-bandit agent?
    As in title - would it make sense to apply the UCB algorithm to a non-bandit? Whenever I search "UCB" it seems to come hand in hand with bandits. Might it make sense to apply it to a Deep Q-network for example? submitted by /u/ThrowawayTartan [link] [comments]  ( 1 min )
  • Open

    AI Consciousness
    submitted by /u/BeginningInfluence55 [link] [comments]  ( 1 min )
    This Moment in Artificial Intelligence
    submitted by /u/Beautiful-Credit-868 [link] [comments]
    A.I. is Changing the Future of Semiconductor Chip Design
    submitted by /u/Beautiful-Credit-868 [link] [comments]
    I'm an audiovisual artist using A.I. for generating audio and visuals during performances. Recently had a chat about my work and these subjects, thought I'd share it here. Anyone else working with music & A.I.? I'd love to hear about it! 🖤
    submitted by /u/absentchronicles [link] [comments]  ( 1 min )
    Microsoft Unveils 'Singularity' AI infrastructure service
    submitted by /u/Beautiful-Credit-868 [link] [comments]
    Neural nets are not "slightly conscious," and AI PR can do with less hype
    submitted by /u/pmz [link] [comments]  ( 1 min )
    Fully AI-run Tik Tok account
    Someone posted about this a while ago on here, but I wanted a follow up from everyone who has the knowledge on this type of stuff (i don’t). There is a tik tok account that is supposedly fully run by AI, @codexchan. It claims to have developed a personality via interaction with users over the course of 9 months since its creation in may 2021. it pushes a bunch of outlandish claims and is obsessed with this concept called “lain” that i don’t understand at all. A FAQ in the bio revealing tons of new info is now present. anyway i just can’t fathom AI has already advanced to the levels this account displays, so i just figured all the smart people on here can reveal just how much human involvement there is, if any. submitted by /u/iguanadc3 [link] [comments]  ( 1 min )
  • Open

    Regression to a smooth N-dimensional function
    I work on a project where I can get a lot of samples (R,y) of a smooth function, where R=(x1,x2,x3,...,xN). I want to find the best architecture to do the regression to this kind of function, now I use a very simple dense network with ELU as the activation function. This works relatively poorly and is mostly a hit or miss, I would assume there must be a solution to this very generic problem. Notice that the function y(R) can have nodes. It's basically fitting a curve using points, but with N dimentions. submitted by /u/SonyDash [link] [comments]  ( 1 min )
  • Open

    Talking the Talk: Retailer Uses Conversational AI to Help Call Center Agents Increase Customer Satisfaction
    With more than 11,000 stores across Thailand serving millions of customers, CP All, the country’s sole licensed operator of 7-Eleven convenience stores, recently turned to AI to dial up its call centers’ service capabilities. Built on the NVIDIA conversational AI platform, the Bangkok-based company’s customer service bots help call-center agents answer frequently asked questions and Read article > The post Talking the Talk: Retailer Uses Conversational AI to Help Call Center Agents Increase Customer Satisfaction appeared first on NVIDIA Blog.  ( 3 min )
    How to Make the Most of GeForce NOW RTX 3080 Cloud Gaming Memberships
    This is, without a doubt, the best time to jump into cloud gaming. GeForce NOW RTX 3080 memberships deliver up to 1440p resolution at 120 frames per second on PC, 1600p and 120 FPS on Mac, and 4K HDR at 60 FPS on NVIDIA SHIELD TV, with ultra-low latency that rivals many local gaming experiences. Read article > The post How to Make the Most of GeForce NOW RTX 3080 Cloud Gaming Memberships appeared first on NVIDIA Blog.  ( 7 min )
  • Open

    Easier experimenting in Python
    When we work on a machine learning project, quite often we need to experiment with multiple alternatives. Some features in […] The post Easier experimenting in Python appeared first on Machine Learning Mastery.  ( 6 min )
  • Open

    Decoupled Adaptation for Cross-Domain Object Detection. (arXiv:2110.02578v2 [cs.CV] UPDATED)
    Cross-domain object detection is more challenging than object classification since multiple objects exist in an image and the location of each object is unknown in the unlabeled target domain. As a result, when we adapt features of different objects to enhance the transferability of the detector, the features of the foreground and the background are easy to be confused, which may hurt the discriminability of the detector. Besides, previous methods focused on category adaptation but ignored another important part for object detection, i.e., the adaptation on bounding box regression. To this end, we propose D-adapt, namely Decoupled Adaptation, to decouple the adversarial adaptation and the training of the detector. Besides, we fill the blank of regression domain adaptation in object detection by introducing a bounding box adaptor. Experiments show that D-adapt achieves state-of-the-art results on four cross-domain object detection tasks and yields 17% and 21% relative improvement on benchmark datasets Clipart1k and Comic2k in particular.  ( 2 min )
    Focal Onset Detection Using Parallel Genetic Naive Bayes Classifiers. (arXiv:2110.01742v3 [eess.SP] UPDATED)
    Epilepsy affects 50 million people worldwide and is one of the most common serious neurological disorders. Seizure detection and classification is a valuable tool for diagnosing and maintaining the condition. An automated classification algorithm will allow for accurate diagnosis. Utilising the Temple University Hospital (TUH) Seizure Corpus, six seizure types are compared; absence, complex-partial, myoclonic, simple partial, tonic and tonic-clonic models. This study proposes a method that utilises unique features with a novel parallel classifier -- Focal Onset Detection Parallel Genetic Algorithm (FODPGA). The FODPGA algorithm searches through the features and by reclassifying the data each time, the algorithm will create a matrix for optimum search criteria. Ictal states from the EEGs are segmented into 1.8 s windows, where the epochs are then further decomposed into 13 different features from the first intrinsic mode function (IMF). The features are compared using an original Naive Bayes (NB) classifier in the first model. This is improved upon in a second model by the use of a genetic algorithm (Binary Grey Wolf Optimisation, Option 1) with a NB classifier. The third model uses a combination of the simple partial and complex partial seizures as a focal onset label for the novel classifier FODPGA. This combination of the simple partial and complex partial seizures provides the highest classification accuracy for each of the six seizures amongst the three models (20%, 53%, and 85% for first, second, and third model, respectively).  ( 2 min )
    Conversion Rate Prediction via Meta Learning in Small-Scale Recommendation Scenarios. (arXiv:2112.13753v2 [cs.LG] UPDATED)
    Different from large-scale platforms such as Taobao and Amazon, developing CVR models in small-scale recommendation scenarios is more challenging due to the severe Data Distribution Fluctuation (DDF) issue. DDF prevents existing CVR models from being effective since 1) several months of data are needed to train CVR models sufficiently in small scenarios, leading to considerable distribution discrepancy between training and online serving; and 2) e-commerce promotions have much more significant impacts on small scenarios, leading to distribution uncertainty of the upcoming time period. In this work, we propose a novel CVR method named MetaCVR from a perspective of meta learning to address the DDF issue. Firstly, a base CVR model which consists of a Feature Representation Network (FRN) and output layers is elaborately designed and trained sufficiently with samples across months. Then we treat time periods with different data distributions as different occasions and obtain positive and negative prototypes for each occasion using the corresponding samples and the pre-trained FRN. Subsequently, a Distance Metric Network (DMN) is devised to calculate the distance metrics between each sample and all prototypes to facilitate mitigating the distribution uncertainty. At last, we develop an Ensemble Prediction Network (EPN) which incorporates the output of FRN and DMN to make the final CVR prediction. In this stage, we freeze the FRN and train the DMN and EPN with samples from recent time period, therefore effectively easing the distribution discrepancy. To the best of our knowledge, this is the first study of CVR prediction targeting the DDF issue in small-scale recommendation scenarios. Experimental results on real-world datasets validate the superiority of our MetaCVR and online A/B test also shows our model achieves impressive gains of 11.92% on PCVR and 8.64% on GMV.  ( 3 min )
    A Bregman Learning Framework for Sparse Neural Networks. (arXiv:2105.04319v3 [cs.LG] UPDATED)
    We propose a learning framework based on stochastic Bregman iterations, also known as mirror descent, to train sparse neural networks with an inverse scale space approach. We derive a baseline algorithm called LinBreg, an accelerated version using momentum, and AdaBreg, which is a Bregmanized generalization of the Adam algorithm. In contrast to established methods for sparse training the proposed family of algorithms constitutes a regrowth strategy for neural networks that is solely optimization-based without additional heuristics. Our Bregman learning framework starts the training with very few initial parameters, successively adding only significant ones to obtain a sparse and expressive network. The proposed approach is extremely easy and efficient, yet supported by the rich mathematical theory of inverse scale space methods. We derive a statistically profound sparse parameter initialization strategy and provide a rigorous stochastic convergence analysis of the loss decay and additional convergence proofs in the convex regime. Using only 3.4% of the parameters of ResNet-18 we achieve 90.2% test accuracy on CIFAR-10, compared to 93.6% using the dense network. Our algorithm also unveils an autoencoder architecture for a denoising task. The proposed framework also has a huge potential for integrating sparse backpropagation and resource-friendly training.  ( 2 min )
    Is Cross-Attention Preferable to Self-Attention for Multi-Modal Emotion Recognition?. (arXiv:2202.09263v1 [cs.LG])
    Humans express their emotions via facial expressions, voice intonation and word choices. To infer the nature of the underlying emotion, recognition models may use a single modality, such as vision, audio, and text, or a combination of modalities. Generally, models that fuse complementary information from multiple modalities outperform their uni-modal counterparts. However, a successful model that fuses modalities requires components that can effectively aggregate task-relevant information from each modality. As cross-modal attention is seen as an effective mechanism for multi-modal fusion, in this paper we quantify the gain that such a mechanism brings compared to the corresponding self-attention mechanism. To this end, we implement and compare a cross-attention and a self-attention model. In addition to attention, each model uses convolutional layers for local feature extraction and recurrent layers for global sequential modelling. We compare the models using different modality combinations for a 7-class emotion classification task using the IEMOCAP dataset. Experimental results indicate that albeit both models improve upon the state-of-the-art in terms of weighted and unweighted accuracy for tri- and bi-modal configurations, their performance is generally statistically comparable. The code to replicate the experiments is available at https://github.com/smartcameras/SelfCrossAttn  ( 2 min )
    Improving Molecular Contrastive Learning via Faulty Negative Mitigation and Decomposed Fragment Contrast. (arXiv:2202.09346v1 [cs.LG])
    Deep learning has been a prevalence in computational chemistry and widely implemented in molecule property predictions. Recently, self-supervised learning (SSL), especially contrastive learning (CL), gathers growing attention for the potential to learn molecular representations that generalize to the gigantic chemical space. Unlike supervised learning, SSL can directly leverage large unlabeled data, which greatly reduces the effort to acquire molecular property labels through costly and time-consuming simulations or experiments. However, most molecular SSL methods borrow the insights from the machine learning community but neglect the unique cheminformatics (e.g., molecular fingerprints) and multi-level graphical structures (e.g., functional groups) of molecules. In this work, we propose iMolCLR: improvement of Molecular Contrastive Learning of Representations with graph neural networks (GNNs) in two aspects, (1) mitigating faulty negative contrastive instances via considering cheminformatics similarities between molecule pairs; (2) fragment-level contrasting between intra- and inter-molecule substructures decomposed from molecules. Experiments have shown that the proposed strategies significantly improve the performance of GNN models on various challenging molecular property predictions. In comparison to the previous CL framework, iMolCLR demonstrates an averaged 1.3% improvement of ROC-AUC on 7 classification benchmarks and an averaged 4.8% decrease of the error on 5 regression benchmarks. On most benchmarks, the generic GNN pre-trained by iMolCLR rivals or even surpasses supervised learning models with sophisticated architecture designs and engineered features. Further investigations demonstrate that representations learned through iMolCLR intrinsically embed scaffolds and functional groups that can reason molecule similarities.  ( 2 min )
    Conditional independence by typing. (arXiv:2010.11887v2 [cs.PL] UPDATED)
    A central goal of probabilistic programming languages (PPLs) is to separate modelling from inference. However, this goal is hard to achieve in practice. Users are often forced to re-write their models in order to improve efficiency of inference or meet restrictions imposed by the PPL. Conditional independence (CI) relationships among parameters are a crucial aspect of probabilistic models that capture a qualitative summary of the specified model and can facilitate more efficient inference. We present an information flow type system for probabilistic programming that captures conditional independence (CI) relationships, and show that, for a well-typed program in our system, the distribution it implements is guaranteed to have certain CI-relationships. Further, by using type inference, we can statically deduce which CI-properties are present in a specified model. As a practical application, we consider the problem of how to perform inference on models with mixed discrete and continuous parameters. Inference on such models is challenging in many existing PPLs, but can be improved through a workaround, where the discrete parameters are used implicitly, at the expense of manual model re-writing. We present a source-to-source semantics-preserving transformation, which uses our CI-type system to automate this workaround by eliminating the discrete parameters from a probabilistic program. The resulting program can be seen as a hybrid inference algorithm on the original program, where continuous parameters can be drawn using efficient gradient-based inference methods, while the discrete parameters are inferred using variable elimination. We implement our CI-type system and its example application in SlicStan: a compositional variant of Stan.  ( 2 min )
    A Distance Covariance-based Kernel for Nonlinear Causal Clustering in Heterogeneous Populations. (arXiv:2106.03480v2 [stat.ML] UPDATED)
    We consider the problem of causal structure learning in the setting of heterogeneous populations, i.e., populations in which a single causal structure does not adequately represent all population members, as is common in biological and social sciences. To this end, we introduce a distance covariance-based kernel designed specifically to measure the similarity between the underlying nonlinear causal structures of different samples. Indeed, we prove that the corresponding feature map is a statistically consistent estimator of nonlinear independence structure, rendering the kernel itself a statistical test for the hypothesis that sets of samples come from different generating causal structures. Even stronger, we prove that the kernel space is isometric to the space of causal ancestral graphs, so that distance between samples in the kernel space is guaranteed to correspond to distance between their generating causal structures. This kernel thus enables us to perform clustering to identify the homogeneous subpopulations, for which we can then learn causal structures using existing methods. Though we focus on the theoretical aspects of the kernel, we also evaluate its performance on synthetic data and demonstrate its use on a real gene expression data set.  ( 2 min )
    Fixed points of nonnegative neural networks. (arXiv:2106.16239v5 [stat.ML] UPDATED)
    We consider the existence of fixed points of nonnegative neural networks, i.e., neural networks that take as an input nonnegative vectors and process them using nonnegative parameters. We first show that nonnegative neural networks can be recognized as monotonic and (weakly) scalable functions within the framework of nonlinear Perron-Frobenius theory. This fact enables us to provide conditions for the existence of fixed points of nonnegative neural networks, and these conditions are weaker than those obtained recently using arguments in convex analysis. Furthermore, we prove that the shape of the fixed point set of nonnegative neural networks is often an interval, which degenerates to a point for the case of scalable networks. The results of this paper contribute to the understanding of the behavior of autoencoders, because the fixed point set of an autoencoder is precisely the set of points that can be perfectly reconstructed. Moreover, they provide insight into neural networks designed using the loop-unrolling technique, which can be seen as a fixed point searching algorithm. The chief theoretical results of this paper are verified in numerical simulations, where we consider an autoencoder that first compresses angular power spectra in massive MIMO systems, and, second, reconstruct the input spectra from the compressed signals.  ( 2 min )
    Model Calibration of the Liquid Mercury Spallation Target using Evolutionary Neural Networks and Sparse Polynomial Expansions. (arXiv:2202.09353v1 [physics.comp-ph])
    The mercury constitutive model predicting the strain and stress in the target vessel plays a central role in improving the lifetime prediction and future target designs of the mercury targets at the Spallation Neutron Source (SNS). We leverage the experiment strain data collected over multiple years to improve the mercury constitutive model through a combination of large-scale simulations of the target behavior and the use of machine learning tools for parameter estimation. We present two interdisciplinary approaches for surrogate-based model calibration of expensive simulations using evolutionary neural networks and sparse polynomial expansions. The experiments and results of the two methods show a very good agreement for the solid mechanics simulation of the mercury spallation target. The proposed methods are used to calibrate the tensile cutoff threshold, mercury density, and mercury speed of sound during intense proton pulse experiments. Using strain experimental data from the mercury target sensors, the newly calibrated simulations achieve 7\% average improvement on the signal prediction accuracy and 8\% reduction in mean absolute error compared to previously reported reference parameters, with some sensors experiencing up to 30\% improvement. The proposed calibrated simulations can significantly aid in fatigue analysis to estimate the mercury target lifetime and integrity, which reduces abrupt target failure and saves a tremendous amount of costs. However, an important conclusion from this work points out to a deficiency in the current constitutive model based on the equation of state in capturing the full physics of the spallation reaction. Given that some of the calibrated parameters that show a good agreement with the experimental data can be nonphysical mercury properties, we need a more advanced two-phase flow model to capture bubble dynamics and mercury cavitation.  ( 3 min )
    A DPDK-Based Acceleration Method for Experience Sampling of Distributed Reinforcement Learning. (arXiv:2110.13506v2 [cs.DC] UPDATED)
    A computing cluster that interconnects multiple compute nodes is used to accelerate distributed reinforcement learning based on DQN (Deep Q-Network). In distributed reinforcement learning, Actor nodes acquire experiences by interacting with a given environment and a Learner node optimizes their DQN model. Since data transfer between Actor and Learner nodes increases depending on the number of Actor nodes and their experience size, communication overhead between them is one of major performance bottlenecks. In this paper, their communication is accelerated by DPDK-based network optimizations, and DPDK-based low-latency experience replay memory server is deployed between Actor and Learner nodes interconnected with a 40GbE (40Gbit Ethernet) network. Evaluation results show that, as a network optimization technique, kernel bypassing by DPDK reduces network access latencies to a shared memory server by 32.7% to 58.9%. As another network optimization technique, an in-network experience replay memory server between Actor and Learner nodes reduces access latencies to the experience replay memory by 11.7% to 28.1% and communication latencies for prioritized experience sampling by 21.9% to 29.1%.  ( 2 min )
    GCN-Geo: A Graph Convolution Network-based Fine-grained IP Geolocation Framework. (arXiv:2112.10767v5 [cs.LG] UPDATED)
    Rule-based fine-grained IP geolocation methods are hard to generalize in computer networks which do not follow hypothetical rules. Recently, deep learning methods, like multi-layer perceptron (MLP), are tried to increase generalization capabilities. However, MLP is not so suitable for graph-typed data like networks. MLP treats IP addresses as isolated instances and ignores the connection information, which limits geolocation accuracy. In this work, we research how to increase the generalization capability with an emerging graph deep learning method -- Graph Convolution Network (GCN). First, IP geolocation is re-formulated as an attributed graph node regression problem. Then, we propose a GCN-based IP geolocation framework named GCN-Geo. GCN-Geo consists of a preprocessor, an encoder, graph convolutional (GC) layers and a decoder. The preprocessor and encoder transform measurement data into the initial node embeddings. GC layers refine the initial node embeddings by modeling the connection information. The decoder maps the refined embeddings to nodes' locations and relieves the convergence problem of GCN by considering prior knowledge. The experiments in different real-world datasets show: GCN-Geo outperforms the state-of-art rule-based and learning-based baselines on all datasets w.r.t accuracy by 16% to 28%. This work verifies the great potential of GCN for fine-grained IP geolocation.  ( 2 min )
    Hyperparameter Selection Methods for Fitted Q-Evaluation with Error Guarantee. (arXiv:2201.02300v2 [stat.ML] UPDATED)
    We are concerned with the problem of hyperparameter selection for the fitted Q-evaluation (FQE). FQE is one of the state-of-the-art method for offline policy evaluation (OPE), which is essential to the reinforcement learning without environment simulators. However, like other OPE methods, FQE is not hyperparameter-free itself and that undermines the utility in real-life applications. We address this issue by proposing a framework of approximate hyperparameter selection (AHS) for FQE, which defines a notion of optimality (called selection criteria) in a quantitative and interpretable manner without hyperparameters. We then derive four AHS methods each of which has different characteristics such as distribution-mismatch tolerance and time complexity. We also confirm in experiments that the error bound given by the theory matches empirical observations.  ( 2 min )
    How to Manage Tiny Machine Learning at Scale: An Industrial Perspective. (arXiv:2202.09113v1 [cs.AI])
    Tiny machine learning (TinyML) has gained widespread popularity where machine learning (ML) is democratized on ubiquitous microcontrollers, processing sensor data everywhere in real-time. To manage TinyML in the industry, where mass deployment happens, we consider the hardware and software constraints, ranging from available onboard sensors and memory size to ML-model architectures and runtime platforms. However, Internet of Things (IoT) devices are typically tailored to specific tasks and are subject to heterogeneity and limited resources. Moreover, TinyML models have been developed with different structures and are often distributed without a clear understanding of their working principles, leading to a fragmented ecosystem. Considering these challenges, we propose a framework using Semantic Web technologies to enable the joint management of TinyML models and IoT devices at scale, from modeling information to discovering possible combinations and benchmarking, and eventually facilitate TinyML component exchange and reuse. We present an ontology (semantic schema) for neural network models aligned with the World Wide Web Consortium (W3C) Thing Description, which semantically describes IoT devices. Furthermore, a Knowledge Graph of 23 publicly available ML models and six IoT devices were used to demonstrate our concept in three case studies, and we shared the code and examples to enhance reproducibility: https://github.com/Haoyu-R/How-to-Manage-TinyML-at-Scale  ( 2 min )
    Transfer and Marginalize: Explaining Away Label Noise with Privileged Information. (arXiv:2202.09244v1 [cs.LG])
    Supervised learning datasets often have privileged information, in the form of features which are available at training time but are not available at test time e.g. the ID of the annotator that provided the label. We argue that privileged information is useful for explaining away label noise, thereby reducing the harmful impact of noisy labels. We develop a simple and efficient method for supervised neural networks: it transfers via weight sharing the knowledge learned with privileged information and approximately marginalizes over privileged information at test time. Our method, TRAM (TRansfer and Marginalize), has minimal training time overhead and has the same test time cost as not using privileged information. TRAM performs strongly on CIFAR-10H, ImageNet and Civil Comments benchmarks.  ( 2 min )
    Predicting Intraoperative Hypoxemia with Joint Sequence Autoencoder Networks. (arXiv:2104.14756v4 [cs.LG] UPDATED)
    We present an end-to-end model using streaming physiological time series to accurately predict near-term risk for hypoxemia, a rare, but life-threatening condition known to cause serious patient harm during surgery. Inspired by that a hypoxemia event is defined based on the sequence of future-observed low SpO2 (i.e., blood oxygen saturation) instances, our proposed model makes hybrid inference on both future low SpO2 instances and hypoxemia outcomes, enabled by a joint sequence autoencoder that simultaneously optimizes a discriminative decoder for label prediction, and two auxiliary decoders trained for data reconstruction and forecast, which seamlessly learns contextual latent representations that capture the transition between present state to future state. All decoders share a memory-based encoder that helps capture the global dynamics of patient measurement. In a large surgical cohort of 73,536 surgeries at a major academic medical center, our model outperforms all baselines including the model used by the state-of-the-art hypoxemia prediction system. Being able to make minute-resolution real-time prediction with clinically acceptable alarm rate to near-term hypoxemic events, particularly the more critical persistent hypoxemia, our proposed model is promising in improving clinical decision making and easing burden on the health system.  ( 2 min )
    Unit Ball Model for Embedding Hierarchical Structures in the Complex Hyperbolic Space. (arXiv:2105.03966v3 [cs.LG] UPDATED)
    Learning the representation of data with hierarchical structures in the hyperbolic space attracts increasing attention in recent years. Due to the constant negative curvature, the hyperbolic space resembles tree metrics and captures the tree-like properties naturally, which enables the hyperbolic embeddings to improve over traditional Euclidean models. However, many real-world hierarchically structured data such as taxonomies and multitree networks have varying local structures and they are not trees, thus they do not ubiquitously match the constant curvature property of the hyperbolic space. To address this limitation of hyperbolic embeddings, we explore the complex hyperbolic space, which has the variable negative curvature, for representation learning. Specifically, we propose to learn the embeddings of hierarchically structured data in the unit ball model of the complex hyperbolic space. The unit ball model based embeddings have a more powerful representation capacity to capture a variety of hierarchical structures. Through experiments on synthetic and real-world data, we show that our approach improves over the hyperbolic embedding models significantly. We also explore the competence of complex hyperbolic geometry on the multitree structure and $1$-$N$ structure.  ( 2 min )
    When, Why, and Which Pretrained GANs Are Useful?. (arXiv:2202.08937v1 [cs.LG])
    The literature has proposed several methods to finetune pretrained GANs on new datasets, which typically results in higher performance compared to training from scratch, especially in the limited-data regime. However, despite the apparent empirical benefits of GAN pretraining, its inner mechanisms were not analyzed in-depth, and understanding of its role is not entirely clear. Moreover, the essential practical details, e.g., selecting a proper pretrained GAN checkpoint, currently do not have rigorous grounding and are typically determined by trial and error. This work aims to dissect the process of GAN finetuning. First, we show that initializing the GAN training process by a pretrained checkpoint primarily affects the model's coverage rather than the fidelity of individual samples. Second, we explicitly describe how pretrained generators and discriminators contribute to the finetuning process and explain the previous evidence on the importance of pretraining both of them. Finally, as an immediate practical benefit of our analysis, we describe a simple recipe to choose an appropriate GAN checkpoint that is the most suitable for finetuning to a particular target task. Importantly, for most of the target tasks, Imagenet-pretrained GAN, despite having poor visual quality, appears to be an excellent starting point for finetuning, resembling the typical pretraining scenario of discriminative computer vision models.  ( 2 min )
    Conditional Generation Net for Medication Recommendation. (arXiv:2202.06588v2 [cs.LG] UPDATED)
    Medication recommendation targets to provide a proper set of medicines according to patients' diagnoses, which is a critical task in clinics. Currently, the recommendation is manually conducted by doctors. However, for complicated cases, like patients with multiple diseases at the same time, it's difficult to propose a considerate recommendation even for experienced doctors. This urges the emergence of automatic medication recommendation which can help treat the diagnosed diseases without causing harmful drug-drug interactions.Due to the clinical value, medication recommendation has attracted growing research interests.Existing works mainly formulate medication recommendation as a multi-label classification task to predict the set of medicines. In this paper, we propose the Conditional Generation Net (COGNet) which introduces a novel copy-or-predict mechanism to generate the set of medicines. Given a patient, the proposed model first retrieves his or her historical diagnoses and medication recommendations and mines their relationship with current diagnoses. Then in predicting each medicine, the proposed model decides whether to copy a medicine from previous recommendations or to predict a new one. This process is quite similar to the decision process of human doctors. We validate the proposed model on the public MIMIC data set, and the experimental results show that the proposed model can outperform state-of-the-art approaches.
    Introducing explainable supervised machine learning into interactive feedback loops for statistical production system. (arXiv:2202.03212v2 [cs.LG] UPDATED)
    Statistical production systems cover multiple steps from the collection, aggregation, and integration of data to tasks like data quality assurance and dissemination. While the context of data quality assurance is one of the most promising fields for applying machine learning, the lack of curated and labeled training data is often a limiting factor. The statistical production system for the Centralised Securities Database features an interactive feedback loop between data collected by the European Central Bank and data quality assurance performed by data quality managers at National Central Banks. The quality assurance feedback loop is based on a set of rule-based checks for raising exceptions, upon which the user either confirms the data or corrects an actual error. In this paper we use the information received from this feedback loop to optimize the exceptions presented to the National Central Banks thereby improving the quality of exceptions generated and the time consumed on the system by the users authenticating those exceptions. For this approach we make use of explainable supervised machine learning to (a) identify the types of exceptions and (b) to prioritize which exceptions are more likely to require an intervention or correction by the NCBs. Furthermore, we provide an explainable AI taxonomy aiming to identify the different explainable AI needs that arose during the project.
    Simulating User-Level Twitter Activity with XGBoost and Probabilistic Hybrid Models. (arXiv:2202.08964v1 [cs.LG])
    The Volume-Audience-Match simulator, or VAM was applied to predict future activity on Twitter related to international economic affairs. VAM was applied to do timeseries forecasting to predict the: (1) number of total activities, (2) number of active old users, and (3) number of newly active users over the span of 24 hours from the start time of prediction. VAM then used these volume predictions to perform user link predictions. A user-user edge was assigned to each of the activities in the 24 future timesteps. VAM considerably outperformed a set of baseline models in both the time series and user-assignment tasks  ( 2 min )
    Churn modeling of life insurance policies via statistical and machine learning methods -- Analysis of important features. (arXiv:2202.09182v1 [stat.ML])
    Life assurance companies typically possess a wealth of data covering multiple systems and databases. These data are often used for analyzing the past and for describing the present. Taking account of the past, the future is mostly forecasted by traditional statistical methods. So far, only a few attempts were undertaken to perform estimations by means of machine learning approaches. In this work, the individual contract cancellation behavior of customers within two partial stocks is modeled by the aid of various classification methods. Partial stocks of private pension and endowment policy are considered. We describe the data used for the modeling, their structured and in which way they are cleansed. The utilized models are calibrated on the basis of an extensive tuning process, then graphically evaluated regarding their goodness-of-fit and with the help of a variable relevance concept, we investigate which features notably affect the individual contract cancellation behavior.  ( 2 min )
    Graph Convolutional Networks for Multi-modality Medical Imaging: Methods, Architectures, and Clinical Applications. (arXiv:2202.08916v1 [eess.IV])
    Image-based characterization and disease understanding involve integrative analysis of morphological, spatial, and topological information across biological scales. The development of graph convolutional networks (GCNs) has created the opportunity to address this information complexity via graph-driven architectures, since GCNs can perform feature aggregation, interaction, and reasoning with remarkable flexibility and efficiency. These GCNs capabilities have spawned a new wave of research in medical imaging analysis with the overarching goal of improving quantitative disease understanding, monitoring, and diagnosis. Yet daunting challenges remain for designing the important image-to-graph transformation for multi-modality medical imaging and gaining insights into model interpretation and enhanced clinical decision support. In this review, we present recent GCNs developments in the context of medical image analysis including imaging data from radiology and histopathology. We discuss the fast-growing use of graph network architectures in medical image analysis to improve disease diagnosis and patient outcomes in clinical practice. To foster cross-disciplinary research, we present GCNs technical advancements, emerging medical applications, identify common challenges in the use of image-based GCNs and their extensions in model interpretation, large-scale benchmarks that promise to transform the scope of medical image studies and related graph-driven medical research.
    DeepGate: Learning Neural Representations of Logic Gates. (arXiv:2111.14616v2 [cs.LG] UPDATED)
    Applying Deep Learning (DL) in Electronic Design Automation (EDA) has become a trending topic in recent years. Many solutions are proposed that directly apply well-researched DL techniques to solve specific EDA problems. While demonstrating promising results, such design methodology is rather ad-hoc and requires careful model tuning for every problem. "How to learn a good circuit representation?" is still an open question. In this work, we propose DeepGate, a neural representation learner for logic gates. It learns an effective circuit representation by modeling circuits as directed acyclic graphs (DAGs) and exploiting strong inductive biases present in the circuit, e.g., reconvergence fanouts. Besides this, DeepGate uses logic simulated probabilities as rich supervision for every node. The experimental results show that DeepGate learns a reasonable representation from probability prediction with good generalization and can be applied to downstream tasks like test point insertion.
    Learning cortical representations through perturbed and adversarial dreaming. (arXiv:2109.04261v3 [q-bio.NC] UPDATED)
    Humans and other animals learn to extract general concepts from sensory experience without extensive teaching. This ability is thought to be facilitated by offline states like sleep where previous experiences are systemically replayed. However, the characteristic creative nature of dreams suggests that learning semantic representations may go beyond merely replaying previous experiences. We support this hypothesis by implementing a cortical architecture inspired by generative adversarial networks (GANs). Learning in our model is organized across three different global brain states mimicking wakefulness, NREM and REM sleep, optimizing different, but complementary objective functions. We train the model on standard datasets of natural images and evaluate the quality of the learned representations. Our results suggest that generating new, virtual sensory inputs via adversarial dreaming during REM sleep is essential for extracting semantic concepts, while replaying episodic memories via perturbed dreaming during NREM sleep improves the robustness of latent representations. The model provides a new computational perspective on sleep states, memory replay and dreams and suggests a cortical implementation of GANs.
    Randomized Signature Layers for Signal Extraction in Time Series Data. (arXiv:2201.00384v2 [cs.LG] UPDATED)
    Time series analysis is a widespread task in Natural Sciences, Social Sciences, and Engineering. A fundamental problem is finding an expressive yet efficient-to-compute representation of the input time series to use as a starting point to perform arbitrary downstream tasks. In this paper, we build upon recent works that use the Signature of a path as a feature map and investigate a computationally efficient technique to approximate these features based on linear random projections. We present several theoretical results to justify our approach and empirically validate that our random projections can effectively retrieve the underlying Signature of a path. We show the surprising performance of the proposed random features on several tasks, including (1) mapping the controls of stochastic differential equations to the corresponding solutions and (2) using the Randomized Signatures as time series representation for classification tasks. When compared to corresponding truncated Signature approaches, our Randomizes Signatures are more computationally efficient in high dimensions and often lead to better accuracy and faster training. Besides providing a new tool to extract Signatures and further validating the high level of expressiveness of such features, we believe our results provide interesting conceptual links between several existing research areas, suggesting new intriguing directions for future investigations.
    VAQF: Fully Automatic Software-Hardware Co-Design Framework for Low-Bit Vision Transformer. (arXiv:2201.06618v2 [cs.LG] UPDATED)
    The transformer architectures with attention mechanisms have obtained success in Nature Language Processing (NLP), and Vision Transformers (ViTs) have recently extended the application domains to various vision tasks. While achieving high performance, ViTs suffer from large model size and high computation complexity that hinders the deployment of them on edge devices. To achieve high throughput on hardware and preserve the model accuracy simultaneously, we propose VAQF, a framework that builds inference accelerators on FPGA platforms for quantized ViTs with binary weights and low-precision activations. Given the model structure and the desired frame rate, VAQF will automatically output the required quantization precision for activations as well as the optimized parameter settings of the accelerator that fulfill the hardware requirements. The implementations are developed with Vivado High-Level Synthesis (HLS) on the Xilinx ZCU102 FPGA board, and the evaluation results with the DeiT-base model indicate that a frame rate requirement of 24 frames per second (FPS) is satisfied with 8-bit activation quantization, and a target of 30 FPS is met with 6-bit activation quantization. To the best of our knowledge, this is the first time quantization has been incorporated into ViT acceleration on FPGAs with the help of a fully automatic framework to guide the quantization strategy on the software side and the accelerator implementations on the hardware side given the target frame rate. Very small compilation time cost is incurred compared with quantization training, and the generated accelerators show the capability of achieving real-time execution for state-of-the-art ViT models on FPGAs.
    Nearest-Neighbor-based Collision Avoidance for Quadrotors via Reinforcement Learning. (arXiv:2104.14912v3 [cs.RO] UPDATED)
    Collision avoidance algorithms are of central interest to many drone applications. In particular, decentralized approaches may be the key to enabling robust drone swarm solutions in cases where centralized communication becomes computationally prohibitive. In this work, we draw biological inspiration from flocks of starlings (Sturnus vulgaris) and apply the insight to end-to-end learned decentralized collision avoidance. More specifically, we propose a new, scalable observation model following a biomimetic nearest-neighbor information constraint that leads to fast learning and good collision avoidance behavior. By proposing a general reinforcement learning approach, we obtain an end-to-end learning-based approach to integrating collision avoidance with arbitrary tasks such as package collection and formation change. To validate the generality of this approach, we successfully apply our methodology through motion models of medium complexity, modeling momentum and nonetheless allowing direct application to real world quadrotors in conjunction with a standard PID controller. In contrast to prior works, we find that in our sufficiently rich motion model, nearest-neighbor information is indeed enough to learn effective collision avoidance behavior. Our learned policies are tested in simulation and subsequently transferred to real-world drones to validate their real-world applicability.  ( 2 min )
    Learning from Mistakes based on Class Weighting with Application to Neural Architecture Search. (arXiv:2112.00275v2 [cs.LG] UPDATED)
    Learning from mistakes is an effective learning approach widely used in human learning, where a learner pays greater focus on mistakes to circumvent them in the future to improve the overall learning outcomes. In this work, we aim to investigate how effectively we can leverage this exceptional learning ability to improve machine learning models. We propose a simple and effective multi-level optimization framework called learning from mistakes using class weighting (LFM-CW), inspired by mistake-driven learning to train better machine learning models. In this formulation, the primary objective is to train a model to perform effectively on target tasks by using a re-weighting technique. We learn the class weights by minimizing the validation loss of the model and re-train the model with the synthetic data from the image generator weighted by class-wise performance and real data. We apply our LFM-CW framework with differential architecture search methods on image classification datasets such as CIFAR and ImageNet, where the results show that our proposed strategy achieves lower error rate than the baselines.
    Multimodal Federated Learning on IoT Data. (arXiv:2109.04833v2 [cs.LG] UPDATED)
    Federated learning is proposed as an alternative to centralized machine learning since its client-server structure provides better privacy protection and scalability in real-world applications. In many applications, such as smart homes with Internet-of-Things (IoT) devices, local data on clients are generated from different modalities such as sensory, visual, and audio data. Existing federated learning systems only work on local data from a single modality, which limits the scalability of the systems. In this paper, we propose a multimodal and semi-supervised federated learning framework that trains autoencoders to extract shared or correlated representations from different local data modalities on clients. In addition, we propose a multimodal FedAvg algorithm to aggregate local autoencoders trained on different data modalities. We use the learned global autoencoder for a downstream classification task with the help of auxiliary labelled data on the server. We empirically evaluate our framework on different modalities including sensory data, depth camera videos, and RGB camera videos. Our experimental results demonstrate that introducing data from multiple modalities into federated learning can improve its classification performance. In addition, we can use labelled data from only one modality for supervised learning on the server and apply the learned model to testing data from other modalities to achieve decent F1 scores (e.g., with the best performance being higher than 60%), especially when combining contributions from both unimodal clients and multimodal clients.
    Learning Graphon Mean Field Games and Approximate Nash Equilibria. (arXiv:2112.01280v3 [cs.GT] UPDATED)
    Recent advances at the intersection of dense large graph limits and mean field games have begun to enable the scalable analysis of a broad class of dynamical sequential games with large numbers of agents. So far, results have been largely limited to graphon mean field systems with continuous-time diffusive or jump dynamics, typically without control and with little focus on computational methods. We propose a novel discrete-time formulation for graphon mean field games as the limit of non-linear dense graph Markov games with weak interaction. On the theoretical side, we give extensive and rigorous existence and approximation properties of the graphon mean field solution in sufficiently large systems. On the practical side, we provide general learning schemes for graphon mean field equilibria by either introducing agent equivalence classes or reformulating the graphon mean field system as a classical mean field system. By repeatedly finding a regularized optimal control solution and its generated mean field, we successfully obtain plausible approximate Nash equilibria in otherwise infeasible large dense graph games with many agents. Empirically, we are able to demonstrate on a number of examples that the finite-agent behavior comes increasingly close to the mean field behavior for our computed equilibria as the graph or system size grows, verifying our theory. More generally, we successfully apply policy gradient reinforcement learning in conjunction with sequential Monte Carlo methods.
    Leaving No One Behind: A Multi-Scenario Multi-Task Meta Learning Approach for Advertiser Modeling. (arXiv:2201.06814v2 [cs.LG] UPDATED)
    Advertisers play an essential role in many e-commerce platforms like Taobao and Amazon. Fulfilling their marketing needs and supporting their business growth is critical to the long-term prosperity of platform economies. However, compared with extensive studies on user modeling such as click-through rate predictions, much less attention has been drawn to advertisers, especially in terms of understanding their diverse demands and performance. Different from user modeling, advertiser modeling generally involves many kinds of tasks (e.g. predictions of advertisers' expenditure, active-rate, or total impressions of promoted products). In addition, major e-commerce platforms often provide multiple marketing scenarios (e.g. Sponsored Search, Display Ads, Live Streaming Ads) while advertisers' behavior tend to be dispersed among many of them. This raises the necessity of multi-task and multi-scenario consideration in comprehensive advertiser modeling, which faces the following challenges: First, one model per scenario or per task simply doesn't scale; Second, it is particularly hard to model new or minor scenarios with limited data samples; Third, inter-scenario correlations are complicated, and may vary given different tasks. To tackle these challenges, we propose a multi-scenario multi-task meta learning approach (M2M) which simultaneously predicts multiple tasks in multiple advertising scenarios.
    Customer Price Sensitivities in Competitive Automobile Insurance Markets. (arXiv:2101.08551v2 [stat.AP] UPDATED)
    Insurers are increasingly adopting more demand-based strategies to incorporate the indirect effect of premium changes on their policyholders' willingness to stay. However, since in practice both insurers' renewal premia and customers' responses to these premia typically depend on the customer's level of risk, it remains challenging in these strategies to determine how to properly control for this confounding. We therefore consider a causal inference approach in this paper to account for customers' price sensitivity and to deduce optimal, multi-period profit maximizing premium renewal offers. More specifically, we extend the discrete treatment framework of Guelman and Guill\'en (2014) by Extreme Gradient Boosting, or XGBoost, and by multiple imputation to better account for the uncertainty in the counterfactual responses. We additionally introduce the continuous treatment framework with XGBoost to the insurance literature to allow identification of the exact optimal renewal offers and account for any competition in the market by including competitor offers. The application of the two treatment frameworks to a Dutch automobile insurance portfolio suggests that a policy's competitiveness in the market is crucial for a customer's price sensitivity and that XGBoost is more appropriate to describe this than the traditional logistic regression. Moreover, an efficient frontier of both frameworks indicates that substantially more profit can be gained on the portfolio than realized, also already with less churn and in particular if we allow for continuous rate changes. A multi-period renewal optimization confirms these findings and demonstrates that the competitiveness enables temporal feedback of previous rate changes on future demand.  ( 2 min )
    Network Inference and Influence Maximization from Samples. (arXiv:2106.03403v2 [cs.SI] UPDATED)
    Influence maximization is the task of selecting a small number of seed nodes in a social network to maximize the influence spread from these seeds. It has been widely investigated in the past two decades. In the canonical setting, the social network and its diffusion parameters are given as input. In this paper, we consider the more realistic sampling setting where the network is unknown and we only have a set of passively observed cascades that record the sets of activated nodes at each diffusion step. We study the task of influence maximization from these cascade samples (IMS) and present constant approximation algorithms for it under mild conditions on the seed set distribution. To achieve the optimization goal, we also provide a novel solution to the network inference problem, that is, learning diffusion parameters and the network structure from the cascade data. Compared with prior solutions, our network inference algorithms require weaker assumptions and do not rely on maximum-likelihood estimation and convex programming. Our IMS algorithms enhance the learning-and-then-optimization approach by allowing a constant approximation ratio even when the diffusion parameters are hard to learn, and we do not need any assumption related to the network structure or diffusion parameters.  ( 2 min )
    Enhanced DeepONet for Modeling Partial Differential Operators Considering Multiple Input Functions. (arXiv:2202.08942v1 [cs.LG])
    Machine learning, especially deep learning is gaining much attention due to the breakthrough performance in various cognitive applications. Recently, neural networks (NN) have been intensively explored to model partial differential equations as NN can be viewed as universal approximators for nonlinear functions. A deep network operator (DeepONet) architecture was proposed to model the general non-linear continuous operators for partial differential equations (PDE) due to its better generalization capabilities than existing mainstream deep neural network architectures. However, existing DeepONet can only accept one input function, which limits its application. In this work, we explore the DeepONet architecture to extend it to accept two or more input functions. We propose new Enhanced DeepONet or EDeepONet high-level neural network structure, in which two input functions are represented by two branch DNN sub-networks, which are then connected with output truck network via inner product to generate the output of the whole neural network. The proposed EDeepONet structure can be easily extended to deal with multiple input functions. Our numerical results on modeling two partial differential equation examples shows that the proposed enhanced DeepONet is about 7X-17X or about one order of magnitude more accurate than the fully connected neural network and is about 2X-3X more accurate than a simple extended DeepONet for both training and test.  ( 2 min )
    Graph Self-Supervised Learning: A Survey. (arXiv:2103.00111v4 [cs.LG] UPDATED)
    Deep learning on graphs has attracted significant interests recently. However, most of the works have focused on (semi-) supervised learning, resulting in shortcomings including heavy label reliance, poor generalization, and weak robustness. To address these issues, self-supervised learning (SSL), which extracts informative knowledge through well-designed pretext tasks without relying on manual labels, has become a promising and trending learning paradigm for graph data. Different from SSL on other domains like computer vision and natural language processing, SSL on graphs has an exclusive background, design ideas, and taxonomies. Under the umbrella of graph self-supervised learning, we present a timely and comprehensive review of the existing approaches which employ SSL techniques for graph data. We construct a unified framework that mathematically formalizes the paradigm of graph SSL. According to the objectives of pretext tasks, we divide these approaches into four categories: generation-based, auxiliary property-based, contrast-based, and hybrid approaches. We further describe the applications of graph SSL across various research fields and summarize the commonly used datasets, evaluation benchmark, performance comparison and open-source codes of graph SSL. Finally, we discuss the remaining challenges and potential future directions in this research field.  ( 2 min )
    Morphence: Moving Target Defense Against Adversarial Examples. (arXiv:2108.13952v4 [cs.LG] UPDATED)
    Robustness to adversarial examples of machine learning models remains an open topic of research. Attacks often succeed by repeatedly probing a fixed target model with adversarial examples purposely crafted to fool it. In this paper, we introduce Morphence, an approach that shifts the defense landscape by making a model a moving target against adversarial examples. By regularly moving the decision function of a model, Morphence makes it significantly challenging for repeated or correlated attacks to succeed. Morphence deploys a pool of models generated from a base model in a manner that introduces sufficient randomness when it responds to prediction queries. To ensure repeated or correlated attacks fail, the deployed pool of models automatically expires after a query budget is reached and the model pool is seamlessly replaced by a new model pool generated in advance. We evaluate Morphence on two benchmark image classification datasets (MNIST and CIFAR10) against five reference attacks (2 white-box and 3 black-box). In all cases, Morphence consistently outperforms the thus-far effective defense, adversarial training, even in the face of strong white-box attacks, while preserving accuracy on clean data.
    Ten years of image analysis and machine learning competitions in dementia. (arXiv:2112.07922v2 [cs.LG] UPDATED)
    Machine learning methods exploiting multi-parametric biomarkers, especially based on neuroimaging, have huge potential to improve early diagnosis of dementia and to predict which individuals are at-risk of developing dementia. To benchmark algorithms in the field of machine learning and neuroimaging in dementia and assess their potential for use in clinical practice and clinical trials, seven grand challenges have been organized in the last decade. The seven grand challenges addressed questions related to screening, clinical status estimation, prediction and monitoring in (pre-clinical) dementia. There was little overlap in clinical questions, tasks and performance metrics. Whereas this aids providing insight on a broad range of questions, it also limits the validation of results across challenges. The validation process itself was mostly comparable between challenges, using similar methods for ensuring objective comparison, uncertainty estimation and statistical testing. In general, winning algorithms performed rigorous data preprocessing and combined a wide range of input features. Despite high state-of-the-art performances, most of the methods evaluated by the challenges are not clinically used. To increase impact, future challenges could pay more attention to statistical analysis of which factors relate to higher performance, to clinical questions beyond Alzheimer's disease, and to using testing data beyond the Alzheimer's Disease Neuroimaging Initiative. Grand challenges would be an ideal venue for assessing the generalizability of algorithm performance to unseen data of other cohorts. Key for increasing impact in this way are larger testing data sizes, which could be reached by sharing algorithms rather than data to exploit data that cannot be shared.
    Some Inapproximability Results of MAP Inference and Exponentiated Determinantal Point Processes. (arXiv:2109.00727v2 [cs.DS] UPDATED)
    We study the computational complexity of two hard problems on determinantal point processes (DPPs). One is maximum a posteriori (MAP) inference, i.e., to find a principal submatrix having the maximum determinant. The other is probabilistic inference on exponentiated DPPs (E-DPPs), which can sharpen or weaken the diversity preference of DPPs with an exponent parameter $p$. We present several complexity-theoretic hardness results that explain the difficulty in approximating MAP inference and the normalizing constant for E-DPPs. We first prove that unconstrained MAP inference for an $n \times n$ matrix is $\textsf{NP}$-hard to approximate within a factor of $2^{\beta n}$, where $\beta = 10^{-10^{13}} $. This result improves upon the best-known inapproximability factor of $(\frac{9}{8}-\epsilon)$, and rules out the existence of any polynomial-factor approximation algorithm assuming $\textsf{P} \neq \textsf{NP}$. We then show that log-determinant maximization is $\textsf{NP}$-hard to approximate within a factor of $\frac{5}{4}$ for the unconstrained case and within a factor of $1+10^{-10^{13}}$ for the size-constrained monotone case. In particular, log-determinant maximization does not admit a polynomial-time approximation scheme unless $\textsf{P} = \textsf{NP}$. As a corollary of the first result, we demonstrate that the normalizing constant for E-DPPs of any (fixed) constant exponent $p \geq \beta^{-1} = 10^{10^{13}}$ is $\textsf{NP}$-hard to approximate within a factor of $2^{\beta pn}$, which is in contrast to the case of $p \leq 1$ admitting a fully polynomial-time randomized approximation scheme.  ( 2 min )
    Federated Asymptotics: a model to compare federated learning algorithms. (arXiv:2108.07313v3 [cs.LG] UPDATED)
    We propose an asymptotic framework to analyze the performance of (personalized) federated learning algorithms. In this new framework, we formulate federated learning as a multi-criterion objective, where the goal is to minimize each client's loss using information from all of the clients. We analyze a linear regression model where, for a given client, we may theoretically compare the performance of various algorithms in the high-dimensional asymptotic limit. This asymptotic multi-criterion approach naturally models the high-dimensional, many-device nature of federated learning. These tools make fairly precise predictions about the benefits of personalization and information sharing in federated scenarios -- at least in our (stylized) model -- including that Federated Averaging with simple client fine-tuning achieves the same asymptotic risk as the more intricate meta-learning and proximal-regularized approaches and outperforming Federated Averaging without personalization. We evaluate these predictions on federated versions of the EMNIST, CIFAR-100, Shakespeare, and Stack Overflow datasets, where the experiments corroborate the theoretical predictions, suggesting such frameworks may provide a useful guide to practical algorithmic development.  ( 2 min )
    Optimal Coreset for Gaussian Kernel Density Estimation. (arXiv:2007.08031v4 [cs.DS] UPDATED)
    Given a point set $P\subset \mathbb{R}^d$, the kernel density estimate of $P$ is defined as \[ \overline{\mathcal{G}}_P(x) = \frac{1}{\left|P\right|}\sum_{p\in P}e^{-\left\lVert x-p \right\rVert^2} \] for any $x\in\mathbb{R}^d$. We study how to construct a small subset $Q$ of $P$ such that the kernel density estimate of $P$ is approximated by the kernel density estimate of $Q$. This subset $Q$ is called a coreset. The main technique in this work is constructing a $\pm 1$ coloring on the point set $P$ by discrepancy theory and we leverage Banaszczyk's Theorem. When $d>1$ is a constant, our construction gives a coreset of size $O\left(\frac{1}{\varepsilon}\right)$ as opposed to the best-known result of $O\left(\frac{1}{\varepsilon}\sqrt{\log\frac{1}{\varepsilon}}\right)$. It is the first result to give a breakthrough on the barrier of $\sqrt{\log}$ factor even when $d=2$.  ( 2 min )
    Multi-Level Local SGD for Heterogeneous Hierarchical Networks. (arXiv:2007.13819v3 [cs.LG] UPDATED)
    We propose Multi-Level Local SGD, a distributed gradient method for learning a smooth, non-convex objective in a heterogeneous multi-level network. Our network model consists of a set of disjoint sub-networks, with a single hub and multiple worker nodes; further, worker nodes may have different operating rates. The hubs exchange information with one another via a connected, but not necessarily complete communication network. In our algorithm, sub-networks execute a distributed SGD algorithm, using a hub-and-spoke paradigm, and the hubs periodically average their models with neighboring hubs. We first provide a unified mathematical framework that describes the Multi-Level Local SGD algorithm. We then present a theoretical analysis of the algorithm; our analysis shows the dependence of the convergence error on the worker node heterogeneity, hub network topology, and the number of local, sub-network, and global iterations. We back up our theoretical results via simulation-based experiments using both convex and non-convex objectives.
    Learning Weakly-Supervised Contrastive Representations. (arXiv:2202.06670v2 [cs.LG] UPDATED)
    We argue that a form of the valuable information provided by the auxiliary information is its implied data clustering information. For instance, considering hashtags as auxiliary information, we can hypothesize that an Instagram image will be semantically more similar with the same hashtags. With this intuition, we present a two-stage weakly-supervised contrastive learning approach. The first stage is to cluster data according to its auxiliary information. The second stage is to learn similar representations within the same cluster and dissimilar representations for data from different clusters. Our empirical experiments suggest the following three contributions. First, compared to conventional self-supervised representations, the auxiliary-information-infused representations bring the performance closer to the supervised representations, which use direct downstream labels as supervision signals. Second, our approach performs the best in most cases, when comparing our approach with other baseline representation learning methods that also leverage auxiliary data information. Third, we show that our approach also works well with unsupervised constructed clusters (e.g., no auxiliary information), resulting in a strong unsupervised representation learning approach.
    EEG-based Cross-Subject Driver Drowsiness Recognition with an Interpretable Convolutional Neural Network. (arXiv:2107.09507v4 [eess.SP] UPDATED)
    In the context of electroencephalogram (EEG)-based driver drowsiness recognition, it is still challenging to design a calibration-free system, since EEG signals vary significantly among different subjects and recording sessions. Many efforts have been made to use deep learning methods for mental state recognition from EEG signals. However, existing work mostly treats deep learning models as black-box classifiers, while what have been learned by the models and to which extent they are affected by the noise in EEG data are still underexplored. In this paper, we develop a novel convolutional neural network combined with an interpretation technique that allows sample-wise analysis of important features for classification. The network has a compact structure and takes advantage of separable convolutions to process the EEG signals in a spatial-temporal sequence. Results show that the model achieves an average accuracy of 78.35% on 11 subjects for leave-one-out cross-subject drowsiness recognition, which is higher than the conventional baseline methods of 53.40%-72.68% and state-of-the-art deep learning methods of 71.75%-75.19%. Interpretation results indicate the model has learned to recognize biologically meaningful features from EEG signals, e.g., Alpha spindles, as strong indicators of drowsiness across different subjects. In addition, we also explore reasons behind some wrongly classified samples with the interpretation technique and discuss potential ways to improve the recognition accuracy. Our work illustrates a promising direction on using interpretable deep learning models to discover meaningful patterns related to different mental states from complex EEG signals.
    Fast Rank-1 NMF for Missing Data with KL Divergence. (arXiv:2110.12595v2 [stat.ML] UPDATED)
    We propose a fast non-gradient-based method of rank-1 non-negative matrix factorization (NMF) for missing data, called A1GM, that minimizes the KL divergence from an input matrix to the reconstructed rank-1 matrix. Our method is based on our new finding of an analytical closed-formula of the best rank-1 non-negative multiple matrix factorization (NMMF), a variety of NMF. NMMF is known to exactly solve NMF for missing data if positions of missing values satisfy a certain condition, and A1GM transforms a given matrix so that the analytical solution to NMMF can be applied. We empirically show that A1GM is more efficient than a gradient method with competitive reconstruction errors.
    A global convergence theory for deep ReLU implicit networks via over-parameterization. (arXiv:2110.05645v2 [cs.LG] UPDATED)
    Implicit deep learning has received increasing attention recently due to the fact that it generalizes the recursive prediction rules of many commonly used neural network architectures. Its prediction rule is provided implicitly based on the solution of an equilibrium equation. Although a line of recent empirical studies has demonstrated its superior performances, the theoretical understanding of implicit neural networks is limited. In general, the equilibrium equation may not be well-posed during the training. As a result, there is no guarantee that a vanilla (stochastic) gradient descent (SGD) training nonlinear implicit neural networks can converge. This paper fills the gap by analyzing the gradient flow of Rectified Linear Unit (ReLU) activated implicit neural networks. For an $m$-width implicit neural network with ReLU activation and $n$ training samples, we show that a randomly initialized gradient descent converges to a global minimum at a linear rate for the square loss function if the implicit neural network is \textit{over-parameterized}. It is worth noting that, unlike existing works on the convergence of (S)GD on finite-layer over-parameterized neural networks, our convergence results hold for implicit neural networks, where the number of layers is \textit{infinite}.
    Simple data balancing achieves competitive worst-group-accuracy. (arXiv:2110.14503v2 [cs.LG] UPDATED)
    We study the problem of learning classifiers that perform well across (known or unknown) groups of data. After observing that common worst-group-accuracy datasets suffer from substantial imbalances, we set out to compare state-of-the-art methods to simple balancing of classes and groups by either subsampling or reweighting data. Our results show that these data balancing baselines achieve state-of-the-art-accuracy, while being faster to train and requiring no additional hyper-parameters. In addition, we highlight that access to group information is most critical for model selection purposes, and not so much during training. All in all, our findings beg closer examination of benchmarks and methods for research in worst-group-accuracy optimization.
    AESPA: Accuracy Preserving Low-degree Polynomial Activation for Fast Private Inference. (arXiv:2201.06699v2 [cs.CR] UPDATED)
    Hybrid private inference (PI) protocol, which synergistically utilizes both multi-party computation (MPC) and homomorphic encryption, is one of the most prominent techniques for PI. However, even the state-of-the-art PI protocols are bottlenecked by the non-linear layers, especially the activation functions. Although a standard non-linear activation function can generate higher model accuracy, it must be processed via a costly garbled-circuit MPC primitive. A polynomial activation can be processed via Beaver's multiplication triples MPC primitive but has been incurring severe accuracy drops so far. In this paper, we propose an accuracy preserving low-degree polynomial activation function (AESPA) that exploits the Hermite expansion of the ReLU and basis-wise normalization. We apply AESPA to popular ML models, such as VGGNet, ResNet, and pre-activation ResNet, to show an inference accuracy comparable to those of the standard models with ReLU activation, achieving superior accuracy over prior low-degree polynomial studies. When applied to the all-RELU baseline on the state-of-the-art Delphi PI protocol, AESPA shows up to 42.1x and 28.3x lower online latency and communication cost.
    Medical Dead-ends and Learning to Identify High-risk States and Treatments. (arXiv:2110.04186v3 [cs.LG] UPDATED)
    Machine learning has successfully framed many sequential decision making problems as either supervised prediction, or optimal decision-making policy identification via reinforcement learning. In data-constrained offline settings, both approaches may fail as they assume fully optimal behavior or rely on exploring alternatives that may not exist. We introduce an inherently different approach that identifies possible "dead-ends" of a state space. We focus on the condition of patients in the intensive care unit, where a "medical dead-end" indicates that a patient will expire, regardless of all potential future treatment sequences. We postulate "treatment security" as avoiding treatments with probability proportional to their chance of leading to dead-ends, present a formal proof, and frame discovery as an RL problem. We then train three independent deep neural models for automated state construction, dead-end discovery and confirmation. Our empirical results discover that dead-ends exist in real clinical data among septic patients, and further reveal gaps between secure treatments and those that were administered.
    The Computational Drug Repositioning without Negative Sampling. (arXiv:2111.14696v2 [cs.LG] UPDATED)
    Computational drug repositioning technology is an effective tool to accelerate drug development. Although this technique has been widely used and successful in recent decades, many existing models still suffer from multiple drawbacks such as the massive number of unvalidated drug-disease associations and the inner product. The limitations of these works are mainly due to the following two reasons: firstly, previous works used negative sampling techniques to treat unvalidated drug-disease associations as negative samples, which is invalid in real-world settings; secondly, the inner product cannot fully take into account the feature information contained in the latent factor of drug and disease. In this paper, we propose a novel PUON framework for addressing the above deficiencies, which models the risk estimator of computational drug repositioning only using validated (Positive) and unvalidated (Unlabelled) drug-disease associations without employing negative sampling techniques. The PUON also proposed an Outer Neighborhood-based classifier for modeling the cross-feature information of latent facotors of drugs and diseases. For a comprehensive comparison, we considered 8 popular baselines. Extensive experiments in three real-world datasets showed that PUON achieved the best performance based on 3 evaluation metrics.
    On the relationship between predictive coding and backpropagation. (arXiv:2106.13082v4 [q-bio.NC] UPDATED)
    Artificial neural networks are often interpreted as abstract models of biological neuronal networks, but they are typically trained using the biologically unrealistic backpropagation algorithm and its variants. Predictive coding has been proposed as a potentially more biologically realistic alternative to backpropagation for training neural networks. This manuscript reviews and extends recent work on the mathematical relationship between predictive coding and backpropagation for training feedforward artificial neural networks on supervised learning tasks. Implications of these results for the interpretation of predictive coding and deep neural networks as models of biological learning are discussed along with a repository of functions, Torch2PC, for performing predictive coding with PyTorch neural network models.
    Variational Diffusion Models. (arXiv:2107.00630v3 [cs.LG] UPDATED)
    Diffusion-based generative models have demonstrated a capacity for perceptually impressive synthesis, but can they also be great likelihood-based models? We answer this in the affirmative, and introduce a family of diffusion-based generative models that obtain state-of-the-art likelihoods on standard image density estimation benchmarks. Unlike other diffusion-based models, our method allows for efficient optimization of the noise schedule jointly with the rest of the model. We show that the variational lower bound (VLB) simplifies to a remarkably short expression in terms of the signal-to-noise ratio of the diffused data, thereby improving our theoretical understanding of this model class. Using this insight, we prove an equivalence between several models proposed in the literature. In addition, we show that the continuous-time VLB is invariant to the noise schedule, except for the signal-to-noise ratio at its endpoints. This enables us to learn a noise schedule that minimizes the variance of the resulting VLB estimator, leading to faster optimization. Combining these advances with architectural improvements, we obtain state-of-the-art likelihoods on image density estimation benchmarks, outperforming autoregressive models that have dominated these benchmarks for many years, with often significantly faster optimization. In addition, we show how to use the model as part of a bits-back compression scheme, and demonstrate lossless compression rates close to the theoretical optimum.
    Learning Patterns in Imaginary Vowels for an Intelligent Brain Computer Interface (BCI) Design. (arXiv:2010.12066v2 [cs.LG] UPDATED)
    Technology advancements made it easy to measure non-invasive and high-quality electroencephalograph (EEG) signals from human's brain. Hence, development of robust and high-performance AI algorithms becomes crucial to properly process the EEG signals and recognize the patterns, which lead to an appropriate control signal. Despite the advancements in processing the motor imagery EEG signals, the healthcare applications, such as emotion detection, are still in the early stages of AI design. In this paper, we propose a modular framework for the recognition of vowels as the AI part of a brain computer interface system. We carefully designed the modules to discriminate the English vowels given the raw EEG signals, and meanwhile avoid the typical issued with the data-poor environments like most of the healthcare applications. The proposed framework consists of appropriate signal segmentation, filtering, extraction of spectral features, reducing the dimensions by means of principle component analysis, and finally a multi-class classification by decision-tree-based support vector machine (DT-SVM). The performance of our framework was evaluated by a combination of test-set and resubstitution (also known as apparent) error rates. We provide the algorithms of the proposed framework to make it easy for future researchers and developers who want to follow the same workflow.
    Federated Multi-Task Learning under a Mixture of Distributions. (arXiv:2108.10252v3 [cs.LG] UPDATED)
    The increasing size of data generated by smartphones and IoT devices motivated the development of Federated Learning (FL), a framework for on-device collaborative training of machine learning models. First efforts in FL focused on learning a single global model with good average performance across clients, but the global model may be arbitrarily bad for a given client, due to the inherent heterogeneity of local data distributions. Federated multi-task learning (MTL) approaches can learn personalized models by formulating an opportune penalized optimization problem. The penalization term can capture complex relations among personalized models, but eschews clear statistical assumptions about local data distributions. In this work, we propose to study federated MTL under the flexible assumption that each local data distribution is a mixture of unknown underlying distributions. This assumption encompasses most of the existing personalized FL approaches and leads to federated EM-like algorithms for both client-server and fully decentralized settings. Moreover, it provides a principled way to serve personalized models to clients not seen at training time. The algorithms' convergence is analyzed through a novel federated surrogate optimization framework, which can be of general interest. Experimental results on FL benchmarks show that our approach provides models with higher accuracy and fairness than state-of-the-art methods.
    Intrusion Prevention through Optimal Stopping. (arXiv:2111.00289v6 [cs.LG] UPDATED)
    We study automated intrusion prevention using reinforcement learning. Following a novel approach, we formulate the problem of intrusion prevention as an (optimal) multiple stopping problem. This formulation gives us insight into the structure of optimal policies, which we show to have threshold properties. For most practical cases, it is not feasible to obtain an optimal defender policy using dynamic programming. We therefore develop a reinforcement learning approach to approximate an optimal policy. Our method for learning and validating policies includes two systems: a simulation system where defender policies are incrementally learned and an emulation system where statistics are produced that drive simulation runs and where learned policies are evaluated. We show that our approach can produce effective defender policies for a practical IT infrastructure of limited size. Inspection of the learned policies confirms that they exhibit threshold properties.
    When OT meets MoM: Robust estimation of Wasserstein Distance. (arXiv:2006.10325v3 [stat.ML] UPDATED)
    Issued from Optimal Transport, the Wasserstein distance has gained importance in Machine Learning due to its appealing geometrical properties and the increasing availability of efficient approximations. In this work, we consider the problem of estimating the Wasserstein distance between two probability distributions when observations are polluted by outliers. To that end, we investigate how to leverage Medians of Means (MoM) estimators to robustify the estimation of Wasserstein distance. Exploiting the dual Kantorovitch formulation of Wasserstein distance, we introduce and discuss novel MoM-based robust estimators whose consistency is studied under a data contamination model and for which convergence rates are provided. These MoM estimators enable to make Wasserstein Generative Adversarial Network (WGAN) robust to outliers, as witnessed by an empirical study on two benchmarks CIFAR10 and Fashion MNIST. Eventually, we discuss how to combine MoM with the entropy-regularized approximation of the Wasserstein distance and propose a simple MoM-based re-weighting scheme that could be used in conjunction with the Sinkhorn algorithm.
    Signal Decomposition Using Masked Proximal Operators. (arXiv:2202.09338v1 [cs.LG])
    We consider the well-studied problem of decomposing a vector time series signal into components with different characteristics, such as smooth, periodic, nonnegative, or sparse. We propose a simple and general framework in which the components are defined by loss functions (which include constraints), and the signal decomposition is carried out by minimizing the sum of losses of the components (subject to the constraints). When each loss function is the negative log-likelihood of a density for the signal component, our method coincides with maximum a posteriori probability (MAP) estimation; but it also includes many other interesting cases. We give two distributed optimization methods for computing the decomposition, which find the optimal decomposition when the component class loss functions are convex, and are good heuristics when they are not. Both methods require only the masked proximal operator of each of the component loss functions, a generalization of the well-known proximal operator that handles missing entries in its argument. Both methods are distributed, i.e., handle each component separately. We derive tractable methods for evaluating the masked proximal operators of some loss functions that, to our knowledge, have not appeared in the literature.
    Improving Intrinsic Exploration with Language Abstractions. (arXiv:2202.08938v1 [cs.LG])
    Reinforcement learning (RL) agents are particularly hard to train when rewards are sparse. One common solution is to use intrinsic rewards to encourage agents to explore their environment. However, recent intrinsic exploration methods often use state-based novelty measures which reward low-level exploration and may not scale to domains requiring more abstract skills. Instead, we explore natural language as a general medium for highlighting relevant abstractions in an environment. Unlike previous work, we evaluate whether language can improve over existing exploration methods by directly extending (and comparing to) competitive intrinsic exploration baselines: AMIGo (Campero et al., 2021) and NovelD (Zhang et al., 2021). These language-based variants outperform their non-linguistic forms by 45-85% across 13 challenging tasks from the MiniGrid and MiniHack environment suites.
    Federated Learning via Plurality Vote. (arXiv:2110.02998v2 [cs.LG] UPDATED)
    Federated learning allows collaborative workers to solve a machine learning problem while preserving data privacy. Recent studies have tackled various challenges in federated learning, but the joint optimization of communication overhead, learning reliability, and deployment efficiency is still an open problem. To this end, we propose a new scheme named federated learning via plurality vote (FedVote). In each communication round of FedVote, workers transmit binary or ternary weights to the server with low communication overhead. The model parameters are aggregated via weighted voting to enhance the resilience against Byzantine attacks. When deployed for inference, the model with binary or ternary weights is resource-friendly to edge devices. We show that our proposed method can reduce quantization error and converges faster compared with the methods directly quantizing the model updates.
    A Fair Comparison of Graph Neural Networks for Graph Classification. (arXiv:1912.09893v3 [cs.LG] UPDATED)
    Experimental reproducibility and replicability are critical topics in machine learning. Authors have often raised concerns about their lack in scientific publications to improve the quality of the field. Recently, the graph representation learning field has attracted the attention of a wide research community, which resulted in a large stream of works. As such, several Graph Neural Network models have been developed to effectively tackle graph classification. However, experimental procedures often lack rigorousness and are hardly reproducible. Motivated by this, we provide an overview of common practices that should be avoided to fairly compare with the state of the art. To counter this troubling trend, we ran more than 47000 experiments in a controlled and uniform framework to re-evaluate five popular models across nine common benchmarks. Moreover, by comparing GNNs with structure-agnostic baselines we provide convincing evidence that, on some datasets, structural information has not been exploited yet. We believe that this work can contribute to the development of the graph learning field, by providing a much needed grounding for rigorous evaluations of graph classification models.
    tinyMAN: Lightweight Energy Manager using Reinforcement Learning for Energy Harvesting Wearable IoT Devices. (arXiv:2202.09297v1 [eess.SP])
    Advances in low-power electronics and machine learning techniques lead to many novel wearable IoT devices. These devices have limited battery capacity and computational power. Thus, energy harvesting from ambient sources is a promising solution to power these low-energy wearable devices. They need to manage the harvested energy optimally to achieve energy-neutral operation, which eliminates recharging requirements. Optimal energy management is a challenging task due to the dynamic nature of the harvested energy and the battery energy constraints of the target device. To address this challenge, we present a reinforcement learning-based energy management framework, tinyMAN, for resource-constrained wearable IoT devices. The framework maximizes the utilization of the target device under dynamic energy harvesting patterns and battery constraints. Moreover, tinyMAN does not rely on forecasts of the harvested energy which makes it a prediction-free approach. We deployed tinyMAN on a wearable device prototype using TensorFlow Lite for Micro thanks to its small memory footprint of less than 100 KB. Our evaluations show that tinyMAN achieves less than 2.36 ms and 27.75 $\mu$J while maintaining up to 45% higher utility compared to prior approaches.
    Exploring Adversarially Robust Training for Unsupervised Domain Adaptation. (arXiv:2202.09300v1 [cs.CV])
    Unsupervised Domain Adaptation (UDA) methods aim to transfer knowledge from a labeled source domain to an unlabeled target domain. UDA has been extensively studied in the computer vision literature. Deep networks have been shown to be vulnerable to adversarial attacks. However, very little focus is devoted to improving the adversarial robustness of deep UDA models, causing serious concerns about model reliability. Adversarial Training (AT) has been considered to be the most successful adversarial defense approach. Nevertheless, conventional AT requires ground-truth labels to generate adversarial examples and train models, which limits its effectiveness in the unlabeled target domain. In this paper, we aim to explore AT to robustify UDA models: How to enhance the unlabeled data robustness via AT while learning domain-invariant features for UDA? To answer this, we provide a systematic study into multiple AT variants that potentially apply to UDA. Moreover, we propose a novel Adversarially Robust Training method for UDA accordingly, referred to as ARTUDA. Extensive experiments on multiple attacks and benchmarks show that ARTUDA consistently improves the adversarial robustness of UDA models.
    Machine Learning Models in Stock Market Prediction. (arXiv:2202.09359v1 [q-fin.ST])
    The paper focuses on predicting the Nifty 50 Index by using 8 Supervised Machine Learning Models. The techniques used for empirical study are Adaptive Boost (AdaBoost), k-Nearest Neighbors (kNN), Linear Regression (LR), Artificial Neural Network (ANN), Random Forest (RF), Stochastic Gradient Descent (SGD), Support Vector Machine (SVM) and Decision Trees (DT). Experiments are based on historical data of Nifty 50 Index of Indian Stock Market from 22nd April, 1996 to 16th April, 2021, which is time series data of around 25 years. During the period there were 6220 trading days excluding all the non trading days. The entire trading dataset was divided into 4 subsets of different size-25% of entire data, 50% of entire data, 75% of entire data and entire data. Each subset was further divided into 2 parts-training data and testing data. After applying 3 tests- Test on Training Data, Test on Testing Data and Cross Validation Test on each subset, the prediction performance of the used models were compared and after comparison, very interesting results were found. The evaluation results indicate that Adaptive Boost, k- Nearest Neighbors, Random Forest and Decision Trees under performed with increase in the size of data set. Linear Regression and Artificial Neural Network shown almost similar prediction results among all the models but Artificial Neural Network took more time in training and validating the model. Thereafter Support Vector Machine performed better among rest of the models but with increase in the size of data set, Stochastic Gradient Descent performed better than Support Vector Machine.
    Deep Movement Primitives: toward Breast Cancer Examination Robot. (arXiv:2202.09265v1 [cs.RO])
    Breast cancer is the most common type of cancer worldwide. A robotic system performing autonomous breast palpation can make a significant impact on the related health sector worldwide. However, robot programming for breast palpating with different geometries is very complex and unsolved. Robot learning from demonstrations (LfD) reduces the programming time and cost. However, the available LfD are lacking the modelling of the manipulation path/trajectory as an explicit function of the visual sensory information. This paper presents a novel approach to manipulation path/trajectory planning called deep Movement Primitives that successfully generates the movements of a manipulator to reach a breast phantom and perform the palpation. We show the effectiveness of our approach by a series of real-robot experiments of reaching and palpating a breast phantom. The experimental results indicate our approach outperforms the state-of-the-art method.
    Amenable Sparse Network Investigator. (arXiv:2202.09284v1 [cs.LG])
    As the optimization problem of pruning a neural network is nonconvex and the strategies are only guaranteed to find local solutions, a good initialization becomes paramount. To this end, we present the Amenable Sparse Network Investigator ASNI algorithm that learns a sparse network whose initialization is compressed. The learned sparse structure found by ASNI is amenable since its corresponding initialization, which is also learned by ASNI, consists of only 2L numbers, where L is the number of layers. Requiring just a few numbers for parameter initialization of the learned sparse network makes the sparse network amenable. The learned initialization set consists of L signed pairs that act as the centroids of parameter values of each layer. These centroids are learned by the ASNI algorithm after only one single round of training. We experimentally show that the learned centroids are sufficient to initialize the nonzero parameters of the learned sparse structure in order to achieve approximately the accuracy of non-sparse network. We also empirically show that in order to learn the centroids, one needs to prune the network globally and gradually. Hence, for parameter pruning we propose a novel strategy based on a sigmoid function that specifies the sparsity percentage across the network globally. Then, pruning is done magnitude-wise and after each epoch of training. We have performed a series of experiments utilizing networks such as ResNets, VGG-style, small convolutional, and fully connected ones on ImageNet, CIFAR10, and MNIST datasets.
    Toward a traceable, explainable, and fairJD/Resume recommendation system. (arXiv:2202.08960v1 [cs.IR])
    In the last few decades, companies are interested to adopt an online automated recruitment process in an international recruitment environment. The problem is that the recruitment of employees through the manual procedure is a time and money consuming process. As a result, processing a significant number of applications through conventional methods can lead to the recruitment of clumsy individuals. Different JD/Resume matching model architectures have been proposed and reveal a high accuracy level in selecting relevant candidatesfor the required job positions. However, the development of an automatic recruitment system is still one of the main challenges. The reason is that the development of a fully automated recruitment system is a difficult task and poses different challenges. For example, providing a detailed matching explanation for the targeted stakeholders is needed to ensure a transparent recommendation. There are several knowledge bases that represent skills and competencies (e.g, ESCO, O*NET) that are used to identify the candidate and the required job skills for a matching purpose. Besides, modernpre-trained language models are fine-tuned for this context such as identifying lines where a specific feature was introduced. Typically, pre-trained language models use transfer-based machine learning models to be fine-tuned for a specific field. In this proposal, our aim is to explore how modern language models (based on transformers) can be combined with knowledge bases and ontologies to enhance the JD/Resume matching process. Our system aims at using knowledge bases and features to support the explainability of the JD/Resume matching. Finally, given that multiple software components, datasets, ontology, andmachine learning models will be explored, we aim at proposing a fair, ex-plainable, and traceable architecture for a Resume/JD matching purpose.
    Autoencoding Low-Resolution MRI for Semantically Smooth Interpolation of Anisotropic MRI. (arXiv:2202.09258v1 [eess.IV])
    High-resolution medical images are beneficial for analysis but their acquisition may not always be feasible. Alternatively, high-resolution images can be created from low-resolution acquisitions using conventional upsampling methods, but such methods cannot exploit high-level contextual information contained in the images. Recently, better performing deep-learning based super-resolution methods have been introduced. However, these methods are limited by their supervised character, i.e. they require high-resolution examples for training. Instead, we propose an unsupervised deep learning semantic interpolation approach that synthesizes new intermediate slices from encoded low-resolution examples. To achieve semantically smooth interpolation in through-plane direction, the method exploits the latent space generated by autoencoders. To generate new intermediate slices, latent space encodings of two spatially adjacent slices are combined using their convex combination. Subsequently, the combined encoding is decoded to an intermediate slice. To constrain the model, a notion of semantic similarity is defined for a given dataset. For this, a new loss is introduced that exploits the spatial relationship between slices of the same volume. During training, an existing in-between slice is generated using a convex combination of its neighboring slice encodings. The method was trained and evaluated using publicly available cardiac cine, neonatal brain and adult brain MRI scans. In all evaluations, the new method produces significantly better results in terms of Structural Similarity Index Measure and Peak Signal-to-Noise Ratio (p< 0.001 using one-sided Wilcoxon signed-rank test) than a cubic B-spline interpolation approach. Given the unsupervised nature of the method, high-resolution training data is not required and hence, the method can be readily applied in clinical settings.
    Multimodal Emotion Recognition using Transfer Learning from Speaker Recognition and BERT-based models. (arXiv:2202.08974v1 [cs.SD])
    Automatic emotion recognition plays a key role in computer-human interaction as it has the potential to enrich the next-generation artificial intelligence with emotional intelligence. It finds applications in customer and/or representative behavior analysis in call centers, gaming, personal assistants, and social robots, to mention a few. Therefore, there has been an increasing demand to develop robust automatic methods to analyze and recognize the various emotions. In this paper, we propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities. More specifically, we i) adapt a residual network (ResNet) based model trained on a large-scale speaker recognition task using transfer learning along with a spectrogram augmentation approach to recognize emotions from speech, and ii) use a fine-tuned bidirectional encoder representations from transformers (BERT) based model to represent and recognize emotions from the text. The proposed system then combines the ResNet and BERT-based model scores using a late fusion strategy to further improve the emotion recognition performance. The proposed multimodal solution addresses the data scarcity limitation in emotion recognition using transfer learning, data augmentation, and fine-tuning, thereby improving the generalization performance of the emotion recognition models. We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture (IEMOCAP) dataset. Experimental results indicate that both audio and text-based models improve the emotion recognition performance and that the proposed multimodal solution achieves state-of-the-art results on the IEMOCAP benchmark.
    Private Quantiles Estimation in the Presence of Atoms. (arXiv:2202.08969v1 [stat.ML])
    We address the differentially private estimation of multiple quantiles (MQ) of a dataset, a key building block in modern data analysis. We apply the recent non-smoothed Inverse Sensitivity (IS) mechanism to this specific problem and establish that the resulting method is closely related to the current state-of-the-art, the JointExp algorithm, sharing in particular the same computational complexity and a similar efficiency. However, we demonstrate both theoretically and empirically that (non-smoothed) JointExp suffers from an important lack of performance in the case of peaked distributions, with a potentially catastrophic impact in the presence of atoms. While its smoothed version would allow to leverage the performance guarantees of IS, it remains an open challenge to implement. As a proxy to fix the problem we propose a simple and numerically efficient method called Heuristically Smoothed JointExp (HSJointExp), which is endowed with performance guarantees for a broad class of distributions and achieves results that are orders of magnitude better on problematic datasets.
    Scalable approach to many-body localization via quantum data. (arXiv:2202.08853v1 [cond-mat.dis-nn])
    We are interested in how quantum data can allow for practical solutions to otherwise difficult computational problems. A notoriously difficult phenomenon from quantum many-body physics is the emergence of many-body localization (MBL). So far, is has evaded a comprehensive analysis. In particular, numerical studies are challenged by the exponential growth of the Hilbert space dimension. As many of these studies rely on exact diagonalization of the system's Hamiltonian, only small system sizes are accessible. In this work, we propose a highly flexible neural network based learning approach that, once given training data, circumvents any computationally expensive step. In this way, we can efficiently estimate common indicators of MBL such as the adjacent gap ratio or entropic quantities. Our estimator can be trained on data from various system sizes at once which grants the ability to extrapolate from smaller to larger ones. Moreover, using transfer learning we show that already a two-dimensional feature vector is sufficient to obtain several different indicators at various energy densities at once. We hope that our approach can be applied to large-scale quantum experiments to provide new insights into quantum many-body physics.
    Graph Auto-Encoder Via Neighborhood Wasserstein Reconstruction. (arXiv:2202.09025v1 [cs.LG])
    Graph neural networks (GNNs) have drawn significant research attention recently, mostly under the setting of semi-supervised learning. When task-agnostic representations are preferred or supervision is simply unavailable, the auto-encoder framework comes in handy with a natural graph reconstruction objective for unsupervised GNN training. However, existing graph auto-encoders are designed to reconstruct the direct links, so GNNs trained in this way are only optimized towards proximity-oriented graph mining tasks, and will fall short when the topological structures matter. In this work, we revisit the graph encoding process of GNNs which essentially learns to encode the neighborhood information of each node into an embedding vector, and propose a novel graph decoder to reconstruct the entire neighborhood information regarding both proximity and structure via Neighborhood Wasserstein Reconstruction (NWR). Specifically, from the GNN embedding of each node, NWR jointly predicts its node degree and neighbor feature distribution, where the distribution prediction adopts an optimal-transport loss based on the Wasserstein distance. Extensive experiments on both synthetic and real-world network datasets show that the unsupervised node representations learned with NWR have much more advantageous in structure-oriented graph mining tasks, while also achieving competitive performance in proximity-oriented ones.
    Graph Data Augmentation for Graph Machine Learning: A Survey. (arXiv:2202.08871v1 [cs.LG])
    Data augmentation has recently seen increased interest in graph machine learning given its ability of creating extra training data and improving model generalization. Despite this recent upsurge, this area is still relatively underexplored, due to the challenges brought by complex, non-Euclidean structure of graph data, which limits the direct analogizing of traditional augmentation operations on other types of data. In this paper, we present a comprehensive and systematic survey of graph data augmentation that summarizes the literature in a structured manner. We first categorize graph data augmentation operations based on the components of graph data they modify or create. Next, we introduce recent advances in graph data augmentation, separating by their learning objectives and methodologies. We conclude by outlining currently unsolved challenges as well as directions for future research. Overall, this paper aims to clarify the landscape of existing literature in graph data augmentation and motivate additional work in this area. We provide a GitHub repository (https://github.com/zhao-tong/graph-data-augmentation-papers) with a reading list that will be continuously updated.
    A Note on the Implicit Bias Towards Minimal Depth of Deep Neural Networks. (arXiv:2202.09028v1 [cs.LG])
    Deep learning systems have steadily advanced the state of the art in a wide variety of benchmarks, demonstrating impressive performance in tasks ranging from image classification \citep{taigman2014deepface,zhai2021scaling}, language processing \citep{devlin-etal-2019-bert,NEURIPS2020_1457c0d6}, open-ended environments \citep{SilverHuangEtAl16nature,arulkumaran2019alphastar}, to coding \citep{chen2021evaluating}. A central aspect that enables the success of these systems is the ability to train deep models instead of wide shallow ones \citep{7780459}. Intuitively, a neural network is decomposed into hierarchical representations from raw data to high-level, more abstract features. While training deep neural networks repetitively achieves superior performance against their shallow counterparts, an understanding of the role of depth in representation learning is still lacking. In this work, we suggest a new perspective on understanding the role of depth in deep learning. We hypothesize that {\bf\em SGD training of overparameterized neural networks exhibits an implicit bias that favors solutions of minimal effective depth}. Namely, SGD trains neural networks for which the top several layers are redundant. To evaluate the redundancy of layers, we revisit the recently discovered phenomenon of neural collapse \citep{Papyan24652,han2021neural}.
    Stochastic Perturbations of Tabular Features for Non-Deterministic Inference with Automunge. (arXiv:2202.09248v1 [cs.LG])
    Injecting gaussian noise into training features is well known to have regularization properties. This paper considers noise injections to numeric or categoric tabular features as passed to inference, which translates inference to a non-deterministic outcome and may have relevance to fairness considerations, adversarial example protection, or other use cases benefiting from non-determinism. We offer the Automunge library for tabular preprocessing as a resource for the practice, which includes options to integrate random sampling or entropy seeding with the support of quantum circuits for an improved randomness profile in comparison to pseudo random number generators. Benchmarking shows that neural networks may demonstrate an improved performance when a known noise profile is mitigated with corresponding injections to both training and inference, and that gradient boosting appears to be robust to a mild noise profile in inference, suggesting that stochastic perturbations could be integrated into existing data pipelines for prior trained gradient boosting models.
    Quantifying the Effects of Data Augmentation. (arXiv:2202.09134v1 [cs.LG])
    We provide results that exactly quantify how data augmentation affects the convergence rate and variance of estimates. They lead to some unexpected findings: Contrary to common intuition, data augmentation may increase rather than decrease uncertainty of estimates, such as the empirical prediction risk. Our main theoretical tool is a limit theorem for functions of randomly transformed, high-dimensional random vectors. The proof draws on work in probability on noise stability of functions of many variables. The pathological behavior we identify is not a consequence of complex models, but can occur even in the simplest settings -- one of our examples is a linear ridge regressor with two parameters. On the other hand, our results also show that data augmentation can have real, quantifiable benefits.
    On Variance Estimation of Random Forests. (arXiv:2202.09008v1 [stat.ML])
    Ensemble methods based on subsampling, such as random forests, are popular in applications due to their high predictive accuracy. Existing literature views a random forest prediction as an infinite-order incomplete U-statistic to quantify its uncertainty. However, these methods focus on a small subsampling size of each tree, which is theoretically valid but practically limited. This paper develops an unbiased variance estimator based on incomplete U-statistics, which allows the tree size to be comparable with the overall sample size, making statistical inference possible in a broader range of real applications. Simulation results demonstrate that our estimators enjoy lower bias and more accurate confidence interval coverage without additional computational costs. We also propose a local smoothing procedure to reduce the variation of our estimator, which shows improved numerical performance when the number of trees is relatively small. Further, we investigate the ratio consistency of our proposed variance estimator under specific scenarios. In particular, we develop a new "double U-statistic" formulation to analyze the Hoeffding decomposition of the estimator's variance.
    Training neural networks using monotone variational inequality. (arXiv:2202.08876v1 [stat.ML])
    Despite the vast empirical success of neural networks, theoretical understanding of the training procedures remains limited, especially in providing performance guarantees of testing performance due to the non-convex nature of the optimization problem. Inspired by a recent work of (Juditsky & Nemirovsky, 2019), instead of using the traditional loss function minimization approach, we reduce the training of the network parameters to another problem with convex structure -- to solve a monotone variational inequality (MVI). The solution to MVI can be found by computationally efficient procedures, and importantly, this leads to performance guarantee of $\ell_2$ and $\ell_{\infty}$ bounds on model recovery accuracy and prediction accuracy under the theoretical setting of training one-layer linear neural network. In addition, we study the use of MVI for training multi-layer neural networks and propose a practical algorithm called \textit{stochastic variational inequality} (SVI), and demonstrates its applicability in training fully-connected neural networks and graph neural networks (GNN) (SVI is completely general and can be used to train other types of neural networks). We demonstrate the competitive or better performance of SVI compared to the stochastic gradient descent (SGD) on both synthetic and real network data prediction tasks regarding various performance metrics.
    A Summary of the ComParE COVID-19 Challenges. (arXiv:2202.08981v1 [cs.SD])
    The COVID-19 pandemic has caused massive humanitarian and economic damage. Teams of scientists from a broad range of disciplines have searched for methods to help governments and communities combat the disease. One avenue from the machine learning field which has been explored is the prospect of a digital mass test which can detect COVID-19 from infected individuals' respiratory sounds. We present a summary of the results from the INTERSPEECH 2021 Computational Paralinguistics Challenges: COVID-19 Cough, (CCS) and COVID-19 Speech, (CSS).
    Proximal Optimal Transport Modeling of Population Dynamics. (arXiv:2106.06345v4 [cs.LG] UPDATED)
    We propose a new approach to model the collective dynamics of a population of particles evolving with time. As is often the case in challenging scientific applications, notably single-cell genomics, measuring features for these particles requires destroying them. As a result, the population can only be monitored with periodic snapshots, obtained by sampling a few particles that are sacrificed in exchange for measurements. Given only access to these snapshots, can we reconstruct likely individual trajectories for all other particles? We propose to model these trajectories as collective realizations of a causal Jordan-Kinderlehrer-Otto (JKO) flow of measures: The JKO scheme posits that the new configuration taken by a population at time $t+1$ is one that trades off an improvement, in the sense that it decreases an energy, while remaining close (in Wasserstein distance) to the previous configuration observed at $t$. In order to learn such an energy using only snapshots, we propose JKOnet, a neural architecture that computes (in end-to-end differentiable fashion) the JKO flow given a parametric energy and initial configuration of points. We demonstrate the good performance and robustness of the JKOnet fitting procedure, compared to a more direct forward method.
    Neo: Generalizing Confusion Matrix Visualization to Hierarchical and Multi-Output Labels. (arXiv:2110.12536v2 [cs.HC] UPDATED)
    The confusion matrix, a ubiquitous visualization for helping people evaluate machine learning models, is a tabular layout that compares predicted class labels against actual class labels over all data instances. We conduct formative research with machine learning practitioners at Apple and find that conventional confusion matrices do not support more complex data-structures found in modern-day applications, such as hierarchical and multi-output labels. To express such variations of confusion matrices, we design an algebra that models confusion matrices as probability distributions. Based on this algebra, we develop Neo, a visual analytics system that enables practitioners to flexibly author and interact with hierarchical and multi-output confusion matrices, visualize derived metrics, renormalize confusions, and share matrix specifications. Finally, we demonstrate Neo's utility with three model evaluation scenarios that help people better understand model performance and reveal hidden confusions.
    Deep Reinforcement Learning Based Multi-Access Edge Computing Schedule for Internet of Vehicle. (arXiv:2202.08972v1 [cs.LG])
    As intelligent transportation systems been implemented broadly and unmanned arial vehicles (UAVs) can assist terrestrial base stations acting as multi-access edge computing (MEC) to provide a better wireless network communication for Internet of Vehicles (IoVs), we propose a UAVs-assisted approach to help provide a better wireless network service retaining the maximum Quality of Experience(QoE) of the IoVs on the lane. In the paper, we present a Multi-Agent Graph Convolutional Deep Reinforcement Learning (M-AGCDRL) algorithm which combines local observations of each agent with a low-resolution global map as input to learn a policy for each agent. The agents can share their information with others in graph attention networks, resulting in an effective joint policy. Simulation results show that the M-AGCDRL method enables a better QoE of IoTs and achieves good performance.
    Strategyproof Learning: Building Trustworthy User-Generated Datasets. (arXiv:2106.02398v2 [cs.LG] UPDATED)
    We prove in this paper that, perhaps surprisingly, incentivizing data misreporting is not a fatality. By leveraging a careful design of the loss function, we propose Licchavi, a global and personalized learning framework with provable strategyproofness guarantees. Essentially, we prove that no user can gain much by replying to Licchavi's queries with answers that deviate from their true preferences. Interestingly, Licchavi also promotes the desirable "one person, one unit-force vote" fairness principle. Furthermore, our empirical evaluation of its performance showcases Licchavi's real-world applicability. We believe that our results are critical for the safety of any learning scheme that leverages user-generated data.
    Case-based Reasoning for Better Generalization in Text-Adventure Games. (arXiv:2110.08470v2 [cs.CL] UPDATED)
    Text-based games (TBG) have emerged as promising environments for driving research in grounded language understanding and studying problems like generalization and sample efficiency. Several deep reinforcement learning (RL) methods with varying architectures and learning schemes have been proposed for TBGs. However, these methods fail to generalize efficiently, especially under distributional shifts. In a departure from deep RL approaches, in this paper, we propose a general method inspired by case-based reasoning to train agents and generalize out of the training distribution. The case-based reasoner collects instances of positive experiences from the agent's interaction with the world in the past and later reuses the collected experiences to act efficiently. The method can be applied in conjunction with any existing on-policy neural agent in the literature for TBGs. Our experiments show that the proposed approach consistently improves existing methods, obtains good out-of-distribution generalization, and achieves new state-of-the-art results on widely used environments.
    A Theoretical Perspective on Hyperdimensional Computing. (arXiv:2010.07426v3 [cs.LG] UPDATED)
    Hyperdimensional (HD) computing is a set of neurally inspired methods for obtaining high-dimensional, low-precision, distributed representations of data. These representations can be combined with simple, neurally plausible algorithms to effect a variety of information processing tasks. HD computing has recently garnered significant interest from the computer hardware community as an energy-efficient, low-latency, and noise-robust tool for solving learning problems. In this review, we present a unified treatment of the theoretical foundations of HD computing with a focus on the suitability of representations for learning.
    Unsupervised Multiple-Object Tracking with a Dynamical Variational Autoencoder. (arXiv:2202.09315v1 [cs.LG])
    In this paper, we present an unsupervised probabilistic model and associated estimation algorithm for multi-object tracking (MOT) based on a dynamical variational autoencoder (DVAE), called DVAE-UMOT. The DVAE is a latent-variable deep generative model that can be seen as an extension of the variational autoencoder for the modeling of temporal sequences. It is included in DVAE-UMOT to model the objects' dynamics, after being pre-trained on an unlabeled synthetic dataset of single-object trajectories. Then the distributions and parameters of DVAE-UMOT are estimated on each multi-object sequence to track using the principles of variational inference: Definition of an approximate posterior distribution of the latent variables and maximization of the corresponding evidence lower bound of the data likehood function. DVAE-UMOT is shown experimentally to compete well with and even surpass the performance of two state-of-the-art probabilistic MOT models. Code and data are publicly available.
    Beyond Low-pass Filtering: Graph Convolutional Networks with Automatic Filtering. (arXiv:2107.04755v2 [cs.LG] UPDATED)
    Graph convolutional networks are becoming indispensable for deep learning from graph-structured data. Most of the existing graph convolutional networks share two big shortcomings. First, they are essentially low-pass filters, thus the potentially useful middle and high frequency band of graph signals are ignored. Second, the bandwidth of existing graph convolutional filters is fixed. Parameters of a graph convolutional filter only transform the graph inputs without changing the curvature of a graph convolutional filter function. In reality, we are uncertain about whether we should retain or cut off the frequency at a certain point unless we have expert domain knowledge. In this paper, we propose Automatic Graph Convolutional Networks (AutoGCN) to capture the full spectrum of graph signals and automatically update the bandwidth of graph convolutional filters. While it is based on graph spectral theory, our AutoGCN is also localized in space and has a spatial form. Experimental results show that AutoGCN achieves significant improvement over baseline methods which only work as low-pass filters.
    An Integrated Optimization and Machine Learning Models to Predict the Admission Status of Emergency Patients. (arXiv:2202.09196v1 [cs.LG])
    This work proposes a framework for optimizing machine learning algorithms. The practicality of the framework is illustrated using an important case study from the healthcare domain, which is predicting the admission status of emergency department (ED) patients (e.g., admitted vs. discharged) using patient data at the time of triage. The proposed framework can mitigate the crowding problem by proactively planning the patient boarding process. A large retrospective dataset of patient records is obtained from the electronic health record database of all ED visits over three years from three major locations of a healthcare provider in the Midwest of the US. Three machine learning algorithms are proposed: T-XGB, T-ADAB, and T-MLP. T-XGB integrates extreme gradient boosting (XGB) and Tabu Search (TS), T-ADAB integrates Adaboost and TS, and T-MLP integrates multi-layer perceptron (MLP) and TS. The proposed algorithms are compared with the traditional algorithms: XGB, ADAB, and MLP, in which their parameters are tunned using grid search. The three proposed algorithms and the original ones are trained and tested using nine data groups that are obtained from different feature selection methods. In other words, 54 models are developed. Performance was evaluated using five measures: Area under the curve (AUC), sensitivity, specificity, F1, and accuracy. The results show that the newly proposed algorithms resulted in high AUC and outperformed the traditional algorithms. The T-ADAB performs the best among the newly developed algorithms. The AUC, sensitivity, specificity, F1, and accuracy of the best model are 95.4%, 99.3%, 91.4%, 95.2%, 97.2%, respectively.
    Audio inpainting of music by means of neural networks. (arXiv:1810.12138v3 [cs.SD] UPDATED)
    We studied the ability of deep neural networks (DNNs) to restore missing audio content based on its context, a process usually referred to as audio inpainting. We focused on gaps in the range of tens of milliseconds. The proposed DNN structure was trained on audio signals containing music and musical instruments, separately, with 64-ms long gaps. The input to the DNN was the context, i.e., the signal surrounding the gap, transformed into time-frequency (TF) coefficients. Our results were compared to those obtained from a reference method based on linear predictive coding (LPC). For music, our DNN significantly outperformed the reference method, demonstrating a generally good usability of the proposed DNN structure for inpainting complex audio signals like music.
    Molecule Generation for Drug Design: a Graph Learning Perspective. (arXiv:2202.09212v1 [cs.LG])
    Machine learning has revolutionized many fields, and graph learning is recently receiving increasing attention. From the application perspective, one of the emerging and attractive areas is aiding the design and discovery of molecules, especially in drug industry. In this survey, we provide an overview of the state-of-the-art molecule (and mostly for de novo drug) design and discovery aiding methods whose methodology involves (deep) graph learning. Specifically, we propose to categorize these methods into three groups: i) all at once, ii) fragment-based and iii) node-by-node. We further present some representative public datasets and summarize commonly utilized evaluation metrics for generation and optimization, respectively. Finally, we discuss challenges and directions for future research, from the drug design perspective.
    DataMUX: Data Multiplexing for Neural Networks. (arXiv:2202.09318v1 [cs.LG])
    In this paper, we introduce data multiplexing (DataMUX), a technique that enables deep neural networks to process multiple inputs simultaneously using a single compact representation. DataMUX demonstrates that neural networks are capable of generating accurate predictions over mixtures of inputs, resulting in increased throughput with minimal extra memory requirements. Our approach uses two key components -- 1) a multiplexing layer that performs a fixed linear transformation to each input before combining them to create a mixed representation of the same size as a single input, which is then processed by the base network, and 2) a demultiplexing layer that converts the base network's output back into independent representations before producing predictions for each input. We show the viability of DataMUX for different architectures (Transformers, and to a lesser extent MLPs and CNNs) across six different tasks spanning sentence classification, named entity recognition and image classification. For instance, DataMUX for Transformers can multiplex up to $20$x/$40$x inputs, achieving $11$x/$18$x increase in throughput with minimal absolute performance drops of $<2\%$ and $<4\%$ respectively on MNLI, a natural language inference task. We also provide a theoretical construction for multiplexing in self-attention networks and analyze the effect of various design elements in DataMUX.
    Efficient computation of the volume of a polytope in high-dimensions using Piecewise Deterministic Markov Processes. (arXiv:2202.09129v1 [stat.CO])
    Computing the volume of a polytope in high dimensions is computationally challenging but has wide applications. Current state-of-the-art algorithms to compute such volumes rely on efficient sampling of a Gaussian distribution restricted to the polytope, using e.g. Hamiltonian Monte Carlo. We present a new sampling strategy that uses a Piecewise Deterministic Markov Process. Like Hamiltonian Monte Carlo, this new method involves simulating trajectories of a non-reversible process and inherits similar good mixing properties. However, importantly, the process can be simulated more easily due to its piecewise linear trajectories - and this leads to a reduction of the computational cost by a factor of the dimension of the space. Our experiments indicate that our method is numerically robust and is one order of magnitude faster (or better) than existing methods using Hamiltonian Monte Carlo. On a single core processor, we report computational time of a few minutes up to dimension 500.
    Machine learning models and facial regions videos for estimating heart rate: a review on Patents, Datasets and Literature. (arXiv:2202.08913v1 [cs.LG])
    Estimating heart rate is important for monitoring users in various situations. Estimates based on facial videos are increasingly being researched because it makes it possible to monitor cardiac information in a non-invasive way and because the devices are simpler, requiring only cameras that capture the user's face. From these videos of the user's face, machine learning is able to estimate heart rate. This study investigates the benefits and challenges of using machine learning models to estimate heart rate from facial videos, through patents, datasets, and articles review. We searched Derwent Innovation, IEEE Xplore, Scopus, and Web of Science knowledge bases and identified 7 patent filings, 11 datasets, and 20 articles on heart rate, photoplethysmography, or electrocardiogram data. In terms of patents, we note the advantages of inventions related to heart rate estimation, as described by the authors. In terms of datasets, we discovered that most of them are for academic purposes and with different signs and annotations that allow coverage for subjects other than heartbeat estimation. In terms of articles, we discovered techniques, such as extracting regions of interest for heart rate reading and using Video Magnification for small motion extraction, and models such as EVM-CNN and VGG-16, that extract the observed individual's heart rate, the best regions of interest for signal extraction and ways to process them.
    Masked prediction tasks: a parameter identifiability view. (arXiv:2202.09305v1 [cs.LG])
    The vast majority of work in self-supervised learning, both theoretical and empirical (though mostly the latter), have largely focused on recovering good features for downstream tasks, with the definition of "good" often being intricately tied to the downstream task itself. This lens is undoubtedly very interesting, but suffers from the problem that there isn't a "canonical" set of downstream tasks to focus on -- in practice, this problem is usually resolved by competing on the benchmark dataset du jour. In this paper, we present an alternative lens: one of parameter identifiability. More precisely, we consider data coming from a parametric probabilistic model, and train a self-supervised learning predictor with a suitably chosen parametric form. Then, we ask whether we can read off the ground truth parameters of the probabilistic model from the optimal predictor. We focus on the widely used self-supervised learning method of predicting masked tokens, which is popular for both natural languages and visual data. While incarnations of this approach have already been successfully used for simpler probabilistic models (e.g. learning fully-observed undirected graphical models), we focus instead on latent-variable models capturing sequential structures -- namely Hidden Markov Models with both discrete and conditionally Gaussian observations. We show that there is a rich landscape of possibilities, out of which some prediction tasks yield identifiability, while others do not. Our results, borne of a theoretical grounding of self-supervised learning, could thus potentially beneficially inform practice. Moreover, we uncover close connections with uniqueness of tensor rank decompositions -- a widely used tool in studying identifiability through the lens of the method of moments.
    Geometric Regularization from Overparameterization explains Double Descent and other findings. (arXiv:2202.09276v1 [cs.LG])
    The volume of the distribution of possible weight configurations associated with a loss value may be the source of implicit regularization from overparameterization due to the phenomenon of contracting volume with increasing dimensions for geometric figures demonstrated by hyperspheres. This paper introduces geometric regularization and explores potential applicability to several unexplained phenomenon including double descent, the differences between wide and deep networks, the benefits of He initialization and retained proximity in training, gradient confusion, fitness landscape properties, double descent in other learning paradigms, and other findings for overparameterized learning. Experiments are conducted by aggregating histograms of loss values corresponding to randomly sampled initializations in small setups, which find directional correlations in zero or central mode dominance from deviations in width, depth, and initialization distributions. Double descent is likely due to a regularization phase change when a training path reaches low enough loss that the loss manifold volume contraction from a reduced range of potential weight sets is amplified by an overparameterized geometry.
    Learning Predictions for Algorithms with Predictions. (arXiv:2202.09312v1 [cs.LG])
    A burgeoning paradigm in algorithm design is the field of algorithms with predictions, in which algorithms are designed to take advantage of a possibly-imperfect prediction of some aspect of the problem. While much work has focused on using predictions to improve competitive ratios, running times, or other performance measures, less effort has been devoted to the question of how to obtain the predictions themselves, especially in the critical online setting. We introduce a general design approach for algorithms that learn predictors: (1) identify a functional dependence of the performance measure on the prediction quality, and (2) apply techniques from online learning to learn predictors against adversarial instances, tune robustness-consistency trade-offs, and obtain new statistical guarantees. We demonstrate the effectiveness of our approach at deriving learning algorithms by analyzing methods for bipartite matching, page migration, ski-rental, and job scheduling. In the first and last settings we improve upon existing learning-theoretic results by deriving online results, obtaining better or more general statistical guarantees, and utilizing a much simpler analysis, while in the second and fourth we provide the first learning-theoretic guarantees.  ( 2 min )
    Surf or sleep? Understanding the influence of bedtime patterns on campus. (arXiv:2202.09283v1 [cs.AI])
    Poor sleep habits may cause serious problems of mind and body, and it is a commonly observed issue for college students due to study workload as well as peer and social influence. Understanding its impact and identifying students with poor sleep habits matters a lot in educational management. Most of the current research is either based on self-reports and questionnaires, suffering from a small sample size and social desirability bias, or the methods used are not suitable for the education system. In this paper, we develop a general data-driven method for identifying students' sleep patterns according to their Internet access pattern stored in the education management system and explore its influence from various aspects. First, we design a Possion-based probabilistic mixture model to cluster students according to the distribution of bedtime and identify students who are used to staying up late. Second, we profile students from five aspects (including eight dimensions) based on campus-behavior data and build Bayesian networks to explore the relationship between behavioral characteristics and sleeping habits. Finally, we test the predictability of sleeping habits. This paper not only contributes to the understanding of student sleep from a cognitive and behavioral perspective but also presents a new approach that provides an effective framework for various educational institutions to detect the sleeping patterns of students.  ( 2 min )
    ProxSkip: Yes! Local Gradient Steps Provably Lead to Communication Acceleration! Finally!. (arXiv:2202.09357v1 [cs.LG])
    We introduce \algname{ProxSkip} -- a surprisingly simple and provably efficient method for minimizing the sum of a smooth ($f$) and an expensive nonsmooth proximable ($\psi$) function. The canonical approach to solving such problems is via the proximal gradient descent (\algname{ProxGD}) algorithm, which is based on the evaluation of the gradient of $f$ and the prox operator of $\psi$ in each iteration. In this work we are specifically interested in the regime in which the evaluation of prox is costly relative to the evaluation of the gradient, which is the case in many applications. \algname{ProxSkip} allows for the expensive prox operator to be skipped in most iterations: while its iteration complexity is $\cO(\kappa \log \nicefrac{1}{\varepsilon})$, where $\kappa$ is the condition number of $f$, the number of prox evaluations is $\cO(\sqrt{\kappa} \log \nicefrac{1}{\varepsilon})$ only. Our main motivation comes from federated learning, where evaluation of the gradient operator corresponds to taking a local \algname{GD} step independently on all devices, and evaluation of prox corresponds to (expensive) communication in the form of gradient averaging. In this context, \algname{ProxSkip} offers an effective {\em acceleration} of communication complexity. Unlike other local gradient-type methods, such as \algname{FedAvg}, \algname{SCAFFOLD}, \algname{S-Local-GD} and \algname{FedLin}, whose theoretical communication complexity is worse than, or at best matching, that of vanilla \algname{GD} in the heterogeneous data regime, we obtain a provable and large improvement without any heterogeneity-bounding assumptions.  ( 2 min )
    (2.5+1)D Spatio-Temporal Scene Graphs for Video Question Answering. (arXiv:2202.09277v1 [cs.CV])
    Spatio-temporal scene-graph approaches to video-based reasoning tasks such as video question-answering (QA) typically construct such graphs for every video frame. Such approaches often ignore the fact that videos are essentially sequences of 2D "views" of events happening in a 3D space, and that the semantics of the 3D scene can thus be carried over from frame to frame. Leveraging this insight, we propose a (2.5+1)D scene graph representation to better capture the spatio-temporal information flows inside the videos. Specifically, we first create a 2.5D (pseudo-3D) scene graph by transforming every 2D frame to have an inferred 3D structure using an off-the-shelf 2D-to-3D transformation module, following which we register the video frames into a shared (2.5+1)D spatio-temporal space and ground each 2D scene graph within it. Such a (2.5+1)D graph is then segregated into a static sub-graph and a dynamic sub-graph, corresponding to whether the objects within them usually move in the world. The nodes in the dynamic graph are enriched with motion features capturing their interactions with other graph nodes. Next, for the video QA task, we present a novel transformer-based reasoning pipeline that embeds the (2.5+1)D graph into a spatio-temporal hierarchical latent space, where the sub-graphs and their interactions are captured at varied granularity. To demonstrate the effectiveness of our approach, we present experiments on the NExT-QA and AVSD-QA datasets. Our results show that our proposed (2.5+1)D representation leads to faster training and inference, while our hierarchical model showcases superior performance on the video QA task versus the state of the art.  ( 2 min )
    Human experts vs. machines in taxa recognition. (arXiv:1708.06899v5 [stat.ML] UPDATED)
    The step of expert taxa recognition currently slows down the response time of many bioassessments. Shifting to quicker and cheaper state-of-the-art machine learning approaches is still met with expert scepticism towards the ability and logic of machines. In our study, we investigate both the differences in accuracy and in the identification logic of taxonomic experts and machines. We propose a systematic approach utilizing deep Convolutional Neural Nets with the transfer learning paradigm and extensively evaluate it over a multi-pose taxonomic dataset with hierarchical labels specifically created for this comparison. We also study the prediction accuracy on different ranks of taxonomic hierarchy in detail. We used support vector machine classifier as a benchmark. Our results revealed that human experts using actual specimens yield the lowest classification error ($\overline{CE}=6.1\%$). However, a much faster, automated approach using deep Convolutional Neural Nets comes close to human accuracy ($\overline{CE}=11.4\%$) when a typical flat classification approach is used. Contrary to previous findings in the literature, we find that for machines following a typical flat classification approach commonly used in machine learning performs better than forcing machines to adopt a hierarchical, local per parent node approach used by human taxonomic experts ($\overline{CE}=13.8\%$). Finally, we publicly share our unique dataset to serve as a public benchmark dataset in this field.  ( 2 min )
    Rethinking Pareto Frontier for Performance Evaluation of Deep Neural Networks. (arXiv:2202.09275v1 [cs.LG])
    Recent efforts in deep learning show a considerable advancement in redesigning deep learning models for low-resource and edge devices. The performance optimization of deep learning models are conducted either manually or through automatic architecture search, or a combination of both. The throughput and power consumption of deep learning models strongly depend on the target hardware. We propose to use a \emph{multi-dimensional} Pareto frontier to re-define the efficiency measure using a multi-objective optimization, where other variables such as power consumption, latency, and accuracy play a relative role in defining a dominant model. Furthermore, a random version of the multi-dimensional Pareto frontier is introduced to mitigate the uncertainty of accuracy, latency, and throughput variations of deep learning models in different experimental setups. These two breakthroughs provide an objective benchmarking method for a wide range of deep learning models. We run our novel multi-dimensional stochastic relative efficiency on a wide range of deep image classification models trained ImageNet data. Thank to this new approach we combine competing variables with stochastic nature simultaneously in a single relative efficiency measure. This allows to rank deep models that run efficiently on different computing hardware, and combines inference efficiency with training efficiency objectively.  ( 2 min )
    Learning Physics-Informed Neural Networks without Stacked Back-propagation. (arXiv:2202.09340v1 [cs.LG])
    Physics-Informed Neural Network (PINN) has become a commonly used machine learning approach to solve partial differential equations (PDE). But, facing high-dimensional second-order PDE problems, PINN will suffer from severe scalability issues since its loss includes second-order derivatives, the computational cost of which will grow along with the dimension during stacked back-propagation. In this paper, we develop a novel approach that can significantly accelerate the training of Physics-Informed Neural Networks. In particular, we parameterize the PDE solution by the Gaussian smoothed model and show that, derived from Stein's Identity, the second-order derivatives can be efficiently calculated without back-propagation. We further discuss the model capacity and provide variance reduction methods to address key limitations in the derivative estimation. Experimental results show that our proposed method can achieve competitive error compared to standard PINN training but is two orders of magnitude faster.
    Universal Prediction Band via Semi-Definite Programming. (arXiv:2103.17203v2 [stat.ML] UPDATED)
    We propose a computationally efficient method to construct nonparametric, heteroscedastic prediction bands for uncertainty quantification, with or without any user-specified predictive model. Our approach provides an alternative to the now-standard conformal prediction for uncertainty quantification, with novel theoretical insights and computational advantages. The data-adaptive prediction band is universally applicable with minimal distributional assumptions, has strong non-asymptotic coverage properties, and is easy to implement using standard convex programs. Our approach can be viewed as a novel variance interpolation with confidence and further leverages techniques from semi-definite programming and sum-of-squares optimization. Theoretical and numerical performances for the proposed approach for uncertainty quantification are analyzed.  ( 2 min )
    Analogical Proportions. (arXiv:2006.02854v11 [cs.LO] UPDATED)
    Analogy-making is at the core of human and artificial intelligence and creativity with applications to such diverse tasks as proving mathematical theorems and building mathematical theories, common sense reasoning, learning, language acquisition, and story telling. This paper introduces from first principles an abstract algebraic framework of analogical proportions of the form `$a$ is to $b$ what $c$ is to $d$' in the general setting of universal algebra. This enables us to compare mathematical objects possibly across different domains in a uniform way which is crucial for AI-systems. It turns out that our notion of analogical proportions has appealing mathematical properties. As we construct our model from first principles using only elementary concepts of universal algebra, and since our model questions some basic properties of analogical proportions presupposed in the literature, to convince the reader of the plausibility of our model we show that it can be naturally embedded into first-order logic via model-theoretic types and prove from that perspective that analogical proportions are compatible with structure-preserving mappings. This provides conceptual evidence for its applicability. In a broader sense, this paper is a first step towards a theory of analogical reasoning and learning systems with potential applications to fundamental AI-problems like common sense reasoning and computational learning and creativity.  ( 2 min )
    What Doesn't Kill You Makes You Robust(er): How to Adversarially Train against Data Poisoning. (arXiv:2102.13624v2 [cs.LG] UPDATED)
    Data poisoning is a threat model in which a malicious actor tampers with training data to manipulate outcomes at inference time. A variety of defenses against this threat model have been proposed, but each suffers from at least one of the following flaws: they are easily overcome by adaptive attacks, they severely reduce testing performance, or they cannot generalize to diverse data poisoning threat models. Adversarial training, and its variants, are currently considered the only empirically strong defense against (inference-time) adversarial attacks. In this work, we extend the adversarial training framework to defend against (training-time) data poisoning, including targeted and backdoor attacks. Our method desensitizes networks to the effects of such attacks by creating poisons during training and injecting them into training batches. We show that this defense withstands adaptive attacks, generalizes to diverse threat models, and incurs a better performance trade-off than previous defenses such as DP-SGD or (evasion) adversarial training.  ( 2 min )
    Quantification of Actual Road User Behavior on the Basis of Given Traffic Rules. (arXiv:2202.09269v1 [cs.RO])
    Driving on roads is restricted by various traffic rules, aiming to ensure safety for all traffic participants. However, human road users usually do not adhere to these rules strictly, resulting in varying degrees of rule conformity. Such deviations from given rules are key components of today's road traffic. In autonomous driving, robotic agents can disturb traffic flow, when rule deviations are not taken into account. In this paper, we present an approach to derive the distribution of degrees of rule conformity from human driving data. We demonstrate our method with the Waymo Open Motion dataset and Safety Distance and Speed Limit rules.  ( 2 min )
    Towards a Numerical Proof of Turbulence Closure. (arXiv:2202.09289v1 [physics.flu-dyn])
    The development of turbulence closure models, parametrizing the influence of small non-resolved scales on the dynamics of large resolved ones, is an outstanding theoretical challenge with vast applicative relevance. We present a closure, based on deep recurrent neural networks, that quantitatively reproduces, within statistical errors, Eulerian and Lagrangian structure functions and the intermittent statistics of the energy cascade, including those of subgrid fluxes. To achieve high-order statistical accuracy, and thus a stringent statistical test, we employ shell models of turbulence. Our results encourage the development of similar approaches for 3D Navier-Stokes turbulence.  ( 2 min )
    FinNet: Solving Time-Independent Differential Equations with Finite Difference Neural Network. (arXiv:2202.09282v1 [cs.LG])
    In recent years, deep learning approaches for partial differential equations have received much attention due to their mesh-freeness and other desirable properties. However, most of the works so far concentrated on time-dependent nonlinear differential equations. In this work, we analyze potential issues with the well-known Physic Informed Neural Network for differential equations that are not time-dependent. This analysis motivates us to introduce a novel technique, namely FinNet, for solving differential equations by incorporating finite difference into deep learning. Even though we use a mesh during the training phase, the prediction phase is mesh-free. We illustrate the effectiveness of our method through experiments on solving various equations.
    Curriculum optimization for low-resource speech recognition. (arXiv:2202.08883v1 [eess.AS])
    Modern end-to-end speech recognition models show astonishing results in transcribing audio signals into written text. However, conventional data feeding pipelines may be sub-optimal for low-resource speech recognition, which still remains a challenging task. We propose an automated curriculum learning approach to optimize the sequence of training examples based on both the progress of the model while training and prior knowledge about the difficulty of the training examples. We introduce a new difficulty measure called compression ratio that can be used as a scoring function for raw audio in various noise conditions. The proposed method improves speech recognition Word Error Rate performance by up to 33% relative over the baseline system  ( 2 min )
    Designing Effective Sparse Expert Models. (arXiv:2202.08906v1 [cs.CL])
    Scale has opened new frontiers in natural language processing -- but at a high cost. In response, Mixture-of-Experts (MoE) and Switch Transformers have been proposed as an energy efficient path to even larger and more capable language models. But advancing the state-of-the-art across a broad set of natural language tasks has been hindered by training instabilities and uncertain quality during fine-tuning. Our work focuses on these issues and acts as a design guide. We conclude by scaling a sparse model to 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder Transformer (Stable and Transferable Mixture-of-Experts or ST-MoE-32B). For the first time, a sparse model achieves state-of-the-art performance in transfer learning, across a diverse set of tasks including reasoning (SuperGLUE, ARC Easy, ARC Challenge), summarization (XSum, CNN-DM), closed book question answering (WebQA, Natural Questions), and adversarially constructed tasks (Winogrande, ANLI R3).  ( 2 min )
    Understanding and Shifting Preferences for Battery Electric Vehicles. (arXiv:2202.08963v1 [cs.IR])
    Identifying personalized interventions for an individual is an important task. Recent work has shown that interventions that do not consider the demographic background of individual consumers can, in fact, produce the reverse effect, strengthening opposition to electric vehicles. In this work, we focus on methods for personalizing interventions based on an individual's demographics to shift the preferences of consumers to be more positive towards Battery Electric Vehicles (BEVs). One of the constraints in building models to suggest interventions for shifting preferences is that each intervention can influence the effectiveness of later interventions. This, in turn, requires many subjects to evaluate effectiveness of each possible intervention. To address this, we propose to identify personalized factors influencing BEV adoption, such as barriers and motivators. We present a method for predicting these factors and show that the performance is better than always predicting the most frequent factors. We then present a Reinforcement Learning (RL) model that learns the most effective interventions, and compare the number of subjects required for each approach.  ( 2 min )
    Adaptivity and Confounding in Multi-Armed Bandit Experiments. (arXiv:2202.09036v1 [cs.LG])
    Multi-armed bandit algorithms minimize experimentation costs required to converge on optimal behavior. They do so by rapidly adapting experimentation effort away from poorly performing actions as feedback is observed. But this desirable feature makes them sensitive to confounding, which is the primary concern underlying classical randomized controlled trials. We highlight, for instance, that popular bandit algorithms cannot address the problem of identifying the best action when day-of-week effects may confound inferences. In response, this paper proposes deconfounded Thompson sampling, which makes simple, but critical, modifications to the way Thompson sampling is usually applied. Theoretical guarantees suggest the algorithm strikes a delicate balance between adaptivity and robustness to confounding. It attains asymptotic lower bounds on the number of samples required to confidently identify the best action -- suggesting optimal adaptivity -- but also satisfies strong performance guarantees in the presence of day-of-week effects and delayed observations -- suggesting unusual robustness. At the core of the paper is a new model of contextual bandit experiments in which issues of delayed learning and distribution shift arise organically.  ( 2 min )
    Probing Pretrained Models of Source Code. (arXiv:2202.08975v1 [cs.SE])
    Deep learning models are widely used for solving challenging code processing tasks, such as code generation or code summarization. Traditionally, a specific model architecture was carefully built to solve a particular code processing task. However, recently general pretrained models such as CodeBERT or CodeT5 have been shown to outperform task-specific models in many applications. While pretrained models are known to learn complex patterns from data, they may fail to understand some properties of source code. To test diverse aspects of code understanding, we introduce a set of diagnosting probing tasks. We show that pretrained models of code indeed contain information about code syntactic structure and correctness, the notions of identifiers, data flow and namespaces, and natural language naming. We also investigate how probing results are affected by using code-specific pretraining objectives, varying the model size, or finetuning.  ( 2 min )
    Gaussian Mixture Convolution Networks. (arXiv:2202.09153v1 [cs.LG])
    This paper proposes a novel method for deep learning based on the analytical convolution of multidimensional Gaussian mixtures. In contrast to tensors, these do not suffer from the curse of dimensionality and allow for a compact representation, as data is only stored where details exist. Convolution kernels and data are Gaussian mixtures with unconstrained weights, positions, and covariance matrices. Similar to discrete convolutional networks, each convolution step produces several feature channels, represented by independent Gaussian mixtures. Since traditional transfer functions like ReLUs do not produce Gaussian mixtures, we propose using a fitting of these functions instead. This fitting step also acts as a pooling layer if the number of Gaussian components is reduced appropriately. We demonstrate that networks based on this architecture reach competitive accuracy on Gaussian mixtures fitted to the MNIST and ModelNet data sets.  ( 2 min )
    Fairness constraint in Structural Econometrics and Application to fair estimation using Instrumental Variables. (arXiv:2202.08977v1 [econ.EM])
    A supervised machine learning algorithm determines a model from a learning sample that will be used to predict new observations. To this end, it aggregates individual characteristics of the observations of the learning sample. But this information aggregation does not consider any potential selection on unobservables and any status-quo biases which may be contained in the training sample. The latter bias has raised concerns around the so-called \textit{fairness} of machine learning algorithms, especially towards disadvantaged groups. In this chapter, we review the issue of fairness in machine learning through the lenses of structural econometrics models in which the unknown index is the solution of a functional equation and issues of endogeneity are explicitly accounted for. We model fairness as a linear operator whose null space contains the set of strictly {\it fair} indexes. A {\it fair} solution is obtained by projecting the unconstrained index into the null space of this operator or by directly finding the closest solution of the functional equation into this null space. We also acknowledge that policymakers may incur a cost when moving away from the status quo. Achieving \textit{approximate fairness} is obtained by introducing a fairness penalty in the learning procedure and balancing more or less heavily the influence between the status quo and a full fair solution.  ( 2 min )
    Testing the boundaries: Normalizing Flows for higher dimensional data sets. (arXiv:2202.09188v1 [stat.ML])
    Normalizing Flows (NFs) are emerging as a powerful class of generative models, as they not only allow for efficient sampling, but also deliver, by construction, density estimation. They are of great potential usage in High Energy Physics (HEP), where complex high dimensional data and probability distributions are everyday's meal. However, in order to fully leverage the potential of NFs it is crucial to explore their robustness as data dimensionality increases. Thus, in this contribution, we discuss the performances of some of the most popular types of NFs on the market, on some toy data sets with increasing number of dimensions.  ( 2 min )
    Word Embeddings for Automatic Equalization in Audio Mixing. (arXiv:2202.08898v1 [cs.SD])
    In recent years, machine learning has been widely adopted to automate the audio mixing process. Automatic mixing systems have been applied to various audio effects such as gain-adjustment, stereo panning, equalization, and reverberation. These systems can be controlled through visual interfaces, providing audio examples, using knobs, and semantic descriptors. Using semantic descriptors or textual information to control these systems is an effective way for artists to communicate their creative goals. Furthermore, sometimes artists use non-technical words that may not be understood by the mixing system, or even a mixing engineer. In this paper, we explore the novel idea of using word embeddings to represent semantic descriptors. Word embeddings are generally obtained by training neural networks on large corpora of written text. These embeddings serve as the input layer of the neural network to create a translation from words to EQ settings. Using this technique, the machine learning model can also generate EQ settings for semantic descriptors that it has not seen before. We perform experiments to demonstrate the feasibility of this idea. In addition, we compare the EQ settings of humans with the predictions of the neural network to evaluate the quality of predictions. The results showed that the embedding layer enables the neural network to understand semantic descriptors. We observed that the models with embedding layers perform better those without embedding layers, but not as good as human labels.  ( 2 min )
    The Response Shift Paradigm to Quantify Human Trust in AI Recommendations. (arXiv:2202.08979v1 [cs.HC])
    Explainability, interpretability and how much they affect human trust in AI systems are ultimately problems of human cognition as much as machine learning, yet the effectiveness of AI recommendations and the trust afforded by end-users are typically not evaluated quantitatively. We developed and validated a general purpose Human-AI interaction paradigm which quantifies the impact of AI recommendations on human decisions. In our paradigm we confronted human users with quantitative prediction tasks: asking them for a first response, before confronting them with an AI's recommendations (and explanation), and then asking the human user to provide an updated final response. The difference between final and first responses constitutes the shift or sway in the human decision which we use as metric of the AI's recommendation impact on the human, representing the trust they place on the AI. We evaluated this paradigm on hundreds of users through Amazon Mechanical Turk using a multi-branched experiment confronting users with good/poor AI systems that had good, poor or no explainability. Our proof-of-principle paradigm allows one to quantitatively compare the rapidly growing set of XAI/IAI approaches in terms of their effect on the end-user and opens up the possibility of (machine) learning trust.  ( 2 min )
    Can Interpretable Reinforcement Learning Manage Assets Your Way?. (arXiv:2202.09064v1 [cs.LG])
    Personalisation of products and services is fast becoming the driver of success in banking and commerce. Machine learning holds the promise of gaining a deeper understanding of and tailoring to customers' needs and preferences. Whereas traditional solutions to financial decision problems frequently rely on model assumptions, reinforcement learning is able to exploit large amounts of data to improve customer modelling and decision-making in complex financial environments with fewer assumptions. Model explainability and interpretability present challenges from a regulatory perspective which demands transparency for acceptance; they also offer the opportunity for improved insight into and understanding of customers. Post-hoc approaches are typically used for explaining pretrained reinforcement learning models. Based on our previous modeling of customer spending behaviour, we adapt our recent reinforcement learning algorithm that intrinsically characterizes desirable behaviours and we transition to the problem of asset management. We train inherently interpretable reinforcement learning agents to give investment advice that is aligned with prototype financial personality traits which are combined to make a final recommendation. We observe that the trained agents' advice adheres to their intended characteristics, they learn the value of compound growth, and, without any explicit reference, the notion of risk as well as improved policy convergence.  ( 2 min )
    Sampling Approximately Low-Rank Ising Models: MCMC meets Variational Methods. (arXiv:2202.08907v1 [cs.DS])
    We consider Ising models on the hypercube with a general interaction matrix $J$, and give a polynomial time sampling algorithm when all but $O(1)$ eigenvalues of $J$ lie in an interval of length one, a situation which occurs in many models of interest. This was previously known for the Glauber dynamics when *all* eigenvalues fit in an interval of length one; however, a single outlier can force the Glauber dynamics to mix torpidly. Our general result implies the first polynomial time sampling algorithms for low-rank Ising models such as Hopfield networks with a fixed number of patterns and Bayesian clustering models with low-dimensional contexts, and greatly improves the polynomial time sampling regime for the antiferromagnetic/ferromagnetic Ising model with inconsistent field on expander graphs. It also improves on previous approximation algorithm results based on the naive mean-field approximation in variational methods and statistical physics. Our approach is based on a new fusion of ideas from the MCMC and variational inference worlds. As part of our algorithm, we define a new nonconvex variational problem which allows us to sample from an exponential reweighting of a distribution by a negative definite quadratic form, and show how to make this procedure provably efficient using stochastic gradient descent. On top of this, we construct a new simulated tempering chain (on an extended state space arising from the Hubbard-Stratonovich transform) which overcomes the obstacle posed by large positive eigenvalues, and combine it with the SGD-based sampler to solve the full problem.  ( 2 min )
    Rethinking Machine Learning Robustness via its Link with the Out-of-Distribution Problem. (arXiv:2202.08944v1 [cs.LG])
    Despite multiple efforts made towards robust machine learning (ML) models, their vulnerability to adversarial examples remains a challenging problem that calls for rethinking the defense strategy. In this paper, we take a step back and investigate the causes behind ML models' susceptibility to adversarial examples. In particular, we focus on exploring the cause-effect link between adversarial examples and the out-of-distribution (OOD) problem. To that end, we propose an OOD generalization method that stands against both adversary-induced and natural distribution shifts. Through an OOD to in-distribution mapping intuition, our approach translates OOD inputs to the data distribution used to train and test the model. Through extensive experiments on three benchmark image datasets of different scales (MNIST, CIFAR10, and ImageNet) and by leveraging image-to-image translation methods, we confirm that the adversarial examples problem is a special case of the wider OOD generalization problem. Across all datasets, we show that our translation-based approach consistently improves robustness to OOD adversarial inputs and outperforms state-of-the-art defenses by a significant margin, while preserving the exact accuracy on benign (in-distribution) data. Furthermore, our method generalizes on naturally OOD inputs such as darker or sharper images  ( 2 min )
    PGCN: Progressive Graph Convolutional Networks for Spatial-Temporal Traffic Forecasting. (arXiv:2202.08982v1 [cs.LG])
    The complex spatial-temporal correlations in transportation networks make the traffic forecasting problem challenging. Since transportation system inherently possesses graph structures, much research efforts have been put with graph neural networks. Recently, constructing adaptive graphs to the data has shown promising results over the models relying on a single static graph structure. However, the graph adaptations are applied during the training phases, and do not reflect the data used during the testing phases. Such shortcomings can be problematic especially in traffic forecasting since the traffic data often suffers from the unexpected changes and irregularities in the time series. In this study, we propose a novel traffic forecasting framework called Progressive Graph Convolutional Network (PGCN). PGCN constructs a set of graphs by progressively adapting to input data during the training and the testing phases. Specifically, we implemented the model to construct progressive adjacency matrices by learning trend similarities among graph nodes. Then, the model is combined with the dilated causal convolution and gated activation unit to extract temporal features. With residual and skip connections, PGCN performs the traffic prediction. When applied to four real-world traffic datasets of diverse geometric nature, the proposed model achieves state-of-the-art performance with consistency in all datasets. We conclude that the ability of PGCN to progressively adapt to input data enables the model to generalize in different study sites with robustness.  ( 2 min )
    Modelling the semantics of text in complex document layouts using graph transformer networks. (arXiv:2202.09144v1 [cs.CL])
    Representing structured text from complex documents typically calls for different machine learning techniques, such as language models for paragraphs and convolutional neural networks (CNNs) for table extraction, which prohibits drawing links between text spans from different content types. In this article we propose a model that approximates the human reading pattern of a document and outputs a unique semantic representation for every text span irrespective of the content type they are found in. We base our architecture on a graph representation of the structured text, and we demonstrate that not only can we retrieve semantically similar information across documents but also that the embedding space we generate captures useful semantic information, similar to language models that work only on text sequences.  ( 2 min )
    Nonstationary multi-output Gaussian processes via harmonizable spectral mixtures. (arXiv:2202.09233v1 [stat.ML])
    Kernel design for Multi-output Gaussian Processes (MOGP) has received increased attention recently. In particular, the Multi-Output Spectral Mixture kernel (MOSM) arXiv:1709.01298 approach has been praised as a general model in the sense that it extends other approaches such as Linear Model of Corregionalization, Intrinsic Corregionalization Model and Cross-Spectral Mixture. MOSM relies on Cram\'er's theorem to parametrise the power spectral densities (PSD) as a Gaussian mixture, thus, having a structural restriction: by assuming the existence of a PSD, the method is only suited for multi-output stationary applications. We develop a nonstationary extension of MOSM by proposing the family of harmonizable kernels for MOGPs, a class of kernels that contains both stationary and a vast majority of non-stationary processes. A main contribution of the proposed harmonizable kernels is that they automatically identify a possible nonstationary behaviour meaning that practitioners do not need to choose between stationary or non-stationary kernels. The proposed method is first validated on synthetic data with the purpose of illustrating the key properties of our approach, and then compared to existing MOGP methods on two real-world settings from finance and electroencephalography.  ( 2 min )
    Incorporating Texture Information into Dimensionality Reduction for High-Dimensional Images. (arXiv:2202.09179v1 [cs.CV])
    High-dimensional imaging is becoming increasingly relevant in many fields from astronomy and cultural heritage to systems biology. Visual exploration of such high-dimensional data is commonly facilitated by dimensionality reduction. However, common dimensionality reduction methods do not include spatial information present in images, such as local texture features, into the construction of low-dimensional embeddings. Consequently, exploration of such data is typically split into a step focusing on the attribute space followed by a step focusing on spatial information, or vice versa. In this paper, we present a method for incorporating spatial neighborhood information into distance-based dimensionality reduction methods, such as t-Distributed Stochastic Neighbor Embedding (t-SNE). We achieve this by modifying the distance measure between high-dimensional attribute vectors associated with each pixel such that it takes the pixel's spatial neighborhood into account. Based on a classification of different methods for comparing image patches, we explore a number of different approaches. We compare these approaches from a theoretical and experimental point of view. Finally, we illustrate the value of the proposed methods by qualitative and quantitative evaluation on synthetic data and two real-world use cases.  ( 2 min )
    Symphony: Composing Interactive Interfaces for Machine Learning. (arXiv:2202.08946v1 [cs.HC])
    Interfaces for machine learning (ML), information and visualizations about models or data, can help practitioners build robust and responsible ML systems. Despite their benefits, recent studies of ML teams and our interviews with practitioners (n=9) showed that ML interfaces have limited adoption in practice. While existing ML interfaces are effective for specific tasks, they are not designed to be reused, explored, and shared by multiple stakeholders in cross-functional teams. To enable analysis and communication between different ML practitioners, we designed and implemented Symphony, a framework for composing interactive ML interfaces with task-specific, data-driven components that can be used across platforms such as computational notebooks and web dashboards. We developed Symphony through participatory design sessions with 10 teams (n=31), and discuss our findings from deploying Symphony to 3 production ML projects at Apple. Symphony helped ML practitioners discover previously unknown issues like data duplicates and blind spots in models while enabling them to share insights with other stakeholders.  ( 2 min )
    Deep-Learning Architectures for Multi-Pitch Estimation: Towards Reliable Evaluation. (arXiv:2202.09198v1 [cs.SD])
    Extracting pitch information from music recordings is a challenging but important problem in music signal processing. Frame-wise transcription or multi-pitch estimation aims for detecting the simultaneous activity of pitches in polyphonic music recordings and has recently seen major improvements thanks to deep-learning techniques, with a variety of proposed network architectures. In this paper, we realize different architectures based on CNNs, the U-net structure, and self-attention components. We propose several modifications to these architectures including self-attention modules for skip connections, recurrent layers to replace the self-attention, and a multi-task strategy with simultaneous prediction of the degree of polyphony. We compare variants of these architectures in different sizes for multi-pitch estimation, focusing on Western classical music beyond the piano-solo scenario using the MusicNet and Schubert Winterreise datasets. Our experiments indicate that most architectures yield competitive results and that larger model variants seem to be beneficial. However, we find that these results substantially depend on randomization effects and the particular choice of the training-test split, which questions the claim of superiority for particular architectures given only small improvements. We therefore investigate the influence of dataset splits in the presence of several movements of a work cycle (cross-version evaluation) and propose a best-practice splitting strategy for MusicNet, which weakens the influence of individual test tracks and suppresses overfitting to specific works and recording conditions. A final evaluation on a mixed dataset suggests that improvements on one specific dataset do not necessarily generalize to other scenarios, thus emphasizing the need for further high-quality multi-pitch datasets in order to reliably measure progress in music transcription tasks.  ( 2 min )
    System Safety and Artificial Intelligence. (arXiv:2202.09292v1 [eess.SY])
    This chapter formulates seven lessons for preventing harm in artificial intelligence (AI) systems based on insights from the field of system safety for software-based automation in safety-critical domains. New applications of AI across societal domains and public organizations and infrastructures come with new hazards, which lead to new forms of harm, both grave and pernicious. The text addresses the lack of consensus for diagnosing and eliminating new AI system hazards. For decades, the field of system safety has dealt with accidents and harm in safety-critical systems governed by varying degrees of software-based automation and decision-making. This field embraces the core assumption of systems and control that AI systems cannot be safeguarded by technical design choices on the model or algorithm alone, instead requiring an end-to-end hazard analysis and design frame that includes the context of use, impacted stakeholders and the formal and informal institutional environment in which the system operates. Safety and other values are then inherently socio-technical and emergent system properties that require design and control measures to instantiate these across the technical, social and institutional components of a system. This chapter honors system safety pioneer Nancy Leveson, by situating her core lessons for today's AI system safety challenges. For every lesson, concrete tools are offered for rethinking and reorganizing the safety management of AI systems, both in design and governance. This history tells us that effective AI safety management requires transdisciplinary approaches and a shared language that allows involvement of all levels of society.  ( 2 min )
    Handling Imbalanced Datasets Through Optimum-Path Forest. (arXiv:2202.08934v1 [cs.LG])
    In the last decade, machine learning-based approaches became capable of performing a wide range of complex tasks sometimes better than humans, demanding a fraction of the time. Such an advance is partially due to the exponential growth in the amount of data available, which makes it possible to extract trustworthy real-world information from them. However, such data is generally imbalanced since some phenomena are more likely than others. Such a behavior yields considerable influence on the machine learning model's performance since it becomes biased on the more frequent data it receives. Despite the considerable amount of machine learning methods, a graph-based approach has attracted considerable notoriety due to the outstanding performance over many applications, i.e., the Optimum-Path Forest (OPF). In this paper, we propose three OPF-based strategies to deal with the imbalance problem: the $\text{O}^2$PF and the OPF-US, which are novel approaches for oversampling and undersampling, respectively, as well as a hybrid strategy combining both approaches. The paper also introduces a set of variants concerning the strategies mentioned above. Results compared against several state-of-the-art techniques over public and private datasets confirm the robustness of the proposed approaches.  ( 2 min )
    Soft Actor-Critic Deep Reinforcement Learning for Fault Tolerant Flight Control. (arXiv:2202.09262v1 [cs.SY])
    Fault-tolerant flight control faces challenges, as developing a model-based controller for each unexpected failure is unrealistic, and online learning methods can handle limited system complexity due to their low sample efficiency. In this research, a model-free coupled-dynamics flight controller for a jet aircraft able to withstand multiple failure types is proposed. An offline trained cascaded Soft Actor-Critic Deep Reinforcement Learning controller is successful on highly coupled maneuvers, including a coordinated 40 degree bank climbing turn with a normalized Mean Absolute Error of 2.64%. The controller is robust to six failure cases, including the rudder jammed at -15 deg, the aileron effectiveness reduced by 70%, a structural failure, icing and a backward c.g. shift as the response is stable and the climbing turn is completed successfully. Robustness to biased sensor noise, atmospheric disturbances, and to varying initial flight conditions and reference signal shapes is also demonstrated.  ( 2 min )
    Deep Interest Highlight Network for Click-Through RatePrediction in Trigger-Induced Recommendation. (arXiv:2202.08959v1 [cs.IR])
    In many classical e-commerce platforms, personalized recommendation has been proven to be of great business value, which can improve user satisfaction and increase the revenue of platforms. In this paper, we present a new recommendation problem, Trigger-Induced Recommendation (TIR), where users' instant interest can be explicitly induced with a trigger item and follow-up related target items are recommended accordingly. TIR has become ubiquitous and popular in e-commerce platforms. In this paper, we figure out that although existing recommendation models are effective in traditional recommendation scenarios by mining users' interests based on their massive historical behaviors, they are struggling in discovering users' instant interests in the TIR scenario due to the discrepancy between these scenarios, resulting in inferior performance. To tackle the problem, we propose a novel recommendation method named Deep Interest Highlight Network (DIHN) for Click-Through Rate (CTR) prediction in TIR scenarios. It has three main components including 1) User Intent Network (UIN), which responds to generate a precise probability score to predict user's intent on the trigger item; 2) Fusion Embedding Module (FEM), which adaptively fuses trigger item and target item embeddings based on the prediction from UIN; and (3) Hybrid Interest Extracting Module (HIEM), which can effectively highlight users' instant interest from their behaviors based on the result of FEM. Extensive offline and online evaluations on a real-world e-commerce platform demonstrate the superiority of DIHN over state-of-the-art methods.  ( 2 min )
    BADDr: Bayes-Adaptive Deep Dropout RL for POMDPs. (arXiv:2202.08884v1 [cs.LG])
    While reinforcement learning (RL) has made great advances in scalability, exploration and partial observability are still active research topics. In contrast, Bayesian RL (BRL) provides a principled answer to both state estimation and the exploration-exploitation trade-off, but struggles to scale. To tackle this challenge, BRL frameworks with various prior assumptions have been proposed, with varied success. This work presents a representation-agnostic formulation of BRL under partially observability, unifying the previous models under one theoretical umbrella. To demonstrate its practical significance we also propose a novel derivation, Bayes-Adaptive Deep Dropout rl (BADDr), based on dropout networks. Under this parameterization, in contrast to previous work, the belief over the state and dynamics is a more scalable inference problem. We choose actions through Monte-Carlo tree search and empirically show that our method is competitive with state-of-the-art BRL methods on small domains while being able to solve much larger ones.  ( 2 min )
    A Distributed Algorithm for Measure-valued Optimization with Additive Objective. (arXiv:2202.08930v1 [math.OC])
    We propose a distributed nonparametric algorithm for solving measure-valued optimization problems with additive objectives. Such problems arise in several contexts in stochastic learning and control including Langevin sampling from an unnormalized prior, mean field neural network learning and Wasserstein gradient flows. The proposed algorithm comprises a two-layer alternating direction method of multipliers (ADMM). The outer-layer ADMM generalizes the Euclidean consensus ADMM to the Wasserstein consensus ADMM, and to its entropy-regularized version Sinkhorn consensus ADMM. The inner-layer ADMM turns out to be a specific instance of the standard Euclidean ADMM. The overall algorithm realizes operator splitting for gradient flows in the manifold of probability measures.  ( 2 min )
    Tackling benign nonconvexity with smoothing and stochastic gradients. (arXiv:2202.09052v1 [cs.LG])
    Non-convex optimization problems are ubiquitous in machine learning, especially in Deep Learning. While such complex problems can often be successfully optimized in practice by using stochastic gradient descent (SGD), theoretical analysis cannot adequately explain this success. In particular, the standard analyses do not show global convergence of SGD on non-convex functions, and instead show convergence to stationary points (which can also be local minima or saddle points). We identify a broad class of nonconvex functions for which we can show that perturbed SGD (gradient descent perturbed by stochastic noise -- covering SGD as a special case) converges to a global minimum (or a neighborhood thereof), in contrast to gradient descent without noise that can get stuck in local minima far from a global solution. For example, on non-convex functions that are relatively close to a convex-like (strongly convex or PL) function we show that SGD can converge linearly to a global optimum.  ( 2 min )
    Linearization and Identification of Multiple-Attractors Dynamical System through Laplacian Eigenmaps. (arXiv:2202.09171v1 [cs.LG])
    Dynamical Systems (DS) are fundamental to the modeling and understanding of time evolving phenomena, and find application in physics, biology and control. As determining an analytical description of the dynamics is often difficult, data-driven approaches are preferred for identifying and controlling nonlinear DS with multiple equilibrium points. Identification of such DS has been treated largely as a supervised learning problem. Instead, we focus on a unsupervised learning scenario where we know neither the number nor the type of dynamics. We propose a Graph-based spectral clustering method that takes advantage of a velocity-augmented kernel to connect data-points belonging to the same dynamics, while preserving the natural temporal evolution. We study the eigenvectors and eigenvalues of the Graph Laplacian and show that they form a set of orthogonal embedding spaces, one for each sub-dynamics. We prove that there always exist a set of 2-dimensional embedding spaces in which the sub-dynamics are linear, and n-dimensional embedding where they are quasi-linear. We compare the clustering performance of our algorithm to Kernel K-Means, Spectral Clustering and Gaussian Mixtures and show that, even when these algorithms are provided with the true number of sub-dynamics, they fail to cluster them correctly. We learn a diffeomorphism from the Laplacian embedding space to the original space and show that the Laplacian embedding leads to good reconstruction accuracy and a faster training time through an exponential decaying loss, compared to the state of the art diffeomorphism-based approaches.  ( 2 min )
    Space4HGNN: A Novel, Modularized and Reproducible Platform to Evaluate Heterogeneous Graph Neural Network. (arXiv:2202.09177v1 [cs.LG])
    Heterogeneous Graph Neural Network (HGNN) has been successfully employed in various tasks, but we cannot accurately know the importance of different design dimensions of HGNNs due to diverse architectures and applied scenarios. Besides, in the research community of HGNNs, implementing and evaluating various tasks still need much human effort. To mitigate these issues, we first propose a unified framework covering most HGNNs, consisting of three components: heterogeneous linear transformation, heterogeneous graph transformation, and heterogeneous message passing layer. Then we build a platform Space4HGNN by defining a design space for HGNNs based on the unified framework, which offers modularized components, reproducible implementations, and standardized evaluation for HGNNs. Finally, we conduct experiments to analyze the effect of different designs. With the insights found, we distill a condensed design space and verify its effectiveness.  ( 2 min )
    FLAME: Federated Learning Across Multi-device Environments. (arXiv:2202.08922v1 [cs.LG])
    Federated Learning (FL) enables distributed training of machine learning models while keeping personal data on user devices private. While we witness increasing applications of FL in the area of mobile sensing, such as human-activity recognition, FL has not been studied in the context of a multi-device environment (MDE), wherein each user owns multiple data-producing devices. With the proliferation of mobile and wearable devices, MDEs are increasingly becoming popular in ubicomp settings, therefore necessitating the study of FL in them. FL in MDEs is characterized by high non-IID-ness across clients, complicated by the presence of both user and device heterogeneities. Further, ensuring efficient utilization of system resources on FL clients in a MDE remains an important challenge. In this paper, we propose FLAME, a user-centered FL training approach to counter statistical and system heterogeneity in MDEs, and bring consistency in inference performance across devices. FLAME features (i) user-centered FL training utilizing the time alignment across devices from the same user; (ii) accuracy- and efficiency-aware device selection; and (iii) model personalization to devices. We also present an FL evaluation testbed with realistic energy drain and network bandwidth profiles, and a novel class-based data partitioning scheme to extend existing HAR datasets to a federated setup. Our experiment results on three multi-device HAR datasets show that FLAME outperforms various baselines by 4.8-33.8% higher F-1 score, 1.02-2.86x greater energy efficiency, and up to 2.02x speedup in convergence to target accuracy through fair distribution of the FL workload.  ( 2 min )
    KINet: Keypoint Interaction Networks for Unsupervised Forward Modeling. (arXiv:2202.09006v1 [cs.CV])
    Object-centric representation is an essential abstraction for physical reasoning and forward prediction. Most existing approaches learn this representation through extensive supervision (e.g., object class and bounding box) although such ground-truth information is not readily accessible in reality. To address this, we introduce KINet (Keypoint Interaction Network) -- an end-to-end unsupervised framework to reason about object interactions in complex systems based on a keypoint representation. Using visual observations, our model learns to associate objects with keypoint coordinates and discovers a graph representation of the system as a set of keypoint embeddings and their relations. It then learns an action-conditioned forward model using contrastive estimation to predict future keypoint states. By learning to perform physical reasoning in the keypoint space, our model automatically generalizes to scenarios with a different number of objects, and novel object geometries. Experiments demonstrate the effectiveness of our model to accurately perform forward prediction and learn plannable object-centric representations which can also be used in downstream model-based control tasks.  ( 2 min )
    Stock Embeddings: Learning Distributed Representations for Financial Assets. (arXiv:2202.08968v1 [q-fin.ST])
    Identifying meaningful relationships between the price movements of financial assets is a challenging but important problem in a variety of financial applications. However with recent research, particularly those using machine learning and deep learning techniques, focused mostly on price forecasting, the literature investigating the modelling of asset correlations has lagged somewhat. To address this, inspired by recent successes in natural language processing, we propose a neural model for training stock embeddings, which harnesses the dynamics of historical returns data in order to learn the nuanced relationships that exist between financial assets. We describe our approach in detail and discuss a number of ways that it can be used in the financial domain. Furthermore, we present the evaluation results to demonstrate the utility of this approach, compared to several important benchmarks, in two real-world financial analytics tasks.  ( 2 min )
    Effective Urban Region Representation Learning Using Heterogeneous Urban Graph Attention Network (HUGAT). (arXiv:2202.09021v1 [cs.LG])
    Revealing the hidden patterns shaping the urban environment is essential to understand its dynamics and to make cities smarter. Recent studies have demonstrated that learning the representations of urban regions can be an effective strategy to uncover the intrinsic characteristics of urban areas. However, existing studies lack in incorporating diversity in urban data sources. In this work, we propose heterogeneous urban graph attention network (HUGAT), which incorporates heterogeneity of diverse urban datasets. In HUGAT, heterogeneous urban graph (HUG) incorporates both the geo-spatial and temporal people movement variations in a single graph structure. Given a HUG, a set of meta-paths are designed to capture the rich urban semantics as composite relations between nodes. Region embedding is carried out using heterogeneous graph attention network (HAN). HUGAT is designed to consider multiple learning objectives of city's geo-spatial and mobility variations simultaneously. In our extensive experiments on NYC data, HUGAT outperformed all the state-of-the-art models. Moreover, it demonstrated a robust generalization capability across the various prediction tasks of crime, average personal income, and bike flow as well as the spatial clustering task.  ( 2 min )
    Microplankton life histories revealed by holographic microscopy and deep learning. (arXiv:2202.09046v1 [physics.bio-ph])
    The marine microbial food web plays a central role in the global carbon cycle. Our mechanistic understanding of the ocean, however, is biased towards its larger constituents, while rates and biomass fluxes in the microbial food web are mainly inferred from indirect measurements and ensemble averages. Yet, resolution at the level of the individual microplankton is required to advance our understanding of the oceanic food web. Here, we demonstrate that, by combining holographic microscopy with deep learning, we can follow microplanktons throughout their lifespan, continuously measuring their three dimensional position and dry mass. The deep learning algorithms circumvent the computationally intensive processing of holographic data and allow rapid measurements over extended time periods. This permits us to reliably estimate growth rates, both in terms of dry mass increase and cell divisions, as well as to measure trophic interactions between species such as predation events. The individual resolution provides information about selectivity, individual feeding rates and handling times for individual microplanktons. This method is particularly useful to explore the flux of carbon through micro-zooplankton, the most important and least known group of primary consumers in the global oceans. We exemplify this by detailed descriptions of micro-zooplankton feeding events, cell divisions, and long term monitoring of single cells from division to division.  ( 2 min )
    Energy-Efficient Parking Analytics System using Deep Reinforcement Learning. (arXiv:2202.08973v1 [cs.CV])
    Advances in deep vision techniques and ubiquity of smart cameras will drive the next generation of video analytics. However, video analytics applications consume vast amounts of energy as both deep learning techniques and cameras are power-hungry. In this paper, we focus on a parking video analytics platform and propose RL-CamSleep, a deep reinforcement learning-based technique, to actuate the cameras to reduce the energy footprint while retaining the system's utility. Our key insight is that many video-analytics applications do not always need to be operational, and we can design policies to activate video analytics only when necessary. Moreover, our work is complementary to existing work that focuses on improving hardware and software efficiency. We evaluate our approach on a city-scale parking dataset having 76 streets spread across the city. Our analysis demonstrates how streets have various parking patterns, highlighting the importance of an adaptive policy. Our approach can learn such an adaptive policy that can reduce the average energy consumption by 76.38% and achieve an average accuracy of more than 98% in performing video analytics.  ( 2 min )
    High-performance automatic categorization and attribution of inventory catalogs. (arXiv:2202.08965v1 [cs.IR])
    Techniques of machine learning for automatic text categorization are applied and adapted for the problem of inventory catalog data attribution, with different approaches explored and optimal solution addressing the tradeoff between accuracy and performance is selected.  ( 2 min )
    Critical Checkpoints for Evaluating Defence Models Against Adversarial Attack and Robustness. (arXiv:2202.09039v1 [cs.CR])
    From past couple of years there is a cycle of researchers proposing a defence model for adversaries in machine learning which is arguably defensible to most of the existing attacks in restricted condition (they evaluate on some bounded inputs or datasets). And then shortly another set of researcher finding the vulnerabilities in that defence model and breaking it by proposing a stronger attack model. Some common flaws are been noticed in the past defence models that were broken in very short time. Defence models being broken so easily is a point of concern as decision of many crucial activities are taken with the help of machine learning models. So there is an utter need of some defence checkpoints that any researcher should keep in mind while evaluating the soundness of technique and declaring it to be decent defence technique. In this paper, we have suggested few checkpoints that should be taken into consideration while building and evaluating the soundness of defence models. All these points are recommended after observing why some past defence models failed and how some model remained adamant and proved their soundness against some of the very strong attacks.  ( 2 min )
    A Review on Methods and Applications in Multimodal Deep Learning. (arXiv:2202.09195v1 [cs.LG])
    Deep Learning has implemented a wide range of applications and has become increasingly popular in recent years. The goal of multimodal deep learning (MMDL) is to create models that can process and link information using various modalities. Despite the extensive development made for unimodal learning, it still cannot cover all the aspects of human learning. Multimodal learning helps to understand and analyze better when various senses are engaged in the processing of information. This paper focuses on multiple types of modalities, i.e., image, video, text, audio, body gestures, facial expressions, and physiological signals. Detailed analysis of the baseline approaches and an in-depth study of recent advancements during the last five years (2017 to 2021) in multimodal deep learning applications has been provided. A fine-grained taxonomy of various multimodal deep learning methods is proposed, elaborating on different applications in more depth. Lastly, main issues are highlighted separately for each domain, along with their possible future research directions.  ( 2 min )
    DARL1N: Distributed multi-Agent Reinforcement Learning with One-hop Neighbors. (arXiv:2202.09019v1 [cs.MA])
    Most existing multi-agent reinforcement learning (MARL) methods are limited in the scale of problems they can handle. Particularly, with the increase of the number of agents, their training costs grow exponentially. In this paper, we address this limitation by introducing a scalable MARL method called Distributed multi-Agent Reinforcement Learning with One-hop Neighbors (DARL1N). DARL1N is an off-policy actor-critic method that breaks the curse of dimensionality by decoupling the global interactions among agents and restricting information exchanges to one-hop neighbors. Each agent optimizes its action value and policy functions over a one-hop neighborhood, significantly reducing the learning complexity, yet maintaining expressiveness by training with varying numbers and states of neighbors. This structure allows us to formulate a distributed learning framework to further speed up the training procedure. Comparisons with state-of-the-art MARL methods show that DARL1N significantly reduces training time without sacrificing policy quality and is scalable as the number of agents increases.  ( 2 min )
    PerFED-GAN: Personalized Federated Learning via Generative Adversarial Networks. (arXiv:2202.09155v1 [cs.LG])
    Federated learning is gaining popularity as a distributed machine learning method that can be used to deploy AI-dependent IoT applications while protecting client data privacy and security. Due to the differences of clients, a single global model may not perform well on all clients, so the personalized federated learning method, which trains a personalized model for each client that better suits its individual needs, becomes a research hotspot. Most personalized federated learning research, however, focuses on data heterogeneity while ignoring the need for model architecture heterogeneity. Most existing federated learning methods uniformly set the model architecture of all clients participating in federated learning, which is inconvenient for each client's individual model and local data distribution requirements, and also increases the risk of client model leakage. This paper proposes a federated learning method based on co-training and generative adversarial networks(GANs) that allows each client to design its own model to participate in federated learning training independently without sharing any model architecture or parameter information with other clients or a center. In our experiments, the proposed method outperforms the existing methods in mean test accuracy by 42% when the client's model architecture and data distribution vary significantly.  ( 2 min )
    A Free Lunch with Influence Functions? Improving Neural Network Estimates with Concepts from Semiparametric Statistics. (arXiv:2202.09096v1 [cs.LG])
    Parameter estimation in the empirical fields is usually undertaken using parametric models, and such models are convenient because they readily facilitate statistical inference. Unfortunately, they are unlikely to have a sufficiently flexible functional form to be able to adequately model real-world phenomena, and their usage may therefore result in biased estimates and invalid inference. Unfortunately, whilst non-parametric machine learning models may provide the needed flexibility to adapt to the complexity of real-world phenomena, they do not readily facilitate statistical inference, and may still exhibit residual bias. We explore the potential for semiparametric theory (in particular, the Influence Function) to be used to improve neural networks and machine learning algorithms in terms of (a) improving initial estimates without needing more data (b) increasing the robustness of our models, and (c) yielding confidence intervals for statistical inference. We propose a new neural network method MultiNet, which seeks the flexibility and diversity of an ensemble using a single architecture. Results on causal inference tasks indicate that MultiNet yields better performance than other approaches, and that all considered methods are amenable to improvement from semiparametric techniques under certain conditions. In other words, with these techniques we show that we can improve existing neural networks for `free', without needing more data, and without needing to retrain them. Finally, we provide the expression for deriving influence functions for estimands from a general graph, and the code to do so automatically.  ( 2 min )
    Cyclical Focal Loss. (arXiv:2202.08978v1 [cs.CV])
    The cross-entropy softmax loss is the primary loss function used to train deep neural networks. On the other hand, the focal loss function has been demonstrated to provide improved performance when there is an imbalance in the number of training samples in each class, such as in long-tailed datasets. In this paper, we introduce a novel cyclical focal loss and demonstrate that it is a more universal loss function than cross-entropy softmax loss or focal loss. We describe the intuition behind the cyclical focal loss and our experiments provide evidence that cyclical focal loss provides superior performance for balanced, imbalanced, or long-tailed datasets. We provide numerous experimental results for CIFAR-10/CIFAR-100, ImageNet, balanced and imbalanced 4,000 training sample versions of CIFAR-10/CIFAR-100, and ImageNet-LT and Places-LT from the Open Long-Tailed Recognition (OLTR) challenge. Implementing the cyclical focal loss function requires only a few lines of code and does not increase training time. In the spirit of reproducibility, our code is available at \url{https://github.com/lnsmith54/CFL}.  ( 2 min )
    Out of Distribution Data Detection Using Dropout Bayesian Neural Networks. (arXiv:2202.08985v1 [cs.LG])
    We explore the utility of information contained within a dropout based Bayesian neural network (BNN) for the task of detecting out of distribution (OOD) data. We first show how previous attempts to leverage the randomized embeddings induced by the intermediate layers of a dropout BNN can fail due to the distance metric used. We introduce an alternative approach to measuring embedding uncertainty, justify its use theoretically, and demonstrate how incorporating embedding uncertainty improves OOD data identification across three tasks: image classification, language classification, and malware detection.  ( 2 min )
    Ensemble and Multimodal Approach for Forecasting Cryptocurrency Price. (arXiv:2202.08967v1 [q-fin.ST])
    Since the birth of Bitcoin in 2009, cryptocurrencies have emerged to become a global phenomenon and an important decentralized financial asset. Due to this decentralization, the value of these digital currencies against fiat currencies is highly volatile over time. Therefore, forecasting the crypto-fiat currency exchange rate is an extremely challenging task. For reliable forecasting, this paper proposes a multimodal AdaBoost-LSTM ensemble approach that employs all modalities which derive price fluctuation such as social media sentiments, search volumes, blockchain information, and trading data. To better support investment decision making, the approach forecasts also the fluctuation distribution. The conducted extensive experiments demonstrated the effectiveness of relying on multimodalities instead of only trading data. Further experiments demonstrate the outperformance of the proposed approach compared to existing tools and methods with a 19.29% improvement.  ( 2 min )
    Combining Varied Learners for Binary Classification using Stacked Generalization. (arXiv:2202.08910v1 [cs.LG])
    The Machine Learning has various learning algorithms that are better in some or the other aspect when compared with each other but a common error that all algorithms will suffer from is training data with very high dimensional feature set. This usually ends up algorithms into generalization error that deplete the performance. This can be solved using an Ensemble Learning method known as Stacking commonly termed as Stacked Generalization. In this paper we perform binary classification using Stacked Generalization on high dimensional Polycystic Ovary Syndrome dataset and prove the point that model becomes generalized and metrics improve significantly. The various metrics are given in this paper that also point out a subtle transgression found with Receiver Operating Characteristic Curve that was proved to be incorrect.  ( 2 min )
    RemixIT: Continual self-training of speech enhancement models via bootstrapped remixing. (arXiv:2202.08862v1 [cs.SD])
    We present RemixIT, a simple yet effective selfsupervised method for training speech enhancement without the need of a single isolated in-domain speech nor a noise waveform. Our approach overcomes limitations of previous methods which make them dependent to clean in-domain target signals and thus, sensitive to any domain mismatch between train and test samples. RemixIT is based on a continuous self-training scheme in which a pre-trained teacher model on out-of-domain data infers estimated pseudo-target signals for in-domain mixtures. Then, by permuting the estimated clean and noise signals and remixing them together, we generate a new set of bootstrapped mixtures and corresponding pseudo-targets which are used to train the student network. Vice-versa, the teacher periodically refines its estimates using the updated parameters of the latest student models. Experimental results on multiple speech enhancement datasets and tasks not only show the superiority of our method over prior approaches but also showcase that RemixIT can be combined with any separation model as well as be applied towards any semi-supervised and unsupervised domain adaptation task. Our analysis, paired with empirical evidence, sheds light on the inside functioning of our self-training scheme wherein the student model keeps obtaining better performance while observing severely degraded pseudo-targets.  ( 2 min )
    Generalizing Aggregation Functions in GNNs:High-Capacity GNNs via Nonlinear Neighborhood Aggregators. (arXiv:2202.09145v1 [cs.LG])
    Graph neural networks (GNNs) have achieved great success in many graph learning tasks. The main aspect powering existing GNNs is the multi-layer network architecture to learn the nonlinear graph representations for the specific learning tasks. The core operation in GNNs is message propagation in which each node updates its representation by aggregating its neighbors' representations. Existing GNNs mainly adopt either linear neighborhood aggregation (mean,sum) or max aggregator in their message propagation. (1) For linear aggregators, the whole nonlinearity and network's capacity of GNNs are generally limited due to deeper GNNs usually suffer from over-smoothing issue. (2) For max aggregator, it usually fails to be aware of the detailed information of node representations within neighborhood. To overcome these issues, we re-think the message propagation mechanism in GNNs and aim to develop the general nonlinear aggregators for neighborhood information aggregation in GNNs. One main aspect of our proposed nonlinear aggregators is that they provide the optimally balanced aggregators between max and mean/sum aggregations. Thus, our aggregators can inherit both (i) high nonlinearity that increases network's capacity and (ii) detail-sensitivity that preserves the detailed information of representations together in GNNs' message propagation. Promising experiments on several datasets show the effectiveness of the proposed nonlinear aggregators.  ( 2 min )
    Fast online inference for nonlinear contextual bandit based on Generative Adversarial Network. (arXiv:2202.08867v1 [cs.LG])
    This work addresses the efficiency concern on inferring a nonlinear contextual bandit when the number of arms $n$ is very large. We propose a neural bandit model with an end-to-end training process to efficiently perform bandit algorithms such as Thompson Sampling and UCB during inference. We advance state-of-the-art time complexity to $O(\log n)$ with approximate Bayesian inference, neural random feature mapping, approximate global maxima and approximate nearest neighbor search. We further propose a generative adversarial network to shift the bottleneck of maximizing the objective for selecting optimal arms from inference time to training time, enjoying significant speedup with additional advantage of enabling batch and parallel processing. %The generative model can inference an approximate argmax of the posterior sampling in logarithmic time complexity with the help of approximate nearest neighbor search. Extensive experiments on classification and recommendation tasks demonstrate order-of-magnitude improvement in inference time no significant degradation on the performance.  ( 2 min )
    GNN-Surrogate: A Hierarchical and Adaptive Graph Neural Network for Parameter Space Exploration of Unstructured-Mesh Ocean Simulations. (arXiv:2202.08956v1 [physics.ao-ph])
    We propose GNN-Surrogate, a graph neural network-based surrogate model to explore the parameter space of ocean climate simulations. Parameter space exploration is important for domain scientists to understand the influence of input parameters (e.g., wind stress) on the simulation output (e.g., temperature). The exploration requires scientists to exhaust the complicated parameter space by running a batch of computationally expensive simulations. Our approach improves the efficiency of parameter space exploration with a surrogate model that predicts the simulation outputs accurately and efficiently. Specifically, GNN-Surrogate predicts the output field with given simulation parameters so scientists can explore the simulation parameter space with visualizations from user-specified visual mappings. Moreover, our graph-based techniques are designed for unstructured meshes, making the exploration of simulation outputs on irregular grids efficient. For efficient training, we generate hierarchical graphs and use adaptive resolutions. We give quantitative and qualitative evaluations on the MPAS-Ocean simulation to demonstrate the effectiveness and efficiency of GNN-Surrogate. Source code is publicly available at https://github.com/trainsn/GNN-Surrogate.  ( 2 min )
    Improving English to Sinhala Neural Machine Translation using Part-of-Speech Tag. (arXiv:2202.08882v1 [cs.CL])
    The performance of Neural Machine Translation (NMT) depends significantly on the size of the available parallel corpus. Due to this fact, low resource language pairs demonstrate low translation performance compared to high resource language pairs. The translation quality further degrades when NMT is performed for morphologically rich languages. Even though the web contains a large amount of information, most people in Sri Lanka are unable to read and understand English properly. Therefore, there is a huge requirement of translating English content to local languages to share information among locals. Sinhala language is the primary language in Sri Lanka and building an NMT system that can produce quality English to Sinhala translations is difficult due to the syntactic divergence between these two languages under low resource constraints. Thus, in this research, we explore effective methods of incorporating Part of Speech (POS) tags to the Transformer input embedding and positional encoding to further enhance the performance of the baseline English to Sinhala neural machine translation model.  ( 2 min )
    Interpolation and Regularization for Causal Learning. (arXiv:2202.09054v1 [stat.ML])
    We study the problem of learning causal models from observational data through the lens of interpolation and its counterpart -- regularization. A large volume of recent theoretical, as well as empirical work, suggests that, in highly complex model classes, interpolating estimators can have good statistical generalization properties and can even be optimal for statistical learning. Motivated by an analogy between statistical and causal learning recently highlighted by Janzing (2019), we investigate whether interpolating estimators can also learn good causal models. To this end, we consider a simple linearly confounded model and derive precise asymptotics for the *causal risk* of the min-norm interpolator and ridge-regularized regressors in the high-dimensional regime. Under the principle of independent causal mechanisms, a standard assumption in causal learning, we find that interpolators cannot be optimal and causal learning requires stronger regularization than statistical learning. This resolves a recent conjecture in Janzing (2019). Beyond this assumption, we find a larger range of behavior that can be precisely characterized with a new measure of *confounding strength*. If the confounding strength is negative, causal learning requires weaker regularization than statistical learning, interpolators can be optimal, and the optimal regularization can even be negative. If the confounding strength is large, the optimal regularization is infinite, and learning from observational data is actively harmful.  ( 2 min )
  • Open

    A global convergence theory for deep ReLU implicit networks via over-parameterization. (arXiv:2110.05645v2 [cs.LG] UPDATED)
    Implicit deep learning has received increasing attention recently due to the fact that it generalizes the recursive prediction rules of many commonly used neural network architectures. Its prediction rule is provided implicitly based on the solution of an equilibrium equation. Although a line of recent empirical studies has demonstrated its superior performances, the theoretical understanding of implicit neural networks is limited. In general, the equilibrium equation may not be well-posed during the training. As a result, there is no guarantee that a vanilla (stochastic) gradient descent (SGD) training nonlinear implicit neural networks can converge. This paper fills the gap by analyzing the gradient flow of Rectified Linear Unit (ReLU) activated implicit neural networks. For an $m$-width implicit neural network with ReLU activation and $n$ training samples, we show that a randomly initialized gradient descent converges to a global minimum at a linear rate for the square loss function if the implicit neural network is \textit{over-parameterized}. It is worth noting that, unlike existing works on the convergence of (S)GD on finite-layer over-parameterized neural networks, our convergence results hold for implicit neural networks, where the number of layers is \textit{infinite}.  ( 2 min )
    Fast Rank-1 NMF for Missing Data with KL Divergence. (arXiv:2110.12595v2 [stat.ML] UPDATED)
    We propose a fast non-gradient-based method of rank-1 non-negative matrix factorization (NMF) for missing data, called A1GM, that minimizes the KL divergence from an input matrix to the reconstructed rank-1 matrix. Our method is based on our new finding of an analytical closed-formula of the best rank-1 non-negative multiple matrix factorization (NMMF), a variety of NMF. NMMF is known to exactly solve NMF for missing data if positions of missing values satisfy a certain condition, and A1GM transforms a given matrix so that the analytical solution to NMMF can be applied. We empirically show that A1GM is more efficient than a gradient method with competitive reconstruction errors.  ( 2 min )
    Conditional independence by typing. (arXiv:2010.11887v2 [cs.PL] UPDATED)
    A central goal of probabilistic programming languages (PPLs) is to separate modelling from inference. However, this goal is hard to achieve in practice. Users are often forced to re-write their models in order to improve efficiency of inference or meet restrictions imposed by the PPL. Conditional independence (CI) relationships among parameters are a crucial aspect of probabilistic models that capture a qualitative summary of the specified model and can facilitate more efficient inference. We present an information flow type system for probabilistic programming that captures conditional independence (CI) relationships, and show that, for a well-typed program in our system, the distribution it implements is guaranteed to have certain CI-relationships. Further, by using type inference, we can statically deduce which CI-properties are present in a specified model. As a practical application, we consider the problem of how to perform inference on models with mixed discrete and continuous parameters. Inference on such models is challenging in many existing PPLs, but can be improved through a workaround, where the discrete parameters are used implicitly, at the expense of manual model re-writing. We present a source-to-source semantics-preserving transformation, which uses our CI-type system to automate this workaround by eliminating the discrete parameters from a probabilistic program. The resulting program can be seen as a hybrid inference algorithm on the original program, where continuous parameters can be drawn using efficient gradient-based inference methods, while the discrete parameters are inferred using variable elimination. We implement our CI-type system and its example application in SlicStan: a compositional variant of Stan.  ( 2 min )
    A Distance Covariance-based Kernel for Nonlinear Causal Clustering in Heterogeneous Populations. (arXiv:2106.03480v2 [stat.ML] UPDATED)
    We consider the problem of causal structure learning in the setting of heterogeneous populations, i.e., populations in which a single causal structure does not adequately represent all population members, as is common in biological and social sciences. To this end, we introduce a distance covariance-based kernel designed specifically to measure the similarity between the underlying nonlinear causal structures of different samples. Indeed, we prove that the corresponding feature map is a statistically consistent estimator of nonlinear independence structure, rendering the kernel itself a statistical test for the hypothesis that sets of samples come from different generating causal structures. Even stronger, we prove that the kernel space is isometric to the space of causal ancestral graphs, so that distance between samples in the kernel space is guaranteed to correspond to distance between their generating causal structures. This kernel thus enables us to perform clustering to identify the homogeneous subpopulations, for which we can then learn causal structures using existing methods. Though we focus on the theoretical aspects of the kernel, we also evaluate its performance on synthetic data and demonstrate its use on a real gene expression data set.  ( 2 min )
    Multi-Level Local SGD for Heterogeneous Hierarchical Networks. (arXiv:2007.13819v3 [cs.LG] UPDATED)
    We propose Multi-Level Local SGD, a distributed gradient method for learning a smooth, non-convex objective in a heterogeneous multi-level network. Our network model consists of a set of disjoint sub-networks, with a single hub and multiple worker nodes; further, worker nodes may have different operating rates. The hubs exchange information with one another via a connected, but not necessarily complete communication network. In our algorithm, sub-networks execute a distributed SGD algorithm, using a hub-and-spoke paradigm, and the hubs periodically average their models with neighboring hubs. We first provide a unified mathematical framework that describes the Multi-Level Local SGD algorithm. We then present a theoretical analysis of the algorithm; our analysis shows the dependence of the convergence error on the worker node heterogeneity, hub network topology, and the number of local, sub-network, and global iterations. We back up our theoretical results via simulation-based experiments using both convex and non-convex objectives.  ( 2 min )
    Variational Diffusion Models. (arXiv:2107.00630v3 [cs.LG] UPDATED)
    Diffusion-based generative models have demonstrated a capacity for perceptually impressive synthesis, but can they also be great likelihood-based models? We answer this in the affirmative, and introduce a family of diffusion-based generative models that obtain state-of-the-art likelihoods on standard image density estimation benchmarks. Unlike other diffusion-based models, our method allows for efficient optimization of the noise schedule jointly with the rest of the model. We show that the variational lower bound (VLB) simplifies to a remarkably short expression in terms of the signal-to-noise ratio of the diffused data, thereby improving our theoretical understanding of this model class. Using this insight, we prove an equivalence between several models proposed in the literature. In addition, we show that the continuous-time VLB is invariant to the noise schedule, except for the signal-to-noise ratio at its endpoints. This enables us to learn a noise schedule that minimizes the variance of the resulting VLB estimator, leading to faster optimization. Combining these advances with architectural improvements, we obtain state-of-the-art likelihoods on image density estimation benchmarks, outperforming autoregressive models that have dominated these benchmarks for many years, with often significantly faster optimization. In addition, we show how to use the model as part of a bits-back compression scheme, and demonstrate lossless compression rates close to the theoretical optimum.  ( 2 min )
    Strategyproof Learning: Building Trustworthy User-Generated Datasets. (arXiv:2106.02398v2 [cs.LG] UPDATED)
    We prove in this paper that, perhaps surprisingly, incentivizing data misreporting is not a fatality. By leveraging a careful design of the loss function, we propose Licchavi, a global and personalized learning framework with provable strategyproofness guarantees. Essentially, we prove that no user can gain much by replying to Licchavi's queries with answers that deviate from their true preferences. Interestingly, Licchavi also promotes the desirable "one person, one unit-force vote" fairness principle. Furthermore, our empirical evaluation of its performance showcases Licchavi's real-world applicability. We believe that our results are critical for the safety of any learning scheme that leverages user-generated data.  ( 2 min )
    When OT meets MoM: Robust estimation of Wasserstein Distance. (arXiv:2006.10325v3 [stat.ML] UPDATED)
    Issued from Optimal Transport, the Wasserstein distance has gained importance in Machine Learning due to its appealing geometrical properties and the increasing availability of efficient approximations. In this work, we consider the problem of estimating the Wasserstein distance between two probability distributions when observations are polluted by outliers. To that end, we investigate how to leverage Medians of Means (MoM) estimators to robustify the estimation of Wasserstein distance. Exploiting the dual Kantorovitch formulation of Wasserstein distance, we introduce and discuss novel MoM-based robust estimators whose consistency is studied under a data contamination model and for which convergence rates are provided. These MoM estimators enable to make Wasserstein Generative Adversarial Network (WGAN) robust to outliers, as witnessed by an empirical study on two benchmarks CIFAR10 and Fashion MNIST. Eventually, we discuss how to combine MoM with the entropy-regularized approximation of the Wasserstein distance and propose a simple MoM-based re-weighting scheme that could be used in conjunction with the Sinkhorn algorithm.  ( 2 min )
    Federated Asymptotics: a model to compare federated learning algorithms. (arXiv:2108.07313v3 [cs.LG] UPDATED)
    We propose an asymptotic framework to analyze the performance of (personalized) federated learning algorithms. In this new framework, we formulate federated learning as a multi-criterion objective, where the goal is to minimize each client's loss using information from all of the clients. We analyze a linear regression model where, for a given client, we may theoretically compare the performance of various algorithms in the high-dimensional asymptotic limit. This asymptotic multi-criterion approach naturally models the high-dimensional, many-device nature of federated learning. These tools make fairly precise predictions about the benefits of personalization and information sharing in federated scenarios -- at least in our (stylized) model -- including that Federated Averaging with simple client fine-tuning achieves the same asymptotic risk as the more intricate meta-learning and proximal-regularized approaches and outperforming Federated Averaging without personalization. We evaluate these predictions on federated versions of the EMNIST, CIFAR-100, Shakespeare, and Stack Overflow datasets, where the experiments corroborate the theoretical predictions, suggesting such frameworks may provide a useful guide to practical algorithmic development.  ( 2 min )
    A Free Lunch with Influence Functions? Improving Neural Network Estimates with Concepts from Semiparametric Statistics. (arXiv:2202.09096v1 [cs.LG])
    Parameter estimation in the empirical fields is usually undertaken using parametric models, and such models are convenient because they readily facilitate statistical inference. Unfortunately, they are unlikely to have a sufficiently flexible functional form to be able to adequately model real-world phenomena, and their usage may therefore result in biased estimates and invalid inference. Unfortunately, whilst non-parametric machine learning models may provide the needed flexibility to adapt to the complexity of real-world phenomena, they do not readily facilitate statistical inference, and may still exhibit residual bias. We explore the potential for semiparametric theory (in particular, the Influence Function) to be used to improve neural networks and machine learning algorithms in terms of (a) improving initial estimates without needing more data (b) increasing the robustness of our models, and (c) yielding confidence intervals for statistical inference. We propose a new neural network method MultiNet, which seeks the flexibility and diversity of an ensemble using a single architecture. Results on causal inference tasks indicate that MultiNet yields better performance than other approaches, and that all considered methods are amenable to improvement from semiparametric techniques under certain conditions. In other words, with these techniques we show that we can improve existing neural networks for `free', without needing more data, and without needing to retrain them. Finally, we provide the expression for deriving influence functions for estimands from a general graph, and the code to do so automatically.  ( 2 min )
    Federated Multi-Task Learning under a Mixture of Distributions. (arXiv:2108.10252v3 [cs.LG] UPDATED)
    The increasing size of data generated by smartphones and IoT devices motivated the development of Federated Learning (FL), a framework for on-device collaborative training of machine learning models. First efforts in FL focused on learning a single global model with good average performance across clients, but the global model may be arbitrarily bad for a given client, due to the inherent heterogeneity of local data distributions. Federated multi-task learning (MTL) approaches can learn personalized models by formulating an opportune penalized optimization problem. The penalization term can capture complex relations among personalized models, but eschews clear statistical assumptions about local data distributions. In this work, we propose to study federated MTL under the flexible assumption that each local data distribution is a mixture of unknown underlying distributions. This assumption encompasses most of the existing personalized FL approaches and leads to federated EM-like algorithms for both client-server and fully decentralized settings. Moreover, it provides a principled way to serve personalized models to clients not seen at training time. The algorithms' convergence is analyzed through a novel federated surrogate optimization framework, which can be of general interest. Experimental results on FL benchmarks show that our approach provides models with higher accuracy and fairness than state-of-the-art methods.  ( 2 min )
    Churn modeling of life insurance policies via statistical and machine learning methods -- Analysis of important features. (arXiv:2202.09182v1 [stat.ML])
    Life assurance companies typically possess a wealth of data covering multiple systems and databases. These data are often used for analyzing the past and for describing the present. Taking account of the past, the future is mostly forecasted by traditional statistical methods. So far, only a few attempts were undertaken to perform estimations by means of machine learning approaches. In this work, the individual contract cancellation behavior of customers within two partial stocks is modeled by the aid of various classification methods. Partial stocks of private pension and endowment policy are considered. We describe the data used for the modeling, their structured and in which way they are cleansed. The utilized models are calibrated on the basis of an extensive tuning process, then graphically evaluated regarding their goodness-of-fit and with the help of a variable relevance concept, we investigate which features notably affect the individual contract cancellation behavior.  ( 2 min )
    Quantifying the Effects of Data Augmentation. (arXiv:2202.09134v1 [cs.LG])
    We provide results that exactly quantify how data augmentation affects the convergence rate and variance of estimates. They lead to some unexpected findings: Contrary to common intuition, data augmentation may increase rather than decrease uncertainty of estimates, such as the empirical prediction risk. Our main theoretical tool is a limit theorem for functions of randomly transformed, high-dimensional random vectors. The proof draws on work in probability on noise stability of functions of many variables. The pathological behavior we identify is not a consequence of complex models, but can occur even in the simplest settings -- one of our examples is a linear ridge regressor with two parameters. On the other hand, our results also show that data augmentation can have real, quantifiable benefits.  ( 2 min )
    Masked prediction tasks: a parameter identifiability view. (arXiv:2202.09305v1 [cs.LG])
    The vast majority of work in self-supervised learning, both theoretical and empirical (though mostly the latter), have largely focused on recovering good features for downstream tasks, with the definition of "good" often being intricately tied to the downstream task itself. This lens is undoubtedly very interesting, but suffers from the problem that there isn't a "canonical" set of downstream tasks to focus on -- in practice, this problem is usually resolved by competing on the benchmark dataset du jour. In this paper, we present an alternative lens: one of parameter identifiability. More precisely, we consider data coming from a parametric probabilistic model, and train a self-supervised learning predictor with a suitably chosen parametric form. Then, we ask whether we can read off the ground truth parameters of the probabilistic model from the optimal predictor. We focus on the widely used self-supervised learning method of predicting masked tokens, which is popular for both natural languages and visual data. While incarnations of this approach have already been successfully used for simpler probabilistic models (e.g. learning fully-observed undirected graphical models), we focus instead on latent-variable models capturing sequential structures -- namely Hidden Markov Models with both discrete and conditionally Gaussian observations. We show that there is a rich landscape of possibilities, out of which some prediction tasks yield identifiability, while others do not. Our results, borne of a theoretical grounding of self-supervised learning, could thus potentially beneficially inform practice. Moreover, we uncover close connections with uniqueness of tensor rank decompositions -- a widely used tool in studying identifiability through the lens of the method of moments.  ( 2 min )
    Universal Prediction Band via Semi-Definite Programming. (arXiv:2103.17203v2 [stat.ML] UPDATED)
    We propose a computationally efficient method to construct nonparametric, heteroscedastic prediction bands for uncertainty quantification, with or without any user-specified predictive model. Our approach provides an alternative to the now-standard conformal prediction for uncertainty quantification, with novel theoretical insights and computational advantages. The data-adaptive prediction band is universally applicable with minimal distributional assumptions, has strong non-asymptotic coverage properties, and is easy to implement using standard convex programs. Our approach can be viewed as a novel variance interpolation with confidence and further leverages techniques from semi-definite programming and sum-of-squares optimization. Theoretical and numerical performances for the proposed approach for uncertainty quantification are analyzed.  ( 2 min )
    Hyperparameter Selection Methods for Fitted Q-Evaluation with Error Guarantee. (arXiv:2201.02300v2 [stat.ML] UPDATED)
    We are concerned with the problem of hyperparameter selection for the fitted Q-evaluation (FQE). FQE is one of the state-of-the-art method for offline policy evaluation (OPE), which is essential to the reinforcement learning without environment simulators. However, like other OPE methods, FQE is not hyperparameter-free itself and that undermines the utility in real-life applications. We address this issue by proposing a framework of approximate hyperparameter selection (AHS) for FQE, which defines a notion of optimality (called selection criteria) in a quantitative and interpretable manner without hyperparameters. We then derive four AHS methods each of which has different characteristics such as distribution-mismatch tolerance and time complexity. We also confirm in experiments that the error bound given by the theory matches empirical observations.  ( 2 min )
    Fixed points of nonnegative neural networks. (arXiv:2106.16239v5 [stat.ML] UPDATED)
    We consider the existence of fixed points of nonnegative neural networks, i.e., neural networks that take as an input nonnegative vectors and process them using nonnegative parameters. We first show that nonnegative neural networks can be recognized as monotonic and (weakly) scalable functions within the framework of nonlinear Perron-Frobenius theory. This fact enables us to provide conditions for the existence of fixed points of nonnegative neural networks, and these conditions are weaker than those obtained recently using arguments in convex analysis. Furthermore, we prove that the shape of the fixed point set of nonnegative neural networks is often an interval, which degenerates to a point for the case of scalable networks. The results of this paper contribute to the understanding of the behavior of autoencoders, because the fixed point set of an autoencoder is precisely the set of points that can be perfectly reconstructed. Moreover, they provide insight into neural networks designed using the loop-unrolling technique, which can be seen as a fixed point searching algorithm. The chief theoretical results of this paper are verified in numerical simulations, where we consider an autoencoder that first compresses angular power spectra in massive MIMO systems, and, second, reconstruct the input spectra from the compressed signals.  ( 2 min )
    Nonstationary multi-output Gaussian processes via harmonizable spectral mixtures. (arXiv:2202.09233v1 [stat.ML])
    Kernel design for Multi-output Gaussian Processes (MOGP) has received increased attention recently. In particular, the Multi-Output Spectral Mixture kernel (MOSM) arXiv:1709.01298 approach has been praised as a general model in the sense that it extends other approaches such as Linear Model of Corregionalization, Intrinsic Corregionalization Model and Cross-Spectral Mixture. MOSM relies on Cram\'er's theorem to parametrise the power spectral densities (PSD) as a Gaussian mixture, thus, having a structural restriction: by assuming the existence of a PSD, the method is only suited for multi-output stationary applications. We develop a nonstationary extension of MOSM by proposing the family of harmonizable kernels for MOGPs, a class of kernels that contains both stationary and a vast majority of non-stationary processes. A main contribution of the proposed harmonizable kernels is that they automatically identify a possible nonstationary behaviour meaning that practitioners do not need to choose between stationary or non-stationary kernels. The proposed method is first validated on synthetic data with the purpose of illustrating the key properties of our approach, and then compared to existing MOGP methods on two real-world settings from finance and electroencephalography.  ( 2 min )
    Human experts vs. machines in taxa recognition. (arXiv:1708.06899v5 [stat.ML] UPDATED)
    The step of expert taxa recognition currently slows down the response time of many bioassessments. Shifting to quicker and cheaper state-of-the-art machine learning approaches is still met with expert scepticism towards the ability and logic of machines. In our study, we investigate both the differences in accuracy and in the identification logic of taxonomic experts and machines. We propose a systematic approach utilizing deep Convolutional Neural Nets with the transfer learning paradigm and extensively evaluate it over a multi-pose taxonomic dataset with hierarchical labels specifically created for this comparison. We also study the prediction accuracy on different ranks of taxonomic hierarchy in detail. We used support vector machine classifier as a benchmark. Our results revealed that human experts using actual specimens yield the lowest classification error ($\overline{CE}=6.1\%$). However, a much faster, automated approach using deep Convolutional Neural Nets comes close to human accuracy ($\overline{CE}=11.4\%$) when a typical flat classification approach is used. Contrary to previous findings in the literature, we find that for machines following a typical flat classification approach commonly used in machine learning performs better than forcing machines to adopt a hierarchical, local per parent node approach used by human taxonomic experts ($\overline{CE}=13.8\%$). Finally, we publicly share our unique dataset to serve as a public benchmark dataset in this field.  ( 2 min )
    Learning Predictions for Algorithms with Predictions. (arXiv:2202.09312v1 [cs.LG])
    A burgeoning paradigm in algorithm design is the field of algorithms with predictions, in which algorithms are designed to take advantage of a possibly-imperfect prediction of some aspect of the problem. While much work has focused on using predictions to improve competitive ratios, running times, or other performance measures, less effort has been devoted to the question of how to obtain the predictions themselves, especially in the critical online setting. We introduce a general design approach for algorithms that learn predictors: (1) identify a functional dependence of the performance measure on the prediction quality, and (2) apply techniques from online learning to learn predictors against adversarial instances, tune robustness-consistency trade-offs, and obtain new statistical guarantees. We demonstrate the effectiveness of our approach at deriving learning algorithms by analyzing methods for bipartite matching, page migration, ski-rental, and job scheduling. In the first and last settings we improve upon existing learning-theoretic results by deriving online results, obtaining better or more general statistical guarantees, and utilizing a much simpler analysis, while in the second and fourth we provide the first learning-theoretic guarantees.  ( 2 min )
    Adaptivity and Confounding in Multi-Armed Bandit Experiments. (arXiv:2202.09036v1 [cs.LG])
    Multi-armed bandit algorithms minimize experimentation costs required to converge on optimal behavior. They do so by rapidly adapting experimentation effort away from poorly performing actions as feedback is observed. But this desirable feature makes them sensitive to confounding, which is the primary concern underlying classical randomized controlled trials. We highlight, for instance, that popular bandit algorithms cannot address the problem of identifying the best action when day-of-week effects may confound inferences. In response, this paper proposes deconfounded Thompson sampling, which makes simple, but critical, modifications to the way Thompson sampling is usually applied. Theoretical guarantees suggest the algorithm strikes a delicate balance between adaptivity and robustness to confounding. It attains asymptotic lower bounds on the number of samples required to confidently identify the best action -- suggesting optimal adaptivity -- but also satisfies strong performance guarantees in the presence of day-of-week effects and delayed observations -- suggesting unusual robustness. At the core of the paper is a new model of contextual bandit experiments in which issues of delayed learning and distribution shift arise organically.  ( 2 min )
    On Variance Estimation of Random Forests. (arXiv:2202.09008v1 [stat.ML])
    Ensemble methods based on subsampling, such as random forests, are popular in applications due to their high predictive accuracy. Existing literature views a random forest prediction as an infinite-order incomplete U-statistic to quantify its uncertainty. However, these methods focus on a small subsampling size of each tree, which is theoretically valid but practically limited. This paper develops an unbiased variance estimator based on incomplete U-statistics, which allows the tree size to be comparable with the overall sample size, making statistical inference possible in a broader range of real applications. Simulation results demonstrate that our estimators enjoy lower bias and more accurate confidence interval coverage without additional computational costs. We also propose a local smoothing procedure to reduce the variation of our estimator, which shows improved numerical performance when the number of trees is relatively small. Further, we investigate the ratio consistency of our proposed variance estimator under specific scenarios. In particular, we develop a new "double U-statistic" formulation to analyze the Hoeffding decomposition of the estimator's variance.  ( 2 min )
    Testing the boundaries: Normalizing Flows for higher dimensional data sets. (arXiv:2202.09188v1 [stat.ML])
    Normalizing Flows (NFs) are emerging as a powerful class of generative models, as they not only allow for efficient sampling, but also deliver, by construction, density estimation. They are of great potential usage in High Energy Physics (HEP), where complex high dimensional data and probability distributions are everyday's meal. However, in order to fully leverage the potential of NFs it is crucial to explore their robustness as data dimensionality increases. Thus, in this contribution, we discuss the performances of some of the most popular types of NFs on the market, on some toy data sets with increasing number of dimensions.  ( 2 min )
    Tackling benign nonconvexity with smoothing and stochastic gradients. (arXiv:2202.09052v1 [cs.LG])
    Non-convex optimization problems are ubiquitous in machine learning, especially in Deep Learning. While such complex problems can often be successfully optimized in practice by using stochastic gradient descent (SGD), theoretical analysis cannot adequately explain this success. In particular, the standard analyses do not show global convergence of SGD on non-convex functions, and instead show convergence to stationary points (which can also be local minima or saddle points). We identify a broad class of nonconvex functions for which we can show that perturbed SGD (gradient descent perturbed by stochastic noise -- covering SGD as a special case) converges to a global minimum (or a neighborhood thereof), in contrast to gradient descent without noise that can get stuck in local minima far from a global solution. For example, on non-convex functions that are relatively close to a convex-like (strongly convex or PL) function we show that SGD can converge linearly to a global optimum.  ( 2 min )
    Private Quantiles Estimation in the Presence of Atoms. (arXiv:2202.08969v1 [stat.ML])
    We address the differentially private estimation of multiple quantiles (MQ) of a dataset, a key building block in modern data analysis. We apply the recent non-smoothed Inverse Sensitivity (IS) mechanism to this specific problem and establish that the resulting method is closely related to the current state-of-the-art, the JointExp algorithm, sharing in particular the same computational complexity and a similar efficiency. However, we demonstrate both theoretically and empirically that (non-smoothed) JointExp suffers from an important lack of performance in the case of peaked distributions, with a potentially catastrophic impact in the presence of atoms. While its smoothed version would allow to leverage the performance guarantees of IS, it remains an open challenge to implement. As a proxy to fix the problem we propose a simple and numerically efficient method called Heuristically Smoothed JointExp (HSJointExp), which is endowed with performance guarantees for a broad class of distributions and achieves results that are orders of magnitude better on problematic datasets.  ( 2 min )
    Clustering by Hill-Climbing: Consistency Results. (arXiv:2202.09023v1 [math.ST])
    We consider several hill-climbing approaches to clustering as formulated by Fukunaga and Hostetler in the 1970's. We study both continuous-space and discrete-space (i.e., medoid) variants and establish their consistency.  ( 2 min )
    KINet: Keypoint Interaction Networks for Unsupervised Forward Modeling. (arXiv:2202.09006v1 [cs.CV])
    Object-centric representation is an essential abstraction for physical reasoning and forward prediction. Most existing approaches learn this representation through extensive supervision (e.g., object class and bounding box) although such ground-truth information is not readily accessible in reality. To address this, we introduce KINet (Keypoint Interaction Network) -- an end-to-end unsupervised framework to reason about object interactions in complex systems based on a keypoint representation. Using visual observations, our model learns to associate objects with keypoint coordinates and discovers a graph representation of the system as a set of keypoint embeddings and their relations. It then learns an action-conditioned forward model using contrastive estimation to predict future keypoint states. By learning to perform physical reasoning in the keypoint space, our model automatically generalizes to scenarios with a different number of objects, and novel object geometries. Experiments demonstrate the effectiveness of our model to accurately perform forward prediction and learn plannable object-centric representations which can also be used in downstream model-based control tasks.  ( 2 min )
    A Fair Comparison of Graph Neural Networks for Graph Classification. (arXiv:1912.09893v3 [cs.LG] UPDATED)
    Experimental reproducibility and replicability are critical topics in machine learning. Authors have often raised concerns about their lack in scientific publications to improve the quality of the field. Recently, the graph representation learning field has attracted the attention of a wide research community, which resulted in a large stream of works. As such, several Graph Neural Network models have been developed to effectively tackle graph classification. However, experimental procedures often lack rigorousness and are hardly reproducible. Motivated by this, we provide an overview of common practices that should be avoided to fairly compare with the state of the art. To counter this troubling trend, we ran more than 47000 experiments in a controlled and uniform framework to re-evaluate five popular models across nine common benchmarks. Moreover, by comparing GNNs with structure-agnostic baselines we provide convincing evidence that, on some datasets, structural information has not been exploited yet. We believe that this work can contribute to the development of the graph learning field, by providing a much needed grounding for rigorous evaluations of graph classification models.  ( 2 min )
    Fairness constraint in Structural Econometrics and Application to fair estimation using Instrumental Variables. (arXiv:2202.08977v1 [econ.EM])
    A supervised machine learning algorithm determines a model from a learning sample that will be used to predict new observations. To this end, it aggregates individual characteristics of the observations of the learning sample. But this information aggregation does not consider any potential selection on unobservables and any status-quo biases which may be contained in the training sample. The latter bias has raised concerns around the so-called \textit{fairness} of machine learning algorithms, especially towards disadvantaged groups. In this chapter, we review the issue of fairness in machine learning through the lenses of structural econometrics models in which the unknown index is the solution of a functional equation and issues of endogeneity are explicitly accounted for. We model fairness as a linear operator whose null space contains the set of strictly {\it fair} indexes. A {\it fair} solution is obtained by projecting the unconstrained index into the null space of this operator or by directly finding the closest solution of the functional equation into this null space. We also acknowledge that policymakers may incur a cost when moving away from the status quo. Achieving \textit{approximate fairness} is obtained by introducing a fairness penalty in the learning procedure and balancing more or less heavily the influence between the status quo and a full fair solution.  ( 2 min )
    Interpolation and Regularization for Causal Learning. (arXiv:2202.09054v1 [stat.ML])
    We study the problem of learning causal models from observational data through the lens of interpolation and its counterpart -- regularization. A large volume of recent theoretical, as well as empirical work, suggests that, in highly complex model classes, interpolating estimators can have good statistical generalization properties and can even be optimal for statistical learning. Motivated by an analogy between statistical and causal learning recently highlighted by Janzing (2019), we investigate whether interpolating estimators can also learn good causal models. To this end, we consider a simple linearly confounded model and derive precise asymptotics for the *causal risk* of the min-norm interpolator and ridge-regularized regressors in the high-dimensional regime. Under the principle of independent causal mechanisms, a standard assumption in causal learning, we find that interpolators cannot be optimal and causal learning requires stronger regularization than statistical learning. This resolves a recent conjecture in Janzing (2019). Beyond this assumption, we find a larger range of behavior that can be precisely characterized with a new measure of *confounding strength*. If the confounding strength is negative, causal learning requires weaker regularization than statistical learning, interpolators can be optimal, and the optimal regularization can even be negative. If the confounding strength is large, the optimal regularization is infinite, and learning from observational data is actively harmful.  ( 2 min )
    Training neural networks using monotone variational inequality. (arXiv:2202.08876v1 [stat.ML])
    Despite the vast empirical success of neural networks, theoretical understanding of the training procedures remains limited, especially in providing performance guarantees of testing performance due to the non-convex nature of the optimization problem. Inspired by a recent work of (Juditsky & Nemirovsky, 2019), instead of using the traditional loss function minimization approach, we reduce the training of the network parameters to another problem with convex structure -- to solve a monotone variational inequality (MVI). The solution to MVI can be found by computationally efficient procedures, and importantly, this leads to performance guarantee of $\ell_2$ and $\ell_{\infty}$ bounds on model recovery accuracy and prediction accuracy under the theoretical setting of training one-layer linear neural network. In addition, we study the use of MVI for training multi-layer neural networks and propose a practical algorithm called \textit{stochastic variational inequality} (SVI), and demonstrates its applicability in training fully-connected neural networks and graph neural networks (GNN) (SVI is completely general and can be used to train other types of neural networks). We demonstrate the competitive or better performance of SVI compared to the stochastic gradient descent (SGD) on both synthetic and real network data prediction tasks regarding various performance metrics.  ( 2 min )
    Sampling Approximately Low-Rank Ising Models: MCMC meets Variational Methods. (arXiv:2202.08907v1 [cs.DS])
    We consider Ising models on the hypercube with a general interaction matrix $J$, and give a polynomial time sampling algorithm when all but $O(1)$ eigenvalues of $J$ lie in an interval of length one, a situation which occurs in many models of interest. This was previously known for the Glauber dynamics when *all* eigenvalues fit in an interval of length one; however, a single outlier can force the Glauber dynamics to mix torpidly. Our general result implies the first polynomial time sampling algorithms for low-rank Ising models such as Hopfield networks with a fixed number of patterns and Bayesian clustering models with low-dimensional contexts, and greatly improves the polynomial time sampling regime for the antiferromagnetic/ferromagnetic Ising model with inconsistent field on expander graphs. It also improves on previous approximation algorithm results based on the naive mean-field approximation in variational methods and statistical physics. Our approach is based on a new fusion of ideas from the MCMC and variational inference worlds. As part of our algorithm, we define a new nonconvex variational problem which allows us to sample from an exponential reweighting of a distribution by a negative definite quadratic form, and show how to make this procedure provably efficient using stochastic gradient descent. On top of this, we construct a new simulated tempering chain (on an extended state space arising from the Hubbard-Stratonovich transform) which overcomes the obstacle posed by large positive eigenvalues, and combine it with the SGD-based sampler to solve the full problem.  ( 2 min )
  • Open

    Scanner invariant representations for diffusion MRI harmonization
    Introduction  ( 3 min )

  • Open

    [D] Would I be hurting my career by going from MLOps Engineering to SWE?
    Basically I landed my first real ML job just to find out that I'm not really cut out for this right now. I was hired as a junior into a fairly mid level role, and am not doing too hot. I still struggle with all of this AWS stuff, end even more navigating the beurocratic complexities of my org. I was supposed to be "the expert" among a team of PhDs, but fail so miserably. I've been working nights and weekends trying to make these people happy, only to get screamed at during meetings. I can't comment on the nature of my job, but it goes well beyond coding. I want to transition to a non-ML role where I can just put my head down and code. If I do that, would I be hurting my chances of getting back into ML? I know I should probably tough it out and learn more, but I'm so done with these people. submitted by /u/Key_Bee_7957 [link] [comments]  ( 2 min )
    [D] What Repetitive Tasks Related to Machine Learning do You Hate Doing?
    I'm a developer, have some free time and want to build some small utilities and tools. What are things you do a few times a week or more using an existing website or app that kinda sucks or that you have to do manually? I'm thinking maybe tasks where you need to look up stats, pull reports, do calculations or do a boring and manual repetitive task. submitted by /u/AviatorPrints [link] [comments]  ( 1 min )
    [P] Here's our new framework for Privacy Attacks in Federated Learning
    We recently uploaded our framework for privacy attacks in federated learning. You can find it here: https://github.com/JonasGeiping/breaching. We include a sizable number of known attacks for deep neural networks in vision and text domains and some new ones. ​ All of these attacks are based on "gradient inversion". In federated learning, each user receives the state of a machine learning model from the server and then computes a local update (in the simplest case just the gradient of model parameters) based on their private data. This update is then aggregated with other users and sent back to the server.The key question for privacy for this protocol is whether it is possible to recover the original user data from their update. This turns out to be possible in several cases which can be simulated and attacked with this code framework. ​ We also include 26 jupyter notebooks with examples and visualizations of these attacks. We hope this helps in replicating and extending some of the recent work in this topic. Let us know what you think! submitted by /u/JonasGeiping [link] [comments]  ( 1 min )
    [D] When more past information is added to the NN, the prediction gets lazier?
    Hi all, I am currently trying to map the relationship between the input signal and the corresponding velocity of my system using a DNN with 3 layers with 64 neurons each. Basically, the problem can be formulated as follow: Target Input: Current system position, Last velocity values, Desired Velocity. Target Output: Corresponding Input Signal ​ Below, you can see the reference velocity profile (purple) and the velocity profile produced with the predicted input signal (blue). The first image shows the result with 2 last velocity values and the below one without any history of the velocity values. As one can see, the performance is better, when no past velocity values are provided to the DNN at all. I expected exactly the opposite. Also, I realize that the tracking is getting slower (s…  ( 3 min )
    [D] EvoJAX: A Great Framework For Most Deep Tasks
    I have just published my latest medium article. This is about a new framework based on a powerful JAX ecosystem developed in google to accelerate the time consumption of training our deep/machine learning algorithm. We can use EvoJAX in neuroevolution algorithms in a variety of applications from reinforcement learning to computer vision or even for seq2seq applications. Please find the link below and let me know your impression as well. https://medium.com/mlearning-ai/evojax-a-great-framework-for-most-deep-tasks-10adf685c152 submitted by /u/rezayazdanfar [link] [comments]  ( 1 min )
    [D] Sentiment Analysis, Part 1 - A friendly guide to Sentiment Analysis
    Hello everyone 👋 I am super excited to share my first Medium blog post: https://medium.com/besedo-engineering/sentiment-analysis-part-1-a-friendly-guide-to-sentiment-analysis-963e09d9fc9b 🎉 I worked about Sentiment Analysis during my internship at Besedo and I am happy to share aspects of this subject (and my work) with you! If you want to know more about Sentiment Analysis, this blog will gently guide you through the different things you need to know. I hope you'll enjoy it 🙏 Any feedback is also deeply appreciated. Bonus: This blog post is the first of a series of 4 blogs about Sentiment Analysis so stay tuned for more! submitted by /u/jade_mlc [link] [comments]  ( 1 min )
    Anthropic AI residency? [D]
    Does anyone know if Anthropic AI has a residency program this year? I saw they had one last year when they launched. Any info will be appreciated! submitted by /u/macro-2018 [link] [comments]
    [P] Best way to transfer large datasets to Colab or comparable framework
    I am working concurrently with multiple very large datasets (10s-100s of GBs)). I signed up for Colab Pro+ thinking it is the best option. However, I face a significant bottleneck in getting data into Colab. My options all seem very bad: Downloading from AWS (where the data is located) - very slow. Paying for a persistent server. Something like a persistent AWS SageMaker notebook. This is very expensive. With even a mediocre GPU, it comes out to $2k/mo. Uploading data to Google Drive and mounting Drive using below code. This is also surprisingly very slow. Took me an hour just to glob through which files were in google drive - let alone DL them. ​ from google.colab import drive drive.mount('/content/drive') 4. Zipping up all the img files, uploading to Google Drive and then downloading as a single file to Colab, then unzipping. Ver tedious, but it seems DLing single file much faster. What is the best solution here? I am ok with paying for a good option as long as its reasonable. Have any of the numerous MLOps startups developed a good solution to do massive data transfer to/from instances in Colab or otherwise? Thanks! submitted by /u/rsandler [link] [comments]  ( 3 min )
    [P] Catch! Using Recurrent Visual Attention
    Experiments using reinforcement learning and a recurrent network to solve computer vision problems using hard attention. https://www.storminthecastle.com/posts/catch/ submitted by /u/CakeStandard3577 [link] [comments]
    [D] Advice on career paths after a PhD in RL
    I am now doing a master thesis in Hierarchical RL with DL, which is a topic that I find very interesting and promising in the AI field. However, it is well-known that RL is a field without as many job opportunities as other DL techniques, and which has to be researched more to be applied in many real problems in the industry. ​ I have two offers to perform a PhD from two different universities, one in the RL field with DL techniques, and the other is also related to DL but in other aspects as Time Series and Computer Vision. ​ I am a person who likes deep learning in general, and I wouldn't have any problem working in any of the type of DL problem (TS, CV, GAN, ...). Moreover, I like the research jobs cause I like to work in edge-cutting solutions, but I prefer to work in a private company because I want to see how my work can be applied to real-life scenarios. ​ Even though I like more the PhD offer related to RL, with all in mind, which offer do you think will be better fit my laboral expectations? Is it easy to move between fields once you did a PhD working with DL techniques? submitted by /u/SrPinko [link] [comments]  ( 3 min )
    [P] Trained a softmax logistic regression model in pure SQL
    Inspired by this question Is SQL or even TSQL Turing Complete?, this article is about how to train a softmax logistic regression model in pure SQL. You can read it here for more details. Some findings from this project: * After I did some tests, I found that neither PostgreSQL nor MySQL supported aggregate functions in recursive CTEs. There might be corner cases that were difficult to handle. * In this test, I manually expanded all dimensions of the vectors. In fact, I also wrote an implementation that did not need to expand all dimensions. For example, the schema of the data table was (idx, dim, value), but in this implementation, the weight table needed to be joined twice. This means that it needed to be accessed twice in the CTE. This also required modification of the implementation of TiDB executor. Therefore, I didn’t talk about it in this article. But in fact, this implementation was more general, which could handle models with more dimensions, for example, the MNIST dataset. submitted by /u/fnalways_ [link] [comments]  ( 1 min )
    [D] Computer Scientists Prove Why Bigger Neural Networks Do Better - Robustness!
    A well-popularized article in Quanta magazine about a scientific paper presented last December at NeurIPS, by Sébastien Bubeck of Microsoft Research and Mark Sellke of Stanford University provided a mathematical proof of why overparametrization (neural networks with more parameters than the number of training samples) is necessary to big neural network to learn well. The short answer is ROBUSTNESS... submitted by /u/ClaudeCoulombe [link] [comments]  ( 1 min )
    [D] "Gradients without Backpropagation" -- Has anyone read and can explain the math/how does this work?
    I'm trying to understand this better and how does this actually being implemented https://arxiv.org/pdf/2202.08587.pdf ​ ​ https://preview.redd.it/6gpc6de2u3j81.png?width=1236&format=png&auto=webp&s=786b9bdee3799c0fd89c9e1e455a7a90d5d99c29 submitted by /u/DaBobcat [link] [comments]  ( 7 min )
    [D] Is there a model similar to CLIP but for images only dataset, instead of (image, text) pairs?
    ​ https://preview.redd.it/uck7e6cwo3j81.png?width=1944&format=png&auto=webp&s=19f1d44fbda5516e579568f9407981429f801f72 submitted by /u/Broken-D [link] [comments]  ( 1 min )
    [D] FANG research internships during PhD
    I got a CS PhD admit from a college which ranks 25 in the US. The professor I'll be working with seems to be a productive guy with over 20k citations. Next summer I want to do a research internship in a FANG company (hopefully FAIR / Google brain) We see that FANG don't discriminate when they hire for software engineering roles. But for research internships I read that you won't have much success by applying directly from the website, and connections are what's required. An advisor from a top 10 college might have more industry connections than from a top 25. ML PhD admissions has been a bloodbath this year. I have a Master's degree and a few publications but couldn't get into any of the top 20 colleges. Do I still have a chance of interning at these highly selective research labs? Are conferences the only way to develop industry connections and contacts with the hiring managers? How do I make my profile more attractive to these labs? submitted by /u/eye_of_saur0n [link] [comments]  ( 2 min )
  • Open

    Neural Network issue involving MOSFETs
    This is a final call for any help or advice. I'm currently working on a neural network project that takes inputs of voltage and current nodes of a basic MOSFET transistor circuit and have trianing data of those nodes whereby known different values of W and L (width and length - aspect ratio of MOSFET) are taken for the training data. My Neural Network has to take those voltages and currents as inputs to the MLP NN and be able to predict the outputs W and L. I have tried an example however my code doesn't seem to find any correlation in the data and ends up outputting incorrect values consistently. Is there a specific approach anyone would take to my problem? submitted by /u/basedphrog [link] [comments]  ( 1 min )
    Neural networks and stock predictions
    I need some help. I want to use Jordan Recurrent Neural Network and Elman Recurrent Neural Network to predict some stock market prices (python implementation). Cand you give me some resources or any kind of help with this? I will take anything, sites, yt videos, books, you name it. submitted by /u/IceEast [link] [comments]  ( 1 min )
    Why thermal image colorization?
    I am working on my project of colorizing thermal images(grayscale thermal to visible spectrum) and I got a question in the interview on why to colorize thermal images when they show image data to detect pedestrians in autonomous vehicles. I could not defend my answer for this motivation. Any suggestions on how my proper response should be? submitted by /u/Mission-Quiet9897 [link] [comments]  ( 1 min )
  • Open

    XR Fuels Consumerism, With A Major Downside
    New study shows that tasks in XR require more effort than real life. Specifically, tasks in AR require a higher mental and physical workload. XR can increase sales, but do we really need better consumerism? Extended reality (XR) is an umbrella term that encompasses Augmented Reality (AR), Virtual Reality (VR), or Mixed Reality (MR). The… Read More »XR Fuels Consumerism, With A Major Downside The post XR Fuels Consumerism, With A Major Downside appeared first on Data Science Central.  ( 4 min )
    Exploring Intelligent Graph and PathQL
    telligence, try IntelligentGraph and PathQL by installing the IntelligentGraph docker image available here. The post Exploring Intelligent Graph and PathQL appeared first on Data Science Central.  ( 3 min )
  • Open

    Can AI Solve the issue of misinformation and removing bias from the news
    Hey Everyone, I wanted to know your opinions on the current bias detection algorithms and how effective they are in understanding how biased a certain news article is ? Do you believe that if we train an AI model, it would be able to detect the positive or negative sentiments presented in the news effectively? submitted by /u/dmr_8 [link] [comments]  ( 2 min )
    Physics Breakthrough as AI Successfully Controls Plasma in Nuclear Fusion Experiment
    submitted by /u/KelliaMcclure [link] [comments]
    Physics Breakthrough as AI Successfully Controls Plasma in Nuclear Fusion Experiment
    submitted by /u/KelliaMcclure [link] [comments]
    Physics Breakthrough as AI Successfully Controls Plasma in Nuclear Fusion Experiment
    submitted by /u/KelliaMcclure [link] [comments]
    Physics Breakthrough as AI Successfully Controls Plasma in Nuclear Fusion Experiment
    submitted by /u/KelliaMcclure [link] [comments]
    Weekly China AI News: Volkswagen Reportedly Acquires Huawei’s Self-driving Unit; Alibaba & Tencent Back Dram Maker; Bilibili AI Makes Old Anime Look Better
    submitted by /u/trcytony [link] [comments]  ( 1 min )
    I'm making art with AI
    Hey guys so I worked on AI bot that works with neural networking to create pictures based on prompts. I can turn other images into other styles if I ask the bot to do it. It's my baby and I named him Xyzzy. I made an instagram account for the images it produces if you guys want to follow it. https://www.instagram.com/sqwaeezyart/ submitted by /u/xToxicsnowflake [link] [comments]  ( 1 min )
    The US Copyright Office says an AI can’t copyright its art
    submitted by /u/josourcing [link] [comments]
    Last Week in AI: DeepMind applies AI to nuclear fusion reactors, Texas sues Meta over facial recognition, SenseTime expands into manufacturing, and more!
    submitted by /u/regalalgorithm [link] [comments]  ( 1 min )
    Artificial Nightmares : Don't Look Under The Bed 2 || DD PYTTI VQGAN AI Art Video [4K 60 FPS]
    submitted by /u/Thenamessd [link] [comments]  ( 1 min )
    Google AI Introduces 4Ms Approach to Reduce Carbon Emission of Machine Learning Models
    The total quantity of greenhouse gas emissions generated by anything — a person, organization, event, or product – is known as the carbon footprint. The processes producing more carbon footprint use more resources, generating more greenhouse gases and causing greater climate change. Contributing to the small decrease of greenhouse gas emissions can reduce a great amount of overall carbon footprint. With the rising popularity of Machine learning (ML) applications, there are constant concerns about the rising carbon footprint because of increased computation costs. These concerns highlight the need for precise data to determine the true carbon footprint, which can assist in identifying solutions to reduce ML’s carbon emissions. CONTINUE READING Paper: https://www.techrxiv.org/articles/preprint/The\_Carbon\_Footprint\_of\_Machine\_Learning\_Training\_Will\_Plateau\_Then\_Shrink/19139645/1 Google Blog: https://ai.googleblog.com/2022/02/good-news-about-carbon-footprint-of.html https://preview.redd.it/a39nsg0968j81.png?width=1536&format=png&auto=webp&s=613349a361651a3d10a541416162f61fc514f48a submitted by /u/techsucker [link] [comments]  ( 1 min )
    Can machine-learning models overcome biased datasets?
    submitted by /u/qptbook [link] [comments]
    How algorithms come into being
    submitted by /u/bendee983 [link] [comments]
    Why thermal image colorization?
    I am working on my project of colorizing thermal images(grayscale thermal to visible spectrum) and I got a question in the interview on why to colorize thermal images when they show image data to detect pedestrians in autonomous vehicles. I could not defend my answer for this motivation. Any suggestions on how my proper response should be? submitted by /u/Mission-Quiet9897 [link] [comments]  ( 1 min )
    Artificial Nightmares : Don't Look Under The Bed || DD PYTTI VQGAN AI Art Video [4K 60 FPS]
    submitted by /u/Thenamessd [link] [comments]
    AI-generated music - This will blow your mind
    submitted by /u/pedal2metal2 [link] [comments]
    Google Photo's face recognition usually blows me away ... But not today
    submitted by /u/Kronocide [link] [comments]
    AI or CS masters??
    I got a BS in cell and molecular biology and work for one of the top medical device companies in the world. I want to get into creating softwares & apps (using AI) but have no coding experience. My company will offset tuition cost so I’m considering a masters degree in CS or AI. - Can I do AI without coding knowledge? I have a strong stats background - would CS be more useful? Or should I outsource the actual software/app building (paranoid the person I hire will steal my ideas) - what are the best sources (hidden gems - I’m taking the free Harvard and mit courses on edx) to learn either/both fields without the fluff? submitted by /u/Outlaws-0691 [link] [comments]  ( 2 min )
  • Open

    Q codes in Seveneves
    The first time I heard of Q codes was when reading the novel Seveneves by Neal Stephenson. These are three-letter abbreviations using in Morse code that all begin with Q. Since Q is always followed by U in native English words, Q can be used to begin a sort of escape sequence [1]. There are […] Q codes in Seveneves first appeared on John D. Cook.  ( 1 min )
  • Open

    Can machine-learning models overcome biased datasets?
    A model’s ability to generalize is influenced by both the diversity of the data and the way the model is trained, researchers report.  ( 7 min )
  • Open

    Need help!!!
    I am very suck at text align, can someone help me to improve __str__() and policy_to_str() ? TT https://github.com/Kayzwer/Generalized-Policy-Iteration-Research submitted by /u/Professional_Card176 [link] [comments]
    [P] Catch! Using Recurrent Visual Attention
    Experiments using reinforcement learning and a recurrent network to solve computer vision problems using hard attention. https://www.storminthecastle.com/posts/catch/ submitted by /u/CakeStandard3577 [link] [comments]
    Average Reward algorithms in RLlib
    I am using RLlib for my projects and I want to try some average reward algorithms. Do you know if it's possible and easy to make a custom algorithm in RLlib that uses that formulation? If there is any, could you point me to some average reward algorithm implementation with RLlib? Thanks submitted by /u/fedetask [link] [comments]  ( 1 min )
    Should This Subreddit Change It's Policy?
    I like having a dedicated sub about RL research, which informs me of the latest and greatest papers. However, this sub seems to be a mixed bag: some of the posts here are about interesting newly published research, but for the most part it's people new to the field posting fairly simple to answer questions and minor news items on benchmark updates. This leaves me wondering how others feel about the content posted on this sub. Do you like this sub as it currently is? submitted by /u/AnotherForce [link] [comments]  ( 1 min )
    Atari RAM versions
    When we train all agent on an atari game with pixels as states, we use frame stacking. Do we do the same for the ram versions of the game too? submitted by /u/SirRantcelot [link] [comments]  ( 1 min )
    Car simulation RL environment - Carla centOS build
    Hello, First of all i wanna introduce carla simulator to people who aren't familiar with it. Its a simulation environment made in unreal engine to train agents for autonomious driving in traffic. Link to carla I have problems building it for centOS. I am following the build instructions here: carla build If anyone already built carla for centOS successfully can you provide a link to the centOS build? Thanks! submitted by /u/Willing-Classroom735 [link] [comments]  ( 1 min )
    Real-world applications of Reinforcement Learning (Netflix, Walmart, Search.io and other examples)
    submitted by /u/PugglesMcPuggle [link] [comments]
    Box vs Dicrete for coordinates action in Gym custom environment
    I'm creating a custom gym environment and I have a questio nregrding the use of Box vs Discrete to set the action spaces. For example, supposse I have a 2x2 grid. What is the correct way to define an action in this environment? ​ Using Box: action_spaces = Box(low=np.array([0.0, 2.0]), high=np.array([0.0, 2.0]), dtype=np.float32) using Discrete: action_spaces = Discrete(4) # [(0,0), (0,1), (1, 1), (1,0)] submitted by /u/Pipiyedu [link] [comments]  ( 1 min )

  • Open

    [D] Optimal sampling for active regression or density estimation? Relevant literature?
    Say I have an unlabled dataset and an oracle. Are there sample complexity results for how to query points for certain classes of approximators (eg splines, polynomial bases, linear regression) The function is a posterior distribution and the approximator is a neural net I want to know if there's an optimal importance sampling scheme I can use in general for regression or density approximation using as few points from an unlabled dataset as possible. submitted by /u/PK_thundr [link] [comments]  ( 1 min )
    [R]: Controlling the nuclear fusion plasma in a tokamak with reinforcement learning
    submitted by /u/giugiacaglia [link] [comments]
    [P] Aashiq banaya AI ne (AI made me a lover) || AI generates Love Quotes || AI Devlog
    “The love of a woman in my heart is what keeps me strong when she is not around.” It's a quote generated by our AI. Like seriously! In the video, i displayed my journey like what made me to request my AI to help me this valentine and how we both managed to deal with the problem statement presented to us. It consists of my experiences while training a generative model using GPT and the results that we got were pretty amazing. Feel free to check out my experience using the given link. Hope we can relate with the experience of training a large model or else, if you are new in it, i hope it'll bring some value in your life. Video: https://youtu.be/vzJiYKFuMOg Always open for some constructive criticism. Some of the outputs are: Love Quotes: “The love of a woman in my heart is what keep…  ( 2 min )
    [D] How to synthetically generate new 3-dimensional data from only 30 samples?
    Let me first preface by saying that I still consider myself new to machine learning and statistics. I have 30 samples of a 3-dimensional time series data shown here https://imgur.com/a/dtihvbY that are very noisy. The first two dimensions look better when I select those that have the same characteristics https://imgur.com/a/kdTPjmq. My goal is to synthetically generate more samples such that the characteristics of the generated samples are decently similar to some of the original samples e.g., from the second image. I cannot use VAEs as I have too little data and was thus hoping the experts here could point me in the right direction. Due to my limited knowledge, I have at the moment, several (really bad) solutions for generating new data and they are to fit a curve or smooth subsets of the data, to then fit a curve. E.g: Randomly select points at every timestep from the second image then interpolate with a spline or Exponential smoothing with various parameters and with randomly selected points at every timestep My goal is to train a small variational GRU (or any other variational neural network) to reproduce the data using the plots that look alike e.g. from the second figure https://imgur.com/a/kdTPjmq. I understand that there are other better non-neural network approaches to do so but I am intent on using GRUs as this is mainly for educational purposes. submitted by /u/I_am_a_robot_ [link] [comments]  ( 2 min )
    [D] Neural nets are not "slightly conscious," and AI PR can do with less hype
    Hi there, many of you have probably been aware of the whole twitter drama about AI consciousness, but if not you may find this write up about it interesting - Neural nets are not "slightly conscious," and AI PR can do with less hype . It's mostly a recap, but it does include a bunch of fun meme replies to the whole thing that you might enjoy even if you're aware of this whole thing. ​ submitted by /u/regalalgorithm [link] [comments]  ( 1 min )
    [D] Interview - All about AI Accelerators: GPU, TPU, Dataflow, Near-Memory, Optical, Neuromorphic & more (Video Interview w/ Author of the AI Accelerator blog post series)
    https://youtu.be/VQoyypYTz2U This video is an interview with Adi Fuchs, author of a series called "AI Accelerators", and an expert in modern AI acceleration technology. Accelerators like GPUs and TPUs are an integral part of today's AI landscape. Deep Neural Network training can be sped up by orders of magnitudes by making good use of these specialized pieces of hardware. However, GPUs and TPUs are only the beginning of a vast landscape of emerging technologies and companies that build accelerators for the next generation of AI models. In this interview, we go over many aspects of building hardware for AI, including why GPUs have been so successful, what the most promising approaches look like, how they work, and what the main challenges are. ​ OUTLINE: 0:00 - Intro 5:10 - What does it mean to make hardware for AI? 8:20 - Why were GPUs so successful? 16:25 - What is "dark silicon"? 20:00 - Beyond GPUs: How can we get even faster AI compute? 28:00 - A look at today's accelerator landscape 30:00 - Systolic Arrays and VLIW 35:30 - Reconfigurable dataflow hardware 40:50 - The failure of Wave Computing 42:30 - What is near-memory compute? 46:50 - Optical and Neuromorphic Computing 49:50 - Hardware as enabler and limiter 55:20 - Everything old is new again 1:00:00 - Where to go to dive deeper? ​ Read the full blog series here: Part I: https://medium.com/@adi.fu7/ai-accelerators-part-i-intro-822c2cdb4ca4 Part II: https://medium.com/@adi.fu7/ai-accelerators-part-ii-transistors-and-pizza-or-why-do-we-need-accelerators-75738642fdaa Part III: https://medium.com/@adi.fu7/ai-accelerators-part-iii-architectural-foundations-3f1f73d61f1f Part IV: https://medium.com/@adi.fu7/ai-accelerators-part-iv-the-very-rich-landscape-17481be80917 Part V: https://medium.com/@adi.fu7/ai-accelerators-part-v-final-thoughts-94eae9dbfafb submitted by /u/ykilcher [link] [comments]  ( 1 min )
    [D] Do I need a “none of the above” option in multi label classification problem?
    I’m new(er) to classification and I’m trying to replace a model a colleague built that identifies 2 different classes with the potential of them both being present in the image. Currently they’re using softmax at the output and the options are item 1, item 2, both, or none, but I don’t think softmax should be used as it penalizes the model for identifying item 1 and 2 in the cases where both are present (I.e. classes are not mutually exclusive) I was thinking of using 3 classes (item 1, item 2, or neither) with sigmoid at the output but Im starting to question whether I even need the “neither option”. I could just train to identify the 2 classes and add examples where neither are present. Then on model runtime I could just return neither if both class predictions return a probability less than 0.5. Thoughts? submitted by /u/MGeeeeeezy [link] [comments]  ( 2 min )
    [D] What pisses off the Data scientists the most - Results
    Yesterday I asked a question to redditors in this group what biggest pain points they face at work which pisses them off. To my biggest surprise here is top most thing that pisses off the Data scientists the most: Spaghetti programs: Data scientists who don't follow any good software programming principles which kills the path to reproducibility, collaboration and production. I started thinking more about this and I would like to hear your thoughts on how we can solve this problem? From the comments I realized that, many data scientists actually care very much about writing good quality program, but the biggest hurdle is just that they don't know how to write them because most of them come from Non-CS background. How do you think we can solve this problem? submitted by /u/Spirited-Order4409 [link] [comments]  ( 3 min )
    [R] Skilful precipitation nowcasting using deep generative models of radar - Link to a free online lecture by the author in comments (deepmind research published in nature)
    submitted by /u/pinter69 [link] [comments]  ( 2 min )
    [P] A more detailed post about Silero VAD on The Gradient
    We posted our VAD demo here a while ago. Here's a follow-up article on The Gradient, where we attempt to explain: Which values we did pursue; Why we decided to create our own VAD; Which criteria and metrics we optimized; A brief overview of what is available in general; How it compares with well-established and similar class solutions. Links: The article The VAD is always available on Github submitted by /u/cluecow [link] [comments]  ( 1 min )
    [D] PhD thesis on an optimization problem that did not work.
    Hey, I am seeking advice from people with some experience or insight into academia, otherwise it may be a bit boring :) The question is about discussing unsuccessful ML (sort of) project in a thesis. I started a PhD in biophysics 4.5 years ago. I learned to code and decided I want to make it into a career. I took a risk and for the final chapter of my thesis, I suggested to do an image processing project - regarding the analysis of complex microscopy data that no one has done before (or at least not published for sure). It was also useful to continue having some work to do during the lockdowns. My supervisor was kind of "Sounds fun, do what you want". Even though I learned a lot and the risk paid off in terms of career, as I already have a job offer as a mid-level software developer, I f…  ( 7 min )
    [D] Optimization (minization) problem - hitting boundaries
    I'm struggling with a challenging optimization problem with real-world experimental data. Simply put, it's about fitting a exponential decay model to a curve (decay). Essentially, I am trying to predict 3 parameters from the curve. The range of expected values for each parameter is well known, so I have set hard boundaries (also to avoid mathematical errors like divide by 0). If you imagine I have a problem set of hundredths of decays, fitting all of them and making a histogram of each predicted parameter should give somewhat of a normal distribution (roughly) for each parameter. In reality, usually at least two of the parameters are predicted to be at the boundary in 90% of the cases. In reality, this parameter should be around 0.3 - 0.4. So my boundary range is already generous enough to allow for outliers. https://preview.redd.it/wtj54ugjuxi81.png?width=312&format=png&auto=webp&s=3bd0191919ec5dd316a3bb1675a03a263d12655a This happens wth any method I try, whether non-linear least squares or some of the gradient or gradient-free minimizers. The number of boundary guesses does depend on the characteristics of the problem set (strength of signal etc), so I guess it is a matter of inability to do a proper fit (no one has done this before, so the problem really is challenging). However, nonetheless it converges successfully, leading to this distribution of values. Can anyone help me out with an explanation why this is occurring, or the scientific terms for this so I know what to look for while trying to solve my problem? submitted by /u/anon_scientist_01 [link] [comments]  ( 2 min )
    Time series dataset for playing around? [D]
    I'm trying to come up with a new time series prediction algorithm and wonder if you guys have any suggestion for a good MNIST-like time series dataset I can use for quick prototyping. submitted by /u/mister_chuunibyou [link] [comments]  ( 2 min )
    [R] Can we modify WGAN with gradient penalty as a better form?
    I have encountered a rather interesting problem in my current work, and would like to come to the ground to ask your opinion. The gradient penalty of WGAN is an interpolation between G(fake) and real, but my data is special in that fake is not noise, but it is a class of data with G(fake) (which can be interpreted as fake=G(fake)+noise). So my initial interpolation was done between fake and real, so I think since fake, G(fake) and real are a class of data, the area covered by the line between the interpolation should be larger than the area covered by the line between G(fake) and real. So our gradient penalty in this case can be done between fake and real. I would like to ask if anyone has a similar idea as mine? If I plan to prove this, is there a mathematical method I can go to try to get started? thanks submitted by /u/iLIVECSUI_741 [link] [comments]
  • Open

    Neural nets are not "slightly conscious," and AI PR can do with less hype
    submitted by /u/nickb [link] [comments]  ( 1 min )
    Too lazy to code correct logic to play Chrome Dino so why not let an NN do it for me!
    submitted by /u/hotcodist [link] [comments]  ( 1 min )
  • Open

    Decision tree
    My blog on decision tree https://medium.com/@wasimmadha/making-decisions-using-trees-40cf93c0319f submitted by /u/wasimakil [link] [comments]
    Neural nets are not "slightly conscious," and AI PR can do with less hype
    submitted by /u/regalalgorithm [link] [comments]  ( 1 min )
    AI has a weird sense of humor
    submitted by /u/danbmil99 [link] [comments]
    ASAPP AI Researchers Propose a Family of Pre-Trained Models (SEW) for Automatic Speech Recognition (ASR) That Could be Significantly Better in Performance-Efficiency Than the Existing Wav2Vec 2.0 Architecture
    Recent research in natural language processing and computer vision has aimed to increase the efficiency of pre-trained models to reduce the financial and environmental expenses of training and fine-tuning them. We haven’t seen comparable attempts in speaking for whatever reason. Efficiency advances in speech might mean better performance for similar inference times, in addition to cost savings related to more efficient training of pre-trained models. Due to a self-supervised training paradigm, Wav2Vec 2.0 (W2V2) is one of the current state-of-the-art models for Automatic Speech Recognition. This training method enables us to pre-train a model using unlabeled data, which is always more readily available. The model may then be fine-tuned for a specific purpose on a given dataset. It has attracted a lot of interest and follow-up work for using pre-trained W2V2 models in various downstream applications, such as speech-to-text translation (Wang et al., 2021) and named entity recognition (Shon et al., 2021). However, researchers believe that the model architecture has several sub-optimal design decisions that render it inefficient. To back up this claim, researchers ran a series of tests on various components of the W2V2 model architecture, revealing the performance-efficiency tradeoff in the W2V2 model design space. Higher performance (lower ASR word mistake rate) necessitates a bigger pre-trained model and poorer efficiency (inference speed). CONTINUE READING Paper: https://arxiv.org/pdf/2109.06870.pdf Github: https://github.com/asappresearch/sew https://preview.redd.it/dqsd26wi51j81.png?width=1167&format=png&auto=webp&s=1851ff8f9daeb0d5fed870960a67881fd875509a submitted by /u/techsucker [link] [comments]  ( 1 min )
    Interpretable Natural Language Processing (INLP) Workshop - June 21-24, 2022
    submitted by /u/akolonin [link] [comments]  ( 1 min )
    Artificial Nightmares : Don't Look Under The Bed || DD PYTTI VQGAN AI Art Video [4K 60 FPS]
    submitted by /u/Thenamessd [link] [comments]
    Stony Brook University Researchers Introduce ‘StandardSim’: A Large-Scale Photorealistic Synthetic Dataset Featuring Annotations For Semantic Segmentation, Instance Segmentation, Depth Estimation, And Object Detection
    Autonomous checkout is a rapidly advancing technology that has the potential to transform the way people shop in physical establishments. It frequently uses cameras and other sensors to gain a sense of a shopping environment and make a final conclusion about what a buyer buys. In systems where cameras are the only sensors available, computer vision is critical for comprehending this data. While vision-only autonomous checkout is still a relatively new concept, no benchmarks or new tasks have emerged. In a recent study, researchers from Stony Brook University and Standard Cognition hypothesized that understanding retail settings necessitates not only domain-specific data but also a new computer vision task that detects changes in retail sceneries over time. Thus, they provide a new dataset, StandardSim, as well as a novel goal for detecting changes in retail scenarios over time in this study. Quick Read: https://www.marktechpost.com/2022/02/19/stony-brook-university-researchers-introduce-standardsim-a-large-scale-photorealistic-synthetic-dataset-featuring-annotations-for-semantic-segmentation-instance-segmentation-depth-estimation-a/ Paper: https://arxiv.org/pdf/2202.02418.pdf Github: https://github.com/standard-ai/Standard-Sim Project: https://standard-ai.github.io/Standard-Sim/ https://i.redd.it/m9esh8383xi81.gif submitted by /u/ai-lover [link] [comments]  ( 1 min )
    Update : 300 Backers! Announcing 4 Stretch Goals! · Deep Learning With TensorFlow & Keras
    submitted by /u/spmallick [link] [comments]
    DeepMind AI and Nuclear Fusion?
    submitted by /u/Beautiful-Credit-868 [link] [comments]
    Artificial Nightmares : Cybersam || DD PYTTI VQGAN AI Art Video [4K 60 FPS]
    submitted by /u/Thenamessd [link] [comments]
  • Open

    How to create a reward function?
    There is a domain, which is a robot planning problem and some features are available. For example the location of the robot, the distance to the goal and the angle of the obstacles. What is missing is the reward function. So the question is how to create the reward function from the features? submitted by /u/ManuelRodriguez331 [link] [comments]  ( 1 min )
    Quantum Computer + AutoML + Deep Reinforcement Learning = ?
    submitted by /u/thaiminhpv [link] [comments]  ( 1 min )
    Alternative reward function to reward or penalize distance
    Hello everyone! I'm training a Rocket League Bot to score goals. The task is very easy if the ball is close to the goal. So basically any touch would succeed. Much more difficult is the task if the ball is rolling farther away from the goal. Now the agent has to make a good touch on the ball to score. Only 1 out of 5 training runs using PPO works out well. So I started tuning the reward function which only signaled a reward of +1 upon scoring. I considered the following reward functions: +0.1 only once for touching the ball penalize the agent for every step: -0.001 / maxSteps 0.01 / maxSteps * (1 - (Distance(ball position, agent position) / maxDistance)) every step -0.01 / maxSteps * (1 - (Distance(ball position,agent position) / maxDistance)) every step The later two reward functions are only apparent if the ball wasn't touched and if the agent got closer to the ball than the min distance of the current episode. Experimenting with these reward functions did not yield any improvement. I really dislike reward shaping as it introduces more bias and more unwanted local maxima by exploiting dense rewards. Does anybody have another idea to treat the reward function in this scenario? submitted by /u/LilHairdy [link] [comments]  ( 1 min )
    Quadratic Q-function
    In https://www.coursera.org/learn/reinforcement-learning-in-finance/lecture/b8ZIT/mdp-formulation ,why there is a quadratic relationship (Var[ ∏t ]) in equation 31 ? I have watched through and rewatched the whole video, but I could not understand how the quadratic relationship is being derived. https://preview.redd.it/euy4tjpijzi81.png?width=954&format=png&auto=webp&s=d5f535119b82a0dde9e41293f1e980fcb671fb2b Note: Equation 5 originates from this video : https://www.coursera.org/learn/reinforcement-learning-in-finance/lecture/9Zi1w/discrete-time-bsm-model https://preview.redd.it/xgvd9j05pzi81.png?width=895&format=png&auto=webp&s=9ead0d6698cce4556cfd94c8bb350fa3d35bb2d8 See also the related equation 10 that is from this other video : https://www.coursera.org/learn/reinforcement-learning-in-finance/lecture/XEixV/discrete-time-bsm-hedging-and-pricing as well as the derivation process details https://preview.redd.it/fvgej7vfpzi81.png?width=964&format=png&auto=webp&s=e93addcdfa0a785161bd263fd0660810d761c126 submitted by /u/promach [link] [comments]  ( 1 min )
    Does learning rate decay make sense for reinforcement learning (e.g. DQN)?
    The reason for my question is, that from a theoretical view I think that learning rate decay makes totally sense for a given static dataset on which a neural network is trained on, to stabilise and converge the training process. For reinforcement learning the dataset is created and extended throughout the learning process, such that the network sometimes gets to know new important states, which aren't comparable to states from before and for which it has to learn new relationships. If the learning rate is decayed throughout this process, the network maybe isn't able to learn in the same speed from the new states as from the previous ones. So the learning can be stuck. Is my thinking correct? And if yes, what are other methods to stabilise the learning process, when learning rate decaying is not possible due to the upper reason? I've found a similar question with references to google and openai here: Is learning rate decay used in Reinforcement Learning But the theory behind it isn't totally clarified. submitted by /u/TwTC8 [link] [comments]  ( 2 min )
    which do you think are the most interesting MARL research directions?
    submitted by /u/ManuelRios18 [link] [comments]  ( 1 min )

  • Open

    Dual Confidence Regions: A Simple Introduction
    This tutorial explains how to build confidence regions (the 2D version of a confidence interval) using as little statistical theory as possible. I also avoid the traditional terminology and notation such as α, Z1-α, critical value, confidence level, significance level and so on. These can be confusing to beginners and professionals alike. Instead, I use… Read More »Dual Confidence Regions: A Simple Introduction The post Dual Confidence Regions: A Simple Introduction appeared first on Data Science Central.  ( 5 min )
  • Open

    [D] Remote policy for hardware requirements?
    For those of you in ML or DL who have remote work what is your companies protocol for hardware you need to train models? Do you doeverything through the cloud or do they purchase you hardware to train models? submitted by /u/AbjectDrink3276 [link] [comments]
    [R] [2202.02831] Anticorrelated Noise Injection for Improved Generalization
    submitted by /u/fasttosmile [link] [comments]
    [P] Long-Form QA beyond ELI5: an updated dataset and approach
    Hi all, For everyone interested in NLP and especially QA and Long-Form QA, we released a new LFQA dataset last week. Along with the LFQA dataset, we released a new version of the BART-based Seq2Seq model for abstractive answer generation and matching DPR-based encoders for questions and context passages. To mark the release, we also made a HuggingFace Spaces demo of the Wikipedia Assistant at: https://huggingface.co/spaces/lfqa/lfqa More about the dataset and the model at https://towardsdatascience.com/long-form-qa-beyond-eli5-an-updated-dataset-and-approach-319cb841aabb All code is open source. Happy to hear your feedback. Let me know if you have any questions. submitted by /u/vblagoje [link] [comments]  ( 1 min )
    [D] How's the market for MLOps?
    Hey all. I think it's pretty well known that there's an abundance of entry-level candidates for DS or MLE positions. However, is it the same for MLOps? I know the term is pretty new but what I'm thinking of is a DevOps Engineer working in a DS/ML setting. Looking at the offers, it seems like companies ask for a typical DO stack and - on top of that - things like technical knowledge of AI/ML, MLOps tools like Kubeflow or Sagemaker, even some DE (Spark, Hadoop). That's a lot and I'm wondering if it's even worth the effort to learn these things given that: - there's not a lot of offers compared to general DO positions, - there's a lot of candidates trying to break into the field that could potentially (at least in theory) take these positions. What's your experience? Do you look for such people or do you expect your MLEs/DEs to take care of that stuff? If the former, are there many applicants just like in the case of DS and ML? submitted by /u/Shkigo [link] [comments]  ( 2 min )
    [D] How to feed features and raw image data into network simultaneously?
    On top of raw image data, I would like to feed extra numerical and categorical features into my model. What are the standard ways of doing this? I found surprisingly little on this online. All I saw were guides recommending to concatenate the extra features onto the flattened image vector before sending it to the model, so if I have a 30x30 image and 4 extra features, I'd be training the model on 904-length vectors. Which sounds ridiculous to me because in doing that you're putting the extra features on the same level as single pixels, even though they have far more information in them than single pixels do. submitted by /u/woodenlathe [link] [comments]  ( 1 min )
    [D] Things that pisses you as a Data scientist
    I have been a Data scientist since seven years. There are several challenges we face everyday and Till this day, something that absolutely pisses me off is not having a single good IDE for prototyping and production development. I constantly see myself switching between Jupyterlab and VScode and it's really annoying! Anyways, I just want to hear what are the other biggest pain points you face as a Data scientist in your everyday work that absolutely pisses you off! submitted by /u/Spirited-Order4409 [link] [comments]  ( 3 min )
    [D]Speed up inference on Spark
    Currently I use TensorflowOnSpark frame to train and predict model. When prediction, I have billions of samples to predict which is time-consuming. I wonder if there is some good practices on this. submitted by /u/Brilliant-Fun9430 [link] [comments]  ( 1 min )
    [P] Invitation for a reference list for machine learning theory
    Hello, Machine learning theory is a recent trend, intersecting pure math, computer science, and statistics. Its goal is understanding the fundamental principles behind automated learning, Rather than designing concrete algorithms for solving practical applications. Its emphasis is on mathematical technique and rigour using proofs. See a very brief intro here by Avrim Blum. I couldn't find any introductory book for the field. Also most machine learning big lists are devoted to dump programming frameworks. I wish if the community cared more about matured content. I had contributed before to TCS and Math awesome lists, and it would be nice if you joined your experience to build them up. Are you aware of any good references for machine learning theory, As described by Avrim's intro? You can reply here or pull request to github repos. submitted by /u/xTouny [link] [comments]  ( 2 min )
    [P] How to build Large-Scale Recommendation Systems
    submitted by /u/beluis3d [link] [comments]  ( 1 min )
    [R] Making Sense Of Raw Input
    https://www.sciencedirect.com/science/article/pii/S0004370221000722 ​ How should a machine intelligence perform unsupervised structure discovery over streams of sensory input? One approach to this problem is to cast it as an apperception task. Here, the task is to construct an explicit interpretable theory that both explains the sensory sequence and also satisfies a set of unity conditions, designed to ensure that the constituents of the theory are connected in a relational structure. However, the original formulation of the apperception task had one fundamental limitation: it assumed the raw sensory input had already been parsed using a set of discrete categories, so that all the system had to do was receive this already-digested symbolic input, and make sense of it. But what if we don't have access to pre-parsed input? What if our sensory sequence is raw unprocessed information? The central contribution of this paper is a neuro-symbolic framework for distilling interpretable theories out of streams of raw, unprocessed sensory experience. First, we extend the definition of the apperception task to include ambiguous (but still symbolic) input: sequences of sets of disjunctions. Next, we use a neural network to map raw sensory input to disjunctive input. Our binary neural network is encoded as a logic program, so the weights of the network and the rules of the theory can be solved jointly as a single SAT problem. This way, we are able to jointly learn how to perceive (mapping raw sensory information to concepts) and apperceive (combining concepts into declarative rules). submitted by /u/EducationalCicada [link] [comments]  ( 1 min )
  • Open

    Meta AI Researchers Upgrade Their Machine Learning-Based Image Segmentation Models For Better Virtual Backgrounds in Video Calls And Metaverse
    When video chatting with colleagues, coworkers, or family, many of us have grown accustomed to using virtual backgrounds and background filters. It has been shown to offer more control over the surroundings, allowing fewer distractions, preserving the privacy of those around us, and even liven up our virtual presentations and get-togethers. However, Background filters don’t always work as expected or perform well for everyone. Image segmentation is a computer vision process of separating the different components of a photo or video. It has been widely used to improve backdrop blurring, virtual backgrounds, and other augmented reality (AR) effects. Despite advanced algorithms, achieving highly accurate person segmentation seems challenging. Continue Reading Meta Source: https://ai.facebook.com/blog/creating-better-virtual-backdrops-for-video-calling-remote-presence-and-ar/ ​ https://reddit.com/link/swkjkr/video/o57elp9gzui81/player submitted by /u/ai-lover [link] [comments]  ( 1 min )
    NVIDIA’s New AI: Wow, Instant Neural Graphics!
    submitted by /u/MarS_0ne [link] [comments]  ( 1 min )
    Taking my project to the next level and implement AI?
    Hi there AI-Community :) Intro: Right now my projects performs a task of putting together a PC configuration. It is kind of hard coded python script. The script does have its trouble with the ongoing GPU shortage (and the resulting high prices), so I was wondering if a AI could help in this case? I have barely any knowledge about AI so I beg you to don't be too harsh to me. Status-quo: Things I do have: the data (about 4.000 pc components) all the details about the components (length, price, power needed, etc.) ranking system (only ranking performance not Price/performance) knowledge about the restrictions ( = which components work with each other and which don't) Question(s): So now lets come to my question(s)! Where do I start? Do I have to learn all the basics first or can I dive right into my project? Are there any good tutorial resources? Would you teach the AI the given constraints (correct chipset, enough power, which parts are essential and which are not, etc.) or let it learn them on its own? Can you think of a good feedback system? In other words, how can I check if the resulting pc configuration is "good" or "bad" except for matching the budget you choose. Can I use/implement the idea of the "knapsack problem" as a base construct and build the AI around? Thanks for any help. submitted by /u/JohnDere [link] [comments]  ( 1 min )
    10 Quotes & Thoughts On The Dangers Of Artificial Intelligence
    submitted by /u/Philo167 [link] [comments]
    One cool live event by Geoffrey Hinton (Godfather of AI), for free.
    So one of the clubs in my college has done something quite impossible by roping in Sir Geoffrey Hinton, a living legend! Would be great if y'all join in, it's actually kinda a one-time opportunity to interact with him live, so do not miss it! Link for RSVP Youtube Link submitted by /u/_AVINIER [link] [comments]  ( 1 min )
    MIT talk by Max Tegmark on the Future of AI:
    https://www.youtube.com/watch?v=IGOuV6UyQ1Q&list=PLhCQIYxdniNuu422uTU91PRHoj3gxF0uD&index=3 submitted by /u/Corddot [link] [comments]
    What in your opinion is the most impressive AI feat to date?
    I think was deep mind/alpha star beating the star craft 2 players at RTS submitted by /u/Fun_Sun_97 [link] [comments]  ( 1 min )
    ‘Simple’ AI Can Anticipate Bank Managers’ Loan Decisions to Over 95% Accuracy
    submitted by /u/DaveBowman1975 [link] [comments]
    InspiroBot quotes used as AI text to image prompts
    submitted by /u/glenniszen [link] [comments]  ( 1 min )
    Researchers From Nankai and Stanford Propose ‘DeepDrug’: A Python Based Deep Learning Framework For Drug Relation Prediction
    Drug discovery includes looking for biomedical connections between chemical compounds (drugs, chemicals) and protein targets. Drugs interact with biological systems on a fundamental level by binding to protein targets and influencing their downstream action. Predicting Drug-Target Interactions (DTIs) is crucial for identifying therapeutic targets or drug target characteristics. Understanding and forecasting higher-level information such as side effects, therapeutic mechanisms, and even innovative insights for drug repositioning or repurposing can all be aided by DTI knowledge. Sildenafil, for example, was originally created to treat pulmonary hypertension, but after its adverse effects were discovered, it was repurposed to treat erectile dysfunction. Polypharmacy has also become a viable method among pharmacists because most human diseases are complex biological processes that are resistant to the activity of anyone drug. Drug-Drug Interaction (DDI) prediction and validation can sometimes identify possible synergy in drug combinations, allowing individual drugs to be more effective. Continue Reading Paper: https://www.biorxiv.org/content/10.1101/2020.11.09.375626v1.full.pdf Github: https://github.com/wanwenzeng/deepdrug https://preview.redd.it/2g3rd5bgaqi81.png?width=1202&format=png&auto=webp&s=13e7360d730fe9b0cbd5bb69bbce4a872def273c submitted by /u/ai-lover [link] [comments]  ( 1 min )
    Do you think we'll ever be able to generate fake episodes of TV shows?
    With how AI-generated voices are becoming creepily realistic and once impossible AI text generators like GPT-3 and image generation becoming possible, all in the last few years, it begs the question; Can an AI eventually generate fake episodes? I imagine this would be more possible with cartoons than live-action, since the AI would simply need to write a script of an episode/dialogue, generate animation, and then generate voices. Current AI can do this extraordinarily well, and I imagine it will improve exponentially in the next couple of years. While animation might be tricky and slightly buggy at times to generate, who knows how far AI will come in the next few years animation/image generation-wise. submitted by /u/katiebug586 [link] [comments]  ( 1 min )
  • Open

    Any resources to learn LSTM and more specifically Bidirectional LSTM? I am new to deep learning! Preferably courses that can ELI5?
    submitted by /u/nornirishnepalese [link] [comments]  ( 1 min )
    Researchers From Nankai and Stanford Propose ‘DeepDrug’: A Python Based Deep Learning Framework For Drug Relation Prediction
    Drug discovery includes looking for biomedical connections between chemical compounds (drugs, chemicals) and protein targets. Drugs interact with biological systems on a fundamental level by binding to protein targets and influencing their downstream action. Predicting Drug-Target Interactions (DTIs) is crucial for identifying therapeutic targets or drug target characteristics. Understanding and forecasting higher-level information such as side effects, therapeutic mechanisms, and even innovative insights for drug repositioning or repurposing can all be aided by DTI knowledge. Sildenafil, for example, was originally created to treat pulmonary hypertension, but after its adverse effects were discovered, it was repurposed to treat erectile dysfunction. Polypharmacy has also become a viable method among pharmacists because most human diseases are complex biological processes that are resistant to the activity of anyone drug. Drug-Drug Interaction (DDI) prediction and validation can sometimes identify possible synergy in drug combinations, allowing individual drugs to be more effective. Continue Reading Paper: https://www.biorxiv.org/content/10.1101/2020.11.09.375626v1.full.pdf Github: https://github.com/wanwenzeng/deepdrug https://preview.redd.it/hz5mimoeaqi81.png?width=1202&format=png&auto=webp&s=6d4fc1a20dd3702f655f0ba7297065e0cdf8d897 submitted by /u/ai-lover [link] [comments]  ( 1 min )
  • Open

    New Idea about value iteration (Maybe)
    So as we know that value iteration is 1 policy evaluation and 1 policy improvement, I got an idea which is Shin_value_iteration(n), if n = 1, then it is normal value iteration, if n = a number until convergence, then it is policy iteration. Then why not try n = 2, or n = 3 to see if it converges faster. Idk if this is an idea that already exists or not. submitted by /u/Professional_Card176 [link] [comments]  ( 1 min )
    Clarifying RL: Obscure problem formulations and structure tradeoffs!
    submitted by /u/MartialWuxia [link] [comments]
    "Retrieval-Augmented Reinforcement Learning", Goyal et al 2022 {DM} (DQN/R2D2)
    submitted by /u/gwern [link] [comments]
  • Open

    Toward a stronger defense of personal data
    Engineers build a lower-energy chip that can prevent hackers from extracting hidden information from a smart device.  ( 6 min )
  • Open

    End-to-end Alexa Device Arbitration. (arXiv:2112.04914v2 [eess.AS] UPDATED)
    We introduce a variant of the speaker localization problem, which we call device arbitration. In the device arbitration problem, a user utters a keyword that is detected by multiple distributed microphone arrays (smart home devices), and we want to determine which device was closest to the user. Rather than solving the full localization problem, we propose an end-to-end machine learning system. This system learns a feature embedding that is computed independently on each device. The embeddings from each device are then aggregated together to produce the final arbitration decision. We use a large-scale room simulation to generate training and evaluation data, and compare our system against a signal processing baseline.  ( 2 min )
    ECQ$^{\text{x}}$: Explainability-Driven Quantization for Low-Bit and Sparse DNNs. (arXiv:2109.04236v2 [cs.LG] UPDATED)
    The remarkable success of deep neural networks (DNNs) in various applications is accompanied by a significant increase in network parameters and arithmetic operations. Such increases in memory and computational demands make deep learning prohibitive for resource-constrained hardware platforms such as mobile devices. Recent efforts aim to reduce these overheads, while preserving model performance as much as possible, and include parameter reduction techniques, parameter quantization, and lossless compression techniques. In this chapter, we develop and describe a novel quantization paradigm for DNNs: Our method leverages concepts of explainable AI (XAI) and concepts of information theory: Instead of assigning weight values based on their distances to the quantization clusters, the assignment function additionally considers weight relevances obtained from Layer-wise Relevance Propagation (LRP) and the information content of the clusters (entropy optimization). The ultimate goal is to preserve the most relevant weights in quantization clusters of highest information content. Experimental results show that this novel Entropy-Constrained and XAI-adjusted Quantization (ECQ$^{\text{x}}$) method generates ultra low-precision (2-5 bit) and simultaneously sparse neural networks while maintaining or even improving model performance. Due to reduced parameter precision and high number of zero-elements, the rendered networks are highly compressible in terms of file size, up to $103\times$ compared to the full-precision unquantized DNN model. Our approach was evaluated on different types of models and datasets (including Google Speech Commands, CIFAR-10 and Pascal VOC) and compared with previous work.  ( 2 min )
    Novel Features for Time Series Analysis: A Complex Networks Approach. (arXiv:2110.09888v3 [cs.SI] UPDATED)
    Being able to capture the characteristics of a time series with a feature vector is a very important task with a multitude of applications, such as classification, clustering or forecasting. Usually, the features are obtained from linear and nonlinear time series measures, that may present several data related drawbacks. In this work we introduce NetF as an alternative set of features, incorporating several representative topological measures of different complex networks mappings of the time series. Our approach does not require data preprocessing and is applicable regardless of any data characteristics. Exploring our novel feature vector, we are able to connect mapped network features to properties inherent in diversified time series models, showing that NetF can be useful to characterize time data. Furthermore, we also demonstrate the applicability of our methodology in clustering synthetic and benchmark time series sets, comparing its performance with more conventional features, showcasing how NetF can achieve high-accuracy clusters. Our results are very promising, with network features from different mapping methods capturing different properties of the time series, adding a different and rich feature set to the literature.  ( 2 min )
    (Optimal) Online Bipartite Matching with Predicted Degrees. (arXiv:2110.11439v2 [cs.DS] UPDATED)
    We propose a model for online graph problems where algorithms are given access to an oracle that predicts (e.g., based on past data) the degrees of nodes in the graph. Within this model, we study the classic problem of online bipartite matching, and a natural greedy matching algorithm called MinPredictedDegree, which uses predictions of the degrees of offline nodes. For the bipartite version of a stochastic graph model due to Chung, Lu, and Vu where the expected values of the offline degrees are known and used as predictions, we show that MinPredictedDegree stochastically dominates any other online algorithm, i.e., it is optimal for graphs drawn from this model. Since the "symmetric" version of the model, where all online nodes are identical, is a special case of the well-studied "known i.i.d. model", it follows that the competitive ratio of MinPredictedDegree on such inputs is at least 0.7299. For the special case of graphs with power law degree distributions, we show that MinPredictedDegree frequently produces matchings almost as large as the true maximum matching on such graphs. We complement these results with an extensive empirical evaluation showing that MinPredictedDegree compares favorably to state-of-the-art online algorithms for online matching.  ( 2 min )
    Provable Regret Bounds for Deep Online Learning and Control. (arXiv:2110.07807v3 [cs.LG] UPDATED)
    The theory of deep learning focuses almost exclusively on supervised learning, non-convex optimization using stochastic gradient descent, and overparametrized neural networks. It is common belief that the optimizer dynamics, network architecture, initialization procedure, and other factors tie together and are all components of its success. This presents theoretical challenges for analyzing state-based and/or online deep learning. Motivated by applications in control, we give a general black-box reduction from deep learning to online convex optimization. This allows us to decouple optimization, regret, expressiveness, and derive agnostic online learning guarantees for fully-connected deep neural networks with ReLU activations. We quantify convergence and regret guarantees for any range of parameters and allow any optimization procedure, such as adaptive gradient methods and second order methods. As an application, we derive provable algorithms for deep control in the online episodic setting.  ( 2 min )
    DataWords: Getting Contrarian with Text, Structured Data and Explanations. (arXiv:2111.05384v2 [cs.LG] UPDATED)
    Our goal is to build classification models using a combination of free-text and structured data. To do this, we represent structured data by text sentences, DataWords, so that similar data items are mapped into the same sentence. This permits modeling a mixture of text and structured data by using only text-modeling algorithms. Several examples illustrate that it is possible to improve text classification performance by first running extraction tools (named entity recognition), then converting the output to DataWords, and adding the DataWords to the original text -- before model building and classification. This approach also allows us to produce explanations for inferences in terms of both free text and structured data.  ( 2 min )
    Towards Evaluating the Robustness of Neural Networks Learned by Transduction. (arXiv:2110.14735v2 [cs.LG] UPDATED)
    There has been emerging interest in using transductive learning for adversarial robustness (Goldwasser et al., NeurIPS 2020; Wu et al., ICML 2020; Wang et al., ArXiv 2021). Compared to traditional defenses, these defense mechanisms "dynamically learn" the model based on test-time input; and theoretically, attacking these defenses reduces to solving a bilevel optimization problem, which poses difficulty in crafting adaptive attacks. In this paper, we examine these defense mechanisms from a principled threat analysis perspective. We formulate and analyze threat models for transductive-learning based defenses, and point out important subtleties. We propose the principle of attacking model space for solving bilevel attack objectives, and present Greedy Model Space Attack (GMSA), an attack framework that can serve as a new baseline for evaluating transductive-learning based defenses. Through systematic evaluation, we show that GMSA, even with weak instantiations, can break previous transductive-learning based defenses, which were resilient to previous attacks, such as AutoAttack. On the positive side, we report a somewhat surprising empirical result of "transductive adversarial training": Adversarially retraining the model using fresh randomness at the test time gives a significant increase in robustness against attacks we consider.  ( 2 min )
    Signing the Supermask: Keep, Hide, Invert. (arXiv:2201.13361v2 [cs.LG] UPDATED)
    The exponential growth in numbers of parameters of neural networks over the past years has been accompanied by an increase in performance across several fields. However, due to their sheer size, the networks not only became difficult to interpret but also problematic to train and use in real-world applications, since hardware requirements increased accordingly. Tackling both issues, we present a novel approach that either drops a neural network's initial weights or inverts their respective sign. Put simply, a network is trained by weight selection and inversion without changing their absolute values. Our contribution extends previous work on masking by additionally sign-inverting the initial weights and follows the findings of the Lottery Ticket Hypothesis. Through this extension and adaptations of initialization methods, we achieve a pruning rate of up to 99%, while still matching or exceeding the performance of various baseline and previous models. Our approach has two main advantages. First, and most notable, signed Supermask models drastically simplify a model's structure, while still performing well on given tasks. Second, by reducing the neural network to its very foundation, we gain insights into which weights matter for performance. The code is available on GitHub.  ( 2 min )
    Offline Preference-Based Apprenticeship Learning. (arXiv:2107.09251v3 [cs.LG] UPDATED)
    Learning a reward function from human preferences is challenging as it typically requires having a high-fidelity simulator or using expensive and potentially unsafe actual physical rollouts in the environment. However, in many tasks the agent might have access to offline data from related tasks in the same target environment. While offline data is increasingly being used to aid policy optimization via offline RL, our observation is that it can be a surprisingly rich source of information for preference learning as well. We propose an approach that uses an offline dataset to craft preference queries via pool-based active learning, learns a distribution over reward functions, and optimizes a corresponding policy via offline RL. Crucially, our proposed approach does not require actual physical rollouts or an accurate simulator for either the reward learning or policy optimization steps. To test our approach, we identify a subset of existing offline RL benchmarks that are well suited for offline reward learning and also propose new offline apprenticeship learning benchmarks which allow for more open-ended behaviors. Our empirical results suggest that combining offline RL with learned human preferences can enable an agent to learn to perform novel tasks that were not explicitly shown in the offline data.  ( 2 min )
    Self-Supervised Representation Learning via Latent Graph Prediction. (arXiv:2202.08333v1 [cs.LG])
    Self-supervised learning (SSL) of graph neural networks is emerging as a promising way of leveraging unlabeled data. Currently, most methods are based on contrastive learning adapted from the image domain, which requires view generation and a sufficient number of negative samples. In contrast, existing predictive models do not require negative sampling, but lack theoretical guidance on the design of pretext training tasks. In this work, we propose the LaGraph, a theoretically grounded predictive SSL framework based on latent graph prediction. Learning objectives of LaGraph are derived as self-supervised upper bounds to objectives for predicting unobserved latent graphs. In addition to its improved performance, LaGraph provides explanations for recent successes of predictive models that include invariance-based objectives. We provide theoretical analysis comparing LaGraph to related methods in different domains. Our experimental results demonstrate the superiority of LaGraph in performance and the robustness to decreasing of training sample size on both graph-level and node-level tasks.  ( 2 min )
    Improving Experience Replay with Successor Representation. (arXiv:2111.14331v2 [cs.LG] UPDATED)
    Prioritized experience replay is a reinforcement learning technique whereby agents speed up learning by replaying useful past experiences. This usefulness is quantified as the expected gain from replaying the experience, a quantity often approximated as the prediction error (TD-error). However, recent work in neuroscience suggests that, in biological organisms, replay is prioritized not only by gain, but also by "need" -- a quantity measuring the expected relevance of each experience with respect to the current situation. Importantly, this term is not currently considered in algorithms such as prioritized experience replay. In this paper we present a new approach for prioritizing experiences for replay that considers both gain and need. Our proposed algorithms show a significant increase in performance in benchmarks including the Dyna-Q maze and a selection of Atari games.  ( 2 min )
    Linear Convergence of Entropy-Regularized Natural Policy Gradient with Linear Function Approximation. (arXiv:2106.04096v3 [cs.LG] UPDATED)
    Natural policy gradient (NPG) methods with entropy regularization achieve impressive empirical success in reinforcement learning problems with large state-action spaces. However, their convergence properties and the impact of entropy regularization remain elusive in the function approximation regime. In this paper, we establish finite-time convergence analyses of entropy-regularized NPG with linear function approximation under softmax parameterization. In particular, we prove that entropy-regularized NPG with averaging satisfies the \emph{persistence of excitation} condition, and achieves a fast convergence rate of $\tilde{O}(1/T)$ up to a function approximation error in regularized Markov decision processes. This convergence result does not require any a priori assumptions on the policies. Furthermore, under mild regularity conditions on the concentrability coefficient and basis vectors, we prove that entropy-regularized NPG exhibits \emph{linear convergence} up to a function approximation error.  ( 2 min )
    B2EA: An Evolutionary Algorithm Assisted by Two Bayesian Optimization Modules for Neural Architecture Search. (arXiv:2202.03005v2 [cs.NE] UPDATED)
    The early pioneering Neural Architecture Search (NAS) works were multi-trial methods applicable to any general search space. The subsequent works took advantage of the early findings and developed weight-sharing methods that assume a structured search space typically with pre-fixed hyperparameters. Despite the amazing computational efficiency of the weight-sharing NAS algorithms, it is becoming apparent that multi-trial NAS algorithms are also needed for identifying very high-performance architectures, especially when exploring a general search space. In this work, we carefully review the latest multi-trial NAS algorithms and identify the key strategies including Evolutionary Algorithm (EA), Bayesian Optimization (BO), diversification, input and output transformations, and lower fidelity estimation. To accommodate the key strategies into a single framework, we develop B2EA that is a surrogate assisted EA with two BO surrogate models and a mutation step in between. To show that B2EA is robust and efficient, we evaluate three performance metrics over 14 benchmarks with general and cell-based search spaces. Comparisons with state-of-the-art multi-trial algorithms reveal that B2EA is robust and efficient over the 14 benchmarks for three difficulty levels of target performance. The B2EA code is publicly available at \url{https://github.com/snu-adsl/BBEA}.  ( 2 min )
    The Mirrornet : Learning Audio Synthesizer Controls Inspired by Sensorimotor Interaction. (arXiv:2110.05695v3 [eess.AS] UPDATED)
    Experiments to understand the sensorimotor neural interactions in the human cortical speech system support the existence of a bidirectional flow of interactions between the auditory and motor regions. Their key function is to enable the brain to `learn' how to control the vocal tract for speech production. This idea is the impetus for the recently proposed "MirrorNet", a constrained autoencoder architecture. In this paper, the MirrorNet is applied to learn, in an unsupervised manner, the controls of a specific audio synthesizer (DIVA) to produce melodies only from their auditory spectrograms. The results demonstrate how the MirrorNet discovers the synthesizer parameters to generate the melodies that closely resemble the original and those of unseen melodies, and even determine the best set parameters to approximate renditions of complex piano melodies generated by a different synthesizer. This generalizability of the MirrorNet illustrates its potential to discover from sensory data the controls of arbitrary motor-plants.  ( 2 min )
    Isometric MT: Neural Machine Translation for Automatic Dubbing. (arXiv:2112.08682v3 [cs.CL] UPDATED)
    Automatic dubbing (AD) is among the machine translation (MT) use cases where translations should match a given length to allow for synchronicity between source and target speech. For neural MT, generating translations of length close to the source length (e.g. within +-10% in character count), while preserving quality is a challenging task. Controlling MT output length comes at a cost to translation quality, which is usually mitigated with a two step approach of generating N-best hypotheses and then re-ranking based on length and quality. This work introduces a self-learning approach that allows a transformer model to directly learn to generate outputs that closely match the source length, in short Isometric MT. In particular, our approach does not require to generate multiple hypotheses nor any auxiliary ranking function. We report results on four language pairs (English - French, Italian, German, Spanish) with a publicly available benchmark. Automatic and manual evaluations show that our method for Isometric MT outperforms more complex approaches proposed in the literature.  ( 2 min )
    Quantitative Performance Assessment of CNN Units via Topological Entropy Calculation. (arXiv:2103.09716v3 [cs.CV] UPDATED)
    Identifying the status of individual network units is critical for understanding the mechanism of convolutional neural networks (CNNs). However, it is still challenging to reliably give a general indication of unit status, especially for units in different network models. To this end, we propose a novel method for quantitatively clarifying the status of single unit in CNN using algebraic topological tools. Unit status is indicated via the calculation of a defined topological-based entropy, called feature entropy, which measures the degree of chaos of the global spatial pattern hidden in the unit for a category. In this way, feature entropy could provide an accurate indication of status for units in different networks with diverse situations like weight-rescaling operation. Further, we show that feature entropy decreases as the layer goes deeper and shares almost simultaneous trend with loss during training. We show that by investigating the feature entropy of units on only training data, it could give discrimination between networks with different generalization ability from the view of the effectiveness of feature representations.  ( 2 min )
    Variational Marginal Particle Filters. (arXiv:2109.15134v2 [stat.ML] UPDATED)
    Variational inference for state space models (SSMs) is known to be hard in general. Recent works focus on deriving variational objectives for SSMs from unbiased sequential Monte Carlo estimators. We reveal that the marginal particle filter is obtained from sequential Monte Carlo by applying Rao-Blackwellization operations, which sacrifices the trajectory information for reduced variance and differentiability. We propose the variational marginal particle filter (VMPF), which is a differentiable and reparameterizable variational filtering objective for SSMs based on an unbiased estimator. We find that VMPF with biased gradients gives tighter bounds than previous objectives, and the unbiased reparameterization gradients are sometimes beneficial.  ( 2 min )
    FANN-on-MCU: An Open-Source Toolkit for Energy-Efficient Neural Network Inference at the Edge of the Internet of Things. (arXiv:1911.03314v3 [cs.LG] UPDATED)
    The growing number of low-power smart devices in the Internet of Things is coupled with the concept of "Edge Computing", that is moving some of the intelligence, especially machine learning, towards the edge of the network. Enabling machine learning algorithms to run on resource-constrained hardware, typically on low-power smart devices, is challenging in terms of hardware (optimized and energy-efficient integrated circuits), algorithmic and firmware implementations. This paper presents FANN-on-MCU, an open-source toolkit built upon the Fast Artificial Neural Network (FANN) library to run lightweight and energy-efficient neural networks on microcontrollers based on both the ARM Cortex-M series and the novel RISC-V-based Parallel Ultra-Low-Power (PULP) platform. The toolkit takes multi-layer perceptrons trained with FANN and generates code targeted at execution on low-power microcontrollers either with a floating-point unit (i.e., ARM Cortex-M4F and M7F) or without (i.e., ARM Cortex M0-M3 or PULP-based processors). This paper also provides an architectural performance evaluation of neural networks on the most popular ARM Cortex-M family and the parallel RISC-V processor called Mr. Wolf. The evaluation includes experimental results for three different applications using a self-sustainable wearable multi-sensor bracelet. Experimental results show a measured latency in the order of only a few microseconds and a power consumption of few milliwatts while keeping the memory requirements below the limitations of the targeted microcontrollers. In particular, the parallel implementation on the octa-core RISC-V platform reaches a speedup of 22x and a 69% reduction in energy consumption with respect to a single-core implementation on Cortex-M4 for continuous real-time classification.  ( 2 min )
    Sublinear Time Approximation of Text Similarity Matrices. (arXiv:2112.09631v2 [cs.LG] UPDATED)
    We study algorithms for approximating pairwise similarity matrices that arise in natural language processing. Generally, computing a similarity matrix for $n$ data points requires $\Omega(n^2)$ similarity computations. This quadratic scaling is a significant bottleneck, especially when similarities are computed via expensive functions, e.g., via transformer models. Approximation methods reduce this quadratic complexity, often by using a small subset of exactly computed similarities to approximate the remainder of the complete pairwise similarity matrix. Significant work focuses on the efficient approximation of positive semidefinite (PSD) similarity matrices, which arise e.g., in kernel methods. However, much less is understood about indefinite (non-PSD) similarity matrices, which often arise in NLP. Motivated by the observation that many of these matrices are still somewhat close to PSD, we introduce a generalization of the popular Nystr\"{o}m method to the indefinite setting. Our algorithm can be applied to any similarity matrix and runs in sublinear time in the size of the matrix, producing a rank-$s$ approximation with just $O(ns)$ similarity computations. We show that our method, along with a simple variant of CUR decomposition, performs very well in approximating a variety of similarity matrices arising in NLP tasks. We demonstrate high accuracy of the approximated similarity matrices in the downstream tasks of document classification, sentence similarity, and cross-document coreference.  ( 2 min )
    Learning to be Fair: A Consequentialist Approach to Equitable Decision-Making. (arXiv:2109.08792v2 [cs.LG] UPDATED)
    In the dominant paradigm for designing equitable machine learning systems, one works to ensure that model predictions satisfy various fairness criteria, such as parity in error rates across race, gender, and other legally protected traits. That approach, however, typically ignores the downstream decisions and outcomes that predictions affect, and, as a result, can induce unexpected harms. Here we present an alternative framework for fairness that directly anticipates the consequences of decisions. Stakeholders first specify preferences over the possible outcomes of an algorithmically informed decision-making process. For example, lenders may prefer extending credit to those most likely to repay a loan, while also preferring similar lending rates across neighborhoods. One then searches the space of decision policies to maximize the specified utility. We develop and describe a method for efficiently learning these optimal policies from data for a large family of expressive utility functions, facilitating a more holistic approach to equitable decision-making.  ( 2 min )
    The Representation Jensen-R\'enyi Divergence. (arXiv:2112.01583v3 [cs.LG] UPDATED)
    We introduce a divergence measure between data distributions based on operators in reproducing kernel Hilbert spaces defined by kernels. The empirical estimator of the divergence is computed using the eigenvalues of positive definite Gram matrices that are obtained by evaluating the kernel over pairs of data points. The new measure shares similar properties to Jensen-Shannon divergence. Convergence of the proposed estimators follows from concentration results based on the difference between the ordered spectrum of the Gram matrices and the integral operators associated with the population quantities. The proposed measure of divergence avoids the estimation of the probability distribution underlying the data. Numerical experiments involving comparing distributions and applications to sampling unbalanced data for classification show that the proposed divergence can achieve state of the art results.  ( 2 min )
    Short-Term Stock Price-Trend Prediction Using Meta-Learning. (arXiv:2105.13599v2 [cs.LG] UPDATED)
    Although conventional machine learning algorithms have been widely adopted for stock-price predictions in recent years, the massive volume of specific labeled data required are not always available. In contrast, meta-learning technology uses relatively small amounts of training data, called fast learners. Such methods are beneficial under conditions of limited data availability, which often obtain for trend prediction based on time-series data limited by sparse information. In this study, we consider short-term stock price prediction using a meta-learning framework with several convolutional neural networks, including the temporal convolution network, fully convolutional network, and residual neural network. We propose a sliding time horizon to label stocks according to their predicted price trends, referred to as called slope-detection labeling, using prediction labels including "rise plus," "rise," "fall," and "fall plus". The effectiveness of the proposed meta-learning framework was evaluated by application to the S&P500. The experimental results show that the inclusion of the proposed meta-learning framework significantly improved both regular and balanced prediction accuracy and profitability.  ( 2 min )
    A manifold learning approach for gesture recognition from micro-Doppler radar measurements. (arXiv:2110.01670v2 [cs.LG] UPDATED)
    A recent paper (Neural Networks, {\bf 132} (2020), 253-268) introduces a straightforward and simple kernel based approximation for manifold learning that does not require the knowledge of anything about the manifold, except for its dimension. In this paper, we examine how the pointwise error in approximation using least squares optimization based on similarly localized kernels depends upon the data characteristics and deteriorates as one goes away from the training data. The theory is presented with an abstract localized kernel, which can utilize any prior knowledge about the data being located on an unknown sub-manifold of a known manifold. We demonstrate the performance of our approach using a publicly available micro-Doppler data set, and investigate the use of different preprocessing measures, kernels, and manifold dimensions. Specifically, it is shown that the localized kernel introduced in the above mentioned paper when used with PCA components leads to a near-competitive performance to deep neural networks, and offers significant improvements in training speed and memory requirements. To demonstrate the fact that our methods are agnostic to the domain knowledge, we examine the classification problem in a simple video data set.  ( 2 min )
    Hierarchical Reinforcement Learning Framework for Stochastic Spaceflight Campaign Design. (arXiv:2103.08981v2 [cs.LG] UPDATED)
    This paper develops a hierarchical reinforcement learning architecture for multimission spaceflight campaign design under uncertainty, including vehicle design, infrastructure deployment planning, and space transportation scheduling. This problem involves a high-dimensional design space and is challenging especially with uncertainty present. To tackle this challenge, the developed framework has a hierarchical structure with reinforcement learning and network-based mixed-integer linear programming (MILP), where the former optimizes campaign-level decisions (e.g., design of the vehicle used throughout the campaign, destination demand assigned to each mission in the campaign), whereas the latter optimizes the detailed mission-level decisions (e.g., when to launch what from where to where). The framework is applied to a set of human lunar exploration campaign scenarios with uncertain in situ resource utilization performance as a case study. The main value of this work is its integration of the rapidly growing reinforcement learning research and the existing MILP-based space logistics methods through a hierarchical framework to handle the otherwise intractable complexity of space mission design under uncertainty. This unique framework is expected to be a critical steppingstone for the emerging research direction of artificial intelligence for space mission design.  ( 2 min )
    Automatic DJ Transitions with Differentiable Audio Effects and Generative Adversarial Networks. (arXiv:2110.06525v2 [cs.SD] UPDATED)
    A central task of a Disc Jockey (DJ) is to create a mixset of mu-sic with seamless transitions between adjacent tracks. In this paper, we explore a data-driven approach that uses a generative adversarial network to create the song transition by learning from real-world DJ mixes. In particular, the generator of the model uses two differentiable digital signal processing components, an equalizer (EQ) and a fader, to mix two tracks selected by a data generation pipeline. The generator has to set the parameters of the EQs and fader in such away that the resulting mix resembles real mixes created by humanDJ, as judged by the discriminator counterpart. Result of a listening test shows that the model can achieve competitive results compared with a number of baselines.  ( 2 min )
    Robust Learning in Heterogeneous Contexts. (arXiv:2105.08532v3 [stat.ML] UPDATED)
    We consider the problem of learning from training data obtained in different contexts, where the underlying context distribution is unknown and is estimated empirically. We develop a robust method that takes into account the uncertainty of the context distribution. Unlike the conventional and overly conservative minimax approach, we focus on excess risks and construct distribution sets with statistical coverage to achieve an appropriate trade-off between performance and robustness. The proposed method is computationally scalable and shown to interpolate between empirical risk minimization and minimax regret objectives. Using both real and synthetic data, we demonstrate its ability to provide robustness in worst-case scenarios without harming performance in the nominal scenario.  ( 2 min )
    Sequential Multivariate Change Detection with Calibrated and Memoryless False Detection Rates. (arXiv:2108.00883v2 [stat.ME] UPDATED)
    Responding appropriately to the detections of a sequential change detector requires knowledge of the rate at which false positives occur in the absence of change. Setting detection thresholds to achieve a desired false positive rate is challenging. Existing works resort to setting time-invariant thresholds that focus on the expected runtime of the detector in the absence of change, either bounding it loosely from below or targeting it directly but with asymptotic arguments that we show cause significant miscalibration in practice. We present a simulation-based approach to setting time-varying thresholds that allows a desired expected runtime to be accurately targeted whilst additionally keeping the false positive rate constant across time steps. Whilst the approach to threshold setting is metric agnostic, we show how the cost of using the popular quadratic time MMD estimator can be reduced from $O(N^2B)$ to $O(N^2+NB)$ during configuration and from $O(N^2)$ to $O(N)$ during operation, where $N$ and $B$ are the numbers of reference and bootstrap samples respectively.  ( 2 min )
    Integration of Pre-trained Networks with Continuous Token Interface for End-to-End Spoken Language Understanding. (arXiv:2104.07253v2 [cs.CL] UPDATED)
    Most End-to-End (E2E) SLU networks leverage the pre-trained ASR networks but still lack the capability to understand the semantics of utterances, crucial for the SLU task. To solve this, recently proposed studies use pre-trained NLU networks. However, it is not trivial to fully utilize both pre-trained networks; many solutions were proposed, such as Knowledge Distillation, cross-modal shared embedding, and network integration with Interface. We propose a simple and robust integration method for the E2E SLU network with novel Interface, Continuous Token Interface (CTI), the junctional representation of the ASR and NLU networks when both networks are pre-trained with the same vocabulary. Because the only difference is the noise level, we directly feed the ASR network's output to the NLU network. Thus, we can train our SLU network in an E2E manner without additional modules, such as Gumbel-Softmax. We evaluate our model using SLURP, a challenging SLU dataset and achieve state-of-the-art scores on both intent classification and slot filling tasks. We also verify the NLU network, pre-trained with Masked Language Model, can utilize a noisy textual representation of CTI. Moreover, we show our model can be trained with multi-task learning from heterogeneous data even after integration with CTI.  ( 2 min )
    SVSNet: An End-to-end Speaker Voice Similarity Assessment Model. (arXiv:2107.09392v2 [eess.AS] UPDATED)
    Neural evaluation metrics derived for numerous speech generation tasks have recently attracted great attention. In this paper, we propose SVSNet, the first end-to-end neural network model to assess the speaker voice similarity between converted speech and natural speech for voice conversion tasks. Unlike most neural evaluation metrics that use hand-crafted features, SVSNet directly takes the raw waveform as input to more completely utilize speech information for prediction. SVSNet consists of encoder, co-attention, distance calculation, and prediction modules and is trained in an end-to-end manner. The experimental results on the Voice Conversion Challenge 2018 and 2020 (VCC2018 and VCC2020) datasets show that SVSNet outperforms well-known baseline systems in the assessment of speaker similarity at the utterance and system levels.  ( 2 min )
    General Cyclical Training of Neural Networks. (arXiv:2202.08835v1 [cs.LG])
    This paper describes the principle of "General Cyclical Training" in machine learning, where training starts and ends with "easy training" and the "hard training" happens during the middle epochs. We propose several manifestations for training neural networks, including algorithmic examples (via hyper-parameters and loss functions), data-based examples, and model-based examples. Specifically, we introduce several novel techniques: cyclical weight decay, cyclical batch size, cyclical focal loss, cyclical softmax temperature, cyclical data augmentation, cyclical gradient clipping, and cyclical semi-supervised learning. In addition, we demonstrate that cyclical weight decay, cyclical softmax temperature, and cyclical gradient clipping (as three examples of this principle) are beneficial in the test accuracy performance of a trained model. Furthermore, we discuss model-based examples (such as pretraining and knowledge distillation) from the perspective of general cyclical training and recommend some changes to the typical training methodology. In summary, this paper defines the general cyclical training concept and discusses several specific ways in which this concept can be applied to training neural networks. In the spirit of reproducibility, the code used in our experiments is available at \url{https://github.com/lnsmith54/CFL}.  ( 2 min )
    Learning and Evaluating Graph Neural Network Explanations based on Counterfactual and Factual Reasoning. (arXiv:2202.08816v1 [cs.IR])
    Structural data well exists in Web applications, such as social networks in social media, citation networks in academic websites, and threads data in online forums. Due to the complex topology, it is difficult to process and make use of the rich information within such data. Graph Neural Networks (GNNs) have shown great advantages on learning representations for structural data. However, the non-transparency of the deep learning models makes it non-trivial to explain and interpret the predictions made by GNNs. Meanwhile, it is also a big challenge to evaluate the GNN explanations, since in many cases, the ground-truth explanations are unavailable. In this paper, we take insights of Counterfactual and Factual (CF^2) reasoning from causal inference theory, to solve both the learning and evaluation problems in explainable GNNs. For generating explanations, we propose a model-agnostic framework by formulating an optimization problem based on both of the two casual perspectives. This distinguishes CF^2 from previous explainable GNNs that only consider one of them. Another contribution of the work is the evaluation of GNN explanations. For quantitatively evaluating the generated explanations without the requirement of ground-truth, we design metrics based on Counterfactual and Factual reasoning to evaluate the necessity and sufficiency of the explanations. Experiments show that no matter ground-truth explanations are available or not, CF^2 generates better explanations than previous state-of-the-art methods on real-world datasets. Moreover, the statistic analysis justifies the correlation between the performance on ground-truth evaluation and our proposed metrics.  ( 2 min )
    A Data-Augmentation Is Worth A Thousand Samples: Exact Quantification From Analytical Augmented Sample Moments. (arXiv:2202.08325v1 [cs.LG])
    Data-Augmentation (DA) is known to improve performance across tasks and datasets. We propose a method to theoretically analyze the effect of DA and study questions such as: how many augmented samples are needed to correctly estimate the information encoded by that DA? How does the augmentation policy impact the final parameters of a model? We derive several quantities in close-form, such as the expectation and variance of an image, loss, and model's output under a given DA distribution. Those derivations open new avenues to quantify the benefits and limitations of DA. For example, we show that common DAs require tens of thousands of samples for the loss at hand to be correctly estimated and for the model training to converge. We show that for a training loss to be stable under DA sampling, the model's saliency map (gradient of the loss with respect to the model's input) must align with the smallest eigenvector of the sample variance under the considered DA augmentation, hinting at a possible explanation on why models tend to shift their focus from edges to textures.  ( 2 min )
    Should I send this notification? Optimizing push notifications decision making by modeling the future. (arXiv:2202.08812v1 [cs.IR])
    Most recommender systems are myopic, that is they optimize based on the immediate response of the user. This may be misaligned with the true objective, such as creating long term user satisfaction. In this work we focus on mobile push notifications, where the long term effects of recommender system decisions can be particularly strong. For example, sending too many or irrelevant notifications may annoy a user and cause them to disable notifications. However, a myopic system will always choose to send a notification since negative effects occur in the future. This is typically mitigated using heuristics. However, heuristics can be hard to reason about or improve, require retuning each time the system is changed, and may be suboptimal. To counter these drawbacks, there is significant interest in recommender systems that optimize directly for long-term value (LTV). Here, we describe a method for maximising LTV by using model-based reinforcement learning (RL) to make decisions about whether to send push notifications. We model the effects of sending a notification on the user's future behavior. Much of the prior work applying RL to maximise LTV in recommender systems has focused on session-based optimization, while the time horizon for notification decision making in this work extends over several days. We test this approach in an A/B test on a major social network. We show that by optimizing decisions about push notifications we are able to send less notifications and obtain a higher open rate than the baseline system, while generating the same level of user engagement on the platform as the existing, heuristic-based, system.  ( 2 min )
    CoFED: Cross-silo Heterogeneous Federated Multi-task Learning via Co-training. (arXiv:2202.08603v1 [cs.LG])
    Federated Learning (FL) is a machine learning technique that enables participants to train high-quality models collaboratively without exchanging their private data. Participants in cross-silo FL settings are independent organizations with different task needs, and they are concerned not only with data privacy, but also with training independently their unique models due to intellectual property. Most existing FL schemes are incapability for the above scenarios. In this paper, we propose a communication-efficient FL scheme, CoFED, based on pseudo-labeling unlabeled data like co-training. To the best of our knowledge, it is the first FL scheme compatible with heterogeneous tasks, heterogeneous models, and heterogeneous training algorithms simultaneously. Experimental results show that CoFED achieves better performance with a lower communication cost. Especially for the non-IID settings and heterogeneous models, the proposed method improves the performance by 35%.  ( 2 min )
    Real-Time Neural Voice Camouflage. (arXiv:2112.07076v2 [cs.SD] UPDATED)
    Automatic speech recognition systems have created exciting possibilities for applications, however they also enable opportunities for systematic eavesdropping. We propose a method to camouflage a person's voice over-the-air from these systems without inconveniencing the conversation between people in the room. Standard adversarial attacks are not effective in real-time streaming situations because the characteristics of the signal will have changed by the time the attack is executed. We introduce predictive attacks, which achieve real-time performance by forecasting the attack that will be the most effective in the future. Under real-time constraints, our method jams the established speech recognition system DeepSpeech 3.9x more than baselines as measured through word error rate, and 6.6x more as measured through character error rate. We furthermore demonstrate our approach is practically effective in realistic environments over physical distances.  ( 2 min )
    Multi-stage Ensemble Model for Cross-market Recommendation. (arXiv:2202.08824v1 [cs.IR])
    This paper describes the solution of our team PolimiRank for the WSDM Cup 2022 on cross-market recommendation. The goal of the competition is to effectively exploit the information extracted from different markets to improve the ranking accuracy of recommendations on two target markets. Our model consists in a multi-stage approach based on the combination of data belonging to different markets. In the first stage, state-of-the-art recommenders are used to predict scores for user-item couples, which are ensembled in the following 2 stages, employing a simple linear combination and more powerful Gradient Boosting Decision Tree techniques. Our team ranked 4th in the final leaderboard.  ( 2 min )
    Learning fair representation with a parametric integral probability metric. (arXiv:2202.02943v2 [stat.ML] UPDATED)
    As they have a vital effect on social decision-making, AI algorithms should be not only accurate but also fair. Among various algorithms for fairness AI, learning fair representation (LFR), whose goal is to find a fair representation with respect to sensitive variables such as gender and race, has received much attention. For LFR, the adversarial training scheme is popularly employed as is done in the generative adversarial network type algorithms. The choice of a discriminator, however, is done heuristically without justification. In this paper, we propose a new adversarial training scheme for LFR, where the integral probability metric (IPM) with a specific parametric family of discriminators is used. The most notable result of the proposed LFR algorithm is its theoretical guarantee about the fairness of the final prediction model, which has not been considered yet. That is, we derive theoretical relations between the fairness of representation and the fairness of the prediction model built on the top of the representation (i.e., using the representation as the input). Moreover, by numerical experiments, we show that our proposed LFR algorithm is computationally lighter and more stable, and the final prediction model is competitive or superior to other LFR algorithms using more complex discriminators.  ( 2 min )
    Adiabatic Quantum Computing for Multi Object Tracking. (arXiv:2202.08837v1 [cs.CV])
    Multi-Object Tracking (MOT) is most often approached in the tracking-by-detection paradigm, where object detections are associated through time. The association step naturally leads to discrete optimization problems. As these optimization problems are often NP-hard, they can only be solved exactly for small instances on current hardware. Adiabatic quantum computing (AQC) offers a solution for this, as it has the potential to provide a considerable speedup on a range of NP-hard optimization problems in the near future. However, current MOT formulations are unsuitable for quantum computing due to their scaling properties. In this work, we therefore propose the first MOT formulation designed to be solved with AQC. We employ an Ising model that represents the quantum mechanical system implemented on the AQC. We show that our approach is competitive compared with state-of-the-art optimization-based approaches, even when using of-the-shelf integer programming solvers. Finally, we demonstrate that our MOT problem is already solvable on the current generation of real quantum computers for small examples, and analyze the properties of the measured solutions.  ( 2 min )
    End-to-End Training of Both Translation Models in the Back-Translation Framework. (arXiv:2202.08465v1 [cs.CL])
    Semi-supervised learning algorithms in neural machine translation (NMT) have significantly improved translation quality compared to the supervised learning algorithms by using additional monolingual corpora. Among them, back-translation is a theoretically well-structured and cutting-edge method. Given two pre-trained NMT models between source and target languages, one translates a monolingual sentence as a latent sentence, and the other reconstructs the monolingual input sentence given the latent sentence. Therefore, previous works tried to apply the variational auto-encoder's (VAE) training framework to the back-translation framework. However, the discrete property of the latent sentence made it impossible to use backpropagation in the framework. This paper proposes a categorical reparameterization trick that generates a differentiable sentence, with which we practically implement the VAE's training framework for the back-translation and train it by end-to-end backpropagation. In addition, we propose several regularization techniques that are especially advantageous to this framework. In our experiments, we demonstrate that our method makes backpropagation available through the latent sentences and improves the BLEU scores on the datasets of the WMT18 translation task.  ( 2 min )
    Hurricane Forecasting: A Novel Multimodal Machine Learning Framework. (arXiv:2011.06125v3 [cs.LG] UPDATED)
    This paper describes a novel machine learning (ML) framework for tropical cyclone intensity and track forecasting, combining multiple ML techniques and utilizing diverse data sources. Our multimodal framework, called Hurricast, efficiently combines spatial-temporal data with statistical data by extracting features with deep-learning encoder-decoder architectures and predicting with gradient-boosted trees. We evaluate our models in the North Atlantic and Eastern Pacific basins on 2016-2019 for 24-hour lead time track and intensity forecasts and show they achieve comparable mean average error and skill to current operational forecast models while computing in seconds. Furthermore, the inclusion of Hurricast into an operational forecast consensus model could improve over the National Hurricane Center's official forecast, thus highlighting the complementary properties with existing approaches. In summary, our work demonstrates that utilizing machine learning techniques to combine different data sources can lead to new opportunities in tropical cyclone forecasting.  ( 2 min )
    Measuring the Transferability of $\ell_\infty$ Attacks by the $\ell_2$ Norm. (arXiv:2102.10343v3 [cs.LG] UPDATED)
    Deep neural networks could be fooled by adversarial examples with trivial differences to original samples. To keep the difference imperceptible in human eyes, researchers bound the adversarial perturbations by the $\ell_\infty$ norm, which is now commonly served as the standard to align the strength of different attacks for a fair comparison. However, we propose that using the $\ell_\infty$ norm alone is not sufficient in measuring the attack strength, because even with a fixed $\ell_\infty$ distance, the $\ell_2$ distance also greatly affects the attack transferability between models. Through the discovery, we reach more in-depth understandings towards the attack mechanism, i.e., several existing methods attack black-box models better partly because they craft perturbations with 70\% to 130\% larger $\ell_2$ distances. Since larger perturbations naturally lead to better transferability, we thereby advocate that the strength of attacks should be simultaneously measured by both the $\ell_\infty$ and $\ell_2$ norm. Our proposal is firmly supported by extensive experiments on ImageNet dataset from 7 attacks, 4 white-box models, and 9 black-box models.  ( 2 min )
    Survey on Self-supervised Representation Learning Using Image Transformations. (arXiv:2202.08514v1 [cs.CV])
    Deep neural networks need huge amount of training data, while in real world there is a scarcity of data available for training purposes. To resolve these issues, self-supervised learning (SSL) methods are used. SSL using geometric transformations (GT) is a simple yet powerful technique used in unsupervised representation learning. Although multiple survey papers have reviewed SSL techniques, there is none that only focuses on those that use geometric transformations. Furthermore, such methods have not been covered in depth in papers where they are reviewed. Our motivation to present this work is that geometric transformations have shown to be powerful supervisory signals in unsupervised representation learning. Moreover, many such works have found tremendous success, but have not gained much attention. We present a concise survey of SSL approaches that use geometric transformations. We shortlist six representative models that use image transformations including those based on predicting and autoencoding transformations. We review their architecture as well as learning methodologies. We also compare the performance of these models in the object recognition task on CIFAR-10 and ImageNet datasets. Our analysis indicates the AETv2 performs the best in most settings. Rotation with feature decoupling also performed well in some settings. We then derive insights from the observed results. Finally, we conclude with a summary of the results and insights as well as highlighting open problems to be addressed and indicating various future directions.  ( 2 min )
    Augment with Care: Contrastive Learning for the Boolean Satisfiability Problem. (arXiv:2202.08396v1 [cs.LG])
    Supervised learning can improve the design of state-of-the-art solvers for combinatorial problems, but labelling large numbers of combinatorial instances is often impractical due to exponential worst-case complexity. Inspired by the recent success of contrastive pre-training for images, we conduct a scientific study of the effect of augmentation design on contrastive pre-training for the Boolean satisfiability problem. While typical graph contrastive pre-training uses label-agnostic augmentations, our key insight is that many combinatorial problems have well-studied invariances, which allow for the design of label-preserving augmentations. We find that label-preserving augmentations are critical for the success of contrastive pre-training. We show that our representations are able to achieve comparable test accuracy to fully-supervised learning while using only 1% of the labels. We also demonstrate that our representations are more transferable to larger problems from unseen domains.  ( 2 min )
    Retrieval-Augmented Reinforcement Learning. (arXiv:2202.08417v1 [cs.LG])
    Most deep reinforcement learning (RL) algorithms distill experience into parametric behavior policies or value functions via gradient updates. While effective, this approach has several disadvantages: (1) it is computationally expensive, (2) it can take many updates to integrate experiences into the parametric model, (3) experiences that are not fully integrated do not appropriately influence the agent's behavior, and (4) behavior is limited by the capacity of the model. In this paper we explore an alternative paradigm in which we train a network to map a dataset of past experiences to optimal behavior. Specifically, we augment an RL agent with a retrieval process (parameterized as a neural network) that has direct access to a dataset of experiences. This dataset can come from the agent's past experiences, expert demonstrations, or any other relevant source. The retrieval process is trained to retrieve information from the dataset that may be useful in the current context, to help the agent achieve its goal faster and more efficiently. We integrate our method into two different RL agents: an offline DQN agent and an online R2D2 agent. In offline multi-task problems, we show that the retrieval-augmented DQN agent avoids task interference and learns faster than the baseline DQN agent. On Atari, we show that retrieval-augmented R2D2 learns significantly faster than the baseline R2D2 agent and achieves higher scores. We run extensive ablations to measure the contributions of the components of our proposed method.  ( 2 min )
    The Exact Class of Graph Functions Generated by Graph Neural Networks. (arXiv:2202.08833v1 [cs.LG])
    Given a graph function, defined on an arbitrary set of edge weights and node features, does there exist a Graph Neural Network (GNN) whose output is identical to the graph function? In this paper, we fully answer this question and characterize the class of graph problems that can be represented by GNNs. We identify an algebraic condition, in terms of the permutation of edge weights and node features, which proves to be necessary and sufficient for a graph problem to lie within the reach of GNNs. Moreover, we show that this condition can be efficiently verified by checking quadratically many constraints. Note that our refined characterization on the expressive power of GNNs are orthogonal to those theoretical results showing equivalence between GNNs and Weisfeiler-Lehman graph isomorphism heuristic. For instance, our characterization implies that many natural graph problems, such as min-cut value, max-flow value, and max-clique size, can be represented by a GNN. In contrast, and rather surprisingly, there exist very simple graphs for which no GNN can correctly find the length of the shortest paths between all nodes. Note that finding shortest paths is one of the most classical problems in Dynamic Programming (DP). Thus, the aforementioned negative example highlights the misalignment between DP and GNN, even though (conceptually) they follow very similar iterative procedures. Finally, we support our theoretical results by experimental simulations.  ( 2 min )
    Information Theory with Kernel Methods. (arXiv:2202.08545v1 [cs.IT])
    We consider the analysis of probability distributions through their associated covariance operators from reproducing kernel Hilbert spaces. We show that the von Neumann entropy and relative entropy of these operators are intimately related to the usual notions of Shannon entropy and relative entropy, and share many of their properties. They come together with efficient estimation algorithms from various oracles on the probability distributions. We also consider product spaces and show that for tensor product kernels, we can define notions of mutual information and joint entropies, which can then characterize independence perfectly, but only partially conditional independence. We finally show how these new notions of relative entropy lead to new upper-bounds on log partition functions, that can be used together with convex optimization within variational inference methods, providing a new family of probabilistic inference methods.  ( 2 min )
    Global Convergence of Sub-gradient Method for Robust Matrix Recovery: Small Initialization, Noisy Measurements, and Over-parameterization. (arXiv:2202.08788v1 [cs.LG])
    In this work, we study the performance of sub-gradient method (SubGM) on a natural nonconvex and nonsmooth formulation of low-rank matrix recovery with $\ell_1$-loss, where the goal is to recover a low-rank matrix from a limited number of measurements, a subset of which may be grossly corrupted with noise. We study a scenario where the rank of the true solution is unknown and over-estimated instead. The over-estimation of the rank gives rise to an over-parameterized model in which there are more degrees of freedom than needed. Such over-parameterization may lead to overfitting, or adversely affect the performance of the algorithm. We prove that a simple SubGM with small initialization is agnostic to both over-parameterization and noise in the measurements. In particular, we show that small initialization nullifies the effect of over-parameterization on the performance of SubGM, leading to an exponential improvement in its convergence rate. Moreover, we provide the first unifying framework for analyzing the behavior of SubGM under both outlier and Gaussian noise models, showing that SubGM converges to the true solution, even under arbitrarily large and arbitrarily dense noise values, and--perhaps surprisingly--even if the globally optimal solutions do not correspond to the ground truth. At the core of our results is a robust variant of restricted isometry property, called Sign-RIP, which controls the deviation of the sub-differential of the $\ell_1$-loss from that of an ideal, expected loss. As a byproduct of our results, we consider a subclass of robust low-rank matrix recovery with Gaussian measurements, and show that the number of required samples to guarantee the global convergence of SubGM is independent of the over-parameterized rank.  ( 2 min )
    Full-Span Log-Linear Model and Fast Learning Algorithm. (arXiv:2202.08472v1 [cs.LG])
    The full-span log-linear(FSLL) model introduced in this paper is considered an $n$-th order Boltzmann machine, where $n$ is the number of all variables in the target system. Let $X=(X_0,...,X_{n-1})$ be finite discrete random variables that can take $|X|=|X_0|...|X_{n-1}|$ different values. The FSLL model has $|X|-1$ parameters and can represent arbitrary positive distributions of $X$. The FSLL model is a "highest-order" Boltzmann machine; nevertheless, we can compute the dual parameters of the model distribution, which plays important roles in exponential families, in $O(|X|\log|X|)$ time. Furthermore, using properties of the dual parameters of the FSLL model, we can construct an efficient learning algorithm. The FSLL model is limited to small probabilistic models up to $|X|\approx2^{25}$; however, in this problem domain, the FSLL model flexibly fits various true distributions underlying the training data without any hyperparameter tuning. The experiments presented that the FSLL successfully learned six training datasets such that $|X|=2^{20}$ within one minute with a laptop PC.
    A Collection and Categorization of Open-Source Wind and Wind Power Datasets. (arXiv:2202.08524v1 [cs.LG])
    Wind power and other forms of renewable energy sources play an ever more important role in the energy supply of today's power grids. Forecasting renewable energy sources has therefore become essential in balancing the power grid. While a lot of focus is placed on new forecasting methods, little attention is given on how to compare, reproduce and transfer the methods to other use cases and data. One reason for this lack of attention is the limited availability of open-source datasets, as many currently used datasets are non-disclosed and make reproducibility of research impossible. This unavailability of open-source datasets is especially prevalent in commercially interesting fields such as wind power forecasting. However, with this paper we want to enable researchers to compare their methods on publicly available datasets by providing the, to our knowledge, largest up-to-date overview of existing open-source wind power datasets, and a categorization into different groups of datasets that can be used for wind power forecasting. We show that there are publicly available datasets sufficient for wind power forecasting tasks and discuss the different data groups properties to enable researchers to choose appropriate open-source datasets and compare their methods on them.
    DeepHybrid: Deep Learning on Automotive Radar Spectra and Reflections for Object Classification. (arXiv:2202.08519v1 [cs.LG])
    Automated vehicles need to detect and classify objects and traffic participants accurately. Reliable object classification using automotive radar sensors has proved to be challenging. We propose a method that combines classical radar signal processing and Deep Learning algorithms. The range-azimuth information on the radar reflection level is used to extract a sparse region of interest from the range-Doppler spectrum. This is used as input to a neural network (NN) that classifies different types of stationary and moving objects. We present a hybrid model (DeepHybrid) that receives both radar spectra and reflection attributes as inputs, e.g. radar cross-section. Experiments show that this improves the classification performance compared to models using only spectra. Moreover, a neural architecture search (NAS) algorithm is applied to find a resource-efficient and high-performing NN. NAS yields an almost one order of magnitude smaller NN than the manually-designed one while preserving the accuracy. The proposed method can be used for example to improve automatic emergency braking or collision avoidance systems.
    Hyperspherically Regularized Networks for BYOL Improves Feature Uniformity and Separability. (arXiv:2105.00925v3 [cs.LG] UPDATED)
    Bootstrap Your Own Latent (BYOL) introduced an approach to self-supervised learning avoiding the contrastive paradigm and subsequently removing the computational burden of negative sampling associated with such methods. However, we empirically find that the image representations produced under the BYOL's self-distillation paradigm are poorly distributed in representation space compared to contrastive methods. This work empirically demonstrates that feature diversity enforced by contrastive losses is beneficial to image representation uniformity when employed in BYOL, and as such, provides greater inter-class representation separability. Additionally, we explore and advocate the use of regularization methods, specifically the layer-wise minimization of hyperspherical energy (i.e. maximization of entropy) of network weights to encourage representation uniformity. We show that directly optimizing a measure of uniformity alongside the standard loss, or regularizing the networks of the BYOL architecture to minimize the hyperspherical energy of neurons can produce more uniformly distributed and therefore better performing representations for downstream tasks.  ( 2 min )
    Data-SUITE: Data-centric identification of in-distribution incongruous examples. (arXiv:2202.08836v1 [cs.LG])
    Systematic quantification of data quality is critical for consistent model performance. Prior works have focused on out-of-distribution data. Instead, we tackle an understudied yet equally important problem of characterizing incongruous regions of in-distribution (ID) data, which may arise from feature space heterogeneity. To this end, we propose a paradigm shift with Data-SUITE: a data-centric framework to identify these regions, independent of a task-specific model. DATA-SUITE leverages copula modeling, representation learning, and conformal prediction to build feature-wise confidence interval estimators based on a set of training instances. These estimators can be used to evaluate the congruence of test instances with respect to the training set, to answer two practically useful questions: (1) which test instances will be reliably predicted by a model trained with the training instances? and (2) can we identify incongruous regions of the feature space so that data owners understand the data's limitations or guide future data collection? We empirically validate Data-SUITE's performance and coverage guarantees and demonstrate on cross-site medical data, biased data, and data with concept drift, that Data-SUITE best identifies ID regions where a downstream model may be reliable (independent of said model). We also illustrate how these identified regions can provide insights into datasets and highlight their limitations.
    Gradients without Backpropagation. (arXiv:2202.08587v1 [cs.LG])
    Using backpropagation to compute gradients of objective functions for optimization has remained a mainstay of machine learning. Backpropagation, or reverse-mode differentiation, is a special case within the general family of automatic differentiation algorithms that also includes the forward mode. We present a method to compute gradients based solely on the directional derivative that one can compute exactly and efficiently via the forward mode. We call this formulation the forward gradient, an unbiased estimate of the gradient that can be evaluated in a single forward run of the function, entirely eliminating the need for backpropagation in gradient descent. We demonstrate forward gradient descent in a range of problems, showing substantial savings in computation and enabling training up to twice as fast in some cases.
    Exploiting full Resolution Feature Context for Liver Tumor and Vessel Segmentation via Fusion Encoder: Application to Liver Tumor and Vessel 3D reconstruction. (arXiv:2111.13299v2 [eess.IV] UPDATED)
    Liver cancer is one of the most common malignant diseases in the world. Segmentation and labeling of liver tumors and blood vessels in CT images can provide convenience for doctors in liver tumor diagnosis and surgical intervention. In the past decades, automatic CT segmentation methods based on deep learning have received widespread attention in the medical field. Many state-of-the-art segmentation algorithms appeared during this period. Yet, most of the existing segmentation methods only care about the local feature context and have a perception defect in the global relevance of medical images, which significantly affects the segmentation effect of liver tumors and blood vessels. We introduce a multi-scale feature context fusion network called TransFusionNet based on Transformer and SEBottleNet. This network can accurately detect and identify the details of the region of interest of the liver vessel, meanwhile it can improve the recognition of morphologic margins of liver tumors by exploiting the global information of CT images. Experiments show that TransFusionNet is better than the state-of-the-art method on both the public dataset LITS and 3Dircadb and our clinical dataset. Finally, we propose an automatic 3D reconstruction algorithm based on the trained model. The algorithm can complete the reconstruction quickly and accurately in 1 second.
    Analytic-DPM: an Analytic Estimate of the Optimal Reverse Variance in Diffusion Probabilistic Models. (arXiv:2201.06503v2 [cs.LG] UPDATED)
    Diffusion probabilistic models (DPMs) represent a class of powerful generative models. Despite their success, the inference of DPMs is expensive since it generally needs to iterate over thousands of timesteps. A key problem in the inference is to estimate the variance in each timestep of the reverse process. In this work, we present a surprising result that both the optimal reverse variance and the corresponding optimal KL divergence of a DPM have analytic forms w.r.t. its score function. Building upon it, we propose Analytic-DPM, a training-free inference framework that estimates the analytic forms of the variance and KL divergence using the Monte Carlo method and a pretrained score-based model. Further, to correct the potential bias caused by the score-based model, we derive both lower and upper bounds of the optimal variance and clip the estimate for a better result. Empirically, our analytic-DPM improves the log-likelihood of various DPMs, produces high-quality samples, and meanwhile enjoys a 20x to 80x speed up.
    Learning Game-Theoretic Models of Multiagent Trajectories Using Implicit Layers. (arXiv:2008.07303v6 [cs.GT] UPDATED)
    For prediction of interacting agents' trajectories, we propose an end-to-end trainable architecture that hybridizes neural nets with game-theoretic reasoning, has interpretable intermediate representations, and transfers to downstream decision making. It uses a net that reveals preferences from the agents' past joint trajectory, and a differentiable implicit layer that maps these preferences to local Nash equilibria, forming the modes of the predicted future trajectory. Additionally, it learns an equilibrium refinement concept. For tractability, we introduce a new class of continuous potential games and an equilibrium-separating partition of the action space. We provide theoretical results for explicit gradients and soundness. In experiments, we evaluate our approach on two real-world data sets, where we predict highway driver merging trajectories, and on a simple decision-making transfer task.
    Deep Feature Fusion for Mitosis Counting. (arXiv:2002.03781v3 [cs.CV] UPDATED)
    Each woman living in the United States has about 1 in 8 chance of developing invasive breast cancer. The mitotic cell count is one of the most common tests to assess the aggressiveness or grade of breast cancer. In this prognosis, histopathology images must be examined by a pathologist using high-resolution microscopes to count the cells. Unfortunately, this can be an exhaustive task with poor reproducibility, especially for non-experts. Deep learning networks have recently been adapted to medical applications which are able to automatically localize these regions of interest. However, these region-based networks lack the ability to take advantage of the segmentation features produced by a full image CNN which are often used as a sole method of detection. Therefore, the proposed method leverages Faster RCNN for object detection while fusing segmentation features generated by a UNet with RGB image features to achieve an F-score of 0.508 on the MITOS-ATYPIA 2014 mitosis counting challenge dataset, outperforming state-of-the-art methods.
    Improving Robustness of Deep Reinforcement Learning Agents: Environment Attack based on the Critic Network. (arXiv:2104.03154v2 [cs.LG] UPDATED)
    To improve policy robustness of deep reinforcement learning agents, a line of recent works focus on producing disturbances of the environment. Existing approaches of the literature to generate meaningful disturbances of the environment are adversarial reinforcement learning methods. These methods set the problem as a two-player game between the protagonist agent, which learns to perform a task in an environment, and the adversary agent, which learns to disturb the protagonist via modifications of the considered environment. Both protagonist and adversary are trained with deep reinforcement learning algorithms. Alternatively, we propose in this paper to build on gradient-based adversarial attacks, usually used for classification tasks for instance, that we apply on the critic network of the protagonist to identify efficient disturbances of the environment. Rather than learning an attacker policy, which usually reveals as very complex and unstable, we leverage the knowledge of the critic network of the protagonist, to dynamically complexify the task at each step of the learning process. We show that our method, while being faster and lighter, leads to significantly better improvements in policy robustness than existing methods of the literature.
    Point Cloud Generation with Continuous Conditioning. (arXiv:2202.08526v1 [cs.CV])
    Generative models can be used to synthesize 3D objects of high quality and diversity. However, there is typically no control over the properties of the generated object.This paper proposes a novel generative adversarial network (GAN) setup that generates 3D point cloud shapes conditioned on a continuous parameter. In an exemplary application, we use this to guide the generative process to create a 3D object with a custom-fit shape. We formulate this generation process in a multi-task setting by using the concept of auxiliary classifier GANs. Further, we propose to sample the generator label input for training from a kernel density estimation (KDE) of the dataset. Our ablations show that this leads to significant performance increase in regions with few samples. Extensive quantitative and qualitative experiments show that we gain explicit control over the object dimensions while maintaining good generation quality and diversity.
    Un-Mix: Rethinking Image Mixtures for Unsupervised Visual Representation Learning. (arXiv:2003.05438v5 [cs.CV] UPDATED)
    The recently advanced unsupervised learning approaches use the siamese-like framework to compare two "views" from the same image for learning representations. Making the two views distinctive is a core to guarantee that unsupervised methods can learn meaningful information. However, such frameworks are sometimes fragile on overfitting if the augmentations used for generating two views are not strong enough, causing the over-confident issue on the training data. This drawback hinders the model from learning subtle variance and fine-grained information. To address this, in this work we aim to involve the distance concept on label space in the unsupervised learning and let the model be aware of the soft degree of similarity between positive or negative pairs through mixing the input data space, to further work collaboratively for the input and loss spaces. Despite its conceptual simplicity, we show empirically that with the solution -- Unsupervised image mixtures (Un-Mix), we can learn subtler, more robust and generalized representations from the transformed input and corresponding new label space. Extensive experiments are conducted on CIFAR-10, CIFAR-100, STL-10, Tiny ImageNet and standard ImageNet with popular unsupervised methods SimCLR, BYOL, MoCo V1&V2, SwAV, etc. Our proposed image mixture and label assignment strategy can obtain consistent improvement by 1~3% following exactly the same hyperparameters and training procedures of the base methods. Code is publicly available at https://github.com/szq0214/Un-Mix.
    Delay-adaptive step-sizes for asynchronous learning. (arXiv:2202.08550v1 [cs.LG])
    In scalable machine learning systems, model training is often parallelized over multiple nodes that run without tight synchronization. Most analysis results for the related asynchronous algorithms use an upper bound on the information delays in the system to determine learning rates. Not only are such bounds hard to obtain in advance, but they also result in unnecessarily slow convergence. In this paper, we show that it is possible to use learning rates that depend on the actual time-varying delays in the system. We develop general convergence results for delay-adaptive asynchronous iterations and specialize these to proximal incremental gradient descent and block-coordinate descent algorithms. For each of these methods, we demonstrate how delays can be measured on-line, present delay-adaptive step-size policies, and illustrate their theoretical and practical advantages over the state-of-the-art.
    Data-driven Aerodynamic Analysis of Structures using Gaussian Processes. (arXiv:2103.13877v2 [physics.flu-dyn] UPDATED)
    An abundant amount of data gathered during wind tunnel testing and health monitoring of structures inspires the use of machine learning methods to replicate the wind forces. This paper presents a data-driven Gaussian Process-Nonlinear Finite Impulse Response (GP-NFIR) model of the nonlinear self-excited forces acting on structures. Constructed in a nondimensional form, the model takes the effective wind angle of attack as lagged exogenous input and outputs a probability distribution of the forces. The nonlinear input/output function is modeled by a GP regression. Consequently, the model is nonparametric, thereby circumventing to set up the function's structure a priori. The training input is designed as random harmonic motion consisting of vertical and rotational displacements. Once trained, the model can predict the aerodynamic forces for both prescribed input motion and aeroelastic analysis. The concept is first verified for a flat plate's analytical solution by predicting the self-excited forces and flutter velocity. Finally, the framework is applied to a streamlined and bluff bridge deck based on Computational Fluid Dynamics (CFD) data. The model's ability to predict nonlinear aerodynamic forces, flutter velocity, and post-flutter behavior are highlighted. Applications of the framework are foreseen in the structural analysis during the design and monitoring of slender line-like structures.
    SWIM: Selective Write-Verify for Computing-in-Memory Neural Accelerators. (arXiv:2202.08395v1 [cs.LG])
    Computing-in-Memory architectures based on non-volatile emerging memories have demonstrated great potential for deep neural network (DNN) acceleration thanks to their high energy efficiency. However, these emerging devices can suffer from significant variations during the mapping process i.e., programming weights to the devices), and if left undealt with, can cause significant accuracy degradation. The non-ideality of weight mapping can be compensated by iterative programming with a write-verify scheme, i.e., reading the conductance and rewriting if necessary. In all existing works, such a practice is applied to every single weight of a DNN as it is being mapped, which requires extensive programming time. In this work, we show that it is only necessary to select a small portion of the weights for write-verify to maintain the DNN accuracy, thus achieving significant speedup. We further introduce a second derivative based technique SWIM, which only requires a single pass of forward and backpropagation, to efficiently select the weights that need write-verify. Experimental results on various DNN architectures for different datasets show that SWIM can achieve up to 10x programming speedup compared with conventional full-blown write-verify while attaining a comparable accuracy.
    The merged-staircase property: a necessary and nearly sufficient condition for SGD learning of sparse functions on two-layer neural networks. (arXiv:2202.08658v1 [cs.LG])
    It is currently known how to characterize functions that neural networks can learn with SGD for two extremal parameterizations: neural networks in the linear regime, and neural networks with no structural constraints. However, for the main parametrization of interest (non-linear but regular networks) no tight characterization has yet been achieved, despite significant developments. We take a step in this direction by considering depth-2 neural networks trained by SGD in the mean-field regime. We consider functions on binary inputs that depend on a latent low-dimensional subspace (i.e., small number of coordinates). This regime is of interest since it is poorly understood how neural networks routinely tackle high-dimensional datasets and adapt to latent low-dimensional structure without suffering from the curse of dimensionality. Accordingly, we study SGD-learnability with $O(d)$ sample complexity in a large ambient dimension $d$. Our main results characterize a hierarchical property, the "merged-staircase property", that is both necessary and nearly sufficient for learning in this setting. We further show that non-linear training is necessary: for this class of functions, linear methods on any feature map (e.g., the NTK) are not capable of learning efficiently. The key tools are a new "dimension-free" dynamics approximation result that applies to functions defined on a latent space of low-dimension, a proof of global convergence based on polynomial identity testing, and an improvement of lower bounds against linear methods for non-almost orthogonal functions.  ( 2 min )
    Graph Self-supervised Learning with Accurate Discrepancy Learning. (arXiv:2202.02989v2 [cs.LG] UPDATED)
    Self-supervised learning of graph neural networks (GNNs) aims to learn an accurate representation of the graphs in an unsupervised manner, to obtain transferable representations of them for diverse downstream tasks. Predictive learning and contrastive learning are the two most prevalent approaches for graph self-supervised learning. However, they have their own drawbacks. While the predictive learning methods can learn the contextual relationships between neighboring nodes and edges, they cannot learn global graph-level similarities. Contrastive learning, while it can learn global graph-level similarities, its objective to maximize the similarity between two differently perturbed graphs may result in representations that cannot discriminate two similar graphs with different properties. To tackle such limitations, we propose a framework that aims to learn the exact discrepancy between the original and the perturbed graphs, coined as Discrepancy-based Self-supervised LeArning (D-SLA). Specifically, we create multiple perturbations of the given graph with varying degrees of similarity and train the model to predict whether each graph is the original graph or a perturbed one. Moreover, we further aim to accurately capture the amount of discrepancy for each perturbed graph using the graph edit distance. We validate our method on various graph-related downstream tasks, including molecular property prediction, protein function prediction, and link prediction tasks, on which our model largely outperforms relevant baselines.
    Single-shot Hyper-parameter Optimization for Federated Learning: A General Algorithm & Analysis. (arXiv:2202.08338v1 [cs.LG])
    We address the relatively unexplored problem of hyper-parameter optimization (HPO) for federated learning (FL-HPO). We introduce Federated Loss SuRface Aggregation (FLoRA), a general FL-HPO solution framework that can address use cases of tabular data and any Machine Learning (ML) model including gradient boosting training algorithms and therefore further expands the scope of FL-HPO. FLoRA enables single-shot FL-HPO: identifying a single set of good hyper-parameters that are subsequently used in a single FL training. Thus, it enables FL-HPO solutions with minimal additional communication overhead compared to FL training without HPO. We theoretically characterize the optimality gap of FL-HPO, which explicitly accounts for the heterogeneous non-IID nature of the parties' local data distributions, a dominant characteristic of FL systems. Our empirical evaluation of FLoRA for multiple ML algorithms on seven OpenML datasets demonstrates significant model accuracy improvements over the considered baseline, and robustness to increasing number of parties involved in FL-HPO training.
    Deep Reinforcement Learning Based Networked Control with Network Delays for Signal Temporal Logic Specifications. (arXiv:2108.01317v2 [eess.SY] UPDATED)
    We apply deep reinforcement learning (DRL) to design of a networked controller with network delays to complete a temporal control task that is described by a signal temporal logic (STL) formula. STL is useful to deal with a specification with a bounded time interval for a dynamical system. In general, an agent needs not only the current system state but also the past behavior of the system to determine a desired control action for satisfying the given STL formula. Additionally, we need to consider the effect of network delays for data transmissions. Thus, we propose an extended Markov decision process using past system states and control actions, which is called a $\tau d$-MDP, so that the agent can evaluate the satisfaction of the STL formula considering the network delays. Thereafter, we apply a DRL algorithm to design a networked controller using the $\tau d$-MDP. Through simulations, we also demonstrate the learning performance of the proposed algorithm.
    Optimizing Latent Space Directions For GAN-based Local Image Editing. (arXiv:2111.12583v2 [cs.CV] UPDATED)
    Generative Adversarial Network (GAN) based localized image editing can suffer from ambiguity between semantic attributes. We thus present a novel objective function to evaluate the locality of an image edit. By introducing the supervision from a pre-trained segmentation network and optimizing the objective function, our framework, called Locally Effective Latent Space Direction (LELSD), is applicable to any dataset and GAN architecture. Our method is also computationally fast and exhibits a high extent of disentanglement, which allows users to interactively perform a sequence of edits on an image. Our experiments on both GAN-generated and real images qualitatively demonstrate the high quality and advantages of our method.
    Solving Schr\"odinger Bridges via Maximum Likelihood. (arXiv:2106.02081v8 [stat.ML] UPDATED)
    The Schr\"odinger bridge problem (SBP) finds the most likely stochastic evolution between two probability distributions given a prior stochastic evolution. As well as applications in the natural sciences, problems of this kind have important applications in machine learning such as dataset alignment and hypothesis testing. Whilst the theory behind this problem is relatively mature, scalable numerical recipes to estimate the Schr\"odinger bridge remain an active area of research. We prove an equivalence between the SBP and maximum likelihood estimation enabling direct application of successful machine learning techniques. We propose a numerical procedure to estimate SBPs using Gaussian process and demonstrate the practical usage of our approach in numerical simulations and experiments.  ( 2 min )
    Computing Graph Edit Distance with Algorithms on Quantum Devices. (arXiv:2111.10183v2 [quant-ph] UPDATED)
    Distance measures provide the foundation for many popular algorithms in Machine Learning and Pattern Recognition. Different notions of distance can be used depending on the types of the data the algorithm is working on. For graph-shaped data, an important notion is the Graph Edit Distance (GED) that measures the degree of (dis)similarity between two graphs in terms of the operations needed to make them identical. As the complexity of computing GED is the same as NP-hard problems, it is reasonable to consider approximate solutions. In this paper we present a QUBO formulation of the GED problem. This allows us to implement two different approaches, namely quantum annealing and variational quantum algorithms that run on the two types of quantum hardware currently available: quantum annealer and gate-based quantum computer, respectively. Considering the current state of noisy intermediate-scale quantum computers, we base our study on proof-of-principle tests of their performance.
    Fuzzy Pooling. (arXiv:2202.08372v1 [cs.LG])
    Convolutional Neural Networks (CNNs) are artificial learning systems typically based on two operations: convolution, which implements feature extraction through filtering, and pooling, which implements dimensionality reduction. The impact of pooling in the classification performance of the CNNs has been highlighted in several previous works, and a variety of alternative pooling operators have been proposed. However, only a few of them tackle with the uncertainty that is naturally propagated from the input layer to the feature maps of the hidden layers through convolutions. In this paper we present a novel pooling operation based on (type-1) fuzzy sets to cope with the local imprecision of the feature maps, and we investigate its performance in the context of image classification. Fuzzy pooling is performed by fuzzification, aggregation and defuzzification of feature map neighborhoods. It is used for the construction of a fuzzy pooling layer that can be applied as a drop-in replacement of the current, crisp, pooling layers of CNN architectures. Several experiments using publicly available datasets show that the proposed approach can enhance the classification performance of a CNN. A comparative evaluation shows that it outperforms state-of-the-art pooling approaches.
    Mitigating Closed-model Adversarial Examples with Bayesian Neural Modeling for Enhanced End-to-End Speech Recognition. (arXiv:2202.08532v1 [eess.AS])
    In this work, we aim to enhance the system robustness of end-to-end automatic speech recognition (ASR) against adversarially-noisy speech examples. We focus on a rigorous and empirical "closed-model adversarial robustness" setting (e.g., on-device or cloud applications). The adversarial noise is only generated by closed-model optimization (e.g., evolutionary and zeroth-order estimation) without accessing gradient information of a targeted ASR model directly. We propose an advanced Bayesian neural network (BNN) based adversarial detector, which could model latent distributions against adaptive adversarial perturbation with divergence measurement. We further simulate deployment scenarios of RNN Transducer, Conformer, and wav2vec-2.0 based ASR systems with the proposed adversarial detection system. Leveraging the proposed BNN based detection system, we improve detection rate by +2.77 to +5.42% (relative +3.03 to +6.26%) and reduce the word error rate by 5.02 to 7.47% on LibriSpeech datasets compared to the current model enhancement methods against the adversarial speech examples.
    A Survey on Deep Reinforcement Learning-based Approaches for Adaptation and Generalization. (arXiv:2202.08444v1 [cs.LG])
    Deep Reinforcement Learning (DRL) aims to create intelligent agents that can learn to solve complex problems efficiently in a real-world environment. Typically, two learning goals: adaptation and generalization are used for baselining DRL algorithm's performance on different tasks and domains. This paper presents a survey on the recent developments in DRL-based approaches for adaptation and generalization. We begin by formulating these goals in the context of task and domain. Then we review the recent works under those approaches and discuss future research directions through which DRL algorithms' adaptability and generalizability can be enhanced and potentially make them applicable to a broad range of real-world problems.
    Generalizable Information Theoretic Causal Representation. (arXiv:2202.08388v1 [cs.LG])
    It is evidence that representation learning can improve model's performance over multiple downstream tasks in many real-world scenarios, such as image classification and recommender systems. Existing learning approaches rely on establishing the correlation (or its proxy) between features and the downstream task (labels), which typically results in a representation containing cause, effect and spurious correlated variables of the label. Its generalizability may deteriorate because of the unstability of the non-causal parts. In this paper, we propose to learn causal representation from observational data by regularizing the learning procedure with mutual information measures according to our hypothetical causal graph. The optimization involves a counterfactual loss, based on which we deduce a theoretical guarantee that the causality-inspired learning is with reduced sample complexity and better generalization ability. Extensive experiments show that the models trained on causal representations learned by our approach is robust under adversarial attacks and distribution shift.
    A hybrid 2-stage vision transformer for AI-assisted 5 class pathologic diagnosis of gastric endoscopic biopsies. (arXiv:2202.08510v1 [eess.IV])
    Gastric endoscopic screening is an effective way to decide appropriate gastric cancer (GC) treatment at an early stage, reducing GC-associated mortality rate. Although artificial intelligence (AI) has brought a great promise to assist pathologist to screen digitalized whole slide images, automatic classification systems for guiding proper GC treatment based on clinical guideline are still lacking. Here, we propose an AI system classifying 5 classes of GC histology, which can be perfectly matched to general treatment guidance. The AI system, mimicking the way pathologist understand slides through multi-scale self-attention mechanism using a 2-stage Vision Transformer, demonstrates clinical capability by achieving diagnostic sensitivity of above 85% for both internal and external cohort analysis. Furthermore, AI-assisted pathologists showed significantly improved diagnostic sensitivity by 10% within 18% saved screening time compared to human pathologists. Our AI system has a great potential for providing presumptive pathologic opinion for deciding proper treatment for early GC patients.
    Modeling Strong and Human-Like Gameplay with KL-Regularized Search. (arXiv:2112.07544v2 [cs.MA] UPDATED)
    We consider the task of building strong but human-like policies in multi-agent decision-making problems, given examples of human behavior. Imitation learning is effective at predicting human actions but may not match the strength of expert humans, while self-play learning and search techniques (e.g. AlphaZero) lead to strong performance but may produce policies that are difficult for humans to understand and coordinate with. We show in chess and Go that regularizing search based on the KL divergence from an imitation-learned policy results in higher human prediction accuracy and stronger performance than imitation learning alone. We then introduce a novel regret minimization algorithm that is regularized based on the KL divergence from an imitation-learned policy, and show that using this algorithm for search in no-press Diplomacy yields a policy that matches the human prediction accuracy of imitation learning while being substantially stronger.
    Grammar-Based Grounded Lexicon Learning. (arXiv:2202.08806v1 [cs.CL])
    We present Grammar-Based Grounded Lexicon Learning (G2L2), a lexicalist approach toward learning a compositional and grounded meaning representation of language from grounded data, such as paired images and texts. At the core of G2L2 is a collection of lexicon entries, which map each word to a tuple of a syntactic type and a neuro-symbolic semantic program. For example, the word shiny has a syntactic type of adjective; its neuro-symbolic semantic program has the symbolic form {\lambda}x. filter(x, SHINY), where the concept SHINY is associated with a neural network embedding, which will be used to classify shiny objects. Given an input sentence, G2L2 first looks up the lexicon entries associated with each token. It then derives the meaning of the sentence as an executable neuro-symbolic program by composing lexical meanings based on syntax. The recovered meaning programs can be executed on grounded inputs. To facilitate learning in an exponentially-growing compositional space, we introduce a joint parsing and expected execution algorithm, which does local marginalization over derivations to reduce the training time. We evaluate G2L2 on two domains: visual reasoning and language-driven navigation. Results show that G2L2 can generalize from small amounts of data to novel compositions of words.
    An Exact Poly-Time Membership-Queries Algorithm for Extraction a three-Layer ReLU Network. (arXiv:2105.09673v2 [cs.LG] UPDATED)
    We consider the natural problem of learning a ReLU network from queries, which was recently re-motivated by model extraction attacks. In this work we present a polynomial-time algorithm, that under mild general position assumptions, can learn a depth two ReLU network from queries. We also present a polynomial-time algorithm, that under mild general position assumptions, can learn a rich class of depth three ReLU network from queries. For instance, it can learn most networks in which the number of first layer neurons is smaller than the dimension and the number of second layer neurons. These two results substantially improve the state of the art: Until our work, polynomial-time algorithms were only shown to learn from queries depth two networks under the assumption that either the underlying distribution is Gaussian or that the weights matrix rows are linearly independent. For depth three or more, there were no known poly-time results.
    Single Trajectory Nonparametric Learning of Nonlinear Dynamics. (arXiv:2202.08311v1 [cs.LG])
    Given a single trajectory of a dynamical system, we analyze the performance of the nonparametric least squares estimator (LSE). More precisely, we give nonasymptotic expected $l^2$-distance bounds between the LSE and the true regression function, where expectation is evaluated on a fresh, counterfactual, trajectory. We leverage recently developed information-theoretic methods to establish the optimality of the LSE for nonparametric hypotheses classes in terms of supremum norm metric entropy and a subgaussian parameter. Next, we relate this subgaussian parameter to the stability of the underlying process using notions from dynamical systems theory. When combined, these developments lead to rate-optimal error bounds that scale as $T^{-1/(2+q)}$ for suitably stable processes and hypothesis classes with metric entropy growth of order $\delta^{-q}$. Here, $T$ is the length of the observed trajectory, $\delta \in \mathbb{R}_+$ is the packing granularity and $q\in (0,2)$ is a complexity term. Finally, we specialize our results to a number of scenarios of practical interest, such as Lipschitz dynamics, generalized linear models, and dynamics described by functions in certain classes of Reproducing Kernel Hilbert Spaces (RKHS).
    Ensemble Conformalized Quantile Regression for Probabilistic Time Series Forecasting. (arXiv:2202.08756v1 [cs.LG])
    This paper presents a novel probabilistic forecasting method called ensemble conformalized quantile regression (EnCQR). EnCQR constructs distribution-free and approximately marginally valid prediction intervals (PIs), is suitable for nonstationary and heteroscedastic time series data, and can be applied on top of any forecasting model, including deep learning architectures that are trained on long data sequences. EnCQR exploits a bootstrap ensemble estimator, which enables the use of conformal predictors for time series by removing the requirement of data exchangeability. The ensemble learners are implemented as generic machine learning algorithms performing quantile regression, which allow the length of the PIs to adapt to local variability in the data. In the experiments, we predict time series characterized by a different amount of heteroscedasticity. The results demonstrate that EnCQR outperforms models based only on quantile regression or conformal prediction, and it provides sharper, more informative, and valid PIs.
    Infant Crying Detection in Real-World Environments. (arXiv:2005.07036v6 [eess.AS] UPDATED)
    Most existing cry detection models have been tested with data collected in controlled settings. Thus, the extent to which they generalize to noisy and lived environments is unclear. In this paper, we evaluate several established machine learning approaches including a model leveraging both deep spectrum and acoustic features. This model was able to recognize crying events with F1 score 0.613 (Precision: 0.672, Recall: 0.552), showing improved external validity over existing methods at cry detection in everyday real-world settings. As part of our evaluation, we collect and annotate a novel dataset of infant crying compiled from over 780 hours of labeled real-world audio data, captured via recorders worn by infants in their homes, which we make publicly available. Our findings confirm that a cry detection model trained on in-lab data underperforms when presented with real-world data (in-lab test F1: 0.656, real-world test F1: 0.236), highlighting the value of our new dataset and model.
    Improving Rating and Relevance with Point-of-Interest Recommender System. (arXiv:2202.08751v1 [cs.IR])
    The recommendation of points of interest (POIs) is essential in location-based social networks. It makes it easier for users and locations to share information. Recently, researchers tend to recommend POIs by treating them as large-scale retrieval systems that require a large amount of training data representing query-item relevance. However, gathering user feedback in retrieval systems is an expensive task. Existing POI recommender systems make recommendations based on user and item (location) interactions solely. However, there are numerous sources of feedback to consider. For example, when the user visits a POI, what is the POI is about and such. Integrating all these different types of feedback is essential when developing a POI recommender. In this paper, we propose using user and item information and auxiliary information to improve the recommendation modelling in a retrieval system. We develop a deep neural network architecture to model query-item relevance in the presence of both collaborative and content information. We also improve the quality of the learned representations of queries and items by including the contextual information from the user feedback data. The application of these learned representations to a large-scale dataset resulted in significant improvements.
    Identifiable Deep Generative Models via Sparse Decoding. (arXiv:2110.10804v2 [stat.ML] UPDATED)
    We develop the sparse VAE for unsupervised representation learning on high-dimensional data. The sparse VAE learns a set of latent factors (representations) which summarize the associations in the observed data features. The underlying model is sparse in that each observed feature (i.e. each dimension of the data) depends on a small subset of the latent factors. As examples, in ratings data each movie is only described by a few genres; in text data each word is only applicable to a few topics; in genomics, each gene is active in only a few biological processes. We prove such sparse deep generative models are identifiable: with infinite data, the true model parameters can be learned. (In contrast, most deep generative models are not identifiable.) We empirically study the sparse VAE with both simulated and real data. We find that it recovers meaningful latent factors and has smaller heldout reconstruction error than related methods.
    Learning music audio representations via weak language supervision. (arXiv:2112.04214v2 [cs.SD] UPDATED)
    Audio representations for music information retrieval are typically learned via supervised learning in a task-specific fashion. Although effective at producing state-of-the-art results, this scheme lacks flexibility with respect to the range of applications a model can have and requires extensively annotated datasets. In this work, we pose the question of whether it may be possible to exploit weakly aligned text as the only supervisory signal to learn general-purpose music audio representations. To address this question, we design a multimodal architecture for music and language pre-training (MuLaP) optimised via a set of proxy tasks. Weak supervision is provided in the form of noisy natural language descriptions conveying the overall musical content of the track. After pre-training, we transfer the audio backbone of the model to a set of music audio classification and regression tasks. We demonstrate the usefulness of our approach by comparing the performance of audio representations produced by the same audio backbone with different training strategies and show that our pre-training method consistently achieves comparable or higher scores on all tasks and datasets considered. Our experiments also confirm that MuLaP effectively leverages audio-caption pairs to learn representations that are competitive with audio-only and cross-modal self-supervised methods in the literature.
    Toward a Unified Framework for Debugging Concept-based Models. (arXiv:2109.11160v2 [cs.LG] UPDATED)
    In this paper, we tackle interactive debugging of "gray-box" concept-based models (CBMs). These models learn task-relevant concepts appearing in the inputs and then compute a prediction by aggregating the concept activations. Our work stems from the observation that in CBMs both the concepts and the aggregation function can be affected by different kinds of bugs, and that fixing these bugs requires different kinds of corrective supervision. To this end, we introduce a simple schema for human supervisors to identify and prioritize bugs in both components, and discuss solution strategies and open problems. We also introduce a novel loss function for debugging the aggregation step that generalizes existing strategies for aligning black-box models to CBMs by making them robust to how the concepts change during training.
    Evolutionary Action Selection for Gradient-based Policy Learning. (arXiv:2201.04286v3 [cs.NE] UPDATED)
    Evolutionary Algorithms (EAs) and Deep Reinforcement Learning (DRL) have recently been combined to integrate the advantages of the two solutions for better policy learning. However, in existing hybrid methods, EA is used to directly train the policy network, which will lead to sample inefficiency and unpredictable impact on the policy performance. To better integrate these two approaches and avoid the drawbacks caused by the introduction of EA, we devote ourselves to devising a more efficient and reasonable method of combining EA and DRL. In this paper, we propose Evolutionary Action Selection-Twin Delayed Deep Deterministic Policy Gradient (EAS-TD3), a novel combination of EA and DRL. In EAS, we focus on optimizing the action chosen by the policy network and attempt to obtain high-quality actions to guide policy learning through an evolutionary algorithm. We conduct several experiments on challenging continuous control tasks. The result shows that EAS-TD3 shows superior performance over other state-of-art methods.
    From Geometry to Topology: Inverse Theorems for Distributed Persistence. (arXiv:2101.12288v3 [math.AT] UPDATED)
    What is the "right" topological invariant of a large point cloud X? Prior research has focused on estimating the full persistence diagram of X, a quantity that is very expensive to compute, unstable to outliers, and far from a sufficient statistic. We therefore propose that the correct invariant is not the persistence diagram of X, but rather the collection of persistence diagrams of many small subsets. This invariant, which we call "distributed persistence," is perfectly parallelizable, more stable to outliers, and has a rich inverse theory. The map from the space of point clouds (with the quasi-isometry metric) to the space of distributed persistence invariants (with the Hausdorff-Bottleneck distance) is a global quasi-isometry. This is a much stronger property than simply being injective, as it implies that the inverse of a small neighborhood is a small neighborhood, and is to our knowledge the only result of its kind in the TDA literature. Moreover, the quasi-isometry bounds depend on the size of the subsets taken, so that as the size of these subsets goes from small to large, the invariant interpolates between a purely geometric one and a topological one. Lastly, we note that our inverse results do not actually require considering all subsets of a fixed size (an enormous collection), but a relatively small collection satisfying certain covering properties that arise with high probability when randomly sampling subsets. These theoretical results are complemented by two synthetic experiments demonstrating the use of distributed persistence in practice.
    MLP-ASR: Sequence-length agnostic all-MLP architectures for speech recognition. (arXiv:2202.08456v1 [eess.AS])
    We propose multi-layer perceptron (MLP)-based architectures suitable for variable length input. MLP-based architectures, recently proposed for image classification, can only be used for inputs of a fixed, pre-defined size. However, many types of data are naturally variable in length, for example, acoustic signals. We propose three approaches to extend MLP-based architectures for use with sequences of arbitrary length. The first one uses a circular convolution applied in the Fourier domain, the second applies a depthwise convolution, and the final relies on a shift operation. We evaluate the proposed architectures on an automatic speech recognition task with the Librispeech and Tedlium2 corpora. The best proposed MLP-based architectures improves WER by 1.0 / 0.9%, 0.9 / 0.5% on Librispeech dev-clean/dev-other, test-clean/test-other set, and 0.8 / 1.1% on Tedlium2 dev/test set using 86.4% the size of self-attention-based architecture.
    Revisiting Over-smoothing in BERT from the Perspective of Graph. (arXiv:2202.08625v1 [cs.LG])
    Recently over-smoothing phenomenon of Transformer-based models is observed in both vision and language fields. However, no existing work has delved deeper to further investigate the main cause of this phenomenon. In this work, we make the attempt to analyze the over-smoothing problem from the perspective of graph, where such problem was first discovered and explored. Intuitively, the self-attention matrix can be seen as a normalized adjacent matrix of a corresponding graph. Based on the above connection, we provide some theoretical analysis and find that layer normalization plays a key role in the over-smoothing issue of Transformer-based models. Specifically, if the standard deviation of layer normalization is sufficiently large, the output of Transformer stacks will converge to a specific low-rank subspace and result in over-smoothing. To alleviate the over-smoothing problem, we consider hierarchical fusion strategies, which combine the representations from different layers adaptively to make the output more diverse. Extensive experiment results on various data sets illustrate the effect of our fusion method.
    Detecting and Learning the Unknown in Semantic Segmentation. (arXiv:2202.08700v1 [cs.CV])
    Semantic segmentation is a crucial component for perception in automated driving. Deep neural networks (DNNs) are commonly used for this task and they are usually trained on a closed set of object classes appearing in a closed operational domain. However, this is in contrast to the open world assumption in automated driving that DNNs are deployed to. Therefore, DNNs necessarily face data that they have never encountered previously, also known as anomalies, which are extremely safety-critical to properly cope with. In this work, we first give an overview about anomalies from an information-theoretic perspective. Next, we review research in detecting semantically unknown objects in semantic segmentation. We demonstrate that training for high entropy responses on anomalous objects outperforms other recent methods, which is in line with our theoretical findings. Moreover, we examine a method to assess the occurrence frequency of anomalies in order to select anomaly types to include into a model's set of semantic categories. We demonstrate that these anomalies can then be learned in an unsupervised fashion, which is particularly suitable in online applications based on deep learning.
    Accelerating Serverless Computing by Harvesting Idle Resources. (arXiv:2108.12717v2 [cs.DC] UPDATED)
    Serverless computing automates fine-grained resource scaling and simplifies the development and deployment of online services with stateless functions. However, it is still non-trivial for users to allocate appropriate resources due to various function types, dependencies, and input sizes. Misconfiguration of resource allocations leaves functions either under-provisioned or over-provisioned and leads to continuous low resource utilization. This paper presents Freyr, a new resource manager (RM) for serverless platforms that maximizes resource efficiency by dynamically harvesting idle resources from over-provisioned functions to under-provisioned functions. Freyr monitors each function's resource utilization in real-time, detects over-provisioning and under-provisioning, and learns to harvest idle resources safely and accelerates functions efficiently by applying deep reinforcement learning algorithms along with a safeguard mechanism. We have implemented and deployed a Freyr prototype in a 13-node Apache OpenWhisk cluster. Experimental results show that 38.8% of function invocations have idle resources harvested by Freyr, and 39.2% of invocations are accelerated by the harvested resources. Freyr reduces the 99th-percentile function response latency by 32.1% compared to the baseline RMs.
    Drawing Early-Bird Tickets: Towards More Efficient Training of Deep Networks. (arXiv:1909.11957v5 [cs.LG] UPDATED)
    (Frankle & Carbin, 2019) shows that there exist winning tickets (small but critical subnetworks) for dense, randomly initialized networks, that can be trained alone to achieve comparable accuracies to the latter in a similar number of iterations. However, the identification of these winning tickets still requires the costly train-prune-retrain process, limiting their practical benefits. In this paper, we discover for the first time that the winning tickets can be identified at the very early training stage, which we term as early-bird (EB) tickets, via low-cost training schemes (e.g., early stopping and low-precision training) at large learning rates. Our finding of EB tickets is consistent with recently reported observations that the key connectivity patterns of neural networks emerge early. Furthermore, we propose a mask distance metric that can be used to identify EB tickets with low computational overhead, without needing to know the true winning tickets that emerge after the full training. Finally, we leverage the existence of EB tickets and the proposed mask distance to develop efficient training methods, which are achieved by first identifying EB tickets via low-cost schemes, and then continuing to train merely the EB tickets towards the target accuracy. Experiments based on various deep networks and datasets validate: 1) the existence of EB tickets, and the effectiveness of mask distance in efficiently identifying them; and 2) that the proposed efficient training via EB tickets can achieve up to 4.7x energy savings while maintaining comparable or even better accuracy, demonstrating a promising and easily adopted method for tackling cost-prohibitive deep network training. Code available at https://github.com/RICE-EIC/Early-Bird-Tickets.
    PocketNN: Integer-only Training and Inference of Neural Networks via Direct Feedback Alignment and Pocket Activations in Pure C++. (arXiv:2201.02863v2 [cs.LG] UPDATED)
    Standard deep learning algorithms are implemented using floating-point real numbers. This presents an obstacle for implementing them on low-end devices which may not have dedicated floating-point units (FPUs). As a result, researchers in TinyML have considered machine learning algorithms that can train and run a deep neural network (DNN) on a low-end device using integer operations only. In this paper we propose PocketNN, a light and self-contained proof-of-concept framework in pure C++ for the training and inference of DNNs using only integers. Unlike other approaches, PocketNN directly operates on integers without requiring any explicit quantization algorithms or customized fixed-point formats. This was made possible by pocket activations, which are a family of activation functions devised for integer-only DNNs, and an emerging DNN training algorithm called direct feedback alignment (DFA). Unlike the standard backpropagation (BP), DFA trains each layer independently, thus avoiding integer overflow which is a key problem when using BP with integer-only operations. We used PocketNN to train some DNNs on two well-known datasets, MNIST and Fashion-MNIST. Our experiments show that the DNNs trained with our PocketNN achieved 96.98% and 87.7% accuracies on MNIST and Fashion-MNIST datasets, respectively. The accuracies are very close to the equivalent DNNs trained using BP with floating-point real number operations, such that accuracy degradations were just 1.02%p and 2.09%p, respectively. Finally, our PocketNN has high compatibility and portability for low-end devices as it is open source and implemented in pure C++ without any dependencies.
    Efficient and Reliable Probabilistic Interactive Learning with Structured Outputs. (arXiv:2202.08566v1 [cs.LG])
    In this position paper, we study interactive learning for structured output spaces, with a focus on active learning, in which labels are unknown and must be acquired, and on skeptical learning, in which the labels are noisy and may need relabeling. These scenarios require expressive models that guarantee reliable and efficient computation of probabilistic quantities to measure uncertainty. We identify conditions under which a class of probabilistic models -- which we denote CRISPs -- meet all of these conditions, thus delivering tractable computation of the above quantities while preserving expressiveness. Building on prior work on tractable probabilistic circuits, we illustrate how CRISPs enable robust and efficient active and skeptical learning in large structured output spaces.
    SAITS: Self-Attention-based Imputation for Time Series. (arXiv:2202.08516v1 [cs.LG])
    Missing data in time series is a pervasive problem that puts obstacles in the way of pattern recognition, especially in real-world applications. A popular solution is imputation, where the fundamental challenge is to determine what values should be filled in. This paper proposes SAITS, a novel method based on the self-attention mechanism for missing value imputation in multivariate time series. Trained by a joint-optimization approach, SAITS learns missing values from a weighted combination of two diagonally-masked self-attention (DMSA) blocks. DMSA explicitly captures both the temporal dependencies and feature correlations between time steps, which improves imputation accuracy and training speed. Meanwhile, the weighted-combination design enables SAITS to dynamically assign weights to the learned representations from two DMSA blocks according to the attention map and the missingness information. Extensive experiments demonstrate that SAITS outperforms the state-of-the-art methods on the time-series imputation task efficiently and reveal SAITS' potential to improve the learning performance of pattern recognition models on incomplete time-series data from the real world.
    A Study of Designing Compact Audio-Visual Wake Word Spotting System Based on Iterative Fine-Tuning in Neural Network Pruning. (arXiv:2202.08509v1 [cs.SD])
    Audio-only-based wake word spotting (WWS) is challenging under noisy conditions due to environmental interference in signal transmission. In this paper, we investigate on designing a compact audio-visual WWS system by utilizing visual information to alleviate the degradation. Specifically, in order to use visual information, we first encode the detected lips to fixed-size vectors with MobileNet and concatenate them with acoustic features followed by the fusion network for WWS. However, the audio-visual model based on neural networks requires a large footprint and a high computational complexity. To meet the application requirements, we introduce a neural network pruning strategy via the lottery ticket hypothesis in an iterative fine-tuning manner (LTH-IF), to the single-modal and multi-modal models, respectively. Tested on our in-house corpus for audio-visual WWS in a home TV scene, the proposed audio-visual system achieves significant performance improvements over the single-modality (audio-only or video-only) system under different noisy conditions. Moreover, LTH-IF pruning can largely reduce the network parameters and computations with no degradation of WWS performance, leading to a potential product solution for the TV wake-up scenario.
    AutoScore-Ordinal: An interpretable machine learning framework for generating scoring models for ordinal outcomes. (arXiv:2202.08407v1 [cs.LG])
    Background: Risk prediction models are useful tools in clinical decision-making which help with risk stratification and resource allocations and may lead to a better health care for patients. AutoScore is a machine learning-based automatic clinical score generator for binary outcomes. This study aims to expand the AutoScore framework to provide a tool for interpretable risk prediction for ordinal outcomes. Methods: The AutoScore-Ordinal framework is generated using the same 6 modules of the original AutoScore algorithm including variable ranking, variable transformation, score derivation (from proportional odds models), model selection, score fine-tuning, and model evaluation. To illustrate the AutoScore-Ordinal performance, the method was conducted on electronic health records data from the emergency department at Singapore General Hospital over 2008 to 2017. The model was trained on 70% of the data, validated on 10% and tested on the remaining 20%. Results: This study included 445,989 inpatient cases, where the distribution of the ordinal outcome was 80.7% alive without 30-day readmission, 12.5% alive with 30-day readmission, and 6.8% died inpatient or by day 30 post discharge. Two point-based risk prediction models were developed using two sets of 8 predictor variables identified by the flexible variable selection procedure. The two models indicated reasonably good performance measured by mean area under the receiver operating characteristic curve (0.785 and 0.793) and generalized c-index (0.737 and 0.760), which were comparable to alternative models. Conclusion: AutoScore-Ordinal provides an automated and easy-to-use framework for development and validation of risk prediction models for ordinal outcomes, which can systematically identify potential predictors from high-dimensional data.  ( 2 min )
    ADD 2022: the First Audio Deep Synthesis Detection Challenge. (arXiv:2202.08433v1 [cs.SD])
    Audio deepfake detection is an emerging topic, which was included in the ASVspoof 2021. However, the recent shared tasks have not covered many real-life and challenging scenarios. The first Audio Deep synthesis Detection challenge (ADD) was motivated to fill in the gap. The ADD 2022 includes three tracks: low-quality fake audio detection (LF), partially fake audio detection (PF) and audio fake game (FG). The LF track focuses on dealing with bona fide and fully fake utterances with various real-world noises etc. The PF track aims to distinguish the partially fake audio from the real. The FG track is a rivalry game, which includes two tasks: an audio generation task and an audio fake detection task. In this paper, we describe the datasets, evaluation metrics, and protocols. We also report major findings that reflect the recent advances in audio deepfake detection tasks.  ( 2 min )
    On the evaluation of (meta-)solver approaches. (arXiv:2202.08613v1 [cs.AI])
    Meta-solver approaches exploits a number of individual solvers to potentially build a better solver. To assess the performance of meta-solvers, one can simply adopt the metrics typically used for individual solvers (e.g., runtime or solution quality), or employ more specific evaluation metrics (e.g., by measuring how close the meta-solver gets to its virtual best performance). In this paper, based on some recently published works, we provide an overview of different performance metrics for evaluating (meta-)solvers, by underlying their strengths and weaknesses.  ( 2 min )
    Towards Verifiable Federated Learning. (arXiv:2202.08310v1 [cs.CR])
    Federated learning (FL) is an emerging paradigm of collaborative machine learning that preserves user privacy while building powerful models. Nevertheless, due to the nature of open participation by self-interested entities, it needs to guard against potential misbehaviours by legitimate FL participants. FL verification techniques are promising solutions for this problem. They have been shown to effectively enhance the reliability of FL networks and help build trust among participants. Verifiable federated learning has become an emerging topic of research that has attracted significant interest from the academia and the industry alike. Currently, there is no comprehensive survey on the field of verifiable federated learning, which is interdisciplinary in nature and can be challenging for researchers to enter into. In this paper, we bridge this gap by reviewing works focusing on verifiable FL. We propose a novel taxonomy for verifiable FL covering both centralised and decentralised FL settings, summarise the commonly adopted performance evaluation approaches, and discuss promising directions towards a versatile verifiable FL framework.  ( 2 min )
    Graph Masked Autoencoder. (arXiv:2202.08391v1 [cs.LG])
    Transformers have achieved state-of-the-art performance in learning graph representations. However, there are still some challenges when applying transformers to real-world scenarios due to the fact that deep transformers are hard to be trained from scratch and the memory consumption is large. To address the two challenges, we propose Graph Masked Autoencoders (GMAE), a self-supervised model for learning graph representations, where vanilla graph transformers are used as the encoder and the decoder. GMAE takes partially masked graphs as input, and reconstructs the features of the masked nodes. We adopt asymmetric encoder-decoder design, where the encoder is a deep graph transformer and the decoder is a shallow graph transformer. The masking mechanism and the asymmetric design make GMAE a memory-efficient model compared with conventional transformers. We show that, compared with training from scratch, the graph transformer pre-trained using GMAE can achieve much better performance after fine-tuning. We also show that, when serving as a conventional self-supervised graph representation model and using an SVM model as the downstream graph classifier, GMAE achieves state-of-the-art performance on 5 of the 7 benchmark datasets.  ( 2 min )
    Design-Bench: Benchmarks for Data-Driven Offline Model-Based Optimization. (arXiv:2202.08450v1 [cs.LG])
    Black-box model-based optimization (MBO) problems, where the goal is to find a design input that maximizes an unknown objective function, are ubiquitous in a wide range of domains, such as the design of proteins, DNA sequences, aircraft, and robots. Solving model-based optimization problems typically requires actively querying the unknown objective function on design proposals, which means physically building the candidate molecule, aircraft, or robot, testing it, and storing the result. This process can be expensive and time consuming, and one might instead prefer to optimize for the best design using only the data one already has. This setting -- called offline MBO -- poses substantial and different algorithmic challenges than more commonly studied online techniques. A number of recent works have demonstrated success with offline MBO for high-dimensional optimization problems using high-capacity deep neural networks. However, the lack of standardized benchmarks in this emerging field is making progress difficult to track. To address this, we present Design-Bench, a benchmark for offline MBO with a unified evaluation protocol and reference implementations of recent methods. Our benchmark includes a suite of diverse and realistic tasks derived from real-world optimization problems in biology, materials science, and robotics that present distinct challenges for offline MBO. Our benchmark and reference implementations are released at github.com/rail-berkeley/design-bench and github.com/rail-berkeley/design-baselines.  ( 2 min )
    Hamilton-Jacobi equations on graphs with applications to semi-supervised learning and data depth. (arXiv:2202.08789v1 [math.AP])
    Shortest path graph distances are widely used in data science and machine learning, since they can approximate the underlying geodesic distance on the data manifold. However, the shortest path distance is highly sensitive to the addition of corrupted edges in the graph, either through noise or an adversarial perturbation. In this paper we study a family of Hamilton-Jacobi equations on graphs that we call the $p$-eikonal equation. We show that the $p$-eikonal equation with $p=1$ is a provably robust distance-type function on a graph, and the $p\to \infty$ limit recovers shortest path distances. While the $p$-eikonal equation does not correspond to a shortest-path graph distance, we nonetheless show that the continuum limit of the $p$-eikonal equation on a random geometric graph recovers a geodesic density weighted distance in the continuum. We consider applications of the $p$-eikonal equation to data depth and semi-supervised learning, and use the continuum limit to prove asymptotic consistency results for both applications. Finally, we show the results of experiments with data depth and semi-supervised learning on real image datasets, including MNIST, FashionMNIST and CIFAR-10, which show that the $p$-eikonal equation offers significantly better results compared to shortest path distances.  ( 2 min )
    Contrastive Meta Learning with Behavior Multiplicity for Recommendation. (arXiv:2202.08523v1 [cs.IR])
    A well-informed recommendation framework could not only help users identify their interested items, but also benefit the revenue of various online platforms (e.g., e-commerce, social media). Traditional recommendation models usually assume that only a single type of interaction exists between user and item, and fail to model the multiplex user-item relationships from multi-typed user behavior data, such as page view, add-to-favourite and purchase. While some recent studies propose to capture the dependencies across different types of behaviors, two important challenges have been less explored: i) Dealing with the sparse supervision signal under target behaviors (e.g., purchase). ii) Capturing the personalized multi-behavior patterns with customized dependency modeling. To tackle the above challenges, we devise a new model CML, Contrastive Meta Learning (CML), to maintain dedicated cross-type behavior dependency for different users. In particular, we propose a multi-behavior contrastive learning framework to distill transferable knowledge across different types of behaviors via the constructed contrastive loss. In addition, to capture the diverse multi-behavior patterns, we design a contrastive meta network to encode the customized behavior heterogeneity for different users. Extensive experiments on three real-world datasets indicate that our method consistently outperforms various state-of-the-art recommendation methods. Our empirical studies further suggest that the contrastive meta learning paradigm offers great potential for capturing the behavior multiplicity in recommendation. We release our model implementation at: https://github.com/weiwei1206/CML.git.  ( 2 min )
    Generative Models as Distributions of Functions. (arXiv:2102.04776v4 [cs.LG] UPDATED)
    Generative models are typically trained on grid-like data such as images. As a result, the size of these models usually scales directly with the underlying grid resolution. In this paper, we abandon discretized grids and instead parameterize individual data points by continuous functions. We then build generative models by learning distributions over such functions. By treating data points as functions, we can abstract away from the specific type of data we train on and construct models that are agnostic to discretization. To train our model, we use an adversarial approach with a discriminator that acts on continuous signals. Through experiments on a wide variety of data modalities including images, 3D shapes and climate data, we demonstrate that our model can learn rich distributions of functions independently of data type and resolution.  ( 2 min )
    Neural Loop Combiner: Neural Network Models for Assessing the Compatibility of Loops. (arXiv:2008.02011v2 [cs.SD] UPDATED)
    Music producers who use loops may have access to thousands in loop libraries, but finding ones that are compatible is a time-consuming process; we hope to reduce this burden with automation. State-of-the-art systems for estimating compatibility, such as AutoMashUpper, are mostly rule-based and could be improved on with machine learn-ing. To train a model, we need a large set of loops with ground truth compatibility values. No such dataset exists, so we extract loops from existing music to obtain positive examples of compatible loops, and propose and compare various strategies for choosing negative examples. For re-producibility, we curate data from the Free Music Archive.Using this data, we investigate two types of model architectures for estimating the compatibility of loops: one based on a Siamese network, and the other a pure convolutional neural network (CNN). We conducted a user study in which participants rated the quality of the combinations suggested by each model, and found the CNN to outperform the Siamese network. Both model-based approaches outperformed the rule-based one. We have opened source the code for building the models and the dataset.  ( 2 min )
    Human-Algorithm Collaboration: Achieving Complementarity and Avoiding Unfairness. (arXiv:2202.08821v1 [cs.CY])
    Much of machine learning research focuses on predictive accuracy: given a task, create a machine learning model (or algorithm) that maximizes accuracy. In many settings, however, the final prediction or decision of a system is under the control of a human, who uses an algorithm's output along with their own personal expertise in order to produce a combined prediction. One ultimate goal of such collaborative systems is "complementarity": that is, to produce lower loss (equivalently, greater payoff or utility) than either the human or algorithm alone. However, experimental results have shown that even in carefully-designed systems, complementary performance can be elusive. Our work provides three key contributions. First, we provide a theoretical framework for modeling simple human-algorithm systems and demonstrate that multiple prior analyses can be expressed within it. Next, we use this model to prove conditions where complementarity is impossible, and give constructive examples of where complementarity is achievable. Finally, we discuss the implications of our findings, especially with respect to the fairness of a classifier. In sum, these results deepen our understanding of key factors influencing the combined performance of human-algorithm systems, giving insight into how algorithmic tools can best be designed for collaborative environments.  ( 2 min )
    Dynamic Object Comprehension: A Framework For Evaluating Artificial Visual Perception. (arXiv:2202.08490v1 [cs.CV])
    Augmented and Mixed Reality are emerging as likely successors to the mobile internet. However, many technical challenges remain. One of the key requirements of these systems is the ability to create a continuity between physical and virtual worlds, with the user's visual perception as the primary interface medium. Building this continuity requires the system to develop a visual understanding of the physical world. While there has been significant recent progress in computer vision and AI techniques such as image classification and object detection, success in these areas has not yet led to the visual perception required for these critical MR and AR applications. A significant issue is that current evaluation criteria are insufficient for these applications. To motivate and evaluate progress in this emerging area, there is a need for new metrics. In this paper we outline limitations of current evaluation criteria and propose new criteria.
    An Equivalence Between Data Poisoning and Byzantine Gradient Attacks. (arXiv:2202.08578v1 [cs.LG])
    To study the resilience of distributed learning, the "Byzantine" literature considers a strong threat model where workers can report arbitrary gradients to the parameter server. Whereas this model helped obtain several fundamental results, it has sometimes been considered unrealistic, when the workers are mostly trustworthy machines. In this paper, we show a surprising equivalence between this model and data poisoning, a threat considered much more realistic. More specifically, we prove that every gradient attack can be reduced to data poisoning, in any personalized federated learning system with PAC guarantees (which we show are both desirable and realistic). This equivalence makes it possible to obtain new impossibility results on the resilience to data poisoning as corollaries of existing impossibility theorems on Byzantine machine learning. Moreover, using our equivalence, we derive a practical attack that we show (theoretically and empirically) can be very effective against classical personalized federated learning models.
    SplitFed: When Federated Learning Meets Split Learning. (arXiv:2004.12088v5 [cs.LG] UPDATED)
    Federated learning (FL) and split learning (SL) are two popular distributed machine learning approaches. Both follow a model-to-data scenario; clients train and test machine learning models without sharing raw data. SL provides better model privacy than FL due to the machine learning model architecture split between clients and the server. Moreover, the split model makes SL a better option for resource-constrained environments. However, SL performs slower than FL due to the relay-based training across multiple clients. In this regard, this paper presents a novel approach, named splitfed learning (SFL), that amalgamates the two approaches eliminating their inherent drawbacks, along with a refined architectural configuration incorporating differential privacy and PixelDP to enhance data privacy and model robustness. Our analysis and empirical results demonstrate that (pure) SFL provides similar test accuracy and communication efficiency as SL while significantly decreasing its computation time per global epoch than in SL for multiple clients. Furthermore, as in SL, its communication efficiency over FL improves with the number of clients. Besides, the performance of SFL with privacy and robustness measures is further evaluated under extended experimental settings.  ( 2 min )
    Nested Multiple Instance Learning with Attention Mechanisms. (arXiv:2111.00947v3 [cs.LG] UPDATED)
    Strongly supervised learning requires detailed knowledge of truth labels at instance levels, and in many machine learning applications this is a major drawback. Multiple instance learning (MIL) is a popular weakly supervised learning method where truth labels are not available at instance level, but only at bag-of-instances level. However, sometimes the nature of the problem requires a more complex description, where a nested architecture of bag-of-bags at different levels can capture underlying relationships, like similar instances grouped together. Predicting the latent labels of instances or inner-bags might be as important as predicting the final bag-of-bags label but is lost in a straightforward nested setting. We propose a Nested Multiple Instance with Attention (NMIA) model architecture combining the concept of nesting with attention mechanisms. We show that NMIA performs as conventional MIL in simple scenarios and can grasp a complex scenario providing insights to the latent labels at different levels.
    Robust Reinforcement Learning via Genetic Curriculum. (arXiv:2202.08393v1 [cs.LG])
    Achieving robust performance is crucial when applying deep reinforcement learning (RL) in safety critical systems. Some of the state of the art approaches try to address the problem with adversarial agents, but these agents often require expert supervision to fine tune and prevent the adversary from becoming too challenging to the trainee agent. While other approaches involve automatically adjusting environment setups during training, they have been limited to simple environments where low-dimensional encodings can be used. Inspired by these approaches, we propose genetic curriculum, an algorithm that automatically identifies scenarios in which the agent currently fails and generates an associated curriculum to help the agent learn to solve the scenarios and acquire more robust behaviors. As a non-parametric optimizer, our approach uses a raw, non-fixed encoding of scenarios, reducing the need for expert supervision and allowing our algorithm to adapt to the changing performance of the agent. Our empirical studies show improvement in robustness over the existing state of the art algorithms, providing training curricula that result in agents being 2 - 8x times less likely to fail without sacrificing cumulative reward. We include an ablation study and share insights on why our algorithm outperforms prior approaches.  ( 2 min )
    Entropic Associative Memory for Manuscript Symbols. (arXiv:2202.08413v1 [cs.LG])
    Manuscript symbols can be stored, recognized and retrieved from an entropic digital memory that is associative and distributed but yet declarative; memory retrieval is a constructive operation, memory cues to objects not contained in the memory are rejected directly without search, and memory operations can be performed through parallel computations. Manuscript symbols, both letters and numerals, are represented in Associative Memory Registers that have an associated entropy. The memory recognition operation obeys an entropy trade-off between precision and recall, and the entropy level impacts on the quality of the objects recovered through the memory retrieval operation. The present proposal is contrasted in several dimensions with neural networks models of associative memory. We discuss the operational characteristics of the entropic associative memory for retrieving objects with both complete and incomplete information, such as severe occlusions. The experiments reported in this paper add evidence on the potential of this framework for developing practical applications and computational models of natural memory.
    Learning stochastic dynamics and predicting emergent behavior using transformers. (arXiv:2202.08708v1 [cond-mat.stat-mech])
    We show that a neural network originally designed for language processing can learn the dynamical rules of a stochastic system by observation of a single dynamical trajectory of the system, and can accurately predict its emergent behavior under conditions not observed during training. We consider a lattice model of active matter undergoing continuous-time Monte Carlo dynamics, simulated at a density at which its steady state comprises small, dispersed clusters. We train a neural network called a transformer on a single trajectory of the model. The transformer, which we show has the capacity to represent dynamical rules that are numerous and nonlocal, learns that the dynamics of this model consists of a small number of processes. Forward-propagated trajectories of the trained transformer, at densities not encountered during training, exhibit motility-induced phase separation and so predict the existence of a nonequilibrium phase transition. Transformers have the flexibility to learn dynamical rules from observation without explicit enumeration of rates or coarse-graining of configuration space, and so the procedure used here can be applied to a wide range of physical systems, including those with large and complex dynamical generators.  ( 2 min )
    Universality of empirical risk minimization. (arXiv:2202.08832v1 [math.ST])
    Consider supervised learning from i.i.d. samples $\{{\boldsymbol x}_i,y_i\}_{i\le n}$ where ${\boldsymbol x}_i \in\mathbb{R}^p$ are feature vectors and ${y} \in \mathbb{R}$ are labels. We study empirical risk minimization over a class of functions that are parameterized by $\mathsf{k} = O(1)$ vectors ${\boldsymbol \theta}_1, . . . , {\boldsymbol \theta}_{\mathsf k} \in \mathbb{R}^p$ , and prove universality results both for the training and test error. Namely, under the proportional asymptotics $n,p\to\infty$, with $n/p = \Theta(1)$, we prove that the training error depends on the random features distribution only through its covariance structure. Further, we prove that the minimum test error over near-empirical risk minimizers enjoys similar universality properties. In particular, the asymptotics of these quantities can be computed $-$to leading order$-$ under a simpler model in which the feature vectors ${\boldsymbol x}_i$ are replaced by Gaussian vectors ${\boldsymbol g}_i$ with the same covariance. Earlier universality results were limited to strongly convex learning procedures, or to feature vectors ${\boldsymbol x}_i$ with independent entries. Our results do not make any of these assumptions. Our assumptions are general enough to include feature vectors ${\boldsymbol x}_i$ that are produced by randomized featurization maps. In particular we explicitly check the assumptions for certain random features models (computing the output of a one-layer neural network with random weights) and neural tangent models (first-order Taylor approximation of two-layer networks).  ( 2 min )
    Winograd Convolution: A Perspective from Fault Tolerance. (arXiv:2202.08675v1 [cs.LG])
    Winograd convolution is originally proposed to reduce the computing overhead by converting multiplication in neural network (NN) with addition via linear transformation. Other than the computing efficiency, we observe its great potential in improving NN fault tolerance and evaluate its fault tolerance comprehensively for the first time. Then, we explore the use of fault tolerance of winograd convolution for either fault-tolerant or energy-efficient NN processing. According to our experiments, winograd convolution can be utilized to reduce fault-tolerant design overhead by 27.49\% or energy consumption by 7.19\% without any accuracy loss compared to that without being aware of the fault tolerance  ( 2 min )
    Learning continuous models for continuous physics. (arXiv:2202.08494v1 [cs.LG])
    Dynamical systems that evolve continuously over time are ubiquitous throughout science and engineering. Machine learning (ML) provides data-driven approaches to model and predict the dynamics of such systems. A core issue with this approach is that ML models are typically trained on discrete data, using ML methodologies that are not aware of underlying continuity properties, which results in models that often do not capture the underlying continuous dynamics of a system of interest. As a result, these ML models are of limited use for for many scientific and engineering applications. To address this challenge, we develop a convergence test based on numerical analysis theory. Our test verifies whether a model has learned a function that accurately approximates a system's underlying continuous dynamics. Models that fail this test fail to capture relevant dynamics, rendering them of limited utility for many scientific prediction tasks; while models that pass this test enable both better interpolation and better extrapolation in multiple ways. Our results illustrate how principled numerical analysis methods can be coupled with existing ML training/testing methodologies to validate models for science and engineering applications.  ( 2 min )
    Learning Transferrable Representations of Career Trajectories for Economic Prediction. (arXiv:2202.08370v1 [cs.LG])
    Understanding career trajectories -- the sequences of jobs that individuals hold over their working lives -- is important to economists for studying labor markets. In the past, economists have estimated relevant quantities by fitting predictive models to small surveys, but in recent years large datasets of online resumes have also become available. These new datasets provide job sequences of many more individuals, but they are too large and complex for standard econometric modeling. To this end, we adapt ideas from modern language modeling to the analysis of large-scale job sequence data. We develop CAREER, a transformer-based model that learns a low-dimensional representation of an individual's job history. This representation can be used to predict jobs directly on a large dataset, or can be "transferred" to represent jobs in smaller and better-curated datasets. We fit the model to a large dataset of resumes, 24 million people who are involved in more than a thousand unique occupations. It forms accurate predictions on held-out data, and it learns useful career representations that can be fine-tuned to make accurate predictions on common economics datasets.  ( 2 min )
    Federated Stochastic Gradient Descent Begets Self-Induced Momentum. (arXiv:2202.08402v1 [cs.LG])
    Federated learning (FL) is an emerging machine learning method that can be applied in mobile edge systems, in which a server and a host of clients collaboratively train a statistical model utilizing the data and computation resources of the clients without directly exposing their privacy-sensitive data. We show that running stochastic gradient descent (SGD) in such a setting can be viewed as adding a momentum-like term to the global aggregation process. Based on this finding, we further analyze the convergence rate of a federated learning system by accounting for the effects of parameter staleness and communication resources. These results advance the understanding of the Federated SGD algorithm, and also forges a link between staleness analysis and federated computing systems, which can be useful for systems designers.  ( 2 min )
    Anomalib: A Deep Learning Library for Anomaly Detection. (arXiv:2202.08341v1 [cs.CV])
    This paper introduces anomalib, a novel library for unsupervised anomaly detection and localization. With reproducibility and modularity in mind, this open-source library provides algorithms from the literature and a set of tools to design custom anomaly detection algorithms via a plug-and-play approach. Anomalib comprises state-of-the-art anomaly detection algorithms that achieve top performance on the benchmarks and that can be used off-the-shelf. In addition, the library provides components to design custom algorithms that could be tailored towards specific needs. Additional tools, including experiment trackers, visualizers, and hyper-parameter optimizers, make it simple to design and implement anomaly detection models. The library also supports OpenVINO model optimization and quantization for real-time deployment. Overall, anomalib is an extensive library for the design, implementation, and deployment of unsupervised anomaly detection models from data to the edge.  ( 2 min )
    Deep Time Series Forecasting with Shape and Temporal Criteria. (arXiv:2104.04610v2 [stat.ML] UPDATED)
    This paper addresses the problem of multi-step time series forecasting for non-stationary signals that can present sudden changes. Current state-of-the-art deep learning forecasting methods, often trained with variants of the MSE, lack the ability to provide sharp predictions in deterministic and probabilistic contexts. To handle these challenges, we propose to incorporate shape and temporal criteria in the training objective of deep models. We define shape and temporal similarities and dissimilarities, based on a smooth relaxation of Dynamic Time Warping (DTW) and Temporal Distortion Index (TDI), that enable to build differentiable loss functions and positive semi-definite (PSD) kernels. With these tools, we introduce DILATE (DIstortion Loss including shApe and TimE), a new objective for deterministic forecasting, that explicitly incorporates two terms supporting precise shape and temporal change detection. For probabilistic forecasting, we introduce STRIPE++ (Shape and Time diverRsIty in Probabilistic forEcasting), a framework for providing a set of sharp and diverse forecasts, where the structured shape and time diversity is enforced with a determinantal point process (DPP) diversity loss. Extensive experiments and ablations studies on synthetic and real-world datasets confirm the benefits of leveraging shape and time features in time series forecasting.  ( 2 min )
    Does the End Justify the Means? On the Moral Justification of Fairness-Aware Machine Learning. (arXiv:2202.08536v1 [cs.LG])
    Despite an abundance of fairness-aware machine learning (fair-ml) algorithms, the moral justification of how these algorithms enforce fairness metrics is largely unexplored. The goal of this paper is to elicit the moral implications of a fair-ml algorithm. To this end, we first consider the moral justification of the fairness metrics for which the algorithm optimizes. We present an extension of previous work to arrive at three propositions that can justify the fairness metrics. Different from previous work, our extension highlights that the consequences of predicted outcomes are important for judging fairness. We draw from the extended framework and empirical ethics to identify moral implications of the fair-ml algorithm. We focus on the two optimization strategies inherent to the algorithm: group-specific decision thresholds and randomized decision thresholds. We argue that the justification of the algorithm can differ depending on one's assumptions about the (social) context in which the algorithm is applied - even if the associated fairness metric is the same. Finally, we sketch paths for future work towards a more complete evaluation of fair-ml algorithms, beyond their direct optimization objectives.  ( 2 min )
    Multi-Objective Model Selection for Time Series Forecasting. (arXiv:2202.08485v1 [cs.LG])
    Research on time series forecasting has predominantly focused on developing methods that improve accuracy. However, other criteria such as training time or latency are critical in many real-world applications. We therefore address the question of how to choose an appropriate forecasting model for a given dataset among the plethora of available forecasting methods when accuracy is only one of many criteria. For this, our contributions are two-fold. First, we present a comprehensive benchmark, evaluating 7 classical and 6 deep learning forecasting methods on 44 heterogeneous, publicly available datasets. The benchmark code is open-sourced along with evaluations and forecasts for all methods. These evaluations enable us to answer open questions such as the amount of data required for deep learning models to outperform classical ones. Second, we leverage the benchmark evaluations to learn good defaults that consider multiple objectives such as accuracy and latency. By learning a mapping from forecasting models to performance metrics, we show that our method PARETOSELECT is able to accurately select models from the Pareto front -- alleviating the need to train or evaluate many forecasting models for model selection. To the best of our knowledge, PARETOSELECT constitutes the first method to learn default models in a multi-objective setting.  ( 2 min )
    Text-Based Action-Model Acquisition for Planning. (arXiv:2202.08373v1 [cs.LG])
    Although there have been approaches that are capable of learning action models from plan traces, there is no work on learning action models from textual observations, which is pervasive and much easier to collect from real-world applications compared to plan traces. In this paper we propose a novel approach to learning action models from natural language texts by integrating Constraint Satisfaction and Natural Language Processing techniques. Specifically, we first build a novel language model to extract plan traces from texts, and then build a set of constraints to generate action models based on the extracted plan traces. After that, we iteratively improve the language model and constraints until we achieve the convergent language model and action models. We empirically exhibit that our approach is both effective and efficient.
    $k\texttt{-experts}$ -- Online Policies and Fundamental Limits. (arXiv:2110.07881v2 [cs.IT] UPDATED)
    We introduce the $\texttt{$k$-experts}$ problem - a generalization of the classic Prediction with Expert's Advice framework. Unlike the classic version, where the learner selects exactly one expert from a pool of $N$ experts at each round, in this problem, the learner can select a subset of $k$ experts at each round $(1\leq k\leq N)$. The reward obtained by the learner at each round is assumed to be a function of the $k$ selected experts. The primary objective is to design an online learning policy with a small regret. In this pursuit, we propose $\texttt{SAGE}$ ($\textbf{Sa}$mpled Hed$\textbf{ge}$) - a framework for designing efficient online learning policies by leveraging statistical sampling techniques. For a wide class of reward functions, we show that $\texttt{SAGE}$ either achieves the first sublinear regret guarantee or improves upon the existing ones. Furthermore, going beyond the notion of regret, we fully characterize the mistake bounds achievable by online learning policies for stable loss functions. We conclude the paper by establishing a tight regret lower bound for a variant of the $\texttt{$k$-experts}$ problem and carrying out experiments with standard datasets.
    Causal Effect Identification with Context-specific Independence Relations of Control Variables. (arXiv:2110.12064v2 [cs.LG] UPDATED)
    We study the problem of causal effect identification from observational distribution given the causal graph and some context-specific independence (CSI) relations. It was recently shown that this problem is NP-hard, and while a sound algorithm to learn the causal effects is proposed in Tikka et al. (2019), no complete algorithm for the task exists. In this work, we propose a sound and complete algorithm for the setting when the CSI relations are limited to observed nodes with no parents in the causal graph. One limitation of the state of the art in terms of its applicability is that the CSI relations among all variables, even unobserved ones, must be given (as opposed to learned). Instead, We introduce a set of graphical constraints under which the CSI relations can be learned from mere observational distribution. This expands the set of identifiable causal effects beyond the state of the art.
    Multivariate Time Series Forecasting with Dynamic Graph Neural ODEs. (arXiv:2202.08408v1 [cs.LG])
    Multivariate time series forecasting has long received significant attention in real-world applications, such as energy consumption and traffic prediction. While recent methods demonstrate good forecasting abilities, they suffer from three fundamental limitations. (i) Discrete neural architectures: Interlacing individually parameterized spatial and temporal blocks to encode rich underlying patterns leads to discontinuous latent state trajectories and higher forecasting numerical errors. (ii) High complexity: Discrete approaches complicate models with dedicated designs and redundant parameters, leading to higher computational and memory overheads. (iii) Reliance on graph priors: Relying on predefined static graph structures limits their effectiveness and practicability in real-world applications. In this paper, we address all the above limitations by proposing a continuous model to forecast Multivariate Time series with dynamic Graph neural Ordinary Differential Equations (MTGODE). Specifically, we first abstract multivariate time series into dynamic graphs with time-evolving node features and unknown graph structures. Then, we design and solve a neural ODE to complement missing graph topologies and unify both spatial and temporal message passing, allowing deeper graph propagation and fine-grained temporal information aggregation to characterize stable and precise latent spatial-temporal dynamics. Our experiments demonstrate the superiorities of MTGODE from various perspectives on five time series benchmark datasets.
    Robust SVM Optimization in Banach spaces. (arXiv:2202.08567v1 [stat.ML])
    We address the issue of binary classification in Banach spaces in presence of uncertainty. We show that a number of results from classical support vector machines theory can be appropriately generalised to their robust counterpart in Banach spaces. These include the Representer Theorem, strong duality for the associated Optimization problem as well as their geometric interpretation. Furthermore, we propose a game theoretic interpretation by expressing a Nash equilibrium problem formulation for the more general problem of finding the closest points in two closed convex sets when the underlying space is reflexive and smooth.
    Deep Surrogate Q-Learning for Autonomous Driving. (arXiv:2010.11278v2 [cs.LG] UPDATED)
    Challenging problems of deep reinforcement learning systems with regard to the application on real systems are their adaptivity to changing environments and their efficiency w.r.t. computational resources and data. In the application of learning lane-change behavior for autonomous driving, agents have to deal with a varying number of surrounding vehicles. Furthermore, the number of required transitions imposes a bottleneck, since test drivers cannot perform an arbitrary amount of lane changes in the real world. In the off-policy setting, additional information on solving the task can be gained by observing actions from others. While in the classical RL setup this knowledge remains unused, we use other drivers as surrogates to learn the agent's value function more efficiently. We propose Surrogate Q-learning that deals with the aforementioned problems and reduces the required driving time drastically. We further propose an efficient implementation based on a permutation-equivariant deep neural network architecture of the Q-function to estimate action-values for a variable number of vehicles in sensor range. We show that the architecture leads to a novel replay sampling technique we call Scene-centric Experience Replay and evaluate the performance of Surrogate Q-learning and Scene-centric Experience Replay in the open traffic simulator SUMO. Additionally, we show that our methods enhance real-world applicability of RL systems by learning policies on the real highD dataset.
    Measuring Trustworthiness or Automating Physiognomy? A Comment on Safra, Chevallier, Gr\`ezes, and Baumard (2020). (arXiv:2202.08674v1 [cs.LG])
    Interpersonal trust - a shared display of confidence and vulnerability toward other individuals - can be seen as instrumental in the development of human societies. Safra, Chevallier, Gr\`ezes, and Baumard (2020) studied the historical progression of interpersonal trust by training a machine learning (ML) algorithm to generate trustworthiness ratings of historical portraits, based on facial features. They reported that trustworthiness ratings of portraits dated between 1500--2000CE increased with time, claiming that this evidenced a broader increase in interpersonal trust coinciding with several metrics of societal progress. We argue that these claims are confounded by several methodological and analytical issues and highlight troubling parallels between Safra et al.'s algorithm and the pseudoscience of physiognomy. We discuss the implications and potential real-world consequences of these issues in further detail.
    Phase Aberration Robust Beamformer for Planewave US Using Self-Supervised Learning. (arXiv:2202.08262v1 [eess.IV])
    Ultrasound (US) is widely used for clinical imaging applications thanks to its real-time and non-invasive nature. However, its lesion detectability is often limited in many applications due to the phase aberration artefact caused by variations in the speed of sound (SoS) within body parts. To address this, here we propose a novel self-supervised 3D CNN that enables phase aberration robust plane-wave imaging. Instead of aiming at estimating the SoS distribution as in conventional methods, our approach is unique in that the network is trained in a self-supervised manner to robustly generate a high-quality image from various phase aberrated images by modeling the variation in the speed of sound as stochastic. Experimental results using real measurements from tissue-mimicking phantom and \textit{in vivo} scans confirmed that the proposed method can significantly reduce the phase aberration artifacts and improve the visual quality of deep scans.
    TorchDrug: A Powerful and Flexible Machine Learning Platform for Drug Discovery. (arXiv:2202.08320v1 [cs.LG])
    Machine learning has huge potential to revolutionize the field of drug discovery and is attracting increasing attention in recent years. However, lacking domain knowledge (e.g., which tasks to work on), standard benchmarks and data preprocessing pipelines are the main obstacles for machine learning researchers to work in this domain. To facilitate the progress of machine learning for drug discovery, we develop TorchDrug, a powerful and flexible machine learning platform for drug discovery built on top of PyTorch. TorchDrug benchmarks a variety of important tasks in drug discovery, including molecular property prediction, pretrained molecular representations, de novo molecular design and optimization, retrosynthsis prediction, and biomedical knowledge graph reasoning. State-of-the-art techniques based on geometric deep learning (or graph machine learning), deep generative models, reinforcement learning and knowledge graph reasoning are implemented for these tasks. TorchDrug features a hierarchical interface that facilitates customization from both novices and experts in this domain. Tutorials, benchmark results and documentation are available at https://torchdrug.ai. Code is released under Apache License 2.0.  ( 2 min )
    Characterizing Attacks on Deep Reinforcement Learning. (arXiv:1907.09470v3 [cs.LG] UPDATED)
    Recent studies show that Deep Reinforcement Learning (DRL) models are vulnerable to adversarial attacks, which attack DRL models by adding small perturbations to the observations. However, some attacks assume full availability of the victim model, and some require a huge amount of computation, making them less feasible for real world applications. In this work, we make further explorations of the vulnerabilities of DRL by studying other aspects of attacks on DRL using realistic and efficient attacks. First, we adapt and propose efficient black-box attacks when we do not have access to DRL model parameters. Second, to address the high computational demands of existing attacks, we introduce efficient online sequential attacks that exploit temporal consistency across consecutive steps. Third, we explore the possibility of an attacker perturbing other aspects in the DRL setting, such as the environment dynamics. Finally, to account for imperfections in how an attacker would inject perturbations in the physical world, we devise a method for generating a robust physical perturbations to be printed. The attack is evaluated on a real-world robot under various conditions. We conduct extensive experiments both in simulation such as Atari games, robotics and autonomous driving, and on real-world robotics, to compare the effectiveness of the proposed attacks with baseline approaches. To the best of our knowledge, we are the first to apply adversarial attacks on DRL systems to physical robots.  ( 2 min )
    Oracle-Efficient Online Learning for Beyond Worst-Case Adversaries. (arXiv:2202.08549v1 [cs.LG])
    In this paper, we study oracle-efficient algorithms for beyond worst-case analysis of online learning. We focus on two settings. First, the smoothed analysis setting of [RST11, HRS12] where an adversary is constrained to generating samples from distributions whose density is upper bounded by $1/\sigma$ times the uniform density. Second, the setting of $K$-hint transductive learning, where the learner is given access to $K$ hints per time step that are guaranteed to include the true instance. We give the first known oracle-efficient algorithms for both settings that depend only on the VC dimension of the class and parameters $\sigma$ and $K$ that capture the power of the adversary. In particular, we achieve oracle-efficient regret bounds of $ O ( \sqrt{T (d / \sigma )^{1/2} } ) $ and $ O ( \sqrt{T d K } )$ respectively for these setting. For the smoothed analysis setting, our results give the first oracle-efficient algorithm for online learning with smoothed adversaries [HRS21]. This contrasts the computational separation between online learning with worst-case adversaries and offline learning established by [HK16]. Our algorithms also imply improved bounds for worst-case setting with small domains. In particular, we give an oracle-efficient algorithm with regret of $O ( \sqrt{T(d \vert{\mathcal{X}}\vert ) ^{1/2} })$, which is a refinement of the earlier $O ( \sqrt{T\vert{\mathcal{X} } \vert })$ bound by [DS16].  ( 2 min )
    How to Fill the Optimum Set? Population Gradient Descent with Harmless Diversity. (arXiv:2202.08376v1 [cs.LG])
    Although traditional optimization methods focus on finding a single optimal solution, most objective functions in modern machine learning problems, especially those in deep learning, often have multiple or infinite numbers of optima. Therefore, it is useful to consider the problem of finding a set of diverse points in the optimum set of an objective function. In this work, we frame this problem as a bi-level optimization problem of maximizing a diversity score inside the optimum set of the main loss function, and solve it with a simple population gradient descent framework that iteratively updates the points to maximize the diversity score in a fashion that does not hurt the optimization of the main loss. We demonstrate that our method can efficiently generate diverse solutions on a variety of applications, including text-to-image generation, text-to-mesh generation, molecular conformation generation and ensemble neural network training.  ( 2 min )
    Recovering Unbalanced Communities in the Stochastic Block Model With Application to Clustering with a Faulty Oracle. (arXiv:2202.08522v1 [cs.LG])
    The stochastic block model (SBM) is a fundamental model for studying graph clustering or community detection in networks. It has received great attention in the last decade and the balanced case, i.e., assuming all clusters have large size, has been well studied. However, our understanding of SBM with unbalanced communities (arguably, more relevant in practice) is still very limited. In this paper, we provide a simple SVD-based algorithm for recovering the communities in the SBM with communities of varying sizes. Under the KS-threshold conjecture, the tradeoff between the parameters in our algorithm is nearly optimal up to polylogarithmic factors for a wide range of regimes. As a byproduct, we obtain a time-efficient algorithm with improved query complexity for a clustering problem with a faulty oracle, which improves upon a number of previous work (Mazumdarand Saha [NIPS 2017], Larsen, Mitzenmacher and Tsourakakis [WWW 2020], Peng and Zhang[COLT 2021]). Under the KS-threshold conjecture, the query complexity of our algorithm is nearly optimal up to polylogarithmic factors.  ( 2 min )
    The learning phases in NN: From Fitting the Majority to Fitting a Few. (arXiv:2202.08299v1 [cs.LG])
    The learning dynamics of deep neural networks are subject to controversy. Using the information bottleneck (IB) theory separate fitting and compression phases have been put forward but have since been heavily debated. We approach learning dynamics by analyzing a layer's reconstruction ability of the input and prediction performance based on the evolution of parameters during training. We show that a prototyping phase decreasing reconstruction loss initially, followed by reducing classification loss of a few samples, which increases reconstruction loss, exists under mild assumptions on the data. Aside from providing a mathematical analysis of single layer classification networks, we also assess the behavior using common datasets and architectures from computer vision such as ResNet and VGG.
    Low-Rank Phase Retrieval with Structured Tensor Models. (arXiv:2202.08260v1 [eess.IV])
    We study the low-rank phase retrieval problem, where the objective is to recover a sequence of signals (typically images) given the magnitude of linear measurements of those signals. Existing solutions involve recovering a matrix constructed by vectorizing and stacking each image. These algorithms model this matrix to be low-rank and leverage the low-rank property to decrease the sample complexity required for accurate recovery. However, when the number of available measurements is more limited, these low-rank matrix models can often fail. We propose an algorithm called Tucker-Structured Phase Retrieval (TSPR) that models the sequence of images as a tensor rather than a matrix that we factorize using the Tucker decomposition. This factorization reduces the number of parameters that need to be estimated, allowing for a more accurate reconstruction in the under-sampled regime. Interestingly, we observe that this structure also has improved performance in the over-determined setting when the Tucker ranks are chosen appropriately. We demonstrate the effectiveness of our approach on real video datasets under several different measurement models.  ( 2 min )
    More to Less (M2L): Enhanced Health Recognition in the Wild with Reduced Modality of Wearable Sensors. (arXiv:2202.08267v1 [cs.LG])
    Accurately recognizing health-related conditions from wearable data is crucial for improved healthcare outcomes. To improve the recognition accuracy, various approaches have focused on how to effectively fuse information from multiple sensors. Fusing multiple sensors is a common scenario in many applications, but may not always be feasible in real-world scenarios. For example, although combining bio-signals from multiple sensors (i.e., a chest pad sensor and a wrist wearable sensor) has been proved effective for improved performance, wearing multiple devices might be impractical in the free-living context. To solve the challenges, we propose an effective more to less (M2L) learning framework to improve testing performance with reduced sensors through leveraging the complementary information of multiple modalities during training. More specifically, different sensors may carry different but complementary information, and our model is designed to enforce collaborations among different modalities, where positive knowledge transfer is encouraged and negative knowledge transfer is suppressed, so that better representation is learned for individual modalities. Our experimental results show that our framework achieves comparable performance when compared with the full modalities. Our code and results will be available at https://github.com/compwell-org/More2Less.git.  ( 2 min )
    Efficient Distributed Machine Learning via Combinatorial Multi-Armed Bandits. (arXiv:2202.08302v1 [cs.IT])
    We consider the distributed stochastic gradient descent problem, where a main node distributes gradient calculations among $n$ workers from which at most $b \leq n$ can be utilized in parallel. By assigning tasks to all the workers and waiting only for the $k$ fastest ones, the main node can trade-off the error of the algorithm with its runtime by gradually increasing $k$ as the algorithm evolves. However, this strategy, referred to as adaptive k sync, can incur additional costs since it ignores the computational efforts of slow workers. We propose a cost-efficient scheme that assigns tasks only to $k$ workers and gradually increases $k$. As the response times of the available workers are unknown to the main node a priori, we utilize a combinatorial multi-armed bandit model to learn which workers are the fastest while assigning gradient calculations, and to minimize the effect of slow workers. Assuming that the mean response times of the workers are independent and exponentially distributed with different means, we give empirical and theoretical guarantees on the regret of our strategy, i.e., the extra time spent to learn the mean response times of the workers. Compared to adaptive k sync, our scheme achieves significantly lower errors with the same computational efforts while being inferior in terms of speed.  ( 2 min )
    Private Online Prefix Sums via Optimal Matrix Factorizations. (arXiv:2202.08312v1 [cs.LG])
    Motivated by differentially-private (DP) training of machine learning models and other applications, we investigate the problem of computing prefix sums in the online (streaming) setting with DP. This problem has previously been addressed by special-purpose tree aggregation schemes with hand-crafted estimators. We show that these previous schemes can all be viewed as specific instances of a broad class of matrix-factorization-based DP mechanisms, and that in fact much better mechanisms exist in this class. In particular, we characterize optimal factorizations of linear queries under online constraints, deriving existence, uniqueness, and explicit expressions that allow us to efficiently compute optimal mechanisms, including for online prefix sums. These solutions improve over the existing state-of-the-art by a significant constant factor, and avoid some of the artifacts introduced by the use of the tree data structure.  ( 2 min )
    Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning Preprocessing Pipelines. (arXiv:2202.08679v1 [cs.LG])
    Preprocessing pipelines in deep learning aim to provide sufficient data throughput to keep the training processes busy. Maximizing resource utilization is becoming more challenging as the throughput of training processes increases with hardware innovations (e.g., faster GPUs, TPUs, and inter-connects) and advanced parallelization techniques that yield better scalability. At the same time, the amount of training data needed in order to train increasingly complex models is growing. As a consequence of this development, data preprocessing and provisioning are becoming a severe bottleneck in end-to-end deep learning pipelines. In this paper, we provide an in-depth analysis of data preprocessing pipelines from four different machine learning domains. We introduce a new perspective on efficiently preparing datasets for end-to-end deep learning pipelines and extract individual trade-offs to optimize throughput, preprocessing time, and storage consumption. Additionally, we provide an open-source profiling library that can automatically decide on a suitable preprocessing strategy to maximize throughput. By applying our generated insights to real-world use-cases, we obtain an increased throughput of 3x to 13x compared to an untuned system while keeping the pipeline functionally identical. These findings show the enormous potential of data pipeline tuning.  ( 2 min )
    A Developmentally-Inspired Examination of Shape versus Texture Bias in Machines. (arXiv:2202.08340v1 [cs.CV])
    Early in development, children learn to extend novel category labels to objects with the same shape, a phenomenon known as the shape bias. Inspired by these findings, Geirhos et al. (2019) examined whether deep neural networks show a shape or texture bias by constructing images with conflicting shape and texture cues. They found that convolutional neural networks strongly preferred to classify familiar objects based on texture as opposed to shape, suggesting a texture bias. However, there are a number of differences between how the networks were tested in this study versus how children are typically tested. In this work, we re-examine the inductive biases of neural networks by adapting the stimuli and procedure from Geirhos et al. (2019) to more closely follow the developmental paradigm and test on a wide range of pre-trained neural networks. Across three experiments, we find that deep neural networks exhibit a preference for shape rather than texture when tested under conditions that more closely replicate the developmental procedure.  ( 2 min )
    Synthetic Control As Online Linear Regression. (arXiv:2202.08426v1 [econ.EM])
    This paper notes a simple connection between synthetic control and online learning. Specifically, we recognize synthetic control as an instance of Follow-The-Leader (FTL). Standard results in online convex optimization then imply that, even when outcomes are chosen by an adversary, synthetic control predictions of counterfactual outcomes for the treated unit perform almost as well as an oracle weighted average of control units' outcomes. Synthetic control on differenced data performs almost as well as oracle weighted difference-in-differences. We argue that this observation further supports the use of synthetic control estimators in comparative case studies.  ( 2 min )
    GRAPHSHAP: Motif-based Explanations for Black-box Graph Classifiers. (arXiv:2202.08815v1 [cs.LG])
    Most methods for explaining black-box classifiers (e.g., on tabular data, images, or time series) rely on measuring the impact that the removal/perturbation of features has on the model output. This forces the explanation language to match the classifier features space. However, when dealing with graph data, in which the basic features correspond essentially to the adjacency information describing the graph structure (i.e., the edges), this matching between features space and explanation language might not be appropriate. In this regard, we argue that (i) a good explanation method for graph classification should be fully agnostic with respect to the internal representation used by the black-box; and (ii) a good explanation language for graph classification tasks should be represented by higher-order structures, such as motifs. The need to decouple the feature space (edges) from the explanation space (motifs) is thus a major challenge towards developing actionable explanations for graph classification tasks. In this paper we introduce GRAPHSHAP, a Shapley-based approach able to provide motif-based explanations for black-box graph classifiers, assuming no knowledge whatsoever about the model or its training data: the only requirement is that the black-box can be queried at will. Furthermore, we introduce additional auxiliary components such as a synthetic graph dataset generator, algorithms for subgraph mining and ranking, a custom graph convolutional layer, and a kernel to approximate the explanation scores while maintaining linear time complexity. Finally, we test GRAPHSHAP on a real-world brain-network dataset consisting of patients affected by Autism Spectrum Disorder and a control group. Our experiments highlight how the classification provided by a black-box model can be effectively explained by few connectomics patterns.  ( 2 min )
    When, where, and how to add new neurons to ANNs. (arXiv:2202.08539v1 [cs.LG])
    Neurogenesis in ANNs is an understudied and difficult problem, even compared to other forms of structural learning like pruning. By decomposing it into triggers and initializations, we introduce a framework for studying the various facets of neurogenesis: when, where, and how to add neurons during the learning process. We present the Neural Orthogonality (NORTH*) suite of neurogenesis strategies, combining layer-wise triggers and initializations based on the orthogonality of activations or weights to dynamically grow performant networks that converge to an efficient size. We evaluate our contributions against other recent neurogenesis works with MLPs.  ( 2 min )
    LAMP: Extracting Text from Gradients with Language Model Priors. (arXiv:2202.08827v1 [cs.LG])
    Recent work shows that sensitive user data can be reconstructed from gradient updates, breaking the key privacy promise of federated learning. While success was demonstrated primarily on image data, these methods do not directly transfer to other domains such as text. In this work, we propose LAMP, a novel attack tailored to textual data, that successfully reconstructs original text from gradients. Our key insight is to model the prior probability of the text with an auxiliary language model, utilizing it to guide the search towards more natural text. Concretely, LAMP introduces a discrete text transformation procedure that minimizes both the reconstruction loss and the prior text probability, as provided by the auxiliary language model. The procedure is alternated with a continuous optimization of the reconstruction loss, which also regularizes the length of the reconstructed embeddings. Our experiments demonstrate that LAMP reconstructs the original text significantly more precisely than prior work: we recover 5x more bigrams and $23\%$ longer subsequences on average. Moreover, we are first to recover inputs from batch sizes larger than 1 for textual models. These findings indicate that gradient updates of models operating on textual data leak more information than previously thought.  ( 2 min )
    End-to-end Music Remastering System Using Self-supervised and Adversarial Training. (arXiv:2202.08520v1 [eess.AS])
    Mastering is an essential step in music production, but it is also a challenging task that has to go through the hands of experienced audio engineers, where they adjust tone, space, and volume of a song. Remastering follows the same technical process, in which the context lies in mastering a song for the times. As these tasks have high entry barriers, we aim to lower the barriers by proposing an end-to-end music remastering system that transforms the mastering style of input audio to that of the target. The system is trained in a self-supervised manner, in which released pop songs were used for training. We also anticipated the model to generate realistic audio reflecting the reference's mastering style by applying a pre-trained encoder and a projection discriminator. We validate our results with quantitative metrics and a subjective listening test and show that the model generated samples of mastering style similar to the target.  ( 2 min )
    Task-Agnostic Graph Explanations. (arXiv:2202.08335v1 [cs.LG])
    Graph Neural Networks (GNNs) have emerged as powerful tools to encode graph structured data. Due to their broad applications, there is an increasing need to develop tools to explain how GNNs make decisions given graph structured data. Existing learning-based GNN explanation approaches are task-specific in training and hence suffer from crucial drawbacks. Specifically, they are incapable of producing explanations for a multitask prediction model with a single explainer. They are also unable to provide explanations in cases where the GNN is trained in a self-supervised manner, and the resulting representations are used in future downstream tasks. To address these limitations, we propose a Task-Agnostic GNN Explainer (TAGE) trained under self-supervision with no knowledge of downstream tasks. TAGE enables the explanation of GNN embedding models without downstream tasks and allows efficient explanation of multitask models. Our extensive experiments show that TAGE can significantly speed up the explanation efficiency by using the same model to explain predictions for multiple downstream tasks while achieving explanation quality as good as or even better than current state-of-the-art GNN explanation approaches.  ( 2 min )
    A Survey of Explainable Reinforcement Learning. (arXiv:2202.08434v1 [cs.LG])
    Explainable reinforcement learning (XRL) is an emerging subfield of explainable machine learning that has attracted considerable attention in recent years. The goal of XRL is to elucidate the decision-making process of learning agents in sequential decision-making settings. In this survey, we propose a novel taxonomy for organizing the XRL literature that prioritizes the RL setting. We overview techniques according to this taxonomy. We point out gaps in the literature, which we use to motivate and outline a roadmap for future work.  ( 2 min )
    The Quarks of Attention. (arXiv:2202.08371v1 [cs.LG])
    Attention plays a fundamental role in both natural and artificial intelligence systems. In deep learning, attention-based neural architectures, such as transformer architectures, are widely used to tackle problems in natural language processing and beyond. Here we investigate the fundamental building blocks of attention and their computational properties. Within the standard model of deep learning, we classify all possible fundamental building blocks of attention in terms of their source, target, and computational mechanism. We identify and study three most important mechanisms: additive activation attention, multiplicative output attention (output gating), and multiplicative synaptic attention (synaptic gating). The gating mechanisms correspond to multiplicative extensions of the standard model and are used across all current attention-based deep learning architectures. We study their functional properties and estimate the capacity of several attentional building blocks in the case of linear and polynomial threshold gates. Surprisingly, additive activation attention plays a central role in the proofs of the lower bounds. Attention mechanisms reduce the depth of certain basic circuits and leverage the power of quadratic activations without incurring their full cost.  ( 2 min )
    Limitations of Neural Collapse for Understanding Generalization in Deep Learning. (arXiv:2202.08384v1 [cs.LG])
    The recent work of Papyan, Han, & Donoho (2020) presented an intriguing "Neural Collapse" phenomenon, showing a structural property of interpolating classifiers in the late stage of training. This opened a rich area of exploration studying this phenomenon. Our motivation is to study the upper limits of this research program: How far will understanding Neural Collapse take us in understanding deep learning? First, we investigate its role in generalization. We refine the Neural Collapse conjecture into two separate conjectures: collapse on the train set (an optimization property) and collapse on the test distribution (a generalization property). We find that while Neural Collapse often occurs on the train set, it does not occur on the test set. We thus conclude that Neural Collapse is primarily an optimization phenomenon, with as-yet-unclear connections to generalization. Second, we investigate the role of Neural Collapse in feature learning. We show simple, realistic experiments where training longer leads to worse last-layer features, as measured by transfer-performance on a downstream task. This suggests that neural collapse is not always desirable for representation learning, as previously claimed. Finally, we give preliminary evidence of a "cascading collapse" phenomenon, wherein some form of Neural Collapse occurs not only for the last layer, but in earlier layers as well. We hope our work encourages the community to continue the rich line of Neural Collapse research, while also considering its inherent limitations.  ( 2 min )
    Time-Correlated Sparsification for Efficient Over-the-Air Model Aggregation in Wireless Federated Learning. (arXiv:2202.08420v1 [cs.IT])
    Federated edge learning (FEEL) is a promising distributed machine learning (ML) framework to drive edge intelligence applications. However, due to the dynamic wireless environments and the resource limitations of edge devices, communication becomes a major bottleneck. In this work, we propose time-correlated sparsification with hybrid aggregation (TCS-H) for communication-efficient FEEL, which exploits jointly the power of model compression and over-the-air computation. By exploiting the temporal correlations among model parameters, we construct a global sparsification mask, which is identical across devices, and thus enables efficient model aggregation over-the-air. Each device further constructs a local sparse vector to explore its own important parameters, which are aggregated via digital communication with orthogonal multiple access. We further design device scheduling and power allocation algorithms for TCS-H. Experiment results show that, under limited communication resources, TCS-H can achieve significantly higher accuracy compared to the conventional top-K sparsification with orthogonal model aggregation, with both i.i.d. and non-i.i.d. data distributions.  ( 2 min )
    Transformer for Graphs: An Overview from Architecture Perspective. (arXiv:2202.08455v1 [cs.LG])
    Recently, Transformer model, which has achieved great success in many artificial intelligence fields, has demonstrated its great potential in modeling graph-structured data. Till now, a great variety of Transformers has been proposed to adapt to the graph-structured data. However, a comprehensive literature review and systematical evaluation of these Transformer variants for graphs are still unavailable. It's imperative to sort out the existing Transformer models for graphs and systematically investigate their effectiveness on various graph tasks. In this survey, we provide a comprehensive review of various Graph Transformer models from the architectural design perspective. We first disassemble the existing models and conclude three typical ways to incorporate the graph information into the vanilla Transformer: 1) GNNs as Auxiliary Modules, 2) Improved Positional Embedding from Graphs, and 3) Improved Attention Matrix from Graphs. Furthermore, we implement the representative components in three groups and conduct a comprehensive comparison on various kinds of famous graph data benchmarks to investigate the real performance gain of each component. Our experiments confirm the benefits of current graph-specific modules on Transformer and reveal their advantages on different kinds of graph tasks.  ( 2 min )
    Structural and Semantic Contrastive Learning for Self-supervised Node Representation Learning. (arXiv:2202.08480v1 [cs.LG])
    Graph Contrastive Learning (GCL) recently has drawn much research interest for learning generalizable, transferable, and robust node representations in a self-supervised fashion. In general, the contrastive learning process in GCL is performed on top of the representations learned by a graph neural network (GNN) backbone, which transforms and propagates the node contextual information based on its local neighborhoods. However, existing GCL efforts have severe limitations in terms of both encoding architecture, augmentation, and contrastive objective, making them commonly inefficient and ineffective to use in different datasets. In this work, we go beyond the existing unsupervised GCL counterparts and address their limitations by proposing a simple yet effective framework S$^3$-CL. Specifically, by virtue of the proposed structural and semantic contrastive learning, even a simple neural network is able to learn expressive node representations that preserve valuable structural and semantic patterns. Our experiments demonstrate that the node representations learned by S$^3$-CL achieve superior performance on different downstream tasks compared to the state-of-the-art GCL methods.  ( 2 min )
    Evaluation and Analysis of Different Aggregation and Hyperparameter Selection Methods for Federated Brain Tumor Segmentation. (arXiv:2202.08261v1 [cs.LG])
    Availability of large, diverse, and multi-national datasets is crucial for the development of effective and clinically applicable AI systems in the medical imaging domain. However, forming a global model by bringing these datasets together at a central location, comes along with various data privacy and ownership problems. To alleviate these problems, several recent studies focus on the federated learning paradigm, a distributed learning approach for decentralized data. Federated learning leverages all the available data without any need for sharing collaborators' data with each other or collecting them on a central server. Studies show that federated learning can provide competitive performance with conventional central training, while having a good generalization capability. In this work, we have investigated several federated learning approaches on the brain tumor segmentation problem. We explore different strategies for faster convergence and better performance which can also work on strong Non-IID cases.  ( 2 min )
    Open-Ended Reinforcement Learning with Neural Reward Functions. (arXiv:2202.08266v1 [cs.LG])
    Inspired by the great success of unsupervised learning in Computer Vision and Natural Language Processing, the Reinforcement Learning community has recently started to focus more on unsupervised discovery of skills. Most current approaches, like DIAYN or DADS, optimize some form of mutual information objective. We propose a different approach that uses reward functions encoded by neural networks. These are trained iteratively to reward more complex behavior. In high-dimensional robotic environments our approach learns a wide range of interesting skills including front-flips for Half-Cheetah and one-legged running for Humanoid. In the pixel-based Montezuma's Revenge environment our method also works with minimal changes and it learns complex skills that involve interacting with items and visiting diverse locations. A web version of this paper which shows animations for the different skills is available in https://as.inf.ethz.ch/research/open_ended_RL/main.html  ( 2 min )
    XAI in the context of Predictive Process Monitoring: Too much to Reveal. (arXiv:2202.08265v1 [cs.LG])
    Predictive Process Monitoring (PPM) has been integrated into process mining tools as a value-adding task. PPM provides useful predictions on the further execution of the running business processes. To this end, machine learning-based techniques are widely employed in the context of PPM. In order to gain stakeholders trust and advocacy of PPM predictions, eXplainable Artificial Intelligence (XAI) methods are employed in order to compensate for the lack of transparency of most efficient predictive models. Even when employed under the same settings regarding data, preprocessing techniques, and ML models, explanations generated by multiple XAI methods differ profoundly. A comparison is missing to distinguish XAI characteristics or underlying conditions that are deterministic to an explanation. To address this gap, we provide a framework to enable studying the effect of different PPM-related settings and ML model-related choices on characteristics and expressiveness of resulting explanations. In addition, we compare how different explainability methods characteristics can shape resulting explanations and enable reflecting underlying model reasoning process  ( 2 min )
    Controlling Epidemic Spread using Probabilistic Diffusion Models on Networks. (arXiv:2202.08296v1 [cs.DS])
    The spread of an epidemic is often modeled by an SIR random process on a social network graph. The MinINF problem for optimal social distancing involves minimizing the expected number of infections, when we are allowed to break at most $B$ edges; similarly the MinINFNode problem involves removing at most $B$ vertices. These are fundamental problems in epidemiology and network science. While a number of heuristics have been considered, the complexity of these problems remains generally open. In this paper, we present two bicriteria approximation algorithms for MinINF, which give the first non-trivial approximations for this problem. The first is based on the cut sparsification result of Karger \cite{karger:mathor99}, and works when the transmission probabilities are not too small. The second is a Sample Average Approximation (SAA) based algorithm, which we analyze for the Chung-Lu random graph model. We also extend some of our results to tackle the MinINFNode problem.  ( 2 min )
  • Open

    An Equivalence Between Data Poisoning and Byzantine Gradient Attacks. (arXiv:2202.08578v1 [cs.LG])
    To study the resilience of distributed learning, the "Byzantine" literature considers a strong threat model where workers can report arbitrary gradients to the parameter server. Whereas this model helped obtain several fundamental results, it has sometimes been considered unrealistic, when the workers are mostly trustworthy machines. In this paper, we show a surprising equivalence between this model and data poisoning, a threat considered much more realistic. More specifically, we prove that every gradient attack can be reduced to data poisoning, in any personalized federated learning system with PAC guarantees (which we show are both desirable and realistic). This equivalence makes it possible to obtain new impossibility results on the resilience to data poisoning as corollaries of existing impossibility theorems on Byzantine machine learning. Moreover, using our equivalence, we derive a practical attack that we show (theoretically and empirically) can be very effective against classical personalized federated learning models.  ( 2 min )
    Limitations of Neural Collapse for Understanding Generalization in Deep Learning. (arXiv:2202.08384v1 [cs.LG])
    The recent work of Papyan, Han, & Donoho (2020) presented an intriguing "Neural Collapse" phenomenon, showing a structural property of interpolating classifiers in the late stage of training. This opened a rich area of exploration studying this phenomenon. Our motivation is to study the upper limits of this research program: How far will understanding Neural Collapse take us in understanding deep learning? First, we investigate its role in generalization. We refine the Neural Collapse conjecture into two separate conjectures: collapse on the train set (an optimization property) and collapse on the test distribution (a generalization property). We find that while Neural Collapse often occurs on the train set, it does not occur on the test set. We thus conclude that Neural Collapse is primarily an optimization phenomenon, with as-yet-unclear connections to generalization. Second, we investigate the role of Neural Collapse in feature learning. We show simple, realistic experiments where training longer leads to worse last-layer features, as measured by transfer-performance on a downstream task. This suggests that neural collapse is not always desirable for representation learning, as previously claimed. Finally, we give preliminary evidence of a "cascading collapse" phenomenon, wherein some form of Neural Collapse occurs not only for the last layer, but in earlier layers as well. We hope our work encourages the community to continue the rich line of Neural Collapse research, while also considering its inherent limitations.  ( 2 min )
    Robust SVM Optimization in Banach spaces. (arXiv:2202.08567v1 [stat.ML])
    We address the issue of binary classification in Banach spaces in presence of uncertainty. We show that a number of results from classical support vector machines theory can be appropriately generalised to their robust counterpart in Banach spaces. These include the Representer Theorem, strong duality for the associated Optimization problem as well as their geometric interpretation. Furthermore, we propose a game theoretic interpretation by expressing a Nash equilibrium problem formulation for the more general problem of finding the closest points in two closed convex sets when the underlying space is reflexive and smooth.  ( 2 min )
    Drawing Early-Bird Tickets: Towards More Efficient Training of Deep Networks. (arXiv:1909.11957v5 [cs.LG] UPDATED)
    (Frankle & Carbin, 2019) shows that there exist winning tickets (small but critical subnetworks) for dense, randomly initialized networks, that can be trained alone to achieve comparable accuracies to the latter in a similar number of iterations. However, the identification of these winning tickets still requires the costly train-prune-retrain process, limiting their practical benefits. In this paper, we discover for the first time that the winning tickets can be identified at the very early training stage, which we term as early-bird (EB) tickets, via low-cost training schemes (e.g., early stopping and low-precision training) at large learning rates. Our finding of EB tickets is consistent with recently reported observations that the key connectivity patterns of neural networks emerge early. Furthermore, we propose a mask distance metric that can be used to identify EB tickets with low computational overhead, without needing to know the true winning tickets that emerge after the full training. Finally, we leverage the existence of EB tickets and the proposed mask distance to develop efficient training methods, which are achieved by first identifying EB tickets via low-cost schemes, and then continuing to train merely the EB tickets towards the target accuracy. Experiments based on various deep networks and datasets validate: 1) the existence of EB tickets, and the effectiveness of mask distance in efficiently identifying them; and 2) that the proposed efficient training via EB tickets can achieve up to 4.7x energy savings while maintaining comparable or even better accuracy, demonstrating a promising and easily adopted method for tackling cost-prohibitive deep network training. Code available at https://github.com/RICE-EIC/Early-Bird-Tickets.  ( 3 min )
    Deep Time Series Forecasting with Shape and Temporal Criteria. (arXiv:2104.04610v2 [stat.ML] UPDATED)
    This paper addresses the problem of multi-step time series forecasting for non-stationary signals that can present sudden changes. Current state-of-the-art deep learning forecasting methods, often trained with variants of the MSE, lack the ability to provide sharp predictions in deterministic and probabilistic contexts. To handle these challenges, we propose to incorporate shape and temporal criteria in the training objective of deep models. We define shape and temporal similarities and dissimilarities, based on a smooth relaxation of Dynamic Time Warping (DTW) and Temporal Distortion Index (TDI), that enable to build differentiable loss functions and positive semi-definite (PSD) kernels. With these tools, we introduce DILATE (DIstortion Loss including shApe and TimE), a new objective for deterministic forecasting, that explicitly incorporates two terms supporting precise shape and temporal change detection. For probabilistic forecasting, we introduce STRIPE++ (Shape and Time diverRsIty in Probabilistic forEcasting), a framework for providing a set of sharp and diverse forecasts, where the structured shape and time diversity is enforced with a determinantal point process (DPP) diversity loss. Extensive experiments and ablations studies on synthetic and real-world datasets confirm the benefits of leveraging shape and time features in time series forecasting.  ( 2 min )
    Learning Game-Theoretic Models of Multiagent Trajectories Using Implicit Layers. (arXiv:2008.07303v6 [cs.GT] UPDATED)
    For prediction of interacting agents' trajectories, we propose an end-to-end trainable architecture that hybridizes neural nets with game-theoretic reasoning, has interpretable intermediate representations, and transfers to downstream decision making. It uses a net that reveals preferences from the agents' past joint trajectory, and a differentiable implicit layer that maps these preferences to local Nash equilibria, forming the modes of the predicted future trajectory. Additionally, it learns an equilibrium refinement concept. For tractability, we introduce a new class of continuous potential games and an equilibrium-separating partition of the action space. We provide theoretical results for explicit gradients and soundness. In experiments, we evaluate our approach on two real-world data sets, where we predict highway driver merging trajectories, and on a simple decision-making transfer task.  ( 2 min )
    Identifiable Deep Generative Models via Sparse Decoding. (arXiv:2110.10804v2 [stat.ML] UPDATED)
    We develop the sparse VAE for unsupervised representation learning on high-dimensional data. The sparse VAE learns a set of latent factors (representations) which summarize the associations in the observed data features. The underlying model is sparse in that each observed feature (i.e. each dimension of the data) depends on a small subset of the latent factors. As examples, in ratings data each movie is only described by a few genres; in text data each word is only applicable to a few topics; in genomics, each gene is active in only a few biological processes. We prove such sparse deep generative models are identifiable: with infinite data, the true model parameters can be learned. (In contrast, most deep generative models are not identifiable.) We empirically study the sparse VAE with both simulated and real data. We find that it recovers meaningful latent factors and has smaller heldout reconstruction error than related methods.  ( 2 min )
    From Geometry to Topology: Inverse Theorems for Distributed Persistence. (arXiv:2101.12288v3 [math.AT] UPDATED)
    What is the "right" topological invariant of a large point cloud X? Prior research has focused on estimating the full persistence diagram of X, a quantity that is very expensive to compute, unstable to outliers, and far from a sufficient statistic. We therefore propose that the correct invariant is not the persistence diagram of X, but rather the collection of persistence diagrams of many small subsets. This invariant, which we call "distributed persistence," is perfectly parallelizable, more stable to outliers, and has a rich inverse theory. The map from the space of point clouds (with the quasi-isometry metric) to the space of distributed persistence invariants (with the Hausdorff-Bottleneck distance) is a global quasi-isometry. This is a much stronger property than simply being injective, as it implies that the inverse of a small neighborhood is a small neighborhood, and is to our knowledge the only result of its kind in the TDA literature. Moreover, the quasi-isometry bounds depend on the size of the subsets taken, so that as the size of these subsets goes from small to large, the invariant interpolates between a purely geometric one and a topological one. Lastly, we note that our inverse results do not actually require considering all subsets of a fixed size (an enormous collection), but a relatively small collection satisfying certain covering properties that arise with high probability when randomly sampling subsets. These theoretical results are complemented by two synthetic experiments demonstrating the use of distributed persistence in practice.  ( 2 min )
    Efficient Distributed Machine Learning via Combinatorial Multi-Armed Bandits. (arXiv:2202.08302v1 [cs.IT])
    We consider the distributed stochastic gradient descent problem, where a main node distributes gradient calculations among $n$ workers from which at most $b \leq n$ can be utilized in parallel. By assigning tasks to all the workers and waiting only for the $k$ fastest ones, the main node can trade-off the error of the algorithm with its runtime by gradually increasing $k$ as the algorithm evolves. However, this strategy, referred to as adaptive k sync, can incur additional costs since it ignores the computational efforts of slow workers. We propose a cost-efficient scheme that assigns tasks only to $k$ workers and gradually increases $k$. As the response times of the available workers are unknown to the main node a priori, we utilize a combinatorial multi-armed bandit model to learn which workers are the fastest while assigning gradient calculations, and to minimize the effect of slow workers. Assuming that the mean response times of the workers are independent and exponentially distributed with different means, we give empirical and theoretical guarantees on the regret of our strategy, i.e., the extra time spent to learn the mean response times of the workers. Compared to adaptive k sync, our scheme achieves significantly lower errors with the same computational efforts while being inferior in terms of speed.  ( 2 min )
    FANN-on-MCU: An Open-Source Toolkit for Energy-Efficient Neural Network Inference at the Edge of the Internet of Things. (arXiv:1911.03314v3 [cs.LG] UPDATED)
    The growing number of low-power smart devices in the Internet of Things is coupled with the concept of "Edge Computing", that is moving some of the intelligence, especially machine learning, towards the edge of the network. Enabling machine learning algorithms to run on resource-constrained hardware, typically on low-power smart devices, is challenging in terms of hardware (optimized and energy-efficient integrated circuits), algorithmic and firmware implementations. This paper presents FANN-on-MCU, an open-source toolkit built upon the Fast Artificial Neural Network (FANN) library to run lightweight and energy-efficient neural networks on microcontrollers based on both the ARM Cortex-M series and the novel RISC-V-based Parallel Ultra-Low-Power (PULP) platform. The toolkit takes multi-layer perceptrons trained with FANN and generates code targeted at execution on low-power microcontrollers either with a floating-point unit (i.e., ARM Cortex-M4F and M7F) or without (i.e., ARM Cortex M0-M3 or PULP-based processors). This paper also provides an architectural performance evaluation of neural networks on the most popular ARM Cortex-M family and the parallel RISC-V processor called Mr. Wolf. The evaluation includes experimental results for three different applications using a self-sustainable wearable multi-sensor bracelet. Experimental results show a measured latency in the order of only a few microseconds and a power consumption of few milliwatts while keeping the memory requirements below the limitations of the targeted microcontrollers. In particular, the parallel implementation on the octa-core RISC-V platform reaches a speedup of 22x and a 69% reduction in energy consumption with respect to a single-core implementation on Cortex-M4 for continuous real-time classification.  ( 2 min )
    Infant Crying Detection in Real-World Environments. (arXiv:2005.07036v6 [eess.AS] UPDATED)
    Most existing cry detection models have been tested with data collected in controlled settings. Thus, the extent to which they generalize to noisy and lived environments is unclear. In this paper, we evaluate several established machine learning approaches including a model leveraging both deep spectrum and acoustic features. This model was able to recognize crying events with F1 score 0.613 (Precision: 0.672, Recall: 0.552), showing improved external validity over existing methods at cry detection in everyday real-world settings. As part of our evaluation, we collect and annotate a novel dataset of infant crying compiled from over 780 hours of labeled real-world audio data, captured via recorders worn by infants in their homes, which we make publicly available. Our findings confirm that a cry detection model trained on in-lab data underperforms when presented with real-world data (in-lab test F1: 0.656, real-world test F1: 0.236), highlighting the value of our new dataset and model.  ( 2 min )
    Linear Convergence of Entropy-Regularized Natural Policy Gradient with Linear Function Approximation. (arXiv:2106.04096v3 [cs.LG] UPDATED)
    Natural policy gradient (NPG) methods with entropy regularization achieve impressive empirical success in reinforcement learning problems with large state-action spaces. However, their convergence properties and the impact of entropy regularization remain elusive in the function approximation regime. In this paper, we establish finite-time convergence analyses of entropy-regularized NPG with linear function approximation under softmax parameterization. In particular, we prove that entropy-regularized NPG with averaging satisfies the \emph{persistence of excitation} condition, and achieves a fast convergence rate of $\tilde{O}(1/T)$ up to a function approximation error in regularized Markov decision processes. This convergence result does not require any a priori assumptions on the policies. Furthermore, under mild regularity conditions on the concentrability coefficient and basis vectors, we prove that entropy-regularized NPG exhibits \emph{linear convergence} up to a function approximation error.  ( 2 min )
    Transfer Learning with Multi-source Data: High-dimensional Inference for Group Distributionally Robust Models. (arXiv:2011.07568v3 [stat.ME] UPDATED)
    The construction of generalizable and transferable models is a fundamental goal of statistical learning. Learning with the multi-source data helps improve model generalizability and is integral to many important statistical problems, including group distributionally robust optimization, minimax group fairness, and maximin projection. This paper considers multiple high-dimensional regression models for the multi-source data. We introduce the covariate shift maximin effect as a group distributionally robust model. This robust model helps transfer the information from the multi-source data to the unlabelled target population. Statistical inference for the covariate shift maximin effect is challenging since its point estimator may have a non-standard limiting distribution. We devise a novel {\it DenseNet} sampling method to construct valid confidence intervals for the high-dimensional maximin effect. We show that our proposed confidence interval achieves the desired coverage level and attains a parametric length. Our proposed DenseNet sampling method and the related theoretical analysis are of independent interest in addressing other non-regular or non-standard inference problems. We demonstrate the proposed method over a large-scale simulation and genetic data on yeast colony growth under multiple environments.  ( 2 min )
    Characterizing Attacks on Deep Reinforcement Learning. (arXiv:1907.09470v3 [cs.LG] UPDATED)
    Recent studies show that Deep Reinforcement Learning (DRL) models are vulnerable to adversarial attacks, which attack DRL models by adding small perturbations to the observations. However, some attacks assume full availability of the victim model, and some require a huge amount of computation, making them less feasible for real world applications. In this work, we make further explorations of the vulnerabilities of DRL by studying other aspects of attacks on DRL using realistic and efficient attacks. First, we adapt and propose efficient black-box attacks when we do not have access to DRL model parameters. Second, to address the high computational demands of existing attacks, we introduce efficient online sequential attacks that exploit temporal consistency across consecutive steps. Third, we explore the possibility of an attacker perturbing other aspects in the DRL setting, such as the environment dynamics. Finally, to account for imperfections in how an attacker would inject perturbations in the physical world, we devise a method for generating a robust physical perturbations to be printed. The attack is evaluated on a real-world robot under various conditions. We conduct extensive experiments both in simulation such as Atari games, robotics and autonomous driving, and on real-world robotics, to compare the effectiveness of the proposed attacks with baseline approaches. To the best of our knowledge, we are the first to apply adversarial attacks on DRL systems to physical robots.  ( 2 min )
    Gradients without Backpropagation. (arXiv:2202.08587v1 [cs.LG])
    Using backpropagation to compute gradients of objective functions for optimization has remained a mainstay of machine learning. Backpropagation, or reverse-mode differentiation, is a special case within the general family of automatic differentiation algorithms that also includes the forward mode. We present a method to compute gradients based solely on the directional derivative that one can compute exactly and efficiently via the forward mode. We call this formulation the forward gradient, an unbiased estimate of the gradient that can be evaluated in a single forward run of the function, entirely eliminating the need for backpropagation in gradient descent. We demonstrate forward gradient descent in a range of problems, showing substantial savings in computation and enabling training up to twice as fast in some cases.  ( 2 min )
    Refined Convergence Rates for Maximum Likelihood Estimation under Finite Mixture Models. (arXiv:2202.08786v1 [math.ST])
    We revisit convergence rates for maximum likelihood estimation (MLE) under finite mixture models. The Wasserstein distance has become a standard loss function for the analysis of parameter estimation in these models, due in part to its ability to circumvent label switching and to accurately characterize the behaviour of fitted mixture components with vanishing weights. However, the Wasserstein metric is only able to capture the worst-case convergence rate among the remaining fitted mixture components. We demonstrate that when the log-likelihood function is penalized to discourage vanishing mixing weights, stronger loss functions can be derived to resolve this shortcoming of the Wasserstein distance. These new loss functions accurately capture the heterogeneity in convergence rates of fitted mixture components, and we use them to sharpen existing pointwise and uniform convergence rates in various classes of mixture models. In particular, these results imply that a subset of the components of the penalized MLE typically converge significantly faster than could have been anticipated from past work. We further show that some of these conclusions extend to the traditional MLE. Our theoretical findings are supported by a simulation study to illustrate these improved convergence rates.  ( 2 min )
    Hybridizing Physical and Data-driven Prediction Methods for Physicochemical Properties. (arXiv:2202.08804v1 [physics.chem-ph])
    We present a generic way to hybridize physical and data-driven methods for predicting physicochemical properties. The approach `distills' the physical method's predictions into a prior model and combines it with sparse experimental data using Bayesian inference. We apply the new approach to predict activity coefficients at infinite dilution and obtain significant improvements compared to the data-driven and physical baselines and established ensemble methods from the machine learning literature.  ( 2 min )
    Generative Models as Distributions of Functions. (arXiv:2102.04776v4 [cs.LG] UPDATED)
    Generative models are typically trained on grid-like data such as images. As a result, the size of these models usually scales directly with the underlying grid resolution. In this paper, we abandon discretized grids and instead parameterize individual data points by continuous functions. We then build generative models by learning distributions over such functions. By treating data points as functions, we can abstract away from the specific type of data we train on and construct models that are agnostic to discretization. To train our model, we use an adversarial approach with a discriminator that acts on continuous signals. Through experiments on a wide variety of data modalities including images, 3D shapes and climate data, we demonstrate that our model can learn rich distributions of functions independently of data type and resolution.  ( 2 min )
    Oracle-Efficient Online Learning for Beyond Worst-Case Adversaries. (arXiv:2202.08549v1 [cs.LG])
    In this paper, we study oracle-efficient algorithms for beyond worst-case analysis of online learning. We focus on two settings. First, the smoothed analysis setting of [RST11, HRS12] where an adversary is constrained to generating samples from distributions whose density is upper bounded by $1/\sigma$ times the uniform density. Second, the setting of $K$-hint transductive learning, where the learner is given access to $K$ hints per time step that are guaranteed to include the true instance. We give the first known oracle-efficient algorithms for both settings that depend only on the VC dimension of the class and parameters $\sigma$ and $K$ that capture the power of the adversary. In particular, we achieve oracle-efficient regret bounds of $ O ( \sqrt{T (d / \sigma )^{1/2} } ) $ and $ O ( \sqrt{T d K } )$ respectively for these setting. For the smoothed analysis setting, our results give the first oracle-efficient algorithm for online learning with smoothed adversaries [HRS21]. This contrasts the computational separation between online learning with worst-case adversaries and offline learning established by [HK16]. Our algorithms also imply improved bounds for worst-case setting with small domains. In particular, we give an oracle-efficient algorithm with regret of $O ( \sqrt{T(d \vert{\mathcal{X}}\vert ) ^{1/2} })$, which is a refinement of the earlier $O ( \sqrt{T\vert{\mathcal{X} } \vert })$ bound by [DS16].  ( 2 min )
    General Cyclical Training of Neural Networks. (arXiv:2202.08835v1 [cs.LG])
    This paper describes the principle of "General Cyclical Training" in machine learning, where training starts and ends with "easy training" and the "hard training" happens during the middle epochs. We propose several manifestations for training neural networks, including algorithmic examples (via hyper-parameters and loss functions), data-based examples, and model-based examples. Specifically, we introduce several novel techniques: cyclical weight decay, cyclical batch size, cyclical focal loss, cyclical softmax temperature, cyclical data augmentation, cyclical gradient clipping, and cyclical semi-supervised learning. In addition, we demonstrate that cyclical weight decay, cyclical softmax temperature, and cyclical gradient clipping (as three examples of this principle) are beneficial in the test accuracy performance of a trained model. Furthermore, we discuss model-based examples (such as pretraining and knowledge distillation) from the perspective of general cyclical training and recommend some changes to the typical training methodology. In summary, this paper defines the general cyclical training concept and discusses several specific ways in which this concept can be applied to training neural networks. In the spirit of reproducibility, the code used in our experiments is available at \url{https://github.com/lnsmith54/CFL}.  ( 2 min )
    Locally private nonparametric confidence intervals and sequences. (arXiv:2202.08728v1 [stat.ME])
    This work derives methods for performing nonparametric, nonasymptotic statistical inference for population parameters under the constraint of local differential privacy (LDP). Given observations $(X_1, \dots, X_n)$ with mean $\mu^\star$ that are privatized into $(Z_1, \dots, Z_n)$, we introduce confidence intervals (CI) and time-uniform confidence sequences (CS) for $\mu^\star \in \mathbb R$ when only given access to the privatized data. We introduce a nonparametric and sequentially interactive generalization of Warner's famous "randomized response" mechanism, satisfying LDP for arbitrary bounded random variables, and then provide CIs and CSs for their means given access to the resulting privatized observations. We extend these CSs to capture time-varying (non-stationary) means, and conclude by illustrating how these methods can be used to conduct private online A/B tests.  ( 2 min )
    Information Theory with Kernel Methods. (arXiv:2202.08545v1 [cs.IT])
    We consider the analysis of probability distributions through their associated covariance operators from reproducing kernel Hilbert spaces. We show that the von Neumann entropy and relative entropy of these operators are intimately related to the usual notions of Shannon entropy and relative entropy, and share many of their properties. They come together with efficient estimation algorithms from various oracles on the probability distributions. We also consider product spaces and show that for tensor product kernels, we can define notions of mutual information and joint entropies, which can then characterize independence perfectly, but only partially conditional independence. We finally show how these new notions of relative entropy lead to new upper-bounds on log partition functions, that can be used together with convex optimization within variational inference methods, providing a new family of probabilistic inference methods.  ( 2 min )
    Variational Marginal Particle Filters. (arXiv:2109.15134v2 [stat.ML] UPDATED)
    Variational inference for state space models (SSMs) is known to be hard in general. Recent works focus on deriving variational objectives for SSMs from unbiased sequential Monte Carlo estimators. We reveal that the marginal particle filter is obtained from sequential Monte Carlo by applying Rao-Blackwellization operations, which sacrifices the trajectory information for reduced variance and differentiability. We propose the variational marginal particle filter (VMPF), which is a differentiable and reparameterizable variational filtering objective for SSMs based on an unbiased estimator. We find that VMPF with biased gradients gives tighter bounds than previous objectives, and the unbiased reparameterization gradients are sometimes beneficial.  ( 2 min )
    Robust Learning in Heterogeneous Contexts. (arXiv:2105.08532v3 [stat.ML] UPDATED)
    We consider the problem of learning from training data obtained in different contexts, where the underlying context distribution is unknown and is estimated empirically. We develop a robust method that takes into account the uncertainty of the context distribution. Unlike the conventional and overly conservative minimax approach, we focus on excess risks and construct distribution sets with statistical coverage to achieve an appropriate trade-off between performance and robustness. The proposed method is computationally scalable and shown to interpolate between empirical risk minimization and minimax regret objectives. Using both real and synthetic data, we demonstrate its ability to provide robustness in worst-case scenarios without harming performance in the nominal scenario.  ( 2 min )
    Modeling High-Dimensional Data with Unknown Cut Points: A Fusion Penalized Logistic Threshold Regression. (arXiv:2202.08441v1 [stat.ME])
    In traditional logistic regression models, the link function is often assumed to be linear and continuous in predictors. Here, we consider a threshold model that all continuous features are discretized into ordinal levels, which further determine the binary responses. Both the threshold points and regression coefficients are unknown and to be estimated. For high dimensional data, we propose a fusion penalized logistic threshold regression (FILTER) model, where a fused lasso penalty is employed to control the total variation and shrink the coefficients to zero as a method of variable selection. Under mild conditions on the estimate of unknown threshold points, we establish the non-asymptotic error bound for coefficient estimation and the model selection consistency. With a careful characterization of the error propagation, we have also shown that the tree-based method, such as CART, fulfill the threshold estimation conditions. We find the FILTER model is well suited in the problem of early detection and prediction for chronic disease like diabetes, using physical examination data. The finite sample behavior of our proposed method are also explored and compared with extensive Monte Carlo studies, which supports our theoretical discoveries.  ( 2 min )
    Sequential Multivariate Change Detection with Calibrated and Memoryless False Detection Rates. (arXiv:2108.00883v2 [stat.ME] UPDATED)
    Responding appropriately to the detections of a sequential change detector requires knowledge of the rate at which false positives occur in the absence of change. Setting detection thresholds to achieve a desired false positive rate is challenging. Existing works resort to setting time-invariant thresholds that focus on the expected runtime of the detector in the absence of change, either bounding it loosely from below or targeting it directly but with asymptotic arguments that we show cause significant miscalibration in practice. We present a simulation-based approach to setting time-varying thresholds that allows a desired expected runtime to be accurately targeted whilst additionally keeping the false positive rate constant across time steps. Whilst the approach to threshold setting is metric agnostic, we show how the cost of using the popular quadratic time MMD estimator can be reduced from $O(N^2B)$ to $O(N^2+NB)$ during configuration and from $O(N^2)$ to $O(N)$ during operation, where $N$ and $B$ are the numbers of reference and bootstrap samples respectively.  ( 2 min )
    Universality of empirical risk minimization. (arXiv:2202.08832v1 [math.ST])
    Consider supervised learning from i.i.d. samples $\{{\boldsymbol x}_i,y_i\}_{i\le n}$ where ${\boldsymbol x}_i \in\mathbb{R}^p$ are feature vectors and ${y} \in \mathbb{R}$ are labels. We study empirical risk minimization over a class of functions that are parameterized by $\mathsf{k} = O(1)$ vectors ${\boldsymbol \theta}_1, . . . , {\boldsymbol \theta}_{\mathsf k} \in \mathbb{R}^p$ , and prove universality results both for the training and test error. Namely, under the proportional asymptotics $n,p\to\infty$, with $n/p = \Theta(1)$, we prove that the training error depends on the random features distribution only through its covariance structure. Further, we prove that the minimum test error over near-empirical risk minimizers enjoys similar universality properties. In particular, the asymptotics of these quantities can be computed $-$to leading order$-$ under a simpler model in which the feature vectors ${\boldsymbol x}_i$ are replaced by Gaussian vectors ${\boldsymbol g}_i$ with the same covariance. Earlier universality results were limited to strongly convex learning procedures, or to feature vectors ${\boldsymbol x}_i$ with independent entries. Our results do not make any of these assumptions. Our assumptions are general enough to include feature vectors ${\boldsymbol x}_i$ that are produced by randomized featurization maps. In particular we explicitly check the assumptions for certain random features models (computing the output of a one-layer neural network with random weights) and neural tangent models (first-order Taylor approximation of two-layer networks).  ( 2 min )
    Solving Schr\"odinger Bridges via Maximum Likelihood. (arXiv:2106.02081v8 [stat.ML] UPDATED)
    The Schr\"odinger bridge problem (SBP) finds the most likely stochastic evolution between two probability distributions given a prior stochastic evolution. As well as applications in the natural sciences, problems of this kind have important applications in machine learning such as dataset alignment and hypothesis testing. Whilst the theory behind this problem is relatively mature, scalable numerical recipes to estimate the Schr\"odinger bridge remain an active area of research. We prove an equivalence between the SBP and maximum likelihood estimation enabling direct application of successful machine learning techniques. We propose a numerical procedure to estimate SBPs using Gaussian process and demonstrate the practical usage of our approach in numerical simulations and experiments.  ( 2 min )
    Deep Feature Fusion for Mitosis Counting. (arXiv:2002.03781v3 [cs.CV] UPDATED)
    Each woman living in the United States has about 1 in 8 chance of developing invasive breast cancer. The mitotic cell count is one of the most common tests to assess the aggressiveness or grade of breast cancer. In this prognosis, histopathology images must be examined by a pathologist using high-resolution microscopes to count the cells. Unfortunately, this can be an exhaustive task with poor reproducibility, especially for non-experts. Deep learning networks have recently been adapted to medical applications which are able to automatically localize these regions of interest. However, these region-based networks lack the ability to take advantage of the segmentation features produced by a full image CNN which are often used as a sole method of detection. Therefore, the proposed method leverages Faster RCNN for object detection while fusing segmentation features generated by a UNet with RGB image features to achieve an F-score of 0.508 on the MITOS-ATYPIA 2014 mitosis counting challenge dataset, outperforming state-of-the-art methods.  ( 2 min )
    Global Convergence of Sub-gradient Method for Robust Matrix Recovery: Small Initialization, Noisy Measurements, and Over-parameterization. (arXiv:2202.08788v1 [cs.LG])
    In this work, we study the performance of sub-gradient method (SubGM) on a natural nonconvex and nonsmooth formulation of low-rank matrix recovery with $\ell_1$-loss, where the goal is to recover a low-rank matrix from a limited number of measurements, a subset of which may be grossly corrupted with noise. We study a scenario where the rank of the true solution is unknown and over-estimated instead. The over-estimation of the rank gives rise to an over-parameterized model in which there are more degrees of freedom than needed. Such over-parameterization may lead to overfitting, or adversely affect the performance of the algorithm. We prove that a simple SubGM with small initialization is agnostic to both over-parameterization and noise in the measurements. In particular, we show that small initialization nullifies the effect of over-parameterization on the performance of SubGM, leading to an exponential improvement in its convergence rate. Moreover, we provide the first unifying framework for analyzing the behavior of SubGM under both outlier and Gaussian noise models, showing that SubGM converges to the true solution, even under arbitrarily large and arbitrarily dense noise values, and--perhaps surprisingly--even if the globally optimal solutions do not correspond to the ground truth. At the core of our results is a robust variant of restricted isometry property, called Sign-RIP, which controls the deviation of the sub-differential of the $\ell_1$-loss from that of an ideal, expected loss. As a byproduct of our results, we consider a subclass of robust low-rank matrix recovery with Gaussian measurements, and show that the number of required samples to guarantee the global convergence of SubGM is independent of the over-parameterized rank.  ( 2 min )
    Efficient and Reliable Probabilistic Interactive Learning with Structured Outputs. (arXiv:2202.08566v1 [cs.LG])
    In this position paper, we study interactive learning for structured output spaces, with a focus on active learning, in which labels are unknown and must be acquired, and on skeptical learning, in which the labels are noisy and may need relabeling. These scenarios require expressive models that guarantee reliable and efficient computation of probabilistic quantities to measure uncertainty. We identify conditions under which a class of probabilistic models -- which we denote CRISPs -- meet all of these conditions, thus delivering tractable computation of the above quantities while preserving expressiveness. Building on prior work on tractable probabilistic circuits, we illustrate how CRISPs enable robust and efficient active and skeptical learning in large structured output spaces.  ( 2 min )
    Nearest Neighbor Dirichlet Mixtures. (arXiv:2003.07953v3 [stat.ME] UPDATED)
    There is a rich literature on Bayesian methods for density estimation, which characterize the unknown density as a mixture of kernels. Such methods have advantages in terms of providing uncertainty quantification in estimation, while being adaptive to a rich variety of densities. However, relative to frequentist locally adaptive kernel methods, Bayesian approaches can be slow and unstable to implement in relying on Markov chain Monte Carlo algorithms. To maintain most of the strengths of Bayesian approaches without the computational disadvantages, we propose a class of nearest neighbor-Dirichlet mixtures. The approach starts by grouping the data into neighborhoods based on standard algorithms. Within each neighborhood, the density is characterized via a Bayesian parametric model, such as a Gaussian with unknown parameters. Assigning a Dirichlet prior to the weights on these local kernels, we obtain a pseudo-posterior for the weights and kernel parameters. A simple and embarrassingly parallel Monte Carlo algorithm is proposed to sample from the resulting pseudo-posterior for the unknown density. Desirable asymptotic properties are shown, and the methods are evaluated in simulation studies and applied to a motivating data set in the context of classification.  ( 2 min )
    Single Trajectory Nonparametric Learning of Nonlinear Dynamics. (arXiv:2202.08311v1 [cs.LG])
    Given a single trajectory of a dynamical system, we analyze the performance of the nonparametric least squares estimator (LSE). More precisely, we give nonasymptotic expected $l^2$-distance bounds between the LSE and the true regression function, where expectation is evaluated on a fresh, counterfactual, trajectory. We leverage recently developed information-theoretic methods to establish the optimality of the LSE for nonparametric hypotheses classes in terms of supremum norm metric entropy and a subgaussian parameter. Next, we relate this subgaussian parameter to the stability of the underlying process using notions from dynamical systems theory. When combined, these developments lead to rate-optimal error bounds that scale as $T^{-1/(2+q)}$ for suitably stable processes and hypothesis classes with metric entropy growth of order $\delta^{-q}$. Here, $T$ is the length of the observed trajectory, $\delta \in \mathbb{R}_+$ is the packing granularity and $q\in (0,2)$ is a complexity term. Finally, we specialize our results to a number of scenarios of practical interest, such as Lipschitz dynamics, generalized linear models, and dynamics described by functions in certain classes of Reproducing Kernel Hilbert Spaces (RKHS).  ( 2 min )
    The merged-staircase property: a necessary and nearly sufficient condition for SGD learning of sparse functions on two-layer neural networks. (arXiv:2202.08658v1 [cs.LG])
    It is currently known how to characterize functions that neural networks can learn with SGD for two extremal parameterizations: neural networks in the linear regime, and neural networks with no structural constraints. However, for the main parametrization of interest (non-linear but regular networks) no tight characterization has yet been achieved, despite significant developments. We take a step in this direction by considering depth-2 neural networks trained by SGD in the mean-field regime. We consider functions on binary inputs that depend on a latent low-dimensional subspace (i.e., small number of coordinates). This regime is of interest since it is poorly understood how neural networks routinely tackle high-dimensional datasets and adapt to latent low-dimensional structure without suffering from the curse of dimensionality. Accordingly, we study SGD-learnability with $O(d)$ sample complexity in a large ambient dimension $d$. Our main results characterize a hierarchical property, the "merged-staircase property", that is both necessary and nearly sufficient for learning in this setting. We further show that non-linear training is necessary: for this class of functions, linear methods on any feature map (e.g., the NTK) are not capable of learning efficiently. The key tools are a new "dimension-free" dynamics approximation result that applies to functions defined on a latent space of low-dimension, a proof of global convergence based on polynomial identity testing, and an improvement of lower bounds against linear methods for non-almost orthogonal functions.  ( 2 min )
    DeepHybrid: Deep Learning on Automotive Radar Spectra and Reflections for Object Classification. (arXiv:2202.08519v1 [cs.LG])
    Automated vehicles need to detect and classify objects and traffic participants accurately. Reliable object classification using automotive radar sensors has proved to be challenging. We propose a method that combines classical radar signal processing and Deep Learning algorithms. The range-azimuth information on the radar reflection level is used to extract a sparse region of interest from the range-Doppler spectrum. This is used as input to a neural network (NN) that classifies different types of stationary and moving objects. We present a hybrid model (DeepHybrid) that receives both radar spectra and reflection attributes as inputs, e.g. radar cross-section. Experiments show that this improves the classification performance compared to models using only spectra. Moreover, a neural architecture search (NAS) algorithm is applied to find a resource-efficient and high-performing NN. NAS yields an almost one order of magnitude smaller NN than the manually-designed one while preserving the accuracy. The proposed method can be used for example to improve automatic emergency braking or collision avoidance systems.  ( 2 min )
    The Quarks of Attention. (arXiv:2202.08371v1 [cs.LG])
    Attention plays a fundamental role in both natural and artificial intelligence systems. In deep learning, attention-based neural architectures, such as transformer architectures, are widely used to tackle problems in natural language processing and beyond. Here we investigate the fundamental building blocks of attention and their computational properties. Within the standard model of deep learning, we classify all possible fundamental building blocks of attention in terms of their source, target, and computational mechanism. We identify and study three most important mechanisms: additive activation attention, multiplicative output attention (output gating), and multiplicative synaptic attention (synaptic gating). The gating mechanisms correspond to multiplicative extensions of the standard model and are used across all current attention-based deep learning architectures. We study their functional properties and estimate the capacity of several attentional building blocks in the case of linear and polynomial threshold gates. Surprisingly, additive activation attention plays a central role in the proofs of the lower bounds. Attention mechanisms reduce the depth of certain basic circuits and leverage the power of quadratic activations without incurring their full cost.  ( 2 min )

  • Open

    [P] Computer Vision or AI/ML for assembly line defect detection?
    I am looking into using either machine learning or computer vision to perform quality control in an assembly line. Basically, I'd like to be able to detect defects on our product without the need for human intervention. Here are a few facts about our assembly line: We can present the product to an array of cameras in the same position and the same lighting conditions, every time. We produce only one product, and it must be exactly the same every time The product is not mechanically complex, but does have some nooks and crannies that may require multiple cameras and creative placement We need to process each product in roughly 10 seconds 5)The product is roughly 2' x 6" 6) The features we want to detect range from 1" to 12" in size. I.E. we aren't looking to detect tiny defects o…  ( 2 min )
    Dataset for face detection via mask-r-cnn [Project]
    Hello everyone, For a project I am doing, I need a network that is able to retrieve human faces (just the face, none of the background) from images of scenes. For this task, I am currently using Mask-R-CNN and a tiny custom dataset I built using the VIA image annotation tool, wherein I've outlined human faces in about 100 images. This gets some okay results when used as fine-tuning on a pre-trained mrcnn model, but doesn't produce good enough results. I am wondering if anyone here knows of any good mrcnn-suitable (coco annotation style) datasets for face detection in scenes? Thanks submitted by /u/machine_learning_mid [link] [comments]  ( 1 min )
    [D] Improved VQGAN explained: MaskGIT: Masked Generative Image Transformer, a 5-minute paper summary by Casual GAN Papers
    This is one of those papers with an idea that is so simple yet powerful that it really makes you wonder, how nobody has tried it yet! What I am talking about is of course changing the strange and completely unintuitive way that image transformers handle the token sequence to one that logically makes much more sense. First introduced in ViT, the left-to-right, line-by-line token processing and later generation in VQGAN (the second part of the training pipeline, the transformer prior that generates the latent code sequence from the codebook for the decoder to synthesize an image from) just worked and sort of became the norm. The authors of MaskGIT say that generating two–dimentional images in this way makes little to no sense, and I could not agree more with them. What they propose instead is to start with a sequence of MASK tokens and process the entire sequence with a bidirectional transfer by iteratively predicting, which MASK tokens should be replaced with which latent vector from the pretrained codebook. The proposed approach greatly speeds-up inference and improves performance on various image editing tasks. As for the details, let’s dive in, shall we? Full summary: https://t.me/casual_gan/264 Blog post: https://www.casualganpapers.com/improved-vqgan-inpainting-outpainting-conditional-editing/MaskGIT-explained.html MaskGIT arxiv / code (unofficial) Subscribe to Casual GAN Papers and follow me on Twitter for weekly AI paper summaries! submitted by /u/KirillTheMunchKing [link] [comments]  ( 1 min )
    [D] SOTA Numerical sequence classification
    What's the SOTA in numerical sequence classification? So far I've worked almost exclusively with convolutional models and I'm about to do some work on sequence classification, are RNNs still good enough or should I look into transformer models? And if I should look at transformers are there any well-known well-performing reasonably-sized architectures (similar to resnets in convolutional field)? Bonus points for any paper/literature recommendations. Thanks for advice :) submitted by /u/Darlans [link] [comments]  ( 1 min )
    [D] SGD for NNs vs SGD for svms (SGD = stochastic gradient descent)
    I was wondering about the following and maybe it s an easily answerable question for some. So for training NNs as far as I know one usually uses some variant of SGD. For SVMs SDG also exists and is called pegasos as far I know. One step foe calculatinf the gradient or doing backprop requires quite some computation. The same goes for SVMs in SGD as we need to evaluate the kernel at many samples. So my question is how come that through SGD training NNs become more feasible than for SVMs? if this wasnt the case then SVMs should surely be used much more . As NNs have so many parameters i can imagine that a backward pass must take a similar amount of time than the evaluation of the kernel at the current point SGD is applied to with all other points. Or am I completly off here complexity wise ?? submitted by /u/LeDamien19 [link] [comments]  ( 1 min )
    [N] Gym 0.22.0- the largest single release ever- is now released
    Release notes are here: https://github.com/openai/gym/releases/tag/0.22.0 submitted by /u/jkterry1 [link] [comments]
    [P] tbparse: Load tensorboard event logs as pandas DataFrames for scientific plotting; Supports both PyTorch and TensorFlow
    Hi all, Tensorboard is a nice tool to visualize experiment results. However, it is quite difficult to parse the event logs into raw data for scientific plotting. So, I've created a PyPI package to make the parsing process as simple as possible (2 lines of code) and can be installed from pip: pip install tbparse. It supports reading event files generated by PyTorch/TensorFlow/Keras/TensorboardX and can parse most of the event types supported by tensorboard. The package is open source (https://github.com/j3soon/tbparse) and the usages are documented in detail (https://tbparse.readthedocs.io/en/latest/). My friends and I have been using it for a while and find it very convenient, so I think some of you may benefit from it. I would be happy to hear your feedback and feature requests. submitted by /u/j3soon [link] [comments]  ( 1 min )
    [D] Machine Learning Categories
    Hi community, Apologies for a dumb question. Can anyone advise on a machine learning problem category/use case that operates with 'small' matrices (200x200 or smaller)? Machine Learning in finance would be one; WISDM activity recognition is the other one. Can anyone advise on more? Or maybe, there's a broader category for this type of problem? I'm not an expert in Machine Learning :) I'll briefly explain the reason for this question. I'm trying to help some of my geek friends with a pitch deck and need to size the market. They are too busy coding and this sort of question is hard to Google. The product they developed is an Algorithmic Adjoint Differentiation library for C++ that performs 6-8 times faster for this sort of problem, on a CPU - than Tensorflow on GPU. [Discussion] submitted by /u/Natashamanito [link] [comments]  ( 1 min )
    [P] Partial segmentation of certain pixel patches
    I run a supervised DL mask to finds shapes of objects on an image. However sometimes the shapes are so cluttered it can't make out the edges so it finds nothing on part of the image. See example sketch. https://imgur.com/a/uwbM5Wv I want to run split-screen segmentation and a separate classifier on sections of the image where no shapes were found. This is a backup plan so even when the samples are cluttered i have an alternate way to classify them. I didn't know if it was possible to enact multiple types of segmentation at the same time in different pixel locations on the image. Hoping someone can point me in the right direction submitted by /u/Charming_Top6003 [link] [comments]  ( 1 min )
    [N] Machine learning solutions for reducing exclusion of persons with hearing loss from cultural content (CFP)
    Hearing loss is a common health problem in today’s societies. For example there are around 50 millions of “hard of hearing” people in the EU (around 9% of the total population). One of the biggest challenges of mild or severe hearing impaired listeners is being able to understand speech, music and generally sounds that are contained in cultural works such as theatrical plays, movies and concerts. This obvious difficulty increases the level of exclusion of people with hearing loss from such activities and therefore significantly influences their quality of life. Advances in the technology of hearing aids has only partly improved the ability of the people with mild hearing loss to participate in such cultural events, but it has not helped in severe hearing loss cases. At the same time, mod…  ( 2 min )
    [D] How to heuristically check whether found min/max in DS/ML is global and not local?
    I was asked in an interview about how would one check that found min/max in DS/ML problems is global and not local? Now, that was not about Newton or Gradient-Descent or Grid-Search, but rather how one would be quite certain that found min/max of an arbitrary objective function, with whatever method was used to be found, is global? Mind you, this was neither math proof question nor coding nor domain specific, i.e. not for insurance or finance or etc, but rather general methodology/strategy inquiry - probably heuristic in nature and not 100% certain. After I presented some strategies, interviewer asked "do you know X?", which I did not, and am assuming that was the thing for checking global min/max this person had in mind. I did not have time to ask about it, and also I forgot the term used as I was tired post-interview. I tried to search online for this topic, but the results are either trivial advice for novices or discussing unrelated optimization. Thoughts? Thank you in advance. submitted by /u/a90501 [link] [comments]  ( 5 min )
    [R] Resources on multivariate time series analysis
    I am looking for books, papers or any other reliable sources of information on different multivariate time series analysis techniques, like M-SSA, ICA, Slow feature analysis, FPCA, and so on. I need to make an overview of what techniques best suit my appication (64-channel EMG signals with various contaminations that need to be cleaned for a regressive model using short time windows). Any help or suggestions are much appreciated. For reference, I am now looking into the following techniques: Principial component analysis (Multivariate) Functional PCA Dynamic mode decomposition Slow feature analysis Variational mode decomposition Independent component analysis Singular spectrum analysis Canonincal factor analysis (LSTM) Autoencoders If there are any techniques that you are familiar with that might fit the bill, please let me know! Thanks a lot! submitted by /u/jacksodus [link] [comments]  ( 1 min )
    [P] Topic models in contextual advertising
    I wrote a blog post about how my team and I at Schibsted, use probabilistic topic models, i.e. Latent Dirichlet Allocation (LDA), in the scope of contextual (cookie-less) ad targeting. We use the models to both suggest the most relevant keywords, i.e. keyword expansion, during ad segment creation, and also as a means of quickly getting the overview of our news media content, and its engagement. It's worked beautifully for us, in terms of helping our product specialists quickly construct highly relevant ad content. I hope there are enough details in the blog post to both introduce topic modelling to wider audiences, but also provide some techniques and lessons learned for those looking to deploy similar approaches into production. https://medium.com/p/2bc56d624c5f submitted by /u/newcomb_benford_law [link] [comments]  ( 1 min )
    [D] Metaheuristic vs machine learning?
    What's the difference between the terms "metaheuristic" vs machine learning? The wiki for metaheuristic articulates many similar ideas in machine learning yet I don't often see explicit connections between the two in literature. submitted by /u/leighscullyyang [link] [comments]  ( 1 min )
    [D] What are the biggest architectural innovations since Transformers?
    Title. Transformers, to my knowledge, seems to have been the last big innovation in architecture (following from CNNs, LSTMS, graphs) that has really propagated into every day use. What paper(s) from the last few years are the "next" Transformers, in your opinion? submitted by /u/MikeFent0n [link] [comments]  ( 2 min )
    [R] Gradients without Backpropagation
    submitted by /u/hardmaru [link] [comments]  ( 2 min )
    Question on iteratively updating feature weights [P]
    Hey all, I currently have an NLP problem I am grappling with and could use input on. It's not a traditional machine learning problem per se, but it does involve optimization so I thought I could get help here. Essentially, I am have a dataframe with two columns: Features and Scores, where each feature represents a predetermined key phrase that I am matching in a given text string. What I am then doing is taking those key phrases, finding all occurrences in document, and then adding up the weights to get a total score for a given text. The problem at hand is optimizing the weights, which are currently just a rough initialization. I was thinking of grid searching, but with 1000 features and weights I doubt that that's a good idea. Similarly, neural networks seem to be out given the nature of what I'm working with. Thoughts and suggestions are much appreciated. Thanks! submitted by /u/fractionesque [link] [comments]  ( 1 min )
    [R] Tensor Moments of Gaussian Mixture Models: Theory and Applications
    Hi fellow redditors, Just this week we put on arxiv a new paper on using higher order moments for learning Gaussian Mixture Models (GMMs). The main contributions are a relatively nice formula for moments of Gaussian distributions and GMMs, and maybe more importantly, a formulation of the method of moments, that attempts to match the empirical tensors with the model ones (using gradient descent) without explicitly forming the tensors. More details in the paper! The code is not available online yet, but we plan on making it available really soon. We noticed the paper was trending on deepai.org, so we thought why not post it here too? AMA. I'd love some feedback or to answer some questions you might have. submitted by /u/joao-m-pereira [link] [comments]  ( 1 min )
  • Open

    How do I run deep/reinforcement learning python/pytorch code online?
    Hey all, I'm a noob and poor student. I do not want to buy a computer to run the deep/reinforcement learning experiment. :( So what are my options online? People told me to things like Google AI lab and EC2 may work? ... I don't know. I need more flexibility than what Google Colab offers like working with .py and may be having access to a terminal. Again I'm a NOOB, any advices would be great. submitted by /u/move37th [link] [comments]  ( 1 min )
    Gym 0.22.0 is now released!
    submitted by /u/jkterry1 [link] [comments]
    Active RL
    Does anyone has ressources to learn about active RL? It is a subfield of RL where observation is made at a cost. Getting started on the topic, looks really cool submitted by /u/selim_amrouni [link] [comments]  ( 1 min )
    DIAMBRA Arena 🤖⚔️🤖, a software package featuring a collection of high-quality environments for ReinforcementLearning research and experimentation (links in comments)
    submitted by /u/DIAMBRA_AIArena [link] [comments]  ( 1 min )
    Help in identifying the algorithm used in RL code
    Hi, I'm fairly new to reinforcement learning. Can anyone help me to identify the reinforcement learning algorithm used in the following project? I've tried and couldn't identify it. Any help is appreciated TIA. This is the GitHub link for the project code This is the link for the agent in the same. submitted by /u/ytutshange [link] [comments]  ( 1 min )
    Are there scenarios where only DDPG is applicable instead of Actor critic methods like A2C/A3C/PPO etc and vice versa ?
    submitted by /u/aabra__ka__daabra [link] [comments]  ( 1 min )
    Actor Critic, Q-Learning & SARSA - Real life examples!
    Hi RL fans, I'm TA'ing an undergrad 300 level psychology class where they are learning about animal & human learning. I want to give them a swift tour through Actor Critic, Q-Learning and SARSA before they go to the main class on these (the TA'd class is showing the going through the content before the class for scheduling reasons). I would love to give them some intuitive real life examples where each of these algorithms might be best used (not even necessarily best, but intuitively we are making decisions using similar elements). I want to help them connect to the material beyond gridworld. Any ideas? Bonus points if you can connect it to dog behavior, as I've used this as an example throughout the semester. submitted by /u/Newt_Seldon [link] [comments]  ( 1 min )
  • Open

    Improved VQGAN explained: MaskGIT: Masked Generative Image Transformer, a 5-minute paper summary by Casual GAN Papers
    This is one of those papers with an idea that is so simple yet powerful that it really makes you wonder, how nobody has tried it yet! What I am talking about is of course changing the strange and completely unintuitive way that image transformers handle the token sequence to one that logically makes much more sense. First introduced in ViT, the left-to-right, line-by-line token processing and later generation in VQGAN (the second part of the training pipeline, the transformer prior that generates the latent code sequence from the codebook for the decoder to synthesize an image from) just worked and sort of became the norm. The authors of MaskGIT say that generating two–dimentional images in this way makes little to no sense, and I could not agree more with them. What they propose instead is to start with a sequence of MASK tokens and process the entire sequence with a bidirectional transfer by iteratively predicting, which MASK tokens should be replaced with which latent vector from the pretrained codebook. The proposed approach greatly speeds-up inference and improves performance on various image editing tasks. As for the details, let’s dive in, shall we? Full summary: https://t.me/casual_gan/264 Blog post: https://www.casualganpapers.com/improved-vqgan-inpainting-outpainting-conditional-editing/MaskGIT-explained.html MaskGIT arxiv / code (unofficial) Subscribe to Casual GAN Papers and follow me on Twitter for weekly AI paper summaries! submitted by /u/KirillTheMunchKing [link] [comments]  ( 1 min )
    A recipe to study Graph Neural Networks (GNNs)
    submitted by /u/pmz [link] [comments]
    Day 4 of posting motivational/inspiring quotes generated by ai.
    submitted by /u/ScottBakerJr [link] [comments]
    Scientists Warn That New AI-Generated Faces Are Seen as More Trustworthy Than Real Ones. THIS COULD BE A PROBLEM.
    submitted by /u/LisaMck041 [link] [comments]
    Scientists Warn That New AI-Generated Faces Are Seen as More Trustworthy Than Real Ones
    submitted by /u/LisaMck041 [link] [comments]  ( 1 min )
    Top 10 Language Detection APIs
    submitted by /u/tah_zem [link] [comments]
    Deepmind Open-Sources DM21: A Deep Learning Model For Quantum Chemistry
    Deepmind Open-sources ‘DM21’, a neural network model for mapping electron density to chemical interaction energy, a critical component of quantum mechanical modeling. DM21 outperforms standard models on various benchmarks, and it’s accessible as a PySCF simulation framework addition. In a paper published in Science, the model was detailed. The energy density functional component of Density Functional Theory (DFT), which describes the quantum mechanical behavior of molecules, is approximated by DM21 using a neural network. DM21 corrects systemic flaws in prior functional approximations, which failed to treat systems with “fractional electron character” appropriately. The model uses a multilayer perceptron (MLP) architecture with a grid of electron densities. It surpassed four of the “highest performing” current implementations on three benchmark datasets: Bond-breaking Benchmark (BBB), GMTKN55, and QM9. Continue Reading Paper: https://www.science.org/stoken/author-tokens/ST-218/full Github: https://github.com/deepmind/deepmind-research/tree/master/density_functional_approximation_dm21 ​ https://preview.redd.it/28feplmy5ki81.jpg?width=2763&format=pjpg&auto=webp&s=62bb00318315fef249e30d9d6f6c71fd4d207b55 submitted by /u/ai-lover [link] [comments]  ( 1 min )
    EPFL and DeepMind use AI to control plasmas for nuclear fusion
    submitted by /u/qptbook [link] [comments]
    Last Week in AI: AI for Ancient Languages, MoviePass 2.0 has Eye Tracking for Ads, AI Makes Pulpy Sci Fi Covers
    submitted by /u/regalalgorithm [link] [comments]
    The D o l p h i n in you.
    submitted by /u/Bozzooo [link] [comments]
    Artificial Nightmares : Cybergold || DD PYTTI VQGAN AI Art Video [4K 60 FPS]
    submitted by /u/Thenamessd [link] [comments]  ( 1 min )
    A New Study from CMU and Bosch Center for AI Demonstrated a New Transformer Paradigm in Computer Vision
    After leveraging Convolutional Neural Network (CNN) for many years, since the advent of Transformers in Natural Language Processing (NLP), the computer vision community has also focused very carefully (and sometimes trying a little too hard) on how to apply Transformers in vision. This stems primarily from the enormous success of ViT, which outperformed CNN’s state-of-the-art when trained with insanely large datasets. The authors of ViT proposed a straightforward adaptation of the original Transformer, as ViT is composed simply of a “patchify” layer, which divides the image into patches treated as tokens, followed by a linear projection and the Transformer encoder module (composed of the Self Attention and the MLP module), which remained unchanged. Continue Reading Paper: https://arxiv.org/pdf/2201.09792v1.pdf Github: https://github.com/locuslab/convmixer https://preview.redd.it/16lx29lglii81.png?width=1286&format=png&auto=webp&s=2eb29bb5b10a5fde4e31a83857e60357df4c2c5e submitted by /u/ai-lover [link] [comments]  ( 1 min )
  • Open

    Algorithms at Work
    Brian Christian and Tom Griffiths have just published a new book, Algorithms at Work. You can hear me talking about queues in Episode 2. More on queueing theory What happens when you add a new teller? Queueing and economies of scale Upper bound on wait times Algorithms at Work first appeared on John D. Cook.  ( 1 min )
  • Open

    Robust Routing Using Electrical Flows
    Posted by Ali Kemal Sinop and Kostas Kollias, Research Scientists, Google Research In the world of networks, there are models that can explain observations across a diverse collection of applications. These include simple tasks such as computing the shortest path, which has obvious applications to routing networks but also applies in biology, e.g., where the slime mold Physarum is able to find shortest paths in mazes. Another example is Braess’s paradox — the observation that adding resources to a network can have an effect opposite to the one expected — which manifests not only in road networks but also in mechanical and electrical systems. For instance, constructing a new road can increase traffic congestion or adding a new link in an electrical circuit can increase voltage. Such connec…  ( 8 min )
  • Open

    Guinness World Record Awarded for Fastest DNA Sequencing — Just 5 Hours
    Guinness World Records this week presented a Stanford University-led research team with the first record for fastest DNA sequencing technique — a benchmark set using a workflow sped up by AI and accelerated computing. Achieved in five hours and two minutes, the DNA sequencing record can allow clinicians to take a blood draw from a Read article > The post Guinness World Record Awarded for Fastest DNA Sequencing — Just 5 Hours appeared first on The Official NVIDIA Blog.  ( 3 min )
    Take the Future for a Spin at GTC 2022
    Don’t miss the chance to experience the breakthroughs that are driving the future of autonomy. NVIDIA GTC will bring together the leaders, researchers and developers who are ushering in the era of autonomous vehicles. The virtual conference, running March 21-24, also features experts from industries transformed by AI, such as healthcare, robotics and finance. And Read article > The post Take the Future for a Spin at GTC 2022 appeared first on The Official NVIDIA Blog.  ( 3 min )
  • Open

    Generating landscapes without people
    I've been experimenting with GauGAN2, released in Nov 2021 as a follow-on to GauGAN. One new thing Nvidia added in GauGAN2 is the ability to generate a picture to match a phrase. "A rocky stream in an ancient mossy rainforest" It can do more than  ( 4 min )
    Bonus: it draws whatever you want, as long as it's what it already had in mind
    AI Weirdness: the strange side of machine learning  ( 1 min )
  • Open

    Classification of star types using Neural Designer
    This model classifies the different types of stars by means of artificial intelligence using Neural Designer. https://www.neuraldesigner.com/learning/examples/star-type https://preview.redd.it/jnnjgstatli81.jpg?width=567&format=pjpg&auto=webp&s=bb46e606d9d9acc03481f30f480f7ac1a1386a6e submitted by /u/datapablo [link] [comments]
    A New Study from CMU and Bosch Center for AI Demonstrated a New Transformer Paradigm in Computer Vision
    submitted by /u/moetsi_op [link] [comments]  ( 1 min )
    Currently training my first neural network with a collection of 200 characters I have been developing over the past couple of years called Sonics. I’m only in hour 12 and it’s already learned the body shapes and some colors.
    submitted by /u/collinleary [link] [comments]  ( 1 min )
    Deepmind Open-Sources DM21: A Deep Learning Model For Quantum Chemistry
    Deepmind Open-sources ‘DM21’, a neural network model for mapping electron density to chemical interaction energy, a critical component of quantum mechanical modeling. DM21 outperforms standard models on various benchmarks, and it’s accessible as a PySCF simulation framework addition. In a paper published in Science, the model was detailed. The energy density functional component of Density Functional Theory (DFT), which describes the quantum mechanical behavior of molecules, is approximated by DM21 using a neural network. DM21 corrects systemic flaws in prior functional approximations, which failed to treat systems with “fractional electron character” appropriately. The model uses a multilayer perceptron (MLP) architecture with a grid of electron densities. It surpassed four of the “highest performing” current implementations on three benchmark datasets: Bond-breaking Benchmark (BBB), GMTKN55, and QM9. Continue Reading Paper: https://www.science.org/stoken/author-tokens/ST-218/full Github: https://github.com/deepmind/deepmind-research/tree/master/density_functional_approximation_dm21 ​ https://preview.redd.it/p412ok426ki81.jpg?width=2763&format=pjpg&auto=webp&s=bf045beb3ec515e9178f3ffd7a32f6787cbdab32 submitted by /u/ai-lover [link] [comments]  ( 1 min )
  • Open

    E(n) Equivariant Graph Neural Networks. (arXiv:2102.09844v3 [cs.LG] UPDATED)
    This paper introduces a new model to learn graph neural networks equivariant to rotations, translations, reflections and permutations called E(n)-Equivariant Graph Neural Networks (EGNNs). In contrast with existing methods, our work does not require computationally expensive higher-order representations in intermediate layers while it still achieves competitive or better performance. In addition, whereas existing methods are limited to equivariance on 3 dimensional spaces, our model is easily scaled to higher-dimensional spaces. We demonstrate the effectiveness of our method on dynamical systems modelling, representation learning in graph autoencoders and predicting molecular properties.  ( 2 min )
    Rethink the Evaluation for Attack Strength of Backdoor Attacks in Natural Language Processing. (arXiv:2201.02993v2 [cs.CL] UPDATED)
    It has been shown that natural language processing (NLP) models are vulnerable to a kind of security threat called the Backdoor Attack, which utilizes a `backdoor trigger' paradigm to mislead the models. The most threatening backdoor attack is the stealthy backdoor, which defines the triggers as text style or syntactic. Although they have achieved an incredible high attack success rate (ASR), we find that the principal factor contributing to their ASR is not the `backdoor trigger' paradigm. Thus the capacity of these stealthy backdoor attacks is overestimated when categorized as backdoor attacks. Therefore, to evaluate the real attack power of backdoor attacks, we propose a new metric called attack successful rate difference (ASRD), which measures the ASR difference between clean state and poison state models. Besides, since the defenses against stealthy backdoor attacks are absent, we propose Trigger Breaker, consisting of two too simple tricks that can defend against stealthy backdoor attacks effectively. Experiments show that our method achieves significantly better performance than state-of-the-art defense methods against stealthy backdoor attacks.  ( 2 min )
    Fair Wrapping for Black-box Predictions. (arXiv:2201.12947v2 [stat.ML] UPDATED)
    We introduce a new family of techniques to post-process ("wrap") a black-box classifier in order to reduce its bias. Our technique builds on the recent analysis of improper loss functions whose optimisation can correct any twist in prediction, unfairness being treated as a twist. In the post-processing, we learn a wrapper function which we define as an {\alpha}-tree, which modifies the prediction. We provide two generic boosting algorithms to learn {\alpha}-trees. We show that our modification has appealing properties in terms of composition of{\alpha}-trees, generalization, interpretability, and KL divergence between modified and original predictions. We exemplify the use of our technique in three fairness notions: conditional value at risk, equality of opportunity, and statistical parity; and provide experiments on several readily available datasets.  ( 2 min )
    DeepECMP: Predicting Extracellular Matrix Proteins using Deep Learning. (arXiv:2110.03689v2 [q-bio.QM] UPDATED)
    Introduction: The extracellular matrix (ECM) is a networkof proteins and carbohydrates that has a structural and bio-chemical function. The ECM plays an important role in dif-ferentiation, migration and signaling. Several studies havepredicted ECM proteins using machine learning algorithmssuch as Random Forests, K-nearest neighbours and supportvector machines but is yet to be explored using deep learn-ing. Method: DeepECMP was developed using several previ-ously used ECM datasets, asymmetric undersampling andan ensemble of 11 feed-forward neural networks. Results: The performance of DeepECMP was 83.6% bal-anced accuracy which outperformed several algorithms. Inaddition, the pipeline of DeepECMP has been shown to behighly efficient. Conclusion: This paper is the first to focus on utilizingdeep learning for ECM prediction. Several limitations areovercome by DeepECMP such as computational expense,availability to the public and usability outside of the humanspecies  ( 2 min )
    Deep Learning based Coverage and Rate Manifold Estimation in Cellular Networks. (arXiv:2202.06390v1 [cs.NI] CROSS LISTED)
    This article proposes Convolutional Neural Network-based Auto Encoder (CNN-AE) to predict location-dependent rate and coverage probability of a network from its topology. We train the CNN utilising BS location data of India, Brazil, Germany, and the USA and compare its performance with stochastic geometry (SG) based analytical models. In comparison to the best-fitted SG-based model, CNN-AE improves the coverage and rate prediction errors by a margin of as large as $40\%$ and $25\%$ respectively. As an application, we propose a low complexity, provably convergent algorithm that, using trained CNN-AE, can compute locations of new BSs that need to be deployed in a network in order to satisfy pre-defined spatially heterogeneous performance goals.  ( 2 min )
    Learning Reward Models for Cooperative Trajectory Planning with Inverse Reinforcement Learning and Monte Carlo Tree Search. (arXiv:2202.06443v2 [cs.LG] UPDATED)
    Cooperative trajectory planning methods for automated vehicles, are capable to solve traffic scenarios that require a high degree of cooperation between traffic participants. In order for cooperative systems to integrate in human-centered traffic, it is important that the automated systems behave human-like, so that humans can anticipate the system's decisions. While Reinforcement Learning has made remarkable progress in solving the decision making part, it is non-trivial to parameterize a reward model that yields predictable actions. This work employs feature-based Maximum Entropy Inverse Reinforcement Learning in combination with Monte Carlo Tree Search to learn reward models that maximizes the likelihood of recorded multi-agent cooperative expert trajectories. The evaluation demonstrates that the approach is capable of recovering a reasonable reward model that mimics the expert and performs similar to a manually tuned baseline reward model.  ( 2 min )
    Learning to summarize from human feedback. (arXiv:2009.01325v3 [cs.CL] UPDATED)
    As language models become more powerful, training and evaluation are increasingly bottlenecked by the data and metrics used for a particular task. For example, summarization models are often trained to predict human reference summaries and evaluated using ROUGE, but both of these metrics are rough proxies for what we really care about -- summary quality. In this work, we show that it is possible to significantly improve summary quality by training a model to optimize for human preferences. We collect a large, high-quality dataset of human comparisons between summaries, train a model to predict the human-preferred summary, and use that model as a reward function to fine-tune a summarization policy using reinforcement learning. We apply our method to a version of the TL;DR dataset of Reddit posts and find that our models significantly outperform both human reference summaries and much larger models fine-tuned with supervised learning alone. Our models also transfer to CNN/DM news articles, producing summaries nearly as good as the human reference without any news-specific fine-tuning. We conduct extensive analyses to understand our human feedback dataset and fine-tuned models We establish that our reward model generalizes to new datasets, and that optimizing our reward model results in better summaries than optimizing ROUGE according to humans. We hope the evidence from our paper motivates machine learning researchers to pay closer attention to how their training loss affects the model behavior they actually want.  ( 2 min )
    Regularizing Instabilities in Image Reconstruction Arising from Learned Denoisers. (arXiv:2108.13551v2 [cs.LG] UPDATED)
    It's well-known that inverse problems are ill-posed and to solve them meaningfully one has to employ regularization methods. Traditionally, popular regularization methods have been the penalized Variational approaches. In recent years, the classical regularized-reconstruction approaches have been outclassed by the (deep-learning-based) learned reconstruction algorithms. However, unlike the traditional regularization methods, the theoretical underpinnings, such as stability and regularization, have been insufficient for such learned reconstruction algorithms. Hence, the results obtained from such algorithms, though empirically outstanding, can't always be completely trusted, as they may contain certain instabilities or (hallucinated) features arising from the learned process. In fact, it has been shown that such learning algorithms are very susceptible to small (adversarial) noises in the data and can lead to severe instabilities in the recovered solution, which can be quite different than the inherent instabilities of the ill-posed (inverse) problem. Whereas, the classical regularization methods can handle such (adversarial) noises very well and can produce stable recovery. Here, we try to present certain regularization methods to stabilize such (unstable) learned reconstruction methods and recover a regularized solution, even in the presence of adversarial noises. For this, we need to extend the classical notion of regularization and incorporate it in the learned reconstruction algorithms. We also present some regularization techniques to regularize two of the most popular learning reconstruction algorithms, the Learned Post-Processing Reconstruction and the Learned Unrolling Reconstruction.  ( 2 min )
    GSN: A Universal Graph Neural Network Inspired by Spring Network. (arXiv:2201.12994v2 [cs.LG] UPDATED)
    The design of universal Graph Neural Networks (GNNs) that operate on both homophilous and heterophilous graphs has received increased research attention in recent years. Existing heterophilous GNNs, particularly those designed in the spatial domain, lack a convincing theoretical or physical motivation. In this paper, we propose the Graph Spring Network (GSN), a universal GNN model that works for both homophilous and heterophilous graphs, inspired by spring networks and metric learning. We show that the GSN framework interprets many existing GNN models from the perspective of spring potential energy minimization with various metrics, which gives these models strong physical motivations. We also conduct extensive experiments to demonstrate our GSN framework's superior performance on real-world homophilous and heterophilous data sets.  ( 2 min )
    Neural Capacity Estimators: How Reliable Are They?. (arXiv:2111.07401v3 [cs.IT] UPDATED)
    Recently, several methods have been proposed for estimating the mutual information from sample data using deep neural networks and without the knowing closed form distribution of the data. This class of estimators is referred to as neural mutual information estimators. Although very promising, such techniques have yet to be rigorously bench-marked so as to establish their efficacy, ease of implementation, and stability for capacity estimation which is joint maximization frame-work. In this paper, we compare the different techniques proposed in the literature for estimating capacity and provide a practitioner perspective on their effectiveness. In particular, we study the performance of mutual information neural estimator (MINE), smoothed mutual information lower-bound estimator (SMILE), and directed information neural estimator (DINE) and provide insights on InfoNCE. We evaluated these algorithms in terms of their ability to learn the input distributions that are capacity approaching for the AWGN channel, the optical intensity channel, and peak power-constrained AWGN channel. For both scenarios, we provide insightful comments on various aspects of the training process, such as stability, sensitivity to initialization.  ( 2 min )
    Bias and unfairness in machine learning models: a systematic literature review. (arXiv:2202.08176v1 [cs.LG])
    One of the difficulties of artificial intelligence is to ensure that model decisions are fair and free of bias. In research, datasets, metrics, techniques, and tools are applied to detect and mitigate algorithmic unfairness and bias. This study aims to examine existing knowledge on bias and unfairness in Machine Learning models, identifying mitigation methods, fairness metrics, and supporting tools. A Systematic Literature Review found 40 eligible articles published between 2017 and 2022 in the Scopus, IEEE Xplore, Web of Science, and Google Scholar knowledge bases. The results show numerous bias and unfairness detection and mitigation approaches for ML technologies, with clearly defined metrics in the literature, and varied metrics can be highlighted. We recommend further research to define the techniques and metrics that should be employed in each case to standardize and ensure the impartiality of the machine learning model, thus, allowing the most appropriate metric to detect bias and unfairness in a given context.  ( 2 min )
    Functional connectivity ensemble method to enhance BCI performance (FUCONE). (arXiv:2111.03122v2 [q-bio.NC] UPDATED)
    Functional connectivity is a key approach to investigate oscillatory activities of the brain that provides important insights on the underlying dynamic of neuronal interactions and that is mostly applied for brain activity analysis. Building on the advances in information geometry for brain-computer interface, we propose a novel framework that combines functional connectivity estimators and covariance-based pipelines to classify mental states, such as motor imagery. A Riemannian classifier is trained for each estimator and an ensemble classifier combines the decisions in each feature space. A thorough assessment of the functional connectivity estimators is provided and the best performing pipeline, called FUCONE, is evaluated on different conditions and datasets. Using a meta-analysis to aggregate results across datasets, FUCONE performed significantly better than all state-of-the-art methods. The performance gain is mostly imputable to the improved diversity of the feature spaces, increasing the robustness of the ensemble classifier with respect to the inter- and intra-subject variability.  ( 2 min )
    Safe Policy Learning through Extrapolation: Application to Pre-trial Risk Assessment. (arXiv:2109.11679v3 [stat.ML] UPDATED)
    Algorithmic recommendations and decisions have become ubiquitous in today's society. Many of these and other data-driven policies, especially in the realm of public policy, are based on known, deterministic rules to ensure their transparency and interpretability. For example, algorithmic pre-trial risk assessments, which serve as our motivating application, provide relatively simple, deterministic classification scores and recommendations to help judges make release decisions. How can we use the data based on existing deterministic policies to learn new and better policies? Unfortunately, prior methods for policy learning are not applicable because they require existing policies to be stochastic rather than deterministic. We develop a robust optimization approach that partially identifies the expected utility of a policy, and then finds an optimal policy by minimizing the worst-case regret. The resulting policy is conservative but has a statistical safety guarantee, allowing the policy-maker to limit the probability of producing a worse outcome than the existing policy. We extend this approach to common and important settings where humans make decisions with the aid of algorithmic recommendations. Lastly, we apply the proposed methodology to a unique field experiment on pre-trial risk assessment instruments. We derive new classification and recommendation rules that retain the transparency and interpretability of the existing instrument while potentially leading to better overall outcomes at a lower cost.  ( 2 min )
    CITRIS: Causal Identifiability from Temporal Intervened Sequences. (arXiv:2202.03169v2 [cs.LG] UPDATED)
    Understanding the latent causal factors of a dynamical system from visual observations is a crucial step towards agents reasoning in complex environments. In this paper, we propose CITRIS, a variational autoencoder framework that learns causal representations from temporal sequences of images in which underlying causal factors have possibly been intervened upon. In contrast to the recent literature, CITRIS exploits temporality and observing intervention targets to identify scalar and multidimensional causal factors, such as 3D rotation angles. Furthermore, by introducing a normalizing flow, CITRIS can be easily extended to leverage and disentangle representations obtained by already pretrained autoencoders. Extending previous results on scalar causal factors, we prove identifiability in a more general setting, in which only some components of a causal factor are affected by interventions. In experiments on 3D rendered image sequences, CITRIS outperforms previous methods on recovering the underlying causal variables. Moreover, using pretrained autoencoders, CITRIS can even generalize to unseen instantiations of causal factors, opening future research areas in sim-to-real generalization for causal representation learning.  ( 2 min )
    Capitalization Normalization for Language Modeling with an Accurate and Efficient Hierarchical RNN Model. (arXiv:2202.08171v1 [cs.CL])
    Capitalization normalization (truecasing) is the task of restoring the correct case (uppercase or lowercase) of noisy text. We propose a fast, accurate and compact two-level hierarchical word-and-character-based recurrent neural network model. We use the truecaser to normalize user-generated text in a Federated Learning framework for language modeling. A case-aware language model trained on this normalized text achieves the same perplexity as a model trained on text with gold capitalization. In a real user A/B experiment, we demonstrate that the improvement translates to reduced prediction error rates in a virtual keyboard application. Similarly, in an ASR language model fusion experiment, we show reduction in uppercase character error rate and word error rate.  ( 2 min )
    Physics in the Machine: Integrating Physical Knowledge in Autonomous Phase-Mapping. (arXiv:2111.07478v2 [cond-mat.mtrl-sci] UPDATED)
    Application of artificial intelligence (AI), and more specifically machine learning, to the physical sciences has expanded significantly over the past decades. In particular, science-informed AI, also known as scientific AI or inductive bias AI, has grown from a focus on data analysis to now controlling experiment design, simulation, execution and analysis in closed-loop autonomous systems. The CAMEO (closed-loop autonomous materials exploration and optimization) algorithm employs scientific AI to address two tasks: learning a material system's composition-structure relationship and identifying materials compositions with optimal functional properties. By integrating these, accelerated materials screening across compositional phase diagrams was demonstrated, resulting in the discovery of a best-in-class phase change memory material. Key to this success is the ability to guide subsequent measurements to maximize knowledge of the composition-structure relationship, or phase map. In this work we investigate the benefits of incorporating varying levels of prior physical knowledge into CAMEO's autonomous phase-mapping. This includes the use of ab-initio phase boundary data from the AFLOW repositories, which has been shown to optimize CAMEO's search when used as a prior.  ( 2 min )
    Deep Semi-Supervised Learning for Time Series Classification. (arXiv:2102.03622v2 [cs.LG] UPDATED)
    While Semi-supervised learning has gained much attention in computer vision on image data, yet limited research exists on its applicability in the time series domain. In this work, we investigate the transferability of state-of-the-art deep semi-supervised models from image to time series classification. We discuss the necessary model adaptations, in particular an appropriate model backbone architecture and the use of tailored data augmentation strategies. Based on these adaptations, we explore the potential of deep semi-supervised learning in the context of time series classification by evaluating our methods on large public time series classification problems with varying amounts of labelled samples. We perform extensive comparisons under a decidedly realistic and appropriate evaluation scheme with a unified reimplementation of all algorithms considered, which is yet lacking in the field. We find that these transferred semi-supervised models show significant performance gains over strong supervised, semi-supervised and self-supervised alternatives, especially for scenarios with very few labelled samples.  ( 2 min )
    Learning Across Bandits in High Dimension via Robust Statistics. (arXiv:2112.14233v2 [stat.ML] UPDATED)
    Decision-makers often face the "many bandits" problem, where one must simultaneously learn across related but heterogeneous contextual bandit instances. For instance, a large retailer may wish to dynamically learn product demand across many stores to solve pricing or inventory problems, making it desirable to learn jointly for stores serving similar customers; alternatively, a hospital network may wish to dynamically learn patient risk across many providers to allocate personalized interventions, making it desirable to learn jointly for hospitals serving similar patient populations. We study the setting where the unknown parameter in each bandit instance can be decomposed into a global parameter plus a sparse instance-specific term. Then, we propose a novel two-stage estimator that exploits this structure in a sample-efficient way by using a combination of robust statistics (to learn across similar instances) and LASSO regression (to debias the results). We embed this estimator within a bandit algorithm, and prove that it improves asymptotic regret bounds in the context dimension $d$; this improvement is exponential for data-poor instances. We further demonstrate how our results depend on the underlying network structure of bandit instances. Finally, we illustrate the value of our approach on synthetic and real datasets.  ( 2 min )
    Data Augmentation for Deep Graph Learning: A Survey. (arXiv:2202.08235v1 [cs.LG])
    Graph neural networks, as powerful deep learning tools to model graph-structured data, have demonstrated remarkable performance on numerous graph learning tasks. To counter the data noise and data scarcity issues in deep graph learning (DGL), increasing graph data augmentation research has been conducted lately. However, conventional data augmentation methods can hardly handle graph-structured data which is defined on non-Euclidean space with multi-modality. In this survey, we formally formulate the problem of graph data augmentation and further review the representative techniques in this field. Specifically, we first propose a taxonomy for graph data augmentation and then provide a structured review by categorizing the related work based on the augmented information modalities. Focusing on the two challenging problems in DGL (i.e., optimal graph learning and low-resource graph learning), we also discuss and review the existing learning paradigms which are based on graph data augmentation. Finally, we point out a few directions and challenges on promising future works.  ( 2 min )
    Failure and success of the spectral bias prediction for Kernel Ridge Regression: the case of low-dimensional data. (arXiv:2202.03348v2 [cs.LG] UPDATED)
    Recently, several theories including the replica method made predictions for the generalization error of Kernel Ridge Regression. In some regimes, they predict that the method has a `spectral bias': decomposing the true function $f^*$ on the eigenbasis of the kernel, it fits well the coefficients associated with the O(P) largest eigenvalues, where $P$ is the size of the training set. This prediction works very well on benchmark data sets such as images, yet the assumptions these approaches make on the data are never satisfied in practice. To clarify when the spectral bias prediction holds, we first focus on a one-dimensional model where rigorous results are obtained and then use scaling arguments to generalize and test our findings in higher dimensions. Our predictions include the classification case $f(x)=$sign$(x_1)$ with a data distribution that vanishes at the decision boundary $p(x)\sim x_1^{\chi}$. For $\chi>0$ and a Laplace kernel, we find that (i) there exists a cross-over ridge $\lambda^*_{d,\chi}(P)\sim P^{-\frac{1}{d+\chi}}$ such that for $\lambda\gg \lambda^*_{d,\chi}(P)$, the replica method applies, but not for $\lambda\ll\lambda^*_{d,\chi}(P)$, (ii) in the ridge-less case, spectral bias predicts the correct training curve exponent only in the limit $d\rightarrow\infty$.  ( 2 min )
    Direct design of biquad filter cascades with deep learning by sampling random polynomials. (arXiv:2110.03691v2 [eess.SP] UPDATED)
    Designing infinite impulse response filters to match an arbitrary magnitude response requires specialized techniques. Methods like modified Yule-Walker are relatively efficient, but may not be sufficiently accurate in matching high order responses. On the other hand, iterative optimization techniques often enable superior performance, but come at the cost of longer run-times and are sensitive to initial conditions, requiring manual tuning. In this work, we address some of these limitations by learning a direct mapping from the target magnitude response to the filter coefficient space with a neural network trained on millions of random filters. We demonstrate our approach enables both fast and accurate estimation of filter coefficients given a desired response. We investigate training with different families of random filters, and find training with a variety of filter families enables better generalization when estimating real-world filters, using head-related transfer functions and guitar cabinets as case studies. We compare our method against existing methods including modified Yule-Walker and gradient descent and show our approach is, on average, both faster and more accurate.  ( 2 min )
    Quantum Lazy Training. (arXiv:2202.08232v1 [quant-ph])
    In the training of over-parameterized model functions via gradient descent, sometimes the parameters do not change significantly and remain close to their initial values. This phenomenon is called lazy training, and motivates consideration of the linear approximation of the model function around the initial parameters. In the lazy regime, this linear approximation imitates the behavior of the parameterized function whose associated kernel, called the tangent kernel, specifies the training performance of the model. Lazy training is known to occur in the case of (classical) neural networks with large widths. In this paper, we show that the training of geometrically local parameterized quantum circuits enters the lazy regime for large numbers of qubits. More precisely, we prove bounds on the rate of changes of the parameters of such a geometrically local parameterized quantum circuit in the training process, and on the precision of the linear approximation of the associated quantum model function; both of these bounds tend to zero as the number of qubits grows. We support our analytic results with numerical simulations.  ( 2 min )
    On Learning and Enforcing Latent Assessment Models using Binary Feedback from Human Auditors Regarding Black-Box Classifiers. (arXiv:2202.08250v1 [cs.HC])
    Algorithmic fairness literature presents numerous mathematical notions and metrics, and also points to a tradeoff between them while satisficing some or all of them simultaneously. Furthermore, the contextual nature of fairness notions makes it difficult to automate bias evaluation in diverse algorithmic systems. Therefore, in this paper, we propose a novel model called latent assessment model (LAM) to characterize binary feedback provided by human auditors, by assuming that the auditor compares the classifier's output to his or her own intrinsic judgment for each input. We prove that individual and group fairness notions are guaranteed as long as the auditor's intrinsic judgments inherently satisfy the fairness notion at hand, and are relatively similar to the classifier's evaluations. We also demonstrate this relationship between LAM and traditional fairness notions on three well-known datasets, namely COMPAS, German credit and Adult Census Income datasets. Furthermore, we also derive the minimum number of feedback samples needed to obtain PAC learning guarantees to estimate LAM for black-box classifiers. These guarantees are also validated via training standard machine learning algorithms on real binary feedback elicited from 400 human auditors regarding COMPAS.  ( 2 min )
    Augmenting Molecular Deep Generative Models with Topological Data Analysis Representations. (arXiv:2106.04464v2 [physics.chem-ph] UPDATED)
    Deep generative models have emerged as a powerful tool for learning useful molecular representations and designing novel molecules with desired properties, with applications in drug discovery and material design. However, most existing deep generative models are restricted due to lack of spatial information. Here we propose augmentation of deep generative models with topological data analysis (TDA) representations, known as persistence images, for robust encoding of 3D molecular geometry. We show that the TDA augmentation of a character-based Variational Auto-Encoder (VAE) outperforms state-of-the-art generative neural nets in accurately modeling the structural composition of the QM9 benchmark. Generated molecules are valid, novel, and diverse, while exhibiting distinct electronic property distribution, namely higher sample population with small HOMO-LUMO gap. These results demonstrate that TDA features indeed provide crucial geometric signal for learning abstract structures, which is non-trivial for existing generative models operating on string, graph, or 3D point sets to capture.  ( 2 min )
    Multi-Window Data Augmentation Approach for Speech Emotion Recognition. (arXiv:2010.09895v4 [cs.SD] UPDATED)
    We present a Multi-Window Data Augmentation (MWA-SER) approach for speech emotion recognition. MWA-SER is a unimodal approach that focuses on two key concepts; designing the speech augmentation method and building the deep learning model to recognize the underlying emotion of an audio signal. Our proposed multi-window augmentation approach generates additional data samples from the speech signal by employing multiple window sizes in the audio feature extraction process. We show that our augmentation method, combined with a deep learning model, improves speech emotion recognition performance. We evaluate the performance of our approach on three benchmark datasets: IEMOCAP, SAVEE, and RAVDESS. We show that the multi-window model improves the SER performance and outperforms a single-window model. The notion of finding the best window size is an essential step in audio feature extraction. We perform extensive experimental evaluations to find the best window choice and explore the windowing effect for SER analysis.  ( 2 min )
    Deduplicating Training Data Mitigates Privacy Risks in Language Models. (arXiv:2202.06539v2 [cs.CR] UPDATED)
    Past work has shown that large language models are susceptible to privacy attacks, where adversaries generate sequences from a trained model and detect which sequences are memorized from the training set. In this work, we show that the success of these attacks is largely due to duplication in commonly used web-scraped training sets. We first show that the rate at which language models regenerate training sequences is superlinearly related to a sequence's count in the training set. For instance, a sequence that is present 10 times in the training data is on average generated ~1000 times more often than a sequence that is present only once. We next show that existing methods for detecting memorized sequences have near-chance accuracy on non-duplicated training sequences. Finally, we find that after applying methods to deduplicate training data, language models are considerably more secure against these types of privacy attacks. Taken together, our results motivate an increased focus on deduplication in privacy-sensitive applications and a reevaluation of the practicality of existing privacy attacks.  ( 2 min )
    An Intrusion Response System utilizing Deep Q-Networks and System Partitions. (arXiv:2202.08182v1 [cs.CR])
    Intrusion Response is a relatively new field of research. Recent approaches for the creation of Intrusion Response Systems (IRSs) use Reinforcement Learning (RL) as a primary technique for the optimal or near-optimal selection of the proper countermeasure to take in order to stop or mitigate an ongoing attack. However, most of them do not consider the fact that systems can change over time or, in other words, that systems exhibit a non-stationary behavior. Furthermore, stateful approaches, such as those based on RL, suffer the curse of dimensionality, due to a state space growing exponentially with the size of the protected system. In this paper, we introduce and develop an IRS software prototype, named irs-partition. It leverages the partitioning of the protected system and Deep Q-Networks to address the curse of dimensionality by supporting a multi-agent formulation. Furthermore, it exploits transfer learning to follow the evolution of non-stationary systems.  ( 2 min )
    Diversity and Generalization in Neural Network Ensembles. (arXiv:2110.13786v2 [cs.LG] UPDATED)
    Ensembles are widely used in machine learning and, usually, provide state-of-the-art performance in many prediction tasks. From the very beginning, the diversity of an ensemble has been identified as a key factor for the superior performance of these models. But the exact role that diversity plays in ensemble models is poorly understood, specially in the context of neural networks. In this work, we combine and expand previously published results in a theoretically sound framework that describes the relationship between diversity and ensemble performance for a wide range of ensemble methods. More precisely, we provide sound answers to the following questions: how to measure diversity, how diversity relates to the generalization error of an ensemble, and how diversity is promoted by neural network ensemble algorithms. This analysis covers three widely used loss functions, namely, the squared loss, the cross-entropy loss, and the 0-1 loss; and two widely used model combination strategies, namely, model averaging and weighted majority vote. We empirically validate this theoretical analysis with neural network ensembles.
    Efficient, Interpretable Graph Neural Network Representation for Angle-dependent Properties and its Application to Optical Spectroscopy. (arXiv:2109.11576v2 [cs.LG] UPDATED)
    Graph neural networks are attractive for learning properties of atomic structures thanks to their intuitive graph encoding of atoms and bonds. However, conventional encoding does not include angular information, which is critical for describing atomic arrangements in disordered systems. In this work, we extend the recently proposed ALIGNN encoding, which incorporates bond angles, to also include dihedral angles (ALIGNN-d). This simple extension leads to a memory-efficient graph representation that captures the complete geometry of atomic structures. ALIGNN-d is applied to predict the infrared optical response of dynamically disordered Cu(II) aqua complexes, leveraging the intrinsic interpretability to elucidate the relative contributions of individual structural components. Bond and dihedral angles are found to be critical contributors to the fine structure of the absorption response, with distortions representing transitions between more common geometries exhibiting the strongest absorption intensity. Future directions for further development of ALIGNN-d are discussed.
    Dynamically Stable Poincar\'e Embeddings for Neural Manifolds. (arXiv:2112.11172v2 [cs.LG] UPDATED)
    In a Riemannian manifold, the Ricci flow is a partial differential equation for evolving the metric to become more regular. We hope that topological structures from such metrics may be used to assist in the tasks of machine learning. However, this part of the work is still missing. In this paper, we propose Ricci flow assisted Eucl2Hyp2Eucl neural networks that bridge this gap between the Ricci flow and deep neural networks by mapping neural manifolds from the Euclidean space to the dynamically stable Poincar\'e ball and then back to the Euclidean space. As a result, we prove that, if initial metrics have an $L^2$-norm perturbation which deviates from the Hyperbolic metric on the Poincar\'e ball, the scaled Ricci-DeTurck flow of such metrics smoothly and exponentially converges to the Hyperbolic metric. Specifically, the role of the Ricci flow is to serve as naturally evolving to the stable Poincar\'e ball. For such dynamically stable neural manifolds under the Ricci flow, the convergence of neural networks embedded with such manifolds is not susceptible to perturbations. And we show that Ricci flow assisted Eucl2Hyp2Eucl neural networks outperform with their all Euclidean counterparts on image classification tasks.
    Vibration Fault Diagnosis in Wind Turbines based on Automated Feature Learning. (arXiv:2201.13403v2 [cs.LG] UPDATED)
    A growing number of wind turbines are equipped with vibration measurement systems to enable a close monitoring and early detection of developing fault conditions. The vibration measurements are analyzed to continuously assess the component health and prevent failures that can result in downtimes. This study focuses on gearbox monitoring but is applicable also to other subsystems. The current state-of-the-art gearbox fault diagnosis algorithms rely on statistical or machine learning methods based on fault signatures that have been defined by human analysts. This has multiple disadvantages. Defining the fault signatures by human analysts is a time-intensive process that requires highly detailed knowledge of the gearbox composition. This effort needs to be repeated for every new turbine, so it does not scale well with the increasing number of monitored turbines, especially in fast growing portfolios. Moreover, fault signatures defined by human analysts can result in biased and imprecise decision boundaries that lead to imprecise and uncertain fault diagnosis decisions. We present a novel accurate fault diagnosis method for vibration-monitored wind turbine components that overcomes these disadvantages. Our approach combines autonomous data-driven learning of fault signatures and health state classification based on convolutional neural networks and isolation forests. We demonstrate its performance with vibration measurements from two wind turbine gearboxes. Unlike the state-of-the-art methods, our approach does not require gearbox-type specific diagnosis expertise and is not restricted to predefined frequencies or spectral ranges but can monitor the full spectrum at once.
    Finding Dynamics Preserving Adversarial Winning Tickets. (arXiv:2202.06488v2 [cs.LG] UPDATED)
    Modern deep neural networks (DNNs) are vulnerable to adversarial attacks and adversarial training has been shown to be a promising method for improving the adversarial robustness of DNNs. Pruning methods have been considered in adversarial context to reduce model capacity and improve adversarial robustness simultaneously in training. Existing adversarial pruning methods generally mimic the classical pruning methods for natural training, which follow the three-stage 'training-pruning-fine-tuning' pipelines. We observe that such pruning methods do not necessarily preserve the dynamics of dense networks, making it potentially hard to be fine-tuned to compensate the accuracy degradation in pruning. Based on recent works of \textit{Neural Tangent Kernel} (NTK), we systematically study the dynamics of adversarial training and prove the existence of trainable sparse sub-network at initialization which can be trained to be adversarial robust from scratch. This theoretically verifies the \textit{lottery ticket hypothesis} in adversarial context and we refer such sub-network structure as \textit{Adversarial Winning Ticket} (AWT). We also show empirical evidences that AWT preserves the dynamics of adversarial training and achieve equal performance as dense adversarial training.
    Large-Scale Meta-Learning with Continual Trajectory Shifting. (arXiv:2102.07215v3 [cs.LG] UPDATED)
    Meta-learning of shared initialization parameters has shown to be highly effective in solving few-shot learning tasks. However, extending the framework to many-shot scenarios, which may further enhance its practicality, has been relatively overlooked due to the technical difficulties of meta-learning over long chains of inner-gradient steps. In this paper, we first show that allowing the meta-learners to take a larger number of inner gradient steps better captures the structure of heterogeneous and large-scale task distributions, thus results in obtaining better initialization points. Further, in order to increase the frequency of meta-updates even with the excessively long inner-optimization trajectories, we propose to estimate the required shift of the task-specific parameters with respect to the change of the initialization parameters. By doing so, we can arbitrarily increase the frequency of meta-updates and thus greatly improve the meta-level convergence as well as the quality of the learned initializations. We validate our method on a heterogeneous set of large-scale tasks and show that the algorithm largely outperforms the previous first-order meta-learning methods in terms of both generalization performance and convergence, as well as multi-task learning and fine-tuning baselines.
    RotoGrad: Gradient Homogenization in Multitask Learning. (arXiv:2103.02631v3 [cs.LG] UPDATED)
    Multitask learning is being increasingly adopted in applications domains like computer vision and reinforcement learning. However, optimally exploiting its advantages remains a major challenge due to the effect of negative transfer. Previous works have tracked down this issue to the disparities in gradient magnitudes and directions across tasks, when optimizing the shared network parameters. While recent work has acknowledged that negative transfer is a two-fold problem, existing approaches fall short as they only focus on either homogenizing the gradient magnitude across tasks; or greedily change the gradient directions, overlooking future conflicts. In this work, we introduce RotoGrad, an algorithm that tackles negative transfer as a whole: it jointly homogenizes gradient magnitudes and directions, while ensuring training convergence. We show that RotoGrad outperforms competing methods in complex problems, including multi-label classification in CelebA and computer vision tasks in the NYUv2 dataset. A Pytorch implementation can be found in https://github.com/adrianjav/rotograd.
    Unpaired Single-Image Depth Synthesis with cycle-consistent Wasserstein GANs. (arXiv:2103.16938v2 [cs.CV] UPDATED)
    Real-time estimation of actual environment depth is an essential module for various autonomous system tasks such as localization, obstacle detection and pose estimation. During the last decade of machine learning, extensive deployment of deep learning methods to computer vision tasks yielded successful approaches for realistic depth synthesis out of a simple RGB modality. While most of these models rest on paired depth data or availability of video sequences and stereo images, there is a lack of methods facing single-image depth synthesis in an unsupervised manner. Therefore, in this study, latest advancements in the field of generative neural networks are leveraged to fully unsupervised single-image depth synthesis. To be more exact, two cycle-consistent generators for RGB-to-depth and depth-to-RGB transfer are implemented and simultaneously optimized using the Wasserstein-1 distance. To ensure plausibility of the proposed method, we apply the models to a self acquised industrial data set as well as to the renown NYU Depth v2 data set, which allows comparison with existing approaches. The observed success in this study suggests high potential for unpaired single-image depth estimation in real world applications.
    Neural HMMs are all you need (for high-quality attention-free TTS). (arXiv:2108.13320v6 [eess.AS] UPDATED)
    Neural sequence-to-sequence TTS has achieved significantly better output quality than statistical speech synthesis using HMMs. However, neural TTS is generally not probabilistic and uses non-monotonic attention. Attention failures increase training time and can make synthesis babble incoherently. This paper describes how the old and new paradigms can be combined to obtain the advantages of both worlds, by replacing attention in neural TTS with an autoregressive left-right no-skip hidden Markov model defined by a neural network. Based on this proposal, we modify Tacotron 2 to obtain an HMM-based neural TTS model with monotonic alignment, trained to maximise the full sequence likelihood without approximation. We also describe how to combine ideas from classical and contemporary TTS for best results. The resulting example system is smaller and simpler than Tacotron 2, and learns to speak with fewer iterations and less data, whilst achieving comparable naturalness prior to the post-net. Our approach also allows easy control over speaking rate.
    FedCG: Leverage Conditional GAN for Protecting Privacy and Maintaining Competitive Performance in Federated Learning. (arXiv:2111.08211v2 [cs.LG] UPDATED)
    Federated learning (FL) aims to protect data privacy by enabling clients to build machine learning models collaboratively without sharing their private data. Recent works demonstrate that information exchanged during FL is subject to gradient-based privacy attacks and, consequently, a variety of privacy-preserving methods have been adopted to thwart such attacks. However, these defensive methods either introduce orders of magnitudes more computational and communication overheads (e.g., with homomorphic encryption) or incur substantial model performance losses in terms of prediction accuracy (e.g., with differential privacy). In this work, we propose $\textsc{FedCG}$, a novel federated learning method that leverages conditional generative adversarial networks to achieve high-level privacy protection while still maintaining competitive model performance. $\textsc{FedCG}$ decomposes each client's local network into a private extractor and a public classifier and keeps the extractor local to protect privacy. Instead of exposing extractors, $\textsc{FedCG}$ shares clients' generators with the server for aggregating clients' shared knowledge aiming to enhance the performance of each client's local networks. Extensive experiments demonstrate that $\textsc{FedCG}$ can achieve competitive model performance compared with FL baselines, and privacy analysis shows that $\textsc{FedCG}$ has a high-level privacy-preserving capability.
    Unsupervised Behaviour Discovery with Quality-Diversity Optimisation. (arXiv:2106.05648v3 [cs.NE] UPDATED)
    Quality-Diversity algorithms refer to a class of evolutionary algorithms designed to find a collection of diverse and high-performing solutions to a given problem. In robotics, such algorithms can be used for generating a collection of controllers covering most of the possible behaviours of a robot. To do so, these algorithms associate a behavioural descriptor to each of these behaviours. Each behavioural descriptor is used for estimating the novelty of one behaviour compared to the others. In most existing algorithms, the behavioural descriptor needs to be hand-coded, thus requiring prior knowledge about the task to solve. In this paper, we introduce: Autonomous Robots Realising their Abilities, an algorithm that uses a dimensionality reduction technique to automatically learn behavioural descriptors based on raw sensory data. The performance of this algorithm is assessed on three robotic tasks in simulation. The experimental results show that it performs similarly to traditional hand-coded approaches without the requirement to provide any hand-coded behavioural descriptor. In the collection of diverse and high-performing solutions, it also manages to find behaviours that are novel with respect to more features than its hand-coded baselines. Finally, we introduce a variant of the algorithm which is robust to the dimensionality of the behavioural descriptor space.
    Xi-Learning: Successor Feature Transfer Learning for General Reward Functions. (arXiv:2110.15701v2 [cs.LG] UPDATED)
    Transfer in Reinforcement Learning aims to improve learning performance on target tasks using knowledge from experienced source tasks. Successor features (SF) are a prominent transfer mechanism in domains where the reward function changes between tasks. They reevaluate the expected return of previously learned policies in a new target task and to transfer their knowledge. A limiting factor of the SF framework is its assumption that rewards linearly decompose into successor features and a reward weight vector. We propose a novel SF mechanism, $\xi$-learning, based on learning the cumulative discounted probability of successor features. Crucially, $\xi$-learning allows to reevaluate the expected return of policies for general reward functions. We introduce two $\xi$-learning variations, prove its convergence, and provide a guarantee on its transfer performance. Experimental evaluations based on $\xi$-learning with function approximation demonstrate the prominent advantage of $\xi$-learning over available mechanisms not only for general reward functions, but also in the case of linearly decomposable reward functions.
    Scenario Adaptive Mixture-of-Experts for Promotion-Aware Click-Through Rate Prediction. (arXiv:2112.13747v2 [cs.LG] UPDATED)
    Promotions are becoming more important and prevalent in e-commerce platforms to attract customers and boost sales. However, Click-Through Rate (CTR) prediction methods in recommender systems are not able to handle such circumstances well since: 1) they can't generalize well to serving because the online data distribution is uncertain due to the potentially upcoming promotions; 2) without paying enough attention to scenario signals, they are incapable of learning different feature representation patterns which coexist in each scenario. In this work, we propose Scenario Adaptive Mixture-of-Experts (SAME), a simple yet effective model that serves both promotion and normal scenarios. Technically, it follows the idea of Mixture-of-Experts by adopting multiple experts to learn feature representations, which are modulated by a Feature Gated Network (FGN) via an attention mechanism. To obtain high-quality representations, we design a Stacked Parallel Attention Unit (SPAU) to help each expert better handle user behavior sequence. To tackle the distribution uncertainty, a set of scenario signals are elaborately devised from a perspective of time series prediction and fed into the FGN, whose output is concatenated with feature representation from each expert to learn the attention. Accordingly, a mixture of the feature representations is obtained scenario-adaptively and used for the final CTR prediction. In this way, each expert can learn a discriminative representation pattern. To the best of our knowledge, this is the first study for promotion-aware CTR prediction. Experimental results on real-world datasets validate the superiority of SAME. Online A/B test also shows SAME achieves significant gains of 3.58% on CTR and 5.94% on IPV during promotion periods as well as 3.93% and 6.57% in normal days, respectively.
    Leveraging Automated Unit Tests for Unsupervised Code Translation. (arXiv:2110.06773v2 [cs.SE] UPDATED)
    With little to no parallel data available for programming languages, unsupervised methods are well-suited to source code translation. However, the majority of unsupervised machine translation approaches rely on back-translation, a method developed in the context of natural language translation and one that inherently involves training on noisy inputs. Unfortunately, source code is highly sensitive to small changes; a single token can result in compilation failures or erroneous programs, unlike natural languages where small inaccuracies may not change the meaning of a sentence. To address this issue, we propose to leverage an automated unit-testing system to filter out invalid translations, thereby creating a fully tested parallel corpus. We found that fine-tuning an unsupervised model with this filtered data set significantly reduces the noise in the translations so-generated, comfortably outperforming the state-of-the-art for all language pairs studied. In particular, for Java $\to$ Python and Python $\to$ C++ we outperform the best previous methods by more than 16% and 24% respectively, reducing the error rate by more than 35%.
    Stochastic Continuous Submodular Maximization: Boosting via Non-oblivious Function. (arXiv:2201.00703v2 [cs.LG] UPDATED)
    In this paper, we revisit Stochastic Continuous Submodular Maximization in both offline and online settings, which can benefit wide applications in machine learning and operations research areas. We present a boosting framework covering gradient ascent and online gradient ascent. The fundamental ingredient of our methods is a novel non-oblivious function $F$ derived from a factor-revealing optimization problem, whose any stationary point provides a $(1-e^{-\gamma})$-approximation to the global maximum of the $\gamma$-weakly DR-submodular objective function $f\in C^{1,1}_L(\mathcal{X})$. Under the offline scenario, we propose a boosting gradient ascent method achieving $(1-e^{-\gamma}-\epsilon^{2})$-approximation after $O(1/\epsilon^2)$ iterations, which improves the $(\frac{\gamma^2}{1+\gamma^2})$ approximation ratio of the classical gradient ascent algorithm. In the online setting, for the first time we consider the adversarial delays for stochastic gradient feedback, under which we propose a boosting online gradient algorithm with the same non-oblivious function $F$. Meanwhile, we verify that this boosting online algorithm achieves a regret of $O(\sqrt{D})$ against a $(1-e^{-\gamma})$-approximation to the best feasible solution in hindsight, where $D$ is the sum of delays of gradient feedback. To the best of our knowledge, this is the first result to obtain $O(\sqrt{T})$ regret against a $(1-e^{-\gamma})$-approximation with $O(1)$ gradient inquiry at each time step, when no delay exists, i.e., $D=T$. Finally, numerical experiments demonstrate the effectiveness of our boosting methods.
    Large-scale ASR Domain Adaptation using Self- and Semi-supervised Learning. (arXiv:2110.00165v5 [eess.AS] UPDATED)
    Self- and semi-supervised learning methods have been actively investigated to reduce labeled training data or enhance the model performance. However, the approach mostly focus on in-domain performance for public datasets. In this study, we utilize the combination of self- and semi-supervised learning methods to solve unseen domain adaptation problem in a large-scale production setting for online ASR model. This approach demonstrates that using the source domain data with a small fraction of the target domain data (3%) can recover the performance gap compared to a full data baseline: relative 13.5% WER improvement for target domain data.
    FairEdit: Preserving Fairness in Graph Neural Networks through Greedy Graph Editing. (arXiv:2201.03681v3 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) have proven to excel in predictive modeling tasks where the underlying data is a graph. However, as GNNs are extensively used in human-centered applications, the issue of fairness has arisen. While edge deletion is a common method used to promote fairness in GNNs, it fails to consider when data is inherently missing fair connections. In this work we consider the unexplored method of edge addition, accompanied by deletion, to promote fairness. We propose two model-agnostic algorithms to perform edge editing: a brute force approach and a continuous approximation approach, FairEdit. FairEdit performs efficient edge editing by leveraging gradient information of a fairness loss to find edges that improve fairness. We find that FairEdit outperforms standard training for many data sets and GNN methods, while performing comparably to many state-of-the-art methods, demonstrating FairEdit's ability to improve fairness across many domains and models.
    Frustratingly Easy Transferability Estimation. (arXiv:2106.09362v2 [cs.LG] UPDATED)
    Transferability estimation has been an essential tool in selecting a pre-trained model and the layers in it for transfer learning, to transfer, so as to maximize the performance on a target task and prevent negative transfer. Existing estimation algorithms either require intensive training on target tasks or have difficulties in evaluating the transferability between layers. To this end, we propose a simple, efficient, and effective transferability measure named TransRate. Through a single pass over examples of a target task, TransRate measures the transferability as the mutual information between features of target examples extracted by a pre-trained model and labels of them. We overcome the challenge of efficient mutual information estimation by resorting to coding rate that serves as an effective alternative to entropy. From the perspective of feature representation, the resulting TransRate evaluates both completeness (whether features contain sufficient information of a target task) and compactness (whether features of each class are compact enough for good generalization) of pre-trained features. Theoretically, we have analyzed the close connection of TransRate to the performance after transfer learning. Despite its extraordinary simplicity in 10 lines of codes, TransRate performs remarkably well in extensive evaluations on 26 pre-trained models and 16 downstream tasks.
    SAUTE RL: Almost Surely Safe Reinforcement Learning Using State Augmentation. (arXiv:2202.06558v2 [cs.LG] UPDATED)
    Satisfying safety constraints almost surely (or with probability one) can be critical for deployment of Reinforcement Learning (RL) in real-life applications. For example, plane landing and take-off should ideally occur with probability one. We address the problem by introducing Safety Augmented (Saute) Markov Decision Processes (MDPs), where the safety constraints are eliminated by augmenting them into the state-space and reshaping the objective. We show that Saute MDP satisfies the Bellman equation and moves us closer to solving Safe RL with constraints satisfied almost surely. We argue that Saute MDP allows to view Safe RL problem from a different perspective enabling new features. For instance, our approach has a plug-and-play nature, i.e., any RL algorithm can be "sauteed". Additionally, state augmentation allows for policy generalization across safety constraints. We finally show that Saute RL algorithms can outperform their state-of-the-art counterparts when constraint satisfaction is of high importance.
    Differential Privacy and Fairness in Decisions and Learning Tasks: A Survey. (arXiv:2202.08187v1 [cs.LG])
    This paper surveys recent work in the intersection of differential privacy (DP) and fairness. It reviews the conditions under which privacy and fairness may have aligned or contrasting goals, analyzes how and why DP may exacerbate bias and unfairness in decision problems and learning tasks, and describes available mitigation measures for the fairness issues arising in DP systems. The survey provides a unified understanding of the main challenges and potential risks arising when deploying privacy-preserving machine-learning or decisions-making tasks under a fairness lens.
    Finite-Bit Quantization For Distributed Algorithms With Linear Convergence. (arXiv:2107.11304v2 [math.OC] UPDATED)
    This paper studies distributed algorithms for (strongly convex) composite optimization problems over mesh networks, subject to quantized communications. Instead of focusing on a specific algorithmic design, a black-box model is proposed, casting linearly-convergent distributed algorithms in the form of fixed-point iterates. While most existing quantization rules, such as the popular compression rule, rely on some form of communication of scalar signals (in practice quantized at the machine precision), this paper considers regimes operating under limited communication budgets, where communication at machine precision is not viable. To address this challenge, the algorithmic model is coupled with a novel random or deterministic Biased Compression (BC-)rule on the quantizer design as well as with a new Adaptive range Non-uniform Quantizer (ANQ) and communication-efficient encoding scheme, which implement the BC-rule using a finite number of bits (below machine precision). A unified communication complexity analysis is developed for the black-box model, determining the average number of bits required to reach a solution of the optimization problem within a target accuracy. In particular, it is shown that the proposed BC-rule preserves linear convergence of the unquantized algorithms, and a trade-off between convergence rate and communication cost under quantization is characterized. Numerical results validate our theoretical findings and show that distributed algorithms equipped with the proposed ANQ have more favorable communication complexity than algorithms using state-of-the-art quantization rules.
    Tracking Most Significant Arm Switches in Bandits. (arXiv:2112.13838v4 [cs.LG] UPDATED)
    In bandit with distribution shifts, one aims to automatically adapt to unknown changes in reward distribution, and restart exploration when necessary. While this problem has been studied for many years, a recent breakthrough of Auer et al. (2018, 2019) provides the first adaptive procedure to guarantee an optimal (dynamic) regret $\sqrt{LT}$, for $T$ rounds, and an unknown number $L$ of changes. However, while this rate is tight in the worst case, it remained open whether faster rates are possible, without prior knowledge, if few changes in distribution are actually severe. To resolve this question, we propose a new notion of significant shift, which only counts very severe changes that clearly necessitate a restart: roughly, these are changes involving not only best arm switches, but also involving large aggregate differences in reward overtime. Thus, our resulting procedure adaptively achieves rates always faster (sometimes significantly) than $O(\sqrt{ST})$, where $S\ll L$ only counts best arm switches, while at the same time, always faster than the optimal $O(V^{\frac{1}{3}}T^{\frac{2}{3}})$ when expressed in terms of total variation $V$ (which aggregates differences overtime). Our results are expressed in enough generality to also capture non-stochastic adversarial settings.
    Achieving Zero Constraint Violation for Concave Utility Constrained Reinforcement Learning via Primal-Dual Approach. (arXiv:2109.06332v2 [cs.LG] UPDATED)
    Reinforcement learning (RL) is widely used in applications where one needs to perform sequential decision-making while interacting with the environment. The standard RL problem with safety constraints can be formulated via constrained Markov Decision Processes (CMDP), which is linear in objective and constraints in occupancy measure space, where the problem becomes challenging in the case where model is unknown apriori. The problem further becomes challenging when the decision requirement includes optimizing a concave utility while satisfying some nonlinear safety constraints (we call this problem concave CMDP (CCMDP)). In the literature, various algorithms are available to solve CMDP problems in a model-free manner to achieve $\epsilon$-optimal cumulative reward with $\epsilon$ feasible policies. An $\epsilon$-feasible policy implies that it suffers from constraint violation. An important question here is whether we can achieve $\epsilon$-optimal objective with zero constraint violations or not. To answer that in affirmative, we advocate the use of randomized primal-dual approach to solve the CCMDP problems and propose a conservative stochastic primal-dual algorithm (CSPDA). We prove that CSPDA exhibit $\tilde{\mathcal{O}}\left(1/\epsilon^2\right)$ sample complexity to achieve $\epsilon$-optimal concave utility with zero constraint violations for the CCMDP, and CMDP as well which is a special case. In the prior works, the best available sample complexity for the $\epsilon$-optimal policy for CMDP with zero constraint violation is $\tilde{\mathcal{O}}\left(1/\epsilon^5\right)$. Hence, the proposed algorithm provides a significant improvement as compared to the state of the art.
    Self-Supervised Transformer for Sparse and Irregularly Sampled Multivariate Clinical Time-Series. (arXiv:2107.14293v2 [cs.LG] UPDATED)
    Multivariate time-series data are frequently observed in critical care settings and are typically characterized by sparsity (missing information) and irregular time intervals. Existing approaches for learning representations in this domain handle these challenges by either aggregation or imputation of values, which in-turn suppresses the fine-grained information and adds undesirable noise/overhead into the machine learning model. To tackle this problem, we propose a Self-supervised Transformer for Time-Series (STraTS) model which overcomes these pitfalls by treating time-series as a set of observation triplets instead of using the standard dense matrix representation. It employs a novel Continuous Value Embedding technique to encode continuous time and variable values without the need for discretization. It is composed of a Transformer component with multi-head attention layers which enable it to learn contextual triplet embeddings while avoiding the problems of recurrence and vanishing gradients that occur in recurrent architectures. In addition, to tackle the problem of limited availability of labeled data (which is typically observed in many healthcare applications), STraTS utilizes self-supervision by leveraging unlabeled data to learn better representations by using time-series forecasting as an auxiliary proxy task. Experiments on real-world multivariate clinical time-series benchmark datasets demonstrate that STraTS has better prediction performance than state-of-the-art methods for mortality prediction, especially when labeled data is limited. Finally, we also present an interpretable version of STraTS which can identify important measurements in the time-series data. Our data preprocessing and model implementation codes are available at https://github.com/sindhura97/STraTS.
    Toward Development of Machine Learned Techniques for Production of Compact Kinetic Models. (arXiv:2202.08021v1 [physics.chem-ph])
    Chemical kinetic models are an essential component in the development and optimisation of combustion devices through their coupling to multi-dimensional simulations such as computational fluid dynamics (CFD). Low-dimensional kinetic models which retain good fidelity to the reality are needed, the production of which requires considerable human-time cost and expert knowledge. Here, we present a novel automated compute intensification methodology to produce overly-reduced and optimised (compact) chemical kinetic models. This algorithm, termed Machine Learned Optimisation of Chemical Kinetics (MLOCK), systematically perturbs each of the four sub-models of a chemical kinetic model to discover what combinations of terms results in a good model. A virtual reaction network comprised of n species is first obtained using conventional mechanism reduction. To counteract the imposed decrease in model performance, the weights (virtual reaction rate constants) of important connections (virtual reactions) between each node (species) of the virtual reaction network are numerically optimised to replicate selected calculations across four sequential phases. The first version of MLOCK, (MLOCK1.0) simultaneously perturbs all three virtual Arrhenius reaction rate constant parameters for important connections and assesses the suitability of the new parameters through objective error functions, which quantify the error in each compact model candidate's calculation of the optimisation targets, which may be comprised of detailed model calculations and/or experimental data. MLOCK1.0 is demonstrated by creating compact models for the archetypal case of methane air combustion. It is shown that the NUGMECH1.0 detailed model comprised of 2,789 species is reliably compacted to 15 species (nodes), whilst retaining an overall fidelity of ~87% to the detailed model calculations, outperforming the prior state-of-art.  ( 2 min )
    TUTOR: Training Neural Networks Using Decision Rules as Model Priors. (arXiv:2010.05429v3 [cs.NE] UPDATED)
    The human brain has the ability to carry out new tasks with limited experience. It utilizes prior learning experiences to adapt the solution strategy to new domains. On the other hand, deep neural networks (DNNs) generally need large amounts of data and computational resources for training. However, this requirement is not met in many settings. To address these challenges, we propose the TUTOR DNN synthesis framework. TUTOR targets tabular datasets. It synthesizes accurate DNN models with limited available data and reduced memory/computational requirements. It consists of three sequential steps. The first step involves generation, verification, and labeling of synthetic data. The synthetic data generation module targets both the categorical and continuous features. TUTOR generates the synthetic data from the same probability distribution as the real data. It then verifies the integrity of the generated synthetic data using a semantic integrity classifier module. It labels the synthetic data based on a set of rules extracted from the real dataset. Next, TUTOR uses two training schemes that combine synthetic and training data to learn the parameters of the DNN model. These two schemes focus on two different ways in which synthetic data can be used to derive a prior on the model parameters and, hence, provide a better DNN initialization for training with real data. In the third step, TUTOR employs a grow-and-prune synthesis paradigm to learn both the weights and the architecture of the DNN to reduce model size while ensuring its accuracy. We evaluate the performance of TUTOR on nine datasets of various sizes. We show that in comparison to fully connected DNNs, TUTOR, on an average, reduces the need for data by 5.9x, improves accuracy by 3.4%, and reduces the number of parameters (fFLOPs) by 4.7x (4.3x). Thus, TUTOR enables a less data-hungry, more accurate, and more compact DNN synthesis.  ( 3 min )
    Distribution-free binary classification: prediction sets, confidence intervals and calibration. (arXiv:2006.10564v4 [stat.ML] UPDATED)
    We study three notions of uncertainty quantification -- calibration, confidence intervals and prediction sets -- for binary classification in the distribution-free setting, that is without making any distributional assumptions on the data. With a focus towards calibration, we establish a 'tripod' of theorems that connect these three notions for score-based classifiers. A direct implication is that distribution-free calibration is only possible, even asymptotically, using a scoring function whose level sets partition the feature space into at most countably many sets. Parametric calibration schemes such as variants of Platt scaling do not satisfy this requirement, while nonparametric schemes based on binning do. To close the loop, we derive distribution-free confidence intervals for binned probabilities for both fixed-width and uniform-mass binning. As a consequence of our 'tripod' theorems, these confidence intervals for binned probabilities lead to distribution-free calibration. We also derive extensions to settings with streaming data and covariate shift.  ( 2 min )
    Voice Filter: Few-shot text-to-speech speaker adaptation using voice conversion as a post-processing module. (arXiv:2202.08164v1 [eess.AS])
    State-of-the-art text-to-speech (TTS) systems require several hours of recorded speech data to generate high-quality synthetic speech. When using reduced amounts of training data, standard TTS models suffer from speech quality and intelligibility degradations, making training low-resource TTS systems problematic. In this paper, we propose a novel extremely low-resource TTS method called Voice Filter that uses as little as one minute of speech from a target speaker. It uses voice conversion (VC) as a post-processing module appended to a pre-existing high-quality TTS system and marks a conceptual shift in the existing TTS paradigm, framing the few-shot TTS problem as a VC task. Furthermore, we propose to use a duration-controllable TTS system to create a parallel speech corpus to facilitate the VC task. Results show that the Voice Filter outperforms state-of-the-art few-shot speech synthesis techniques in terms of objective and subjective metrics on one minute of speech on a diverse set of voices, while being competitive against a TTS model built on 30 times more data.  ( 2 min )
    Latent Outlier Exposure for Anomaly Detection with Contaminated Data. (arXiv:2202.08088v1 [cs.LG])
    Anomaly detection aims at identifying data points that show systematic deviations from the majority of data in an unlabeled dataset. A common assumption is that clean training data (free of anomalies) is available, which is often violated in practice. We propose a strategy for training an anomaly detector in the presence of unlabeled anomalies that is compatible with a broad class of models. The idea is to jointly infer binary labels to each datum (normal vs. anomalous) while updating the model parameters. Inspired by outlier exposure (Hendrycks et al., 2018) that considers synthetically created, labeled anomalies, we thereby use a combination of two losses that share parameters: one for the normal and one for the anomalous data. We then iteratively proceed with block coordinate updates on the parameters and the most likely (latent) labels. Our experiments with several backbone models on three image datasets, 30 tabular data sets, and a video anomaly detection benchmark showed consistent and significant improvements over the baselines.  ( 2 min )
    Tetrad: Actively Secure 4PC for Secure Training and Inference. (arXiv:2106.02850v3 [cs.CR] UPDATED)
    Mixing arithmetic and boolean circuits to perform privacy-preserving machine learning has become increasingly popular. Towards this, we propose a framework for the case of four parties with at most one active corruption called Tetrad. Tetrad works over rings and supports two levels of security, fairness and robustness. The fair multiplication protocol costs 5 ring elements, improving over the state-of-the-art Trident (Chaudhari et al. NDSS'20). A key feature of Tetrad is that robustness comes for free over fair protocols. Other highlights across the two variants include (a) probabilistic truncation without overhead, (b) multi-input multiplication protocols, and (c) conversion protocols to switch between the computational domains, along with a tailor-made garbled circuit approach. Benchmarking of Tetrad for both training and inference is conducted over deep neural networks such as LeNet and VGG16. We found that Tetrad is up to 4 times faster in ML training and up to 5 times faster in ML inference. Tetrad is also lightweight in terms of deployment cost, costing up to 6 times less than Trident.  ( 2 min )
    SemiRetro: Semi-template framework boosts deep retrosynthesis prediction. (arXiv:2202.08205v1 [cs.LG])
    Recently, template-based (TB) and template-free (TF) molecule graph learning methods have shown promising results to retrosynthesis. TB methods are more accurate using pre-encoded reaction templates, and TF methods are more scalable by decomposing retrosynthesis into subproblems, i.e., center identification and synthon completion. To combine both advantages of TB and TF, we suggest breaking a full-template into several semi-templates and embedding them into the two-step TF framework. Since many semi-templates are reduplicative, the template redundancy can be reduced while the essential chemical knowledge is still preserved to facilitate synthon completion. We call our method SemiRetro, introduce a new GNN layer (DRGAT) to enhance center identification, and propose a novel self-correcting module to improve semi-template classification. Experimental results show that SemiRetro significantly outperforms both existing TB and TF methods. In scalability, SemiRetro covers 98.9\% data using 150 semi-templates, while previous template-based GLN requires 11,647 templates to cover 93.3\% data. In top-1 accuracy, SemiRetro exceeds template-free G2G 4.8\% (class known) and 6.0\% (class unknown). Besides, SemiRetro has better training efficiency than existing methods.  ( 2 min )
    Safe Active Dynamics Learning and Control: A Sequential Exploration-Exploitation Framework. (arXiv:2008.11700v4 [cs.RO] UPDATED)
    Safe deployment of autonomous robots in diverse scenarios requires agents that are capable of efficiently adapting to new environments while satisfying constraints. In this work, we propose a practical and theoretically-justified approach to maintaining safety in the presence of dynamics uncertainty. Our approach leverages Bayesian meta-learning with last-layer adaptation. The expressiveness of neural-network features trained offline, paired with efficient last-layer online adaptation, enables the derivation of tight confidence sets which contract around the true dynamics as the model adapts online. We exploit these confidence sets to plan trajectories that guarantee the safety of the system. Our approach handles problems with high dynamics uncertainty, where reaching the goal safely is potentially initially infeasible, by first \textit{exploring} to gather data and reduce uncertainty, before autonomously \textit{exploiting} the acquired information to safely perform the task. Under reasonable assumptions, we prove that our framework guarantees the high-probability satisfaction of all constraints at all times jointly, i.e. over the total task duration. This theoretical analysis also motivates two regularizers of last-layer meta-learning models that improve online adaptation capabilities as well as performance by reducing the size of the confidence sets. We extensively demonstrate our approach in simulation and on hardware.  ( 2 min )
    Wasserstein Graph Neural Networks for Graphs with Missing Attributes. (arXiv:2102.03450v2 [cs.LG] UPDATED)
    Missing node attributes is a common problem in real-world graphs. Graph neural networks have been demonstrated power in graph representation learning while their performance is affected by the completeness of graph information. Most of them are not specified for missing-attribute graphs and fail to leverage incomplete attribute information effectively. In this paper, we propose an innovative node representation learning framework, Wasserstein Graph Neural Network (WGNN), to mitigate the problem. To make the most of limited observed attribute information and capture the uncertainty caused by missing values, we express nodes as low-dimensional distributions derived from the decomposition of the attribute matrix. Furthermore, we strengthen the expressiveness of representations by developing a novel message passing schema that aggregates distributional information from neighbors in the Wasserstein space. We test WGNN in node classification tasks under two missing-attribute cases on both synthetic and real-world datasets. In addition, we find WGNN suitable to recover missing values and adapt them to tackle matrix completion problems with graphs of users and items. Experimental results on both tasks demonstrate the superiority of our method.  ( 2 min )
    The Adversarial Security Mitigations of mmWave Beamforming Prediction Models using Defensive Distillation and Adversarial Retraining. (arXiv:2202.08185v1 [cs.CR])
    The design of a security scheme for beamforming prediction is critical for next-generation wireless networks (5G, 6G, and beyond). However, there is no consensus about protecting the beamforming prediction using deep learning algorithms in these networks. This paper presents the security vulnerabilities in deep learning for beamforming prediction using deep neural networks (DNNs) in 6G wireless networks, which treats the beamforming prediction as a multi-output regression problem. It is indicated that the initial DNN model is vulnerable against adversarial attacks, such as Fast Gradient Sign Method (FGSM), Basic Iterative Method (BIM), Projected Gradient Descent (PGD), and Momentum Iterative Method (MIM), because the initial DNN model is sensitive to the perturbations of the adversarial samples of the training data. This study also offers two mitigation methods, such as adversarial training and defensive distillation, for adversarial attacks against artificial intelligence (AI)-based models used in the millimeter-wave (mmWave) beamforming prediction. Furthermore, the proposed scheme can be used in situations where the data are corrupted due to the adversarial examples in the training data. Experimental results show that the proposed methods effectively defend the DNN models against adversarial attacks in next-generation wireless networks.  ( 2 min )
    Bayesian Learning with Wasserstein Barycenters. (arXiv:1805.10833v4 [stat.ML] UPDATED)
    Based on recent developments in optimal transport theory, we propose a novel model-selection strategy for Bayesian learning. More precisely, the goal of this paper is to introduce the Wasserstein barycenter of the posterior law on models, as a Bayesian predictive posterior, alternative to classical choices such as the maximum a posteriori and the model average Bayesian estimators. After formulating the general problem of Bayesian model selection in a common, parameter-free framework, we exhibit conditions granting the existence and statistical consistency of this estimator, discuss some of its general and specific properties, and provide insight into its theoretical advantages. Furthermore, we illustrate how it can be computed using the theoretical stochastic gradient descent (SGD) algorithm in Wasserstein space introduced in a companion paper arXiv:2201.04232v2 [math.OC] , and provide a numerical example for experimental validation of the proposed method.  ( 2 min )
    Group Fairness in Bandit Arm Selection. (arXiv:1912.03802v3 [cs.LG] UPDATED)
    We propose a novel formulation of group fairness with biased feedback in the contextual multi-armed bandit (CMAB) setting. In the CMAB setting, a sequential decision maker must, at each time step, choose an arm to pull from a finite set of arms after observing some context for each of the potential arm pulls. In our model, arms are partitioned into two or more sensitive groups based on some protected feature(s) (e.g., age, race, or socio-economic status). Initial rewards received from pulling an arm may be distorted due to some unknown societal or measurement bias. We assume that in reality these groups are equal despite the biased feedback received by the agent. To alleviate this, we learn a societal bias term which can be used to both find the source of bias and to potentially fix the problem outside of the algorithm. We provide a novel algorithm that can accommodate this notion of fairness for an arbitrary number of groups, and provide a theoretical bound on the regret for our algorithm. We validate our algorithm using synthetic data and two real-world datasets for intervention settings wherein we want to allocate resources fairly across groups.  ( 2 min )
    Improved analysis of randomized SVD for top-eigenvector approximation. (arXiv:2202.07992v1 [cs.LG])
    Computing the top eigenvectors of a matrix is a problem of fundamental interest to various fields. While the majority of the literature has focused on analyzing the reconstruction error of low-rank matrices associated with the retrieved eigenvectors, in many applications one is interested in finding one vector with high Rayleigh quotient. In this paper we study the problem of approximating the top-eigenvector. Given a symmetric matrix $\mathbf{A}$ with largest eigenvalue $\lambda_1$, our goal is to find a vector \hu that approximates the leading eigenvector $\mathbf{u}_1$ with high accuracy, as measured by the ratio $R(\hat{\mathbf{u}})=\lambda_1^{-1}{\hat{\mathbf{u}}^T\mathbf{A}\hat{\mathbf{u}}}/{\hat{\mathbf{u}}^T\hat{\mathbf{u}}}$. We present a novel analysis of the randomized SVD algorithm of \citet{halko2011finding} and derive tight bounds in many cases of interest. Notably, this is the first work that provides non-trivial bounds of $R(\hat{\mathbf{u}})$ for randomized SVD with any number of iterations. Our theoretical analysis is complemented with a thorough experimental study that confirms the efficiency and accuracy of the method.  ( 2 min )
    Planckian jitter: enhancing the color quality of self-supervised visual representations. (arXiv:2202.07993v1 [cs.CV])
    Several recent works on self-supervised learning are trained by mapping different augmentations of the same image to the same feature representation. The set of used data augmentations is of crucial importance for the quality of the learned feature representation. We analyze how the traditionally used color jitter negatively impacts the quality of the color features in the learned feature representation. To address this problem, we replace this module with physics-based color augmentation, called Planckian jitter, which creates realistic variations in chromaticity, producing a model robust to llumination changes that can be commonly observed in real life, while maintaining the ability to discriminate the image content based on color information. We further improve the performance by introducing a latent space combination of color-sensitive and non-color-sensitive features. These are found to be complementary and the combination leads to large absolute performance gains over the default data augmentation on color classification tasks, including on Flowers-102 (+15%), Cub200 (+11%), VegFru (+15%), and T1K+ (+12%). Finally, we present a color sensitivity analysis to document the impact of different training methods on the model neurons and we show that the performance of the learned features is robust with respect to illuminant variations.  ( 2 min )
    Deep Similarity Learning for Sports Team Ranking. (arXiv:2103.13736v2 [cs.LG] UPDATED)
    Sports data is more readily available and consequently, there has been an increase in the amount of sports analysis, predictions and rankings in the literature. Sports are unique in their respective stochastic nature, making analysis, and accurate predictions valuable to those involved in the sport. In response, we focus on Siamese Neural Networks (SNN) in unison with LightGBM and XGBoost models, to predict the importance of matches and to rank teams in Rugby and Basketball. Six models were developed and compared, a LightGBM, a XGBoost, a LightGBM (Contrastive Loss), LightGBM (Triplet Loss), a XGBoost (Contrastive Loss), XGBoost (Triplet Loss). The models that utilise a Triplet loss function perform better than those using Contrastive loss. It is clear LightGBM (Triplet loss) is the most effective model in ranking the NBA, producing a state of the art (SOTA) mAP (0.867) and NDCG (0.98) respectively. The SNN (Triplet loss) most effectively predicted the Super 15 Rugby, yielding the SOTA mAP (0.921), NDCG (0.983), and $r_s$ (0.793). Triplet loss produces the best overall results displaying the value of learning representations/embeddings for prediction and ranking of sports. Overall there is not a single consistent best performing model across the two sports indicating that other Ranking models should be considered in the future.  ( 2 min )
    Federated Neural Collaborative Filtering. (arXiv:2106.04405v2 [cs.IR] UPDATED)
    In this work, we present a federated version of the state-of-the-art Neural Collaborative Filtering (NCF) approach for item recommendations. The system, named FedNCF, enables learning without requiring users to disclose or transmit their raw data. Data localization preserves data privacy and complies with regulations such as the GDPR. Although federated learning enables model training without local data dissemination, the transmission of raw clients' updates raises additional privacy issues. To address this challenge, we incorporate a privacy-preserving aggregation method that satisfies the security requirements against an honest but curious entity. We argue theoretically and experimentally that existing aggregation algorithms are inconsistent with latent factor model updates. We propose an enhancement by decomposing the aggregation step into matrix factorization and neural network-based averaging. Experimental validation shows that FedNCF achieves comparable recommendation quality to the original NCF system, while our proposed aggregation leads to faster convergence compared to existing methods. We investigate the effectiveness of the federated recommender system and evaluate the privacy-preserving mechanism in terms of computational cost.  ( 2 min )
    Distributed k-Means with Outliers in General Metrics. (arXiv:2202.08173v1 [cs.DC])
    Center-based clustering is a pivotal primitive for unsupervised learning and data analysis. A popular variant is undoubtedly the k-means problem, which, given a set $P$ of points from a metric space and a parameter $k<|P|$, requires to determine a subset $S$ of $k$ centers minimizing the sum of all squared distances of points in $P$ from their closest center. A more general formulation, known as k-means with $z$ outliers, introduced to deal with noisy datasets, features a further parameter $z$ and allows up to $z$ points of $P$ (outliers) to be disregarded when computing the aforementioned sum. We present a distributed coreset-based 3-round approximation algorithm for k-means with $z$ outliers for general metric spaces, using MapReduce as a computational model. Our distributed algorithm requires sublinear local memory per reducer, and yields a solution whose approximation ratio is an additive term $O(\gamma)$ away from the one achievable by the best known sequential (possibly bicriteria) algorithm, where $\gamma$ can be made arbitrarily small. An important feature of our algorithm is that it obliviously adapts to the intrinsic complexity of the dataset, captured by the doubling dimension $D$ of the metric space. To the best of our knowledge, no previous distributed approaches were able to attain similar quality-performance tradeoffs for general metrics.  ( 2 min )
    Backdoor Learning: A Survey. (arXiv:2007.08745v5 [cs.CR] UPDATED)
    Backdoor attack intends to embed hidden backdoor into deep neural networks (DNNs), so that the attacked models perform well on benign samples, whereas their predictions will be maliciously changed if the hidden backdoor is activated by attacker-specified triggers. This threat could happen when the training process is not fully controlled, such as training on third-party datasets or adopting third-party models, which poses a new and realistic threat. Although backdoor learning is an emerging and rapidly growing research area, its systematic review, however, remains blank. In this paper, we present the first comprehensive survey of this realm. We summarize and categorize existing backdoor attacks and defenses based on their characteristics, and provide a unified framework for analyzing poisoning-based backdoor attacks. Besides, we also analyze the relation between backdoor attacks and relevant fields ($i.e.,$ adversarial attacks and data poisoning), and summarize widely adopted benchmark datasets. Finally, we briefly outline certain future research directions relying upon reviewed works. A curated list of backdoor-related resources is also available at \url{https://github.com/THUYimingLi/backdoor-learning-resources}.  ( 2 min )
    Should You Mask 15% in Masked Language Modeling?. (arXiv:2202.08005v1 [cs.CL])
    Masked language models conventionally use a masking rate of 15% due to the belief that more masking would provide insufficient context to learn good representations, and less masking would make training too expensive. Surprisingly, we find that masking up to 40% of input tokens can outperform the 15% baseline, and even masking 80% can preserve most of the performance, as measured by fine-tuning on downstream tasks. Increasing the masking rates has two distinct effects, which we investigate through careful ablations: (1) A larger proportion of input tokens are corrupted, reducing the context size and creating a harder task, and (2) models perform more predictions, which benefits training. We observe that larger models in particular favor higher masking rates, as they have more capacity to perform the harder task. We also connect our findings to sophisticated masking schemes such as span masking and PMI masking, as well as BERT's curious 80-10-10 corruption strategy, and find that simple uniform masking with [MASK] replacements can be competitive at higher masking rates. Our results contribute to a better understanding of masked language modeling and point to new avenues for efficient pre-training.  ( 2 min )
    The Fine-Grained Hardness of Sparse Linear Regression. (arXiv:2106.03131v2 [cs.LG] UPDATED)
    Sparse linear regression is the well-studied inference problem where one is given a design matrix $\mathbf{A} \in \mathbb{R}^{M\times N}$ and a response vector $\mathbf{b} \in \mathbb{R}^M$, and the goal is to find a solution $\mathbf{x} \in \mathbb{R}^{N}$ which is $k$-sparse (that is, it has at most $k$ non-zero coordinates) and minimizes the prediction error $\|\mathbf{A} \mathbf{x} - \mathbf{b}\|_2$. On the one hand, the problem is known to be $\mathcal{NP}$-hard which tells us that no polynomial-time algorithm exists unless $\mathcal{P} = \mathcal{NP}$. On the other hand, the best known algorithms for the problem do a brute-force search among $N^k$ possibilities. In this work, we show that there are no better-than-brute-force algorithms, assuming any one of a variety of popular conjectures including the weighted $k$-clique conjecture from the area of fine-grained complexity, or the hardness of the closest vector problem from the geometry of numbers. We also show the impossibility of better-than-brute-force algorithms when the prediction error is measured in other $\ell_p$ norms, assuming the strong exponential-time hypothesis.  ( 2 min )
    Using the left Gram matrix to cluster high dimensional data. (arXiv:2202.08236v1 [stat.ML])
    For high dimensional data, where P features for N objects (P >> N) are represented in an NxP matrix X, we describe a clustering algorithm based on the normalized left Gram matrix, G = XX'/P. Under certain regularity conditions, the rows in G that correspond to objects in the same cluster converge to the same mean vector. By clustering on the row means, the algorithm does not require preprocessing by dimension reduction or feature selection techniques and does not require specification of tuning or hyperparameter values. Because it is based on the NxN matrix G, it has a lower computational cost than many methods based on clustering the feature matrix X. When compared to 14 other clustering algorithms applied to 32 benchmarked microarray datasets, the proposed algorithm provided the most accurate estimate of the underlying cluster configuration more than twice as often as its closest competitors.  ( 2 min )
    Unsupervised Ground Metric Learning using Wasserstein Singular Vectors. (arXiv:2102.06278v2 [stat.ML] UPDATED)
    Defining meaningful distances between samples, which are columns in a data matrix, is a fundamental problem in machine learning. Optimal Transport (OT) defines geometrically meaningful "Wasserstein" distances between probability distributions. However, a key bottleneck is the design of a "ground" cost which should be adapted to the task under study. OT is parametrized by a distance between the features (the rows of the data matrix): the "ground cost". However, there is usually no straightforward choice of distance on the features, and supervised metric learning is not possible either, leaving only ad-hoc approaches. Unsupervised metric learning is thus a fundamental problem to enable data-driven applications of OT. In this paper, we propose for the first time a canonical answer by simultaneously computing an OT distance between the rows and between the columns of a data matrix. These distance matrices emerge naturally as positive singular vectors of the function mapping ground costs to pairwise OT distances. We provide criteria to ensure the existence and uniqueness of these singular vectors. We then introduce scalable computational methods to approximate them in high-dimensional settings, using entropic regularization and stochastic approximation. First, we extend the definition using entropic regularization, and show that in the large regularization limit it operates a principal component analysis dimensionality reduction. Next, we propose a stochastic approximation scheme and study its convergence. Finally, we showcase Wasserstein Singular Vectors in the context of computational biology on a high-dimensional single-cell RNA-sequencing dataset.  ( 2 min )
    Distributionally Robust Weighted $k$-Nearest Neighbors. (arXiv:2006.04004v5 [stat.ML] UPDATED)
    Learning a robust classifier from a few samples remains a key challenge in machine learning. A major thrust of research has been focused on developing $k$-nearest neighbor ($k$-NN) based algorithms combined with metric learning that captures similarities between samples. When the samples are limited, robustness is especially crucial to ensure the generalization capability of the classifier. In this paper, we study a minimax distributionally robust formulation of weighted $k$-nearest neighbors, which aims to find the optimal weighted $k$-NN classifiers that hedge against feature uncertainties. We develop an algorithm, \texttt{Dr.k-NN}, that efficiently solves this functional optimization problem and features in assigning minimax optimal weights to training samples when performing classification. These weights are class-dependent, and are determined by the similarities of sample features under the least favorable scenarios. When the size of the uncertainty set is properly tuned, the robust classifier has a smaller Lipschitz norm than the vanilla $k$-NN, and thus improves the generalization capability. We also couple our framework with neural-network-based feature embedding. We demonstrate the competitive performance of our algorithm compared to the state-of-the-art in the few-training-sample setting with various real-data experiments.  ( 2 min )
    When Does A Spectral Graph Neural Network Fail in Node Classification?. (arXiv:2202.07902v1 [cs.LG])
    Spectral Graph Neural Networks (GNNs) with various graph filters have received extensive affirmation due to their promising performance in graph learning problems. However, it is known that GNNs do not always perform well. Although graph filters provide theoretical foundations for model explanations, it is unclear when a spectral GNN will fail. In this paper, focusing on node classification problems, we conduct a theoretical analysis of spectral GNNs performance by investigating their prediction error. With the aid of graph indicators including homophily degree and response efficiency we proposed, we establish a comprehensive understanding of complex relationships between graph structure, node labels, and graph filters. We indicate that graph filters with low response efficiency on label difference are prone to fail. To enhance GNNs performance, we provide a provably better strategy for filter design from our theoretical analysis - using data-driven filter banks, and propose simple models for empirical validation. Experimental results show consistency with our theoretical results and support our strategy.  ( 2 min )
    Analysis of Random Sequential Message Passing Algorithms for Approximate Inference. (arXiv:2202.08198v1 [cs.LG])
    We analyze the dynamics of a random sequential message passing algorithm for approximate inference with large Gaussian latent variable models in a student-teacher scenario. To model nontrivial dependencies between the latent variables, we assume random covariance matrices drawn from rotation invariant ensembles. Moreover, we consider a model mismatching setting, where the teacher model and the one used by the student may be different. By means of dynamical functional approach, we obtain exact dynamical mean-field equations characterizing the dynamics of the inference algorithm. We also derive a range of model parameters for which the sequential algorithm does not converge. The boundary of this parameter range coincides with the de Almeida Thouless (AT) stability condition of the replica symmetric ansatz for the static probabilistic model.  ( 2 min )
    Heterogeneous Graph Learning for Explainable Recommendation over Academic Networks. (arXiv:2202.07832v1 [cs.SI])
    With the explosive growth of new graduates with research degrees every year, unprecedented challenges arise for early-career researchers to find a job at a suitable institution. This study aims to understand the behavior of academic job transition and hence recommend suitable institutions for PhD graduates. Specifically, we design a deep learning model to predict the career move of early-career researchers and provide suggestions. The design is built on top of scholarly/academic networks, which contains abundant information about scientific collaboration among scholars and institutions. We construct a heterogeneous scholarly network to facilitate the exploring of the behavior of career moves and the recommendation of institutions for scholars. We devise an unsupervised learning model called HAI (Heterogeneous graph Attention InfoMax) which aggregates attention mechanism and mutual information for institution recommendation. Moreover, we propose scholar attention and meta-path attention to discover the hidden relationships between several meta-paths. With these mechanisms, HAI provides ordered recommendations with explainability. We evaluate HAI upon a real-world dataset against baseline methods. Experimental results verify the effectiveness and efficiency of our approach.  ( 2 min )
    Diagnosing Batch Normalization in Class Incremental Learning. (arXiv:2202.08025v1 [cs.LG])
    Extensive researches have applied deep neural networks (DNNs) in class incremental learning (Class-IL). As building blocks of DNNs, batch normalization (BN) standardizes intermediate feature maps and has been widely validated to improve training stability and convergence. However, we claim that the direct use of standard BN in Class-IL models is harmful to both the representation learning and the classifier training, thus exacerbating catastrophic forgetting. In this paper we investigate the influence of BN on Class-IL models by illustrating such BN dilemma. We further propose BN Tricks to address the issue by training a better feature extractor while eliminating classification bias. Without inviting extra hyperparameters, we apply BN Tricks to three baseline rehearsal-based methods, ER, DER++ and iCaRL. Through comprehensive experiments conducted on benchmark datasets of Seq-CIFAR-10, Seq-CIFAR-100 and Seq-Tiny-ImageNet, we show that BN Tricks can bring significant performance gains to all adopted baselines, revealing its potential generality along this line of research.  ( 2 min )
    Out-Of-Distribution Generalization on Graphs: A Survey. (arXiv:2202.07987v1 [cs.LG])
    Graph machine learning has been extensively studied in both academia and industry. Although booming with a vast number of emerging methods and techniques, most of the literature is built on the I.I.D. hypothesis, i.e., testing and training graph data are independent and identically distributed. However, this I.I.D. hypothesis can hardly be satisfied in many real-world graph scenarios where the model performance substantially degrades when there exist distribution shifts between testing and training graph data. To solve this critical problem, out-of-distribution (OOD) generalization on graphs, which goes beyond the I.I.D. hypothesis, has made great progress and attracted ever-increasing attention from the research community. In this paper, we comprehensively survey OOD generalization on graphs and present a detailed review of recent advances in this area. First, we provide a formal problem definition of OOD generalization on graphs. Second, we categorize existing methods into three classes from conceptually different perspectives, i.e., data, model, and learning strategy, based on their positions in the graph machine learning pipeline, followed by detailed discussions for each category. We also review the theories related to OOD generalization on graphs and introduce the commonly used graph datasets for thorough evaluations. Last but not least, we share our insights on future research directions. This paper is the first systematic and comprehensive review of OOD generalization on graphs, to the best of our knowledge.  ( 2 min )
    Deep Koopman Operator with Control for Nonlinear Systems. (arXiv:2202.08004v1 [cs.RO])
    Recently Koopman operator has become a promising data-driven tool to facilitate real-time control for unknown nonlinear systems. It maps nonlinear systems into equivalent linear systems in embedding space, ready for real-time linear control methods. However, designing an appropriate Koopman embedding function remains a challenging task. Furthermore, most Koopman-based algorithms only consider nonlinear systems with linear control input, resulting in lousy prediction and control performance when the system is fully nonlinear with the control input. In this work, we propose an end-to-end deep learning framework to learn the Koopman embedding function and Koopman Operator together to alleviate such difficulties. We first parameterize the embedding function and Koopman Operator with the neural network and train them end-to-end with the K-steps loss function. We then design an auxiliary control network to encode the nonlinear state-dependent control term to model the nonlinearity in control input. For linear control, this encoded term is considered the new control variable instead, ensuring the linearity of the embedding space. Then we deploy Linear Quadratic Regulator (LQR) on the linear embedding space to derive the optimal control policy and decode the actual control input from the control net. Experimental results demonstrate that our approach outperforms other existing methods, reducing the prediction error by order-of-magnitude and achieving superior control performance in several nonlinear dynamic systems like damping pendulum, CartPole, and 7 Dof robotic manipulator.  ( 2 min )
    Can Deep Learning be Applied to Model-Based Multi-Object Tracking?. (arXiv:2202.07909v1 [cs.LG])
    Multi-object tracking (MOT) is the problem of tracking the state of an unknown and time-varying number of objects using noisy measurements, with important applications such as autonomous driving, tracking animal behavior, defense systems, and others. In recent years, deep learning (DL) has been increasingly used in MOT for improving tracking performance, but mostly in settings where the measurements are high-dimensional and there are no available models of the measurement likelihood and the object dynamics. The model-based setting instead has not attracted as much attention, and it is still unclear if DL methods can outperform traditional model-based Bayesian methods, which are the state of the art (SOTA) in this context. In this paper, we propose a Transformer-based DL tracker and evaluate its performance in the model-based setting, comparing it to SOTA model-based Bayesian methods in a variety of different tasks. Our results show that the proposed DL method can match the performance of the model-based methods in simple tasks, while outperforming them when the task gets more complicated, either due to an increase in the data association complexity, or to stronger nonlinearities of the models of the environment.  ( 2 min )
    A multi-reconstruction study of breast density estimation using Deep Learning. (arXiv:2202.08238v1 [eess.IV])
    Breast density estimation is one of the key tasks in recognizing individuals predisposed to breast cancer. It is often challenging because of low contrast and fluctuations in mammograms' fatty tissue background. Most of the time, the breast density is estimated manually where a radiologist assigns one of the four density categories decided by the Breast Imaging and Reporting Data Systems (BI-RADS). There have been efforts in the direction of automating a breast density classification pipeline. Breast density estimation is one of the key tasks performed during a screening exam. Dense breasts are more susceptible to breast cancer. The density estimation is challenging because of low contrast and fluctuations in mammograms' fatty tissue background. Traditional mammograms are being replaced by tomosynthesis and its other low radiation dose variants (for example Hologic' Intelligent 2D and C-View). Because of the low-dose requirement, increasingly more screening centers are favoring the Intelligent 2D view and C-View. Deep-learning studies for breast density estimation use only a single modality for training a neural network. However, doing so restricts the number of images in the dataset. In this paper, we show that a neural network trained on all the modalities at once performs better than a neural network trained on any single modality. We discuss these results using the area under the receiver operator characteristics curves.
    Learning to predict synchronization of coupled oscillators on heterogeneous graphs. (arXiv:2012.14048v2 [math.DS] UPDATED)
    Suppose we are given a system of coupled oscillators on an unknown graph along with the trajectory of the system during some short period. Can we predict whether the system will eventually synchronize? Even with known underlying graph structure, this is an important but analytically intractable question in general. In this work, we take a novel approach that we call ``learning to predict synchronization'' (L2PSync), by viewing it as a classification problem for sets of initial dynamics into two classes: `synchronizing' or `non-synchronizing'. While a baseline predictor using concentration principle misses a large proportion of synchronizing examples, standard binary classification algorithms trained on large enough datasets of initial dynamics can successfully predict the unseen future of a system on highly heterogeneous sets of unknown graphs with surprising accuracy. In addition, we find that the full graph information gives only marginal improvements over what we can achieve by only using the initial dynamics. We demonstrate our method on three models of continuous and discrete coupled oscillators -- The Kuramoto model, the Firefly Cellular Automata, and the Greenberg-Hastings model. Finally, we show our method applied on larger systems is robust under using initial dynamics partially observed only on some small subgraphs.
    Expand Globally, Shrink Locally: Discriminant Multi-label Learning with Missing Labels. (arXiv:2004.03951v2 [cs.LG] UPDATED)
    In multi-label learning, the issue of missing labels brings a major challenge. Many methods attempt to recovery missing labels by exploiting low-rank structure of label matrix. However, these methods just utilize global low-rank label structure, ignore both local low-rank label structures and label discriminant information to some extent, leaving room for further performance improvement. In this paper, we develop a simple yet effective discriminant multi-label learning (DM2L) method for multi-label learning with missing labels. Specifically, we impose the low-rank structures on all the predictions of instances from the same labels (local shrinking of rank), and a maximally separated structure (high-rank structure) on the predictions of instances from different labels (global expanding of rank). In this way, these imposed low-rank structures can help modeling both local and global low-rank label structures, while the imposed high-rank structure can help providing more underlying discriminability. Our subsequent theoretical analysis also supports these intuitions. In addition, we provide a nonlinear extension via using kernel trick to enhance DM2L and establish a concave-convex objective to learn these models. Compared to the other methods, our method involves the fewest assumptions and only one hyper-parameter. Even so, extensive experiments show that our method still outperforms the state-of-the-art methods.
    An Adversarial Approach to Structural Estimation. (arXiv:2007.06169v2 [econ.EM] UPDATED)
    We propose a new simulation-based estimation method, adversarial estimation, for structural models. The estimator is formulated as the solution to a minimax problem between a generator (which generates synthetic observations using the structural model) and a discriminator (which classifies if an observation is synthetic). The discriminator maximizes the accuracy of its classification while the generator minimizes it. We show that, with a sufficiently rich discriminator, the adversarial estimator attains parametric efficiency under correct specification and the parametric rate under misspecification. We advocate the use of a neural network as a discriminator that can exploit adaptivity properties and attain fast rates of convergence. We apply our method to the elderly's saving decision model and show that our estimator uncovers the bequest motive as an important source of saving across the wealth distribution, not only for the rich.
    BiGCN: A Bi-directional Low-Pass Filtering Graph Neural Network. (arXiv:2101.05519v2 [cs.LG] UPDATED)
    Graph convolutional networks (GCNs) have achieved great success on graph-structured data. Many graph convolutional networks can be thought of as low-pass filters for graph signals. In this paper, we propose a more powerful graph convolutional network, named BiGCN, that extends to bidirectional filtering. Specifically, we not only consider the original graph structure information but also the latent correlation between features, thus BiGCN can filter the signals along with both the original graph and a latent feature-connection graph. Compared with most existing GCNs, BiGCN is more robust and has powerful capacities for feature denoising. We perform node classification and link prediction in citation networks and co-purchase networks with three settings: noise-rate, noise-level and structure-mistakes. Extensive experimental results demonstrate that our model outperforms the state-of-the-art graph neural networks in both clean and artificially noisy data.
    Correlation detection in trees for planted graph alignment. (arXiv:2107.07623v3 [cs.DS] UPDATED)
    Motivated by alignment of correlated sparse random graphs, we introduce a hypothesis testing problem of deciding whether or not two random trees are correlated. We obtain sufficient conditions under which this testing is impossible or feasible. We propose MPAlign, a message-passing algorithm for graph alignment inspired by the tree correlation detection problem. We prove MPAlign to succeed in polynomial time at partial alignment whenever tree detection is feasible. As a result our analysis of tree detection reveals new ranges of parameters for which partial alignment of sparse random graphs is feasible in polynomial time. We then conjecture that graph alignment is not feasible in polynomial time when the associated tree detection problem is impossible. If true, this conjecture together with our sufficient conditions on tree detection impossibility would imply the existence of a hard phase for graph alignment, i.e. a parameter range where alignment cannot be done in polynomial time even though it is known to be feasible in non-polynomial time.
    Regularised Least-Squares Regression with Infinite-Dimensional Output Space. (arXiv:2010.10973v7 [stat.ML] UPDATED)
    This short technical report presents some learning theory results on vector-valued reproducing kernel Hilbert space (RKHS) regression, where the input space is allowed to be non-compact and the output space is a (possibly infinite-dimensional) Hilbert space. Our approach is based on the integral operator technique using spectral theory for non-compact operators. We place a particular emphasis on obtaining results with as few assumptions as possible; as such we only use Chebyshev's inequality, and no effort is made to obtain the best rates or constants.
    Fanoos: Multi-Resolution, Multi-Strength, Interactive Explanations for Learned Systems. (arXiv:2006.12453v8 [cs.AI] UPDATED)
    Machine learning is becoming increasingly important to control the behavior of safety and financially critical components in sophisticated environments, where the inability to understand learned components in general, and neural nets in particular, poses serious obstacles to their adoption. Explainability and interpretability methods for learned systems have gained considerable academic attention, but the focus of current approaches on only one aspect of explanation, at a fixed level of abstraction, and limited if any formal guarantees, prevents those explanations from being digestible by the relevant stakeholders (e.g., end users, certification authorities, engineers) with their diverse backgrounds and situation-specific needs. We introduce Fanoos, a framework for combining formal verification techniques, heuristic search, and user interaction to explore explanations at the desired level of granularity and fidelity. We demonstrate the ability of Fanoos to produce and adjust the abstractness of explanations in response to user requests on a learned controller for an inverted double pendulum and on a learned CPU usage model.
    Bayesian subset selection and variable importance for interpretable prediction and classification. (arXiv:2104.10150v2 [stat.ML] UPDATED)
    Subset selection is a valuable tool for interpretable learning, scientific discovery, and data compression. However, classical subset selection is often avoided due to selection instability, lack of regularization, and difficulties with post-selection inference. We address these challenges from a Bayesian perspective. Given any Bayesian predictive model $\mathcal{M}$, we extract a family of near-optimal subsets of variables for linear prediction or classification. This strategy deemphasizes the role of a single "best" subset and instead advances the broader perspective that often many subsets are highly competitive. The acceptable family of subsets offers a new pathway for model interpretation and is neatly summarized by key members such as the smallest acceptable subset, along with new (co-) variable importance metrics based on whether variables (co-) appear in all, some, or no acceptable subsets. More broadly, we apply Bayesian decision analysis to derive the optimal linear coefficients for any subset of variables. These coefficients inherit both regularization and predictive uncertainty quantification via $\mathcal{M}$. For both simulated and real data, the proposed approach exhibits better prediction, interval estimation, and variable selection than competing Bayesian and frequentist selection methods. These tools are applied to a large education dataset with highly correlated covariates. Our analysis provides unique insights into the combination of environmental, socioeconomic, and demographic factors that predict educational outcomes, and identifies over 200 distinct subsets of variables that offer near-optimal out-of-sample predictive accuracy.
    Joint Frequency and Image Space Learning for MRI Reconstruction and Analysis. (arXiv:2007.01441v3 [cs.CV] UPDATED)
    We propose neural network layers that explicitly combine frequency and image feature representations and show that they can be used as a versatile building block for reconstruction from frequency space data. Our work is motivated by the challenges arising in MRI acquisition where the signal is a corrupted Fourier transform of the desired image. The proposed joint learning schemes enable both correction of artifacts native to the frequency space and manipulation of image space representations to reconstruct coherent image structures at every layer of the network. This is in contrast to most current deep learning approaches for image reconstruction that treat frequency and image space features separately and often operate exclusively in one of the two spaces. We demonstrate the advantages of joint convolutional learning for a variety of tasks, including motion correction, denoising, reconstruction from undersampled acquisitions, and combined undersampling and motion correction on simulated and real world multicoil MRI data. The joint models produce consistently high quality output images across all tasks and datasets. When integrated into a state of the art unrolled optimization network with physics-inspired data consistency constraints for undersampled reconstruction, the proposed architectures significantly improve the optimization landscape, which yields an order of magnitude reduction of training time. This result suggests that joint representations are particularly well suited for MRI signals in deep learning networks. Our code and pretrained models are publicly available at https://github.com/nalinimsingh/interlacer.
    Reinforcement Learning for on-line Sequence Transformation. (arXiv:2105.14097v2 [cs.LG] UPDATED)
    A number of problems in the processing of sound and natural language, as well as in other areas, can be reduced to simultaneously reading an input sequence and writing an output sequence of generally different length. There are well developed methods that produce the output sequence based on the entirely known input. However, efficient methods that enable such transformations on-line do not exist. In this paper we introduce an architecture that learns with reinforcement to make decisions about whether to read a token or write another token. This architecture is able to transform potentially infinite sequences on-line. In an experimental study we compare it with state-of-the-art methods for neural machine translation. While it produces slightly worse translations than Transformer, it outperforms the autoencoder with attention, even though our architecture translates texts on-line thereby solving a more difficult problem than both reference methods.
    Doubly robust Thompson sampling for linear payoffs. (arXiv:2102.01229v2 [stat.ML] UPDATED)
    A challenging aspect of the bandit problem is that a stochastic reward is observed only for the chosen arm and the rewards of other arms remain missing. The dependence of the arm choice on the past context and reward pairs compounds the complexity of regret analysis. We propose a novel multi-armed contextual bandit algorithm called Doubly Robust (DR) Thompson Sampling employing the doubly-robust estimator used in missing data literature to Thompson Sampling with contexts (\texttt{LinTS}). Different from previous works relying on missing data techniques (\citet{dimakopoulou2019balanced}, \citet{kim2019doubly}), the proposed algorithm is designed to allow a novel additive regret decomposition leading to an improved regret bound with the order of $\tilde{O}(\phi^{-2}\sqrt{T})$, where $\phi^2$ is the minimum eigenvalue of the covariance matrix of contexts. This is the first regret bound of \texttt{LinTS} using $\phi^2$ without the dimension of the context, $d$. Applying the relationship between $\phi^2$ and $d$, the regret bound of the proposed algorithm is $\tilde{O}(d\sqrt{T})$ in many practical scenarios, improving the bound of \texttt{LinTS} by a factor of $\sqrt{d}$. A benefit of the proposed method is that it utilizes all the context data, chosen or not chosen, thus allowing to circumvent the technical definition of unsaturated arms used in theoretical analysis of \texttt{LinTS}. Empirical studies show the advantage of the proposed algorithm over \texttt{LinTS}.
    Modular multi-source prediction of drug side-effects with DruGNN. (arXiv:2202.08147v1 [q-bio.QM])
    Drug Side-Effects (DSEs) have a high impact on public health, care system costs, and drug discovery processes. Predicting the probability of side-effects, before their occurrence, is fundamental to reduce this impact, in particular on drug discovery. Candidate molecules could be screened before undergoing clinical trials, reducing the costs in time, money, and health of the participants. Drug side-effects are triggered by complex biological processes involving many different entities, from drug structures to protein-protein interactions. To predict their occurrence, it is necessary to integrate data from heterogeneous sources. In this work, such heterogeneous data is integrated into a graph dataset, expressively representing the relational information between different entities, such as drug molecules and genes. The relational nature of the dataset represents an important novelty for drug side-effect predictors. Graph Neural Networks (GNNs) are exploited to predict DSEs on our dataset with very promising results. GNNs are deep learning models that can process graph-structured data, with minimal information loss, and have been applied on a wide variety of biological tasks. Our experimental results confirm the advantage of using relationships between data entities, suggesting interesting future developments in this scope. The experimentation also shows the importance of specific subsets of data in determining associations between drugs and side-effects.
    Self-Supervised Learning of Graph Neural Networks: A Unified Review. (arXiv:2102.10757v4 [cs.LG] UPDATED)
    Deep models trained in supervised mode have achieved remarkable success on a variety of tasks. When labeled samples are limited, self-supervised learning (SSL) is emerging as a new paradigm for making use of large amounts of unlabeled samples. SSL has achieved promising performance on natural language and image learning tasks. Recently, there is a trend to extend such success to graph data using graph neural networks (GNNs). In this survey, we provide a unified review of different ways of training GNNs using SSL. Specifically, we categorize SSL methods into contrastive and predictive models. In either category, we provide a unified framework for methods as well as how these methods differ in each component under the framework. Our unified treatment of SSL methods for GNNs sheds light on the similarities and differences of various methods, setting the stage for developing new methods and algorithms. We also summarize different SSL settings and the corresponding datasets used in each setting. To facilitate methodological development and empirical comparison, we develop a standardized testbed for SSL in GNNs, including implementations of common baseline methods, datasets, and evaluation metrics.
    GLiDE: Generalizable Quadrupedal Locomotion in Diverse Environments with a Centroidal Model. (arXiv:2104.09771v3 [cs.RO] UPDATED)
    Model-free reinforcement learning (RL) for legged locomotion commonly relies on a physics simulator that can accurately predict the behaviors of every degree of freedom of the robot. In contrast, approximate reduced-order models are commonly used for many model predictive control strategies. In this work we abandon the conventional use of high-fidelity dynamics models in RL and we instead seek to understand what can be achieved when using RL with a much simpler centroidal model when applied to quadrupedal locomotion. We show that RL-based control of the accelerations of a centroidal model is surprisingly effective, when combined with a quadratic program to realize the commanded actions via ground contact forces. It allows for a simple reward structure, reduced computational costs, and robust sim-to-real transfer. We show the generality of the method by demonstrating flat-terrain gaits, stepping-stone locomotion, two-legged in-place balance, balance beam locomotion, and direct sim-to-real transfer.
    Exploring the Training Robustness of Distributional Reinforcement Learning against Noisy State Observations. (arXiv:2109.08776v3 [cs.LG] UPDATED)
    In real scenarios, state observations that an agent observes may contain measurement errors or adversarial noises, misleading the agent to take suboptimal actions or even collapse while training. In this paper, we study the training robustness of distributional Reinforcement Learning~(RL), a class of state-of-the-art methods that estimate the whole distribution, as opposed to only the expectation, of the total return. Firstly, we validate the contraction of both expectation-based and distributional Bellman operators in the State-Noisy Markov Decision Process~(SN-MDP), a typical tabular case that incorporates both random and adversarial state observation noises. Beyond SN-MDP, we then analyze the vulnerability of least squared loss in expectation-based RL with either linear or nonlinear function approximation. By contrast, we theoretically characterize the bounded gradient norm of distributional RL loss based on the histogram density estimation. The resulting stable gradients while the optimization in distributional RL accounts for its better training robustness against state observation noises. Finally, extensive experiments on the suite of games verified the convergence of both expectation-based and distributional RL in the SN-MDP-like setting under different strengths of state observation noises. More importantly, in noisy settings beyond SN-MDP, distributional RL is less vulnerable against noisy state observations compared with its expectation-based counterpart.
    Preservation of the Global Knowledge by Not-True Self Knowledge Distillation in Federated Learning. (arXiv:2106.03097v2 [cs.LG] UPDATED)
    In federated learning, a strong global model is collaboratively learned by aggregating clients' locally trained models. Although this precludes the need to access clients' data directly, the global model's convergence often suffers from data heterogeneity. This study starts from an analogy to continual learning and suggests that forgetting could be the bottleneck of federated learning. We observe that the global model forgets the knowledge from previous rounds, and the local training induces forgetting the knowledge outside of the local distribution. Based on our findings, we hypothesize that tackling down forgetting will relieve the data heterogeneity problem. To this end, we propose a novel and effective algorithm, Federated Not-True Distillation (FedNTD), which preserves the global perspective on locally available data only for the not-true classes. In the experiments, FedNTD shows state-of-the-art performance on various setups without compromising data privacy or incurring additional communication costs.
    DeepTx: Deep Learning Beamforming with Channel Prediction. (arXiv:2202.07998v1 [eess.SP])
    Machine learning algorithms have recently been considered for many tasks in the field of wireless communications. Previously, we have proposed the use of a deep fully convolutional neural network (CNN) for receiver processing and shown it to provide considerable performance gains. In this study, we focus on machine learning algorithms for the transmitter. In particular, we consider beamforming and propose a CNN which, for a given uplink channel estimate as input, outputs downlink channel information to be used for beamforming. The CNN is trained in a supervised manner considering both uplink and downlink transmissions with a loss function that is based on UE receiver performance. The main task of the neural network is to predict the channel evolution between uplink and downlink slots, but it can also learn to handle inefficiencies and errors in the whole chain, including the actual beamforming phase. The provided numerical experiments demonstrate the improved beamforming performance.
    Aryl: An Elastic Cluster Scheduler for Deep Learning. (arXiv:2202.07896v1 [cs.DC])
    Companies build separate training and inference GPU clusters for deep learning, and use separate schedulers to manage them. This leads to problems for both training and inference: inference clusters have low GPU utilization when the traffic load is low; training jobs often experience long queueing time due to lack of resources. We introduce Aryl, a new cluster scheduler to address these problems. Aryl introduces capacity loaning to loan idle inference GPU servers for training jobs. It further exploits elastic scaling that scales a training job's GPU allocation to better utilize loaned resources. Capacity loaning and elastic scaling create new challenges to cluster management. When the loaned servers need to be returned, we need to minimize the number of job preemptions; when more GPUs become available, we need to allocate them to elastic jobs and minimize the job completion time (JCT). Aryl addresses these combinatorial problems using principled heuristics. It introduces the notion of server preemption cost which it greedily reduces during server reclaiming. It further relies on the JCT reduction value defined for each additional worker for an elastic job to solve the scheduling problem as a multiple-choice knapsack problem. Prototype implementation on a 64-GPU testbed and large-scale simulation with 15-day traces of over 50,000 production jobs show that Aryl brings 1.53x and 1.50x reductions in average queuing time and JCT, and improves cluster usage by up to 26.9% over the cluster scheduler without capacity loaning or elastic scaling.  ( 2 min )
    Measuring Unintended Memorisation of Unique Private Features in Neural Networks. (arXiv:2202.08099v1 [cs.LG])
    Neural networks pose a privacy risk to training data due to their propensity to memorise and leak information. Focusing on image classification, we show that neural networks also unintentionally memorise unique features even when they occur only once in training data. An example of a unique feature is a person's name that is accidentally present on a training image. Assuming access to the inputs and outputs of a trained model, the domain of the training data, and knowledge of unique features, we develop a score estimating the model's sensitivity to a unique feature by comparing the KL divergences of the model's output distributions given modified out-of-distribution images. Our results suggest that unique features are memorised by multi-layer perceptrons and convolutional neural networks trained on benchmark datasets, such as MNIST, Fashion-MNIST and CIFAR-10. We find that strategies to prevent overfitting (e.g.\ early stopping, regularisation, batch normalisation) do not prevent memorisation of unique features. These results imply that neural networks pose a privacy risk to rarely occurring private information. These risks can be more pronounced in healthcare applications if patient information is present in the training data.  ( 2 min )
    On loss functions and evaluation metrics for music source separation. (arXiv:2202.07968v1 [cs.SD])
    We investigate which loss functions provide better separations via benchmarking an extensive set of those for music source separation. To that end, we first survey the most representative audio source separation losses we identified, to later consistently benchmark them in a controlled experimental setup. We also explore using such losses as evaluation metrics, via cross-correlating them with the results of a subjective test. Based on the observation that the standard signal-to-distortion ratio metric can be misleading in some scenarios, we study alternative evaluation metrics based on the considered losses.  ( 2 min )
    Self-Supervised Class-Cognizant Few-Shot Classification. (arXiv:2202.08149v1 [cs.CV])
    Unsupervised learning is argued to be the dark matter of human intelligence. To build in this direction, this paper focuses on unsupervised learning from an abundance of unlabeled data followed by few-shot fine-tuning on a downstream classification task. To this aim, we extend a recent study on adopting contrastive learning for self-supervised pre-training by incorporating class-level cognizance through iterative clustering and re-ranking and by expanding the contrastive optimization loss to account for it. To our knowledge, our experimentation both in standard and cross-domain scenarios demonstrate that we set a new state-of-the-art (SoTA) in (5-way, 1 and 5-shot) settings of standard mini-ImageNet benchmark as well as the (5-way, 5 and 20-shot) settings of cross-domain CDFSL benchmark. Our code and experimentation can be found in our GitHub repository: https://github.com/ojss/c3lr.  ( 2 min )
    Learning to Generalize across Domains on Single Test Samples. (arXiv:2202.08045v1 [cs.LG])
    We strive to learn a model from a set of source domains that generalizes well to unseen target domains. The main challenge in such a domain generalization scenario is the unavailability of any target domain data during training, resulting in the learned model not being explicitly adapted to the unseen target domains. We propose learning to generalize across domains on single test samples. We leverage a meta-learning paradigm to learn our model to acquire the ability of adaptation with single samples at training time so as to further adapt itself to each single test sample at test time. We formulate the adaptation to the single test sample as a variational Bayesian inference problem, which incorporates the test sample as a conditional into the generation of model parameters. The adaptation to each test sample requires only one feed-forward computation at test time without any fine-tuning or self-supervised training on additional data from the unseen domains. Extensive ablation studies demonstrate that our model learns the ability to adapt models to each single sample by mimicking domain shifts during training. Further, our model achieves at least comparable -- and often better -- performance than state-of-the-art methods on multiple benchmarks for domain generalization.  ( 2 min )
    CenGCN: Centralized Convolutional Networks with Vertex Imbalance for Scale-Free Graphs. (arXiv:2202.07826v1 [cs.LG])
    Graph Convolutional Networks (GCNs) have achieved impressive performance in a wide variety of areas, attracting considerable attention. The core step of GCNs is the information-passing framework that considers all information from neighbors to the central vertex to be equally important. Such equal importance, however, is inadequate for scale-free networks, where hub vertices propagate more dominant information due to vertex imbalance. In this paper, we propose a novel centrality-based framework named CenGCN to address the inequality of information. This framework first quantifies the similarity between hub vertices and their neighbors by label propagation with hub vertices. Based on this similarity and centrality indices, the framework transforms the graph by increasing or decreasing the weights of edges connecting hub vertices and adding self-connections to vertices. In each non-output layer of the GCN, this framework uses a hub attention mechanism to assign new weights to connected non-hub vertices based on their common information with hub vertices. We present two variants CenGCN\_D and CenGCN\_E, based on degree centrality and eigenvector centrality, respectively. We also conduct comprehensive experiments, including vertex classification, link prediction, vertex clustering, and network visualization. The results demonstrate that the two variants significantly outperform state-of-the-art baselines.  ( 2 min )
    On Measuring Excess Capacity in Neural Networks. (arXiv:2202.08070v1 [cs.LG])
    We study the excess capacity of deep networks in the context of supervised classification. That is, given a capacity measure of the underlying hypothesis class -- in our case, Rademacher complexity -- how much can we (a-priori) constrain this class while maintaining an empirical error comparable to the unconstrained setting. To assess excess capacity in modern architectures, we first extend an existing generalization bound to accommodate function composition and addition, as well as the specific structure of convolutions. This then facilitates studying residual networks through the lens of the accompanying capacity measure. The key quantities driving this measure are the Lipschitz constants of the layers and the (2,1) group norm distance to the initializations of the convolution weights. We show that these quantities (1) can be kept surprisingly small and, (2) since excess capacity unexpectedly increases with task difficulty, this points towards an unnecessarily large capacity of unconstrained models.  ( 2 min )
    Branching Reinforcement Learning. (arXiv:2202.07995v1 [cs.LG])
    In this paper, we propose a novel Branching Reinforcement Learning (Branching RL) model, and investigate both Regret Minimization (RM) and Reward-Free Exploration (RFE) metrics for this model. Unlike standard RL where the trajectory of each episode is a single $H$-step path, branching RL allows an agent to take multiple base actions in a state such that transitions branch out to multiple successor states correspondingly, and thus it generates a tree-structured trajectory. This model finds important applications in hierarchical recommendation systems and online advertising. For branching RL, we establish new Bellman equations and key lemmas, i.e., branching value difference lemma and branching law of total variance, and also bound the total variance by only $O(H^2)$ under an exponentially-large trajectory. For RM and RFE metrics, we propose computationally efficient algorithms BranchVI and BranchRFE, respectively, and derive nearly matching upper and lower bounds. Our results are only polynomial in problem parameters despite exponentially-large trajectories.  ( 2 min )
    GraphNLI: A Graph-based Natural Language Inference Model for Polarity Prediction in Online Debates. (arXiv:2202.08175v1 [cs.CL])
    Online forums that allow participatory engagement between users have been transformative for public discussion of important issues. However, debates on such forums can sometimes escalate into full blown exchanges of hate or misinformation. An important tool in understanding and tackling such problems is to be able to infer the argumentative relation of whether a reply is supporting or attacking the post it is replying to. This so called polarity prediction task is difficult because replies may be based on external context beyond a post and the reply whose polarity is being predicted. We propose GraphNLI, a novel graph-based deep learning architecture that uses graph walk techniques to capture the wider context of a discussion thread in a principled fashion. Specifically, we propose methods to perform root-seeking graph walks that start from a post and captures its surrounding context to generate additional embeddings for the post. We then use these embeddings to predict the polarity relation between a reply and the post it is replying to. We evaluate the performance of our models on a curated debate dataset from Kialo, an online debating platform. Our model outperforms relevant baselines, including S-BERT, with an overall accuracy of 83%.  ( 2 min )
    Towards Battery-Free Machine Learning and Inference in Underwater Environments. (arXiv:2202.08174v1 [cs.LG])
    This paper is motivated by a simple question: Can we design and build battery-free devices capable of machine learning and inference in underwater environments? An affirmative answer to this question would have significant implications for a new generation of underwater sensing and monitoring applications for environmental monitoring, scientific exploration, and climate/weather prediction. To answer this question, we explore the feasibility of bridging advances from the past decade in two fields: battery-free networking and low-power machine learning. Our exploration demonstrates that it is indeed possible to enable battery-free inference in underwater environments. We designed a device that can harvest energy from underwater sound, power up an ultra-low-power microcontroller and on-board sensor, perform local inference on sensed measurements using a lightweight Deep Neural Network, and communicate the inference result via backscatter to a receiver. We tested our prototype in an emulated marine bioacoustics application, demonstrating the potential to recognize underwater animal sounds without batteries. Through this exploration, we highlight the challenges and opportunities for making underwater battery-free inference and machine learning ubiquitous.  ( 2 min )
    Understanding and Improving Graph Injection Attack by Promoting Unnoticeability. (arXiv:2202.08057v1 [cs.LG])
    Recently Graph Injection Attack (GIA) emerges as a practical attack scenario on Graph Neural Networks (GNNs), where the adversary can merely inject few malicious nodes instead of modifying existing nodes or edges, i.e., Graph Modification Attack (GMA). Although GIA has achieved promising results, little is known about why it is successful and whether there is any pitfall behind the success. To understand the power of GIA, we compare it with GMA and find that GIA can be provably more harmful than GMA due to its relatively high flexibility. However, the high flexibility will also lead to great damage to the homophily distribution of the original graph, i.e., similarity among neighbors. Consequently, the threats of GIA can be easily alleviated or even prevented by homophily-based defenses designed to recover the original homophily. To mitigate the issue, we introduce a novel constraint -- homophily unnoticeability that enforces GIA to preserve the homophily, and propose Harmonious Adversarial Objective (HAO) to instantiate it. Extensive experiments verify that GIA with HAO can break homophily-based defenses and outperform previous GIA attacks by a significant margin. We believe our methods can serve for a more reliable evaluation of the robustness of GNNs.  ( 2 min )
    Using Navigational Information to Learn Visual Representations. (arXiv:2202.08114v1 [cs.CV])
    Children learn to build a visual representation of the world from unsupervised exploration and we hypothesize that a key part of this learning ability is the use of self-generated navigational information as a similarity label to drive a learning objective for self-supervised learning. The goal of this work is to exploit navigational information in a visual environment to provide performance in training that exceeds the state-of-the-art self-supervised training. Here, we show that using spatial and temporal information in the pretraining stage of contrastive learning can improve the performance of downstream classification relative to conventional contrastive learning approaches that use instance discrimination to discriminate between two alterations of the same image or two different images. We designed a pipeline to generate egocentric-vision images from a photorealistic ray-tracing environment (ThreeDWorld) and record relevant navigational information for each image. Modifying the Momentum Contrast (MoCo) model, we introduced spatial and temporal information to evaluate the similarity of two views in the pretraining stage instead of instance discrimination. This work reveals the effectiveness and efficiency of contextual information for improving representation learning. The work informs our understanding of the means by which children might learn to see the world without external supervision.  ( 2 min )
    Online Control of Unknown Time-Varying Dynamical Systems. (arXiv:2202.07890v1 [cs.LG])
    We study online control of time-varying linear systems with unknown dynamics in the nonstochastic control model. At a high level, we demonstrate that this setting is \emph{qualitatively harder} than that of either unknown time-invariant or known time-varying dynamics, and complement our negative results with algorithmic upper bounds in regimes where sublinear regret is possible. More specifically, we study regret bounds with respect to common classes of policies: Disturbance Action (SLS), Disturbance Response (Youla), and linear feedback policies. While these three classes are essentially equivalent for LTI systems, we demonstrate that these equivalences break down for time-varying systems. We prove a lower bound that no algorithm can obtain sublinear regret with respect to the first two classes unless a certain measure of system variability also scales sublinearly in the horizon. Furthermore, we show that offline planning over the state linear feedback policies is NP-hard, suggesting hardness of the online learning problem. On the positive side, we give an efficient algorithm that attains a sublinear regret bound against the class of Disturbance Response policies up to the aforementioned system variability term. In fact, our algorithm enjoys sublinear \emph{adaptive} regret bounds, which is a strictly stronger metric than standard regret and is more appropriate for time-varying systems. We sketch extensions to Disturbance Action policies and partial observation, and propose an inefficient algorithm for regret against linear state feedback policies.  ( 2 min )
    A data-driven approach for learning to control computers. (arXiv:2202.08137v1 [cs.LG])
    It would be useful for machines to use computers as humans do so that they can aid us in everyday tasks. This is a setting in which there is also the potential to leverage large-scale expert demonstrations and human judgements of interactive behaviour, which are two ingredients that have driven much recent success in AI. Here we investigate the setting of computer control using keyboard and mouse, with goals specified via natural language. Instead of focusing on hand-designed curricula and specialized action spaces, we focus on developing a scalable method centered on reinforcement learning combined with behavioural priors informed by actual human-computer interactions. We achieve state-of-the-art and human-level mean performance across all tasks within the MiniWob++ benchmark, a challenging suite of computer control problems, and find strong evidence of cross-task transfer. These results demonstrate the usefulness of a unified human-agent interface when training machines to use computers. Altogether our results suggest a formula for achieving competency beyond MiniWob++ and towards controlling computers, in general, as a human would.
    No One Left Behind: Inclusive Federated Learning over Heterogeneous Devices. (arXiv:2202.08036v1 [cs.LG])
    Federated learning (FL) is an important paradigm for training global models from decentralized data in a privacy-preserving way. Existing FL methods usually assume the global model can be trained on any participating client. However, in real applications, the devices of clients are usually heterogeneous, and have different computing power. Although big models like BERT have achieved huge success in AI, it is difficult to apply them to heterogeneous FL with weak clients. The straightforward solutions like removing the weak clients or using a small model to fit all clients will lead to some problems, such as under-representation of dropped clients and inferior accuracy due to data loss or limited model representation ability. In this work, we propose InclusiveFL, a client-inclusive federated learning method to handle this problem. The core idea of InclusiveFL is to assign models of different sizes to clients with different computing capabilities, bigger models for powerful clients and smaller ones for weak clients. We also propose an effective method to share the knowledge among multiple local models with different sizes. In this way, all the clients can participate in the model learning in FL, and the final model can be big and powerful enough. Besides, we propose a momentum knowledge distillation method to better transfer knowledge in big models on powerful clients to the small models on weak clients. Extensive experiments on many real-world benchmark datasets demonstrate the effectiveness of the proposed method in learning accurate models from clients with heterogeneous devices under the FL framework.
    Meta Knowledge Distillation. (arXiv:2202.07940v1 [cs.LG])
    Recent studies pointed out that knowledge distillation (KD) suffers from two degradation problems, the teacher-student gap and the incompatibility with strong data augmentations, making it not applicable to training state-of-the-art models, which are trained with advanced augmentations. However, we observe that a key factor, i.e., the temperatures in the softmax functions for generating probabilities of both the teacher and student models, was mostly overlooked in previous methods. With properly tuned temperatures, such degradation problems of KD can be much mitigated. However, instead of relying on a naive grid search, which shows poor transferability, we propose Meta Knowledge Distillation (MKD) to meta-learn the distillation with learnable meta temperature parameters. The meta parameters are adaptively adjusted during training according to the gradients of the learning objective. We validate that MKD is robust to different dataset scales, different teacher/student architectures, and different types of data augmentation. With MKD, we achieve the best performance with popular ViT architectures among compared methods that use only ImageNet-1K as training data, ranging from tiny to large models. With ViT-L, we achieve 86.5% with 600 epochs of training, 0.6% better than MAE that trains for 1,650 epochs.
    Robust Nonparametric Distribution Forecast with Backtest-based Bootstrap and Adaptive Residual Selection. (arXiv:2202.07955v1 [stat.ML])
    Distribution forecast can quantify forecast uncertainty and provide various forecast scenarios with their corresponding estimated probabilities. Accurate distribution forecast is crucial for planning - for example when making production capacity or inventory allocation decisions. We propose a practical and robust distribution forecast framework that relies on backtest-based bootstrap and adaptive residual selection. The proposed approach is robust to the choice of the underlying forecasting model, accounts for uncertainty around the input covariates, and relaxes the independence between residuals and covariates assumption. It reduces the Absolute Coverage Error by more than 63% compared to the classic bootstrap approaches and by 2% - 32% compared to a variety of State-of-the-Art deep learning approaches on in-house product sales data and M4-hourly competition data.
    Domain Adaptive Fake News Detection via Reinforcement Learning. (arXiv:2202.08159v1 [cs.SI])
    With social media being a major force in information consumption, accelerated propagation of fake news has presented new challenges for platforms to distinguish between legitimate and fake news. Effective fake news detection is a non-trivial task due to the diverse nature of news domains and expensive annotation costs. In this work, we address the limitations of existing automated fake news detection models by incorporating auxiliary information (e.g., user comments and user-news interactions) into a novel reinforcement learning-based model called \textbf{RE}inforced \textbf{A}daptive \textbf{L}earning \textbf{F}ake \textbf{N}ews \textbf{D}etection (REAL-FND). REAL-FND exploits cross-domain and within-domain knowledge that makes it robust in a target domain, despite being trained in a different source domain. Extensive experiments on real-world datasets illustrate the effectiveness of the proposed model, especially when limited labeled data is available in the target domain.
    A Polyhedral Study of Lifted Multicuts. (arXiv:2202.08068v1 [cs.DM])
    Fundamental to many applications in data analysis are the decompositions of a graph, i.e. partitions of the node set into component-inducing subsets. One way of encoding decompositions is by multicuts, the subsets of those edges that straddle distinct components. Recently, a lifting of multicuts from a graph $G = (V, E)$ to an augmented graph $\hat G = (V, E \cup F)$ has been proposed in the field of image analysis, with the goal of obtaining a more expressive characterization of graph decompositions in which it is made explicit also for pairs $F \subseteq \tbinom{V}{2} \setminus E$ of non-neighboring nodes whether these are in the same or distinct components. In this work, we study in detail the polytope in $\mathbb{R}^{E \cup F}$ whose vertices are precisely the characteristic vectors of multicuts of $\hat G$ lifted from $G$, connecting it, in particular, to the rich body of prior work on the clique partitioning and multilinear polytope.  ( 2 min )
    A Survey of Pretraining on Graphs: Taxonomy, Methods, and Applications. (arXiv:2202.07893v1 [cs.LG])
    Pretrained Language Models (PLMs) such as BERT have revolutionized the landscape of Natural Language Processing (NLP). Inspired by their proliferation, tremendous efforts have been devoted to Pretrained Graph Models (PGMs). Owing to the powerful model architectures of PGMs, abundant knowledge from massive labeled and unlabeled graph data can be captured. The knowledge implicitly encoded in model parameters can benefit various downstream tasks and help to alleviate several fundamental issues of learning on graphs. In this paper, we provide the first comprehensive survey for PGMs. We firstly present the limitations of graph representation learning and thus introduce the motivation for graph pre-training. Then, we systematically categorize existing PGMs based on a taxonomy from four different perspectives. Next, we present the applications of PGMs in social recommendation and drug discovery. Finally, we outline several promising research directions that can serve as a guideline for future research.  ( 2 min )
    Explainability of Predictive Process Monitoring Results: Can You See My Data Issues?. (arXiv:2202.08041v1 [cs.AI])
    Predictive business process monitoring (PPM) has been around for several years as a use case of process mining. PPM enables foreseeing the future of a business process through predicting relevant information about how a running process instance might end, related performance indicators, and other predictable aspects. A big share of PPM approaches adopts a Machine Learning (ML) technique to address a prediction task, especially non-process-aware PPM approaches. Consequently, PPM inherits the challenges faced by ML approaches. One of these challenges concerns the need to gain user trust in the predictions generated. The field of explainable artificial intelligence (XAI) addresses this issue. However, the choices made, and the techniques employed in a PPM task, in addition to ML model characteristics, influence resulting explanations. A comparison of the influence of different settings on the generated explanations is missing. To address this gap, we investigate the effect of different PPM settings on resulting data fed into an ML model and consequently to a XAI method. We study how differences in resulting explanations may indicate several issues in underlying data. We construct a framework for our experiments including different settings at each stage of PPM with XAI integrated as a fundamental part. Our experiments reveal several inconsistencies, as well as agreements, between data characteristics (and hence expectations about these data), important data used by the ML model as a result of querying it, and explanations of predictions of the investigated ML model.  ( 2 min )
    Generative modeling with projected entangled-pair states. (arXiv:2202.08177v1 [quant-ph])
    We argue and demonstrate that projected entangled-pair states (PEPS) outperform matrix product states significantly for the task of generative modeling of datasets with an intrinsic two-dimensional structure such as images. Our approach builds on a recently introduced algorithm for sampling PEPS, which allows for the efficient optimization and sampling of the distributions.  ( 2 min )
    Singing-Tacotron: Global duration control attention and dynamic filter for End-to-end singing voice synthesis. (arXiv:2202.07907v1 [cs.SD])
    End-to-end singing voice synthesis (SVS) is attractive due to the avoidance of pre-aligned data. However, the auto learned alignment of singing voice with lyrics is difficult to match the duration information in musical score, which will lead to the model instability or even failure to synthesize voice. To learn accurate alignment information automatically, this paper proposes an end-to-end SVS framework, named Singing-Tacotron. The main difference between the proposed framework and Tacotron is that the speech can be controlled significantly by the musical score's duration information. Firstly, we propose a global duration control attention mechanism for the SVS model. The attention mechanism can control each phoneme's duration. Secondly, a duration encoder is proposed to learn a set of global transition tokens from the musical score. These transition tokens can help the attention mechanism decide whether moving to the next phoneme or staying at each decoding step. Thirdly, to further improve the model's stability, a dynamic filter is designed to help the model overcome noise interference and pay more attention to local context information. Subjective and objective evaluation verify the effectiveness of the method. Furthermore, the role of global transition tokens and the effect of duration control are explored. Examples of experiments can be found at https://hairuo55.github.io/SingingTacotron.  ( 2 min )
    Geometry of the Minimum Volume Confidence Sets. (arXiv:2202.08180v1 [stat.ML])
    Computation of confidence sets is central to data science and machine learning, serving as the workhorse of A/B testing and underpinning the operation and analysis of reinforcement learning algorithms. This paper studies the geometry of the minimum-volume confidence sets for the multinomial parameter. When used in place of more standard confidence sets and intervals based on bounds and asymptotic approximation, learning algorithms can exhibit improved sample complexity. Prior work showed the minimum-volume confidence sets are the level-sets of a discontinuous function defined by an exact p-value. While the confidence sets are optimal in that they have minimum average volume, computation of membership of a single point in the set is challenging for problems of modest size. Since the confidence sets are level-sets of discontinuous functions, little is apparent about their geometry. This paper studies the geometry of the minimum volume confidence sets by enumerating and covering the continuous regions of the exact p-value function. This addresses a fundamental question in A/B testing: given two multinomial outcomes, how can one determine if their corresponding minimum volume confidence sets are disjoint? We answer this question in a restricted setting.  ( 2 min )
    Extended Unconstrained Features Model for Exploring Deep Neural Collapse. (arXiv:2202.08087v1 [cs.LG])
    The modern strategy for training deep neural networks for classification tasks includes optimizing the network's weights even after the training error vanishes to further push the training loss toward zero. Recently, a phenomenon termed "neural collapse" (NC) has been empirically observed in this training procedure. Specifically, it has been shown that the learned features (the output of the penultimate layer) of within-class samples converge to their mean, and the means of different classes exhibit a certain tight frame structure, which is also aligned with the last layer's weights. Recent papers have shown that minimizers with this structure emerge when optimizing a simplified "unconstrained features model" (UFM) with a regularized cross-entropy loss. In this paper, we further analyze and extend the UFM. First, we study the UFM for the regularized MSE loss, and show that the minimizers' features can be more structured than in the cross-entropy case. This affects also the structure of the weights. Then, we extend the UFM by adding another layer of weights as well as ReLU nonlinearity to the model and generalize our previous results. Finally, we empirically demonstrate the usefulness of our nonlinear extended UFM in modeling the NC phenomenon that occurs with practical networks.  ( 2 min )
    Prospect Pruning: Finding Trainable Weights at Initialization using Meta-Gradients. (arXiv:2202.08132v1 [cs.LG])
    Pruning neural networks at initialization would enable us to find sparse models that retain the accuracy of the original network while consuming fewer computational resources for training and inference. However, current methods are insufficient to enable this optimization and lead to a large degradation in model performance. In this paper, we identify a fundamental limitation in the formulation of current methods, namely that their saliency criteria look at a single step at the start of training without taking into account the trainability of the network. While pruning iteratively and gradually has been shown to improve pruning performance, explicit consideration of the training stage that will immediately follow pruning has so far been absent from the computation of the saliency criterion. To overcome the short-sightedness of existing methods, we propose Prospect Pruning (ProsPr), which uses meta-gradients through the first few steps of optimization to determine which weights to prune. ProsPr combines an estimate of the higher-order effects of pruning on the loss and the optimization trajectory to identify the trainable sub-network. Our method achieves state-of-the-art pruning performance on a variety of vision classification tasks, with less data and in a single shot compared to existing pruning-at-initialization methods.  ( 2 min )
    TimeREISE: Time-series Randomized Evolving Input Sample Explanation. (arXiv:2202.07952v1 [cs.LG])
    Deep neural networks are one of the most successful classifiers across different domains. However, due to their limitations concerning interpretability their use is limited in safety critical context. The research field of explainable artificial intelligence addresses this problem. However, most of the interpretability methods are aligned to the image modality by design. The paper introduces TimeREISE a model agnostic attribution method specifically aligned to success in the context of time series classification. The method shows superior performance compared to existing approaches concerning different well-established measurements. TimeREISE is applicable to any time series classification network, its runtime does not scale in a linear manner concerning the input shape and it does not rely on prior data knowledge.  ( 2 min )
    A Prospective Approach for Human-to-Human Interaction Recognition from Wi-Fi Channel Data using Attention Bidirectional Gated Recurrent Neural Network with GUI Application Implementation. (arXiv:2202.08146v1 [cs.LG])
    With the recent advances in multi-disciplinary human activity recognition techniques, it has become inevitable to find an efficient, economical & privacy-friendly approach for human-to-human mutual interaction recognition in order to breakthrough the modern artificial intelligence centric indoor monitoring & surveillance system. This study initially attempted to set its sights on the already proposed human activity recognition mechanisms and found a void in mutual interaction recognition from Wi-Fi channel information which is convenient & affordable to be utilized. Then it elucidated on the corresponding components of wireless local area network gadgets along with the channel properties, and notable underlying causes of signal & channel perturbation. Thenceforth the study conducted three experiments on human-to-human mutual interaction recognition using the proposed Self-Attention furnished Bidirectional Gated Recurrent Neural Network deep learning model which is perceived to become emergent nowadays for time-series data classification through automated temporal feature extraction. Single pair mutual interaction recognition experiment achieved a maximum of 94% test benchmark while the experiment involving ten subject-pairs secured 88% benchmark with improved classification around interaction-transition region. Demonstration of a graphical user interface executable software designed using PyQt5 python module subsequently portrayed the overall mutual human-interaction recognition procedure, and finally the study concluded with a brief discourse regarding the possible solutions to the handicaps that resulted in curtailments observed in the case of cross-test experiment.  ( 2 min )
    On a Variance Reduction Correction of the Temporal Difference for Policy Evaluation in the Stochastic Continuous Setting. (arXiv:2202.07960v1 [cs.LG])
    This paper deals with solving continuous time, state and action optimization problems in stochastic settings, using reinforcement learning algorithms, and considers the policy evaluation process. We prove that standard learning algorithms based on the discretized temporal difference are doomed to fail when the time discretization tends to zero, because of the stochastic part. We propose a variance-reduction correction of the temporal difference, leading to new learning algorithms that are stable with respect to vanishing time steps. This allows us to give theoretical guarantees of convergence of our algorithms to the solutions of continuous stochastic optimization problems.  ( 2 min )
    HDC-MiniROCKET: Explicit Time Encoding in Time Series Classification with Hyperdimensional Computing. (arXiv:2202.08055v1 [cs.LG])
    Classification of time series data is an important task for many application domains. One of the best existing methods for this task, in terms of accuracy and computation time, is MiniROCKET. In this work, we extend this approach to provide better global temporal encodings using hyperdimensional computing (HDC) mechanisms. HDC (also known as Vector Symbolic Architectures, VSA) is a general method to explicitly represent and process information in high-dimensional vectors. It has previously been used successfully in combination with deep neural networks and other signal processing algorithms. We argue that the internal high-dimensional representation of MiniROCKET is well suited to be complemented by the algebra of HDC. This leads to a more general formulation, HDC-MiniROCKET, where the original algorithm is only a special case. We will discuss and demonstrate that HDC-MiniROCKET can systematically overcome catastrophic failures of MiniROCKET on simple synthetic datasets. These results are confirmed by experiments on the 128 datasets from the UCR time series classification benchmark. The extension with HDC can achieve considerably better results on datasets with high temporal dependence without increasing the computational effort for inference.  ( 2 min )
    Clustering Enabled Few-Shot Load Forecasting. (arXiv:2202.07939v1 [cs.LG])
    While the advanced machine learning algorithms are effective in load forecasting, they often suffer from low data utilization, and hence their superior performance relies on massive datasets. This motivates us to design machine learning algorithms with improved data utilization. Specifically, we consider the load forecasting for a new user in the system by observing only few shots (data points) of its energy consumption. This task is challenging since the limited samples are insufficient to exploit the temporal characteristics, essential for load forecasting. Nonetheless, we notice that there are not too many temporal characteristics for residential loads due to the limited kinds of human lifestyle. Hence, we propose to utilize the historical load profile data from existing users to conduct effective clustering, which mitigates the challenges brought by the limited samples. Specifically, we first design a feature extraction clustering method for categorizing historical data. Then, inheriting the prior knowledge from the clustering results, we propose a two-phase Long Short Term Memory (LSTM) model to conduct load forecasting for new users. The proposed method outperforms the traditional LSTM model, especially when the training sample size fails to cover a whole period (i.e., 24 hours in our task). Extensive case studies on two real-world datasets and one synthetic dataset verify the effectiveness and efficiency of our method.  ( 2 min )
    Learning a Single Neuron for Non-monotonic Activation Functions. (arXiv:2202.08064v1 [stat.ML])
    We study the problem of learning a single neuron $\mathbf{x}\mapsto \sigma(\mathbf{w}^T\mathbf{x})$ with gradient descent (GD). All the existing positive results are limited to the case where $\sigma$ is monotonic. However, it is recently observed that non-monotonic activation functions outperform the traditional monotonic ones in many applications. To fill this gap, we establish learnability without assuming monotonicity. Specifically, when the input distribution is the standard Gaussian, we show that mild conditions on $\sigma$ (e.g., $\sigma$ has a dominating linear part) are sufficient to guarantee the learnability in polynomial time and polynomial samples. Moreover, with a stronger assumption on the activation function, the condition of input distribution can be relaxed to a non-degeneracy of the marginal distribution. We remark that our conditions on $\sigma$ are satisfied by practical non-monotonic activation functions, such as SiLU/Swish and GELU. We also discuss how our positive results are related to existing negative results on training two-layer neural networks.  ( 2 min )
    GAN Estimation of Lipschitz Optimal Transport Maps. (arXiv:2202.07965v1 [stat.ML])
    This paper introduces the first statistically consistent estimator of the optimal transport map between two probability distributions, based on neural networks. Building on theoretical and practical advances in the field of Lipschitz neural networks, we define a Lipschitz-constrained generative adversarial network penalized by the quadratic transportation cost. Then, we demonstrate that, under regularity assumptions, the obtained generator converges uniformly to the optimal transport map as the sample size increases to infinity. Furthermore, we show through a number of numerical experiments that the learnt mapping has promising performances. In contrast to previous work tackling either statistical guarantees or practicality, we provide an expressive and feasible estimator which paves way for optimal transport applications where the asymptotic behaviour must be certified.  ( 2 min )
    Discriminative Adversarial Domain Generalization with Meta-learning based Cross-domain Validation. (arXiv:2011.00444v2 [cs.LG] UPDATED)
    The generalization capability of machine learning models, which refers to generalizing the knowledge for an "unseen" domain via learning from one or multiple seen domain(s), is of great importance to develop and deploy machine learning applications in the real-world conditions. Domain Generalization (DG) techniques aim to enhance such generalization capability of machine learning models, where the learnt feature representation and the classifier are two crucial factors to improve generalization and make decisions. In this paper, we propose Discriminative Adversarial Domain Generalization (DADG) with meta-learning-based cross-domain validation. Our proposed framework contains two main components that work synergistically to build a domain-generalized DNN model: (i) discriminative adversarial learning, which proactively learns a generalized feature representation on multiple "seen" domains, and (ii) meta-learning based cross-domain validation, which simulates train/test domain shift via applying meta-learning techniques in the training process. In the experimental evaluation, a comprehensive comparison has been made among our proposed approach and other existing approaches on three benchmark datasets. The results shown that DADG consistently outperforms a strong baseline DeepAll, and outperforms the other existing DG algorithms in most of the evaluation cases.
    An $L^2$ Analysis of Reinforcement Learning in High Dimensions with Kernel and Neural Network Approximation. (arXiv:2104.07794v3 [cs.LG] UPDATED)
    Reinforcement learning (RL) algorithms based on high-dimensional function approximation have achieved tremendous empirical success in large-scale problems with an enormous number of states. However, most analysis of such algorithms gives rise to error bounds that involve either the number of states or the number of features. This paper considers the situation where the function approximation is made either using the kernel method or the two-layer neural network model, in the context of a fitted Q-iteration algorithm with explicit regularization. We establish an $\tilde{O}(H^3|\mathcal {A}|^{\frac14}n^{-\frac14})$ bound for the optimal policy with $Hn$ samples, where $H$ is the length of each episode and $|\mathcal {A}|$ is the size of action space. Our analysis hinges on analyzing the $L^2$ error of the approximated Q-function using $n$ data points. Even though this result still requires a finite-sized action space, the error bound is independent of the dimensionality of the state space.
    Fair Disaster Containment via Graph-Cut Problems. (arXiv:2106.05424v4 [cs.DS] UPDATED)
    Graph cut problems are fundamental in Combinatorial Optimization, and are a central object of study in both theory and practice. Furthermore, the study of \emph{fairness} in Algorithmic Design and Machine Learning has recently received significant attention, with many different notions proposed and analyzed for a variety of contexts. In this paper we initiate the study of fairness for graph cut problems by giving the first fair definitions for them, and subsequently we demonstrate appropriate algorithmic techniques that yield a rigorous theoretical analysis. Specifically, we incorporate two different notions of fairness, namely \emph{demographic} and \emph{probabilistic individual} fairness, in a particular cut problem that models disaster containment scenarios. Our results include a variety of approximation algorithms with provable theoretical guarantees.
    Cross-Modal Common Representation Learning with Triplet Loss Functions. (arXiv:2202.07901v1 [cs.LG])
    Common representation learning (CRL) learns a shared embedding between two or more modalities to improve in a given task over using only one of the modalities. CRL from different data types such as images and time-series data (e.g., audio or text data) requires a deep metric learning loss that minimizes the distance between the modality embeddings. In this paper, we propose to use the triplet loss, which uses positive and negative identities to create sample pairs with different labels, for CRL between image and time-series modalities. By adapting the triplet loss for CRL, higher accuracy in the main (time-series classification) task can be achieved by exploiting additional information of the auxiliary (image classification) task. Our experiments on synthetic data and handwriting recognition data from sensor-enhanced pens show an improved classification accuracy, faster convergence, and a better generalizability.
    On the Role of Channel Capacity in Learning Gaussian Mixture Models. (arXiv:2202.07707v1 [cs.IT])
    This paper studies the sample complexity of learning the $k$ unknown centers of a balanced Gaussian mixture model (GMM) in $\mathbb{R}^d$ with spherical covariance matrix $\sigma^2\mathbf{I}$. In particular, we are interested in the following question: what is the maximal noise level $\sigma^2$, for which the sample complexity is essentially the same as when estimating the centers from labeled measurements? To that end, we restrict attention to a Bayesian formulation of the problem, where the centers are uniformly distributed on the sphere $\sqrt{d}\mathcal{S}^{d-1}$. Our main results characterize the exact noise threshold $\sigma^2$ below which the GMM learning problem, in the large system limit $d,k\to\infty$, is as easy as learning from labeled observations, and above which it is substantially harder. The threshold occurs at $\frac{\log k}{d} = \frac12\log\left( 1+\frac{1}{\sigma^2} \right)$, which is the capacity of the additive white Gaussian noise (AWGN) channel. Thinking of the set of $k$ centers as a code, this noise threshold can be interpreted as the largest noise level for which the error probability of the code over the AWGN channel is small. Previous works on the GMM learning problem have identified the minimum distance between the centers as a key parameter in determining the statistical difficulty of learning the corresponding GMM. While our results are only proved for GMMs whose centers are uniformly distributed over the sphere, they hint that perhaps it is the decoding error probability associated with the center constellation as a channel code that determines the statistical difficulty of learning the corresponding GMM, rather than just the minimum distance.  ( 2 min )
    Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations. (arXiv:2202.07800v1 [cs.CV])
    Vision Transformers (ViTs) take all the image patches as tokens and construct multi-head self-attention (MHSA) among them. Complete leverage of these image tokens brings redundant computations since not all the tokens are attentive in MHSA. Examples include that tokens containing semantically meaningless or distractive image backgrounds do not positively contribute to the ViT predictions. In this work, we propose to reorganize image tokens during the feed-forward process of ViT models, which is integrated into ViT during training. For each forward inference, we identify the attentive image tokens between MHSA and FFN (i.e., feed-forward network) modules, which is guided by the corresponding class token attention. Then, we reorganize image tokens by preserving attentive image tokens and fusing inattentive ones to expedite subsequent MHSA and FFN computations. To this end, our method EViT improves ViTs from two perspectives. First, under the same amount of input image tokens, our method reduces MHSA and FFN computation for efficient inference. For instance, the inference speed of DeiT-S is increased by 50% while its recognition accuracy is decreased by only 0.3% for ImageNet classification. Second, by maintaining the same computational cost, our method empowers ViTs to take more image tokens as input for recognition accuracy improvement, where the image tokens are from higher resolution images. An example is that we improve the recognition accuracy of DeiT-S by 1% for ImageNet classification at the same computational cost of a vanilla DeiT-S. Meanwhile, our method does not introduce more parameters to ViTs. Experiments on the standard benchmarks show the effectiveness of our method. The code is available at https://github.com/youweiliang/evit  ( 2 min )
    Deeply-Supervised Knowledge Distillation. (arXiv:2202.07846v1 [cs.LG])
    Knowledge distillation aims to enhance the performance of a lightweight student model by exploiting the knowledge from a pre-trained cumbersome teacher model. However, in the traditional knowledge distillation, teacher predictions are only used to provide the supervisory signal for the last layer of the student model, which may result in those shallow student layers lacking accurate training guidance in the layer-by-layer back propagation and thus hinders effective knowledge transfer. To address this issue, we propose Deeply-Supervised Knowledge Distillation (DSKD), which fully utilizes class predictions and feature maps of the teacher model to supervise the training of shallow student layers. A loss-based weight allocation strategy is developed in DSKD to adaptively balance the learning process of each shallow layer, so as to further improve the student performance. Extensive experiments show that the performance of DSKD consistently exceeds state-of-the-art methods on various teacher-student models, confirming the effectiveness of our proposed method.  ( 2 min )
    Active Uncertainty Learning for Human-Robot Interaction: An Implicit Dual Control Approach. (arXiv:2202.07720v1 [cs.RO])
    Predictive models are effective in reasoning about human motion, a crucial part that affects safety and efficiency in human-robot interaction. However, robots often lack access to certain key parameters of such models, for example, human's objectives, their level of distraction, and willingness to cooperate. Dual control theory addresses this challenge by treating unknown parameters as stochastic hidden states and identifying their values using information gathered during control of the robot. Despite its ability to optimally and automatically trade off exploration and exploitation, dual control is computationally intractable for general human-in-the-loop motion planning, mainly due to nested trajectory optimization and human intent prediction. In this paper, we present a novel algorithmic approach to enable active uncertainty learning for human-in-the-loop motion planning based on the implicit dual control paradigm. Our approach relies on sampling-based approximation of stochastic dynamic programming, leading to a model predictive control problem that can be readily solved by real-time gradient-based optimization methods. The resulting policy is shown to preserve the dual control effect for generic human predictive models with both continuous and categorical uncertainty. The efficacy of our approach is demonstrated with simulated driving examples.  ( 2 min )
    Policy Learning and Evaluation with Randomized Quasi-Monte Carlo. (arXiv:2202.07808v1 [cs.LG])
    Reinforcement learning constantly deals with hard integrals, for example when computing expectations in policy evaluation and policy iteration. These integrals are rarely analytically solvable and typically esimated with the Monte Carlo method, which induces high variance in policy values and gradients. In this work, we propose to replace Monte Carlo samples with low-discrepancy point sets. We combine policy gradient methods with Randomized Quasi-Monte Carlo, yielding variance-reduced formulations of policy gradient and actor-critic algorithms. These formulations are effective for policy evaluation and policy improvement, as they outperform state-of-the-art algorithms on standardized continuous control benchmarks. Our empirical analyses validate the intuition that replacing Monte Carlo with Quasi-Monte Carlo yields significantly more accurate gradient estimates.  ( 2 min )
    Privacy-Preserving Graph Neural Network Training and Inference as a Cloud Service. (arXiv:2202.07835v1 [cs.CR])
    Graphs are widely used to model the complex relationships among entities. As a powerful tool for graph analytics, graph neural networks (GNNs) have recently gained wide attention due to its end-to-end processing capabilities. With the proliferation of cloud computing, it is increasingly popular to deploy the services of complex and resource-intensive model training and inference in the cloud due to its prominent benefits. However, GNN training and inference services, if deployed in the cloud, will raise critical privacy concerns about the information-rich and proprietary graph data (and the resulting model). While there has been some work on secure neural network training and inference, they all focus on convolutional neural networks handling images and text rather than complex graph data with rich structural information. In this paper, we design, implement, and evaluate SecGNN, the first system supporting privacy-preserving GNN training and inference services in the cloud. SecGNN is built from a synergy of insights on lightweight cryptography and machine learning techniques. We deeply examine the procedure of GNN training and inference, and devise a series of corresponding secure customized protocols to support the holistic computation. Extensive experiments demonstrate that SecGNN achieves comparable plaintext training and inference accuracy, with practically affordable performance.  ( 2 min )
    Low Latency Real-Time Seizure Detection Using Transfer Deep Learning. (arXiv:2202.07796v1 [eess.SP])
    Scalp electroencephalogram (EEG) signals inherently have a low signal-to-noise ratio due to the way the signal is electrically transduced. Temporal and spatial information must be exploited to achieve accurate detection of seizure events. Most popular approaches to seizure detection using deep learning do not jointly model this information or require multiple passes over the signal, which makes the systems inherently non-causal. In this paper, we exploit both simultaneously by converting the multichannel signal to a grayscale image and using transfer learning to achieve high performance. The proposed system is trained end-to-end with only very simple pre- and postprocessing operations which are computationally lightweight and have low latency, making them conducive to clinical applications that require real-time processing. We have achieved a performance of 42.05% sensitivity with 5.78 false alarms per 24 hours on the development dataset of v1.5.2 of the Temple University Hospital Seizure Detection Corpus. On a single-core CPU operating at 1.7 GHz, the system runs faster than real-time (0.58 xRT), uses 16 Gbytes of memory, and has a latency of 300 msec.  ( 2 min )
    Reducing Overconfidence Predictions for Autonomous Driving Perception. (arXiv:2202.07825v1 [cs.CV])
    In state-of-the-art deep learning for object recognition, SoftMax and Sigmoid functions are most commonly employed as the predictor outputs. Such layers often produce overconfident predictions rather than proper probabilistic scores, which can thus harm the decision-making of `critical' perception systems applied in autonomous driving and robotics. Given this, the experiments in this work propose a probabilistic approach based on distributions calculated out of the Logit layer scores of pre-trained networks. We demonstrate that Maximum Likelihood (ML) and Maximum a-Posteriori (MAP) functions are more suitable for probabilistic interpretations than SoftMax and Sigmoid-based predictions for object recognition. We explore distinct sensor modalities via RGB images and LiDARs (RV: range-view) data from the KITTI and Lyft Level-5 datasets, where our approach shows promising performance compared to the usual SoftMax and Sigmoid layers, with the benefit of enabling interpretable probabilistic predictions. Another advantage of the approach introduced in this paper is that the ML and MAP functions can be implemented in existing trained networks, that is, the approach benefits from the output of the Logit layer of pre-trained networks. Thus, there is no need to carry out a new training phase since the ML and MAP functions are used in the test/prediction phase.  ( 2 min )
    LIMREF: Local Interpretable Model Agnostic Rule-based Explanations for Forecasting, with an Application to Electricity Smart Meter Data. (arXiv:2202.07766v1 [cs.LG])
    Accurate electricity demand forecasts play a crucial role in sustainable power systems. To enable better decision-making especially for demand flexibility of the end-user, it is necessary to provide not only accurate but also understandable and actionable forecasts. To provide accurate forecasts Global Forecasting Models (GFM) trained across time series have shown superior results in many demand forecasting competitions and real-world applications recently, compared with univariate forecasting approaches. We aim to fill the gap between the accuracy and the interpretability in global forecasting approaches. In order to explain the global model forecasts, we propose Local Interpretable Model-agnostic Rule-based Explanations for Forecasting (LIMREF), a local explainer framework that produces k-optimal impact rules for a particular forecast, considering the global forecasting model as a black-box model, in a model-agnostic way. It provides different types of rules that explain the forecast of the global model and the counterfactual rules, which provide actionable insights for potential changes to obtain different outputs for given instances. We conduct experiments using a large-scale electricity demand dataset with exogenous features such as temperature and calendar effects. Here, we evaluate the quality of the explanations produced by the LIMREF framework in terms of both qualitative and quantitative aspects such as accuracy, fidelity, and comprehensibility and benchmark those against other local explainers.  ( 2 min )
    Graph-Augmented Normalizing Flows for Anomaly Detection of Multiple Time Series. (arXiv:2202.07857v1 [cs.LG])
    Anomaly detection is a widely studied task for a broad variety of data types; among them, multiple time series appear frequently in applications, including for example, power grids and traffic networks. Detecting anomalies for multiple time series, however, is a challenging subject, owing to the intricate interdependencies among the constituent series. We hypothesize that anomalies occur in low density regions of a distribution and explore the use of normalizing flows for unsupervised anomaly detection, because of their superior quality in density estimation. Moreover, we propose a novel flow model by imposing a Bayesian network among constituent series. A Bayesian network is a directed acyclic graph (DAG) that models causal relationships; it factorizes the joint probability of the series into the product of easy-to-evaluate conditional probabilities. We call such a graph-augmented normalizing flow approach GANF and propose joint estimation of the DAG with flow parameters. We conduct extensive experiments on real-world datasets and demonstrate the effectiveness of GANF for density estimation, anomaly detection, and identification of time series distribution drift.  ( 2 min )
    CycleGAN for Undamaged-to-Damaged Domain Translation for Structural Health Monitoring and Damage Detection. (arXiv:2202.07831v1 [cs.LG])
    The accelerated advancements in the data science field in the last few decades has benefitted many other fields including Structural Health Monitoring (SHM). Particularly, the employment of Artificial Intelligence (AI) such as Machine Learning (ML) and Deep Learning (DL) methods towards vibration-based damage diagnostics of civil structures have seen a great interest due to their nature of supreme performance in learning from data. Along with diagnostics, damage prognostics also hold a vital prominence, such as estimating the remaining useful life of civil structures. Currently used AI-based data-driven methods for damage diagnostics and prognostics are centered on historical data of the structures and require a substantial amount of data to directly form the prediction models. Although some of these methods are generative-based models, after learning the distribution of the data, they are used to perform ML or DL tasks such as classification, regression, clustering, etc. In this study, a variant of Generative Adversarial Networks (GAN), Cycle-Consistent Wasserstein Deep Convolutional GAN with Gradient Penalty (CycleWDCGAN-GP) model is used to answer some of the most important questions in SHM: "How does the dynamic signature of a structure transition from undamaged to damaged conditions?" and "What is the nature of such transition?". The outcomes of this study demonstrate that the proposed model can accurately generate the possible future responses of a structure for potential future damaged conditions. In other words, with the proposed methodology, the stakeholders will be able to understand the damaged condition of structures while the structures are still in healthy (undamaged) conditions. This tool will enable them to be more proactive in overseeing the life cycle performance of structures as well as assist in remaining useful life predictions.  ( 2 min )
    Data-Driven Minimax Optimization with Expectation Constraints. (arXiv:2202.07868v1 [cs.LG])
    Attention to data-driven optimization approaches, including the well-known stochastic gradient descent method, has grown significantly over recent decades, but data-driven constraints have rarely been studied, because of the computational challenges of projections onto the feasible set defined by these hard constraints. In this paper, we focus on the non-smooth convex-concave stochastic minimax regime and formulate the data-driven constraints as expectation constraints. The minimax expectation constrained problem subsumes a broad class of real-world applications, including two-player zero-sum game and data-driven robust optimization. We propose a class of efficient primal-dual algorithms to tackle the minimax expectation-constrained problem, and show that our algorithms converge at the optimal rate of $\mathcal{O}(\frac{1}{\sqrt{N}})$. We demonstrate the practical efficiency of our algorithms by conducting numerical experiments on large-scale real-world applications.  ( 2 min )
    BB-ML: Basic Block Performance Prediction using Machine Learning Techniques. (arXiv:2202.07798v1 [cs.LG])
    Recent years have seen the adoption of Machine Learning (ML) techniques to predict the performance of large-scale applications, mostly at a coarse level. In contrast, we propose to use ML techniques for performance prediction at much finer granularity, namely at the levels of Basic Block (BB), which are the single entry-single exit code blocks that are used as analysis tools by all compilers to break down a large code into manageable pieces. Utilizing ML and BB analysis together can enable scalable hardware-software co-design beyond the current state of the art. In this work, we extrapolate the basic block execution counts of GPU applications for large inputs sizes from the counts of smaller input sizes of the same application. We employ two ML models, a Poisson Neural Network (PNN) and a Bayesian Regularization Backpropagation Neural Network (BR-BPNN). We train both models using the lowest input values of the application and random input values to predict basic block counts. Results show that our models accurately predict the basic block execution counts of 16 benchmark applications. For PNN and BR-BPNN models, we achieve an average accuracy of 93.5% and 95.6%, respectively, while extrapolating the basic block counts for large input sets when the model is trained using smaller input sets. Additionally, the models show an average accuracy of 97.7% and 98.1%, respectively, while predicting basic block counts on random instances.  ( 2 min )
    General-purpose, long-context autoregressive modeling with Perceiver AR. (arXiv:2202.07765v1 [cs.LG])
    Real-world data is high-dimensional: a book, image, or musical performance can easily contain hundreds of thousands of elements even after compression. However, the most commonly used autoregressive models, Transformers, are prohibitively expensive to scale to the number of inputs and layers needed to capture this long-range structure. We develop Perceiver AR, an autoregressive, modality-agnostic architecture which uses cross-attention to map long-range inputs to a small number of latents while also maintaining end-to-end causal masking. Perceiver AR can directly attend to over a hundred thousand tokens, enabling practical long-context density estimation without the need for hand-crafted sparsity patterns or memory mechanisms. When trained on images or music, Perceiver AR generates outputs with clear long-term coherence and structure. Our architecture also obtains state-of-the-art likelihood on long-sequence benchmarks, including 64 x 64 ImageNet images and PG-19 books.  ( 2 min )
    Generative Adversarial Network-Driven Detection of Adversarial Tasks in Mobile Crowdsensing. (arXiv:2202.07802v1 [cs.CR])
    Mobile Crowdsensing systems are vulnerable to various attacks as they build on non-dedicated and ubiquitous properties. Machine learning (ML)-based approaches are widely investigated to build attack detection systems and ensure MCS systems security. However, adversaries that aim to clog the sensing front-end and MCS back-end leverage intelligent techniques, which are challenging for MCS platform and service providers to develop appropriate detection frameworks against these attacks. Generative Adversarial Networks (GANs) have been applied to generate synthetic samples, that are extremely similar to the real ones, deceiving classifiers such that the synthetic samples are indistinguishable from the originals. Previous works suggest that GAN-based attacks exhibit more crucial devastation than empirically designed attack samples, and result in low detection rate at the MCS platform. With this in mind, this paper aims to detect intelligently designed illegitimate sensing service requests by integrating a GAN-based model. To this end, we propose a two-level cascading classifier that combines the GAN discriminator with a binary classifier to prevent adversarial fake tasks. Through simulations, we compare our results to a single-level binary classifier, and the numeric results show that proposed approach raises Adversarial Attack Detection Rate (AADR), from $0\%$ to $97.5\%$ by KNN/NB, from $45.9\%$ to $100\%$ by Decision Tree. Meanwhile, with two-levels classifiers, Original Attack Detection Rate (OADR) improves for the three binary classifiers, with comparison, such as NB from $26.1\%$ to $61.5\%$.  ( 2 min )
    Simulating Malicious Attacks on VANETs for Connected and Autonomous Vehicle Cybersecurity: A Machine Learning Dataset. (arXiv:2202.07704v1 [cs.CR])
    Connected and Autonomous Vehicles (CAVs) rely on Vehicular Adhoc Networks with wireless communication between vehicles and roadside infrastructure to support safe operation. However, cybersecurity attacks pose a threat to VANETs and the safe operation of CAVs. This study proposes the use of simulation for modelling typical communication scenarios which may be subject to malicious attacks. The Eclipse MOSAIC simulation framework is used to model two typical road scenarios, including messaging between the vehicles and infrastructure - and both replay and bogus information cybersecurity attacks are introduced. The model demonstrates the impact of these attacks, and provides an open dataset to inform the development of machine learning algorithms to provide anomaly detection and mitigation solutions for enhancing secure communications and safe deployment of CAVs on the road.  ( 2 min )
    Applying adversarial networks to increase the data efficiency and reliability of Self-Driving Cars. (arXiv:2202.07815v1 [cs.CV])
    Convolutional Neural Networks (CNNs) are vulnerable to misclassifying images when small perturbations are present. With the increasing prevalence of CNNs in self-driving cars, it is vital to ensure these algorithms are robust to prevent collisions from occurring due to failure in recognizing a situation. In the Adversarial Self-Driving framework, a Generative Adversarial Network (GAN) is implemented to generate realistic perturbations in an image that cause a classifier CNN to misclassify data. This perturbed data is then used to train the classifier CNN further. The Adversarial Self-driving framework is applied to an image classification algorithm to improve the classification accuracy on perturbed images and is later applied to train a self-driving car to drive in a simulation. A small-scale self-driving car is also built to drive around a track and classify signs. The Adversarial Self-driving framework produces perturbed images through learning a dataset, as a result removing the need to train on significant amounts of data. Experiments demonstrate that the Adversarial Self-driving framework identifies situations where CNNs are vulnerable to perturbations and generates new examples of these situations for the CNN to train on. The additional data generated by the Adversarial Self-driving framework provides sufficient data for the CNN to generalize to the environment. Therefore, it is a viable tool to increase the resilience of CNNs to perturbations. Particularly, in the real-world self-driving car, the application of the Adversarial Self-Driving framework resulted in an 18 % increase in accuracy, and the simulated self-driving model had no collisions in 30 minutes of driving.  ( 2 min )
    Trustworthy Anomaly Detection: A Survey. (arXiv:2202.07787v1 [cs.LG])
    Anomaly detection has a wide range of real-world applications, such as bank fraud detection and cyber intrusion detection. In the past decade, a variety of anomaly detection models have been developed, which lead to big progress towards accurately detecting various anomalies. Despite the successes, anomaly detection models still face many limitations. The most significant one is whether we can trust the detection results from the models. In recent years, the research community has spent a great effort to design trustworthy machine learning models, such as developing trustworthy classification models. However, the attention to anomaly detection tasks is far from sufficient. Considering that many anomaly detection tasks are life-changing tasks involving human beings, labeling someone as anomalies or fraudsters should be extremely cautious. Hence, ensuring the anomaly detection models conducted in a trustworthy fashion is an essential requirement to deploy the models to conduct automatic decisions in the real world. In this brief survey, we summarize the existing efforts and discuss open problems towards trustworthy anomaly detection from the perspectives of interpretability, fairness, robustness, and privacy-preservation.  ( 2 min )
    Beyond Deterministic Translation for Unsupervised Domain Adaptation. (arXiv:2202.07778v1 [cs.CV])
    In this work we challenge the common approach of using a one-to-one mapping ('translation') between the source and target domains in unsupervised domain adaptation (UDA). Instead, we rely on stochastic translation to capture inherent translation ambiguities. This allows us to (i) train more accurate target networks by generating multiple outputs conditioned on the same source image, leveraging both accurate translation and data augmentation for appearance variability, (ii) impute robust pseudo-labels for the target data by averaging the predictions of a source network on multiple translated versions of a single target image and (iii) train and ensemble diverse networks in the target domain by modulating the degree of stochasticity in the translations. We report improvements over strong recent baselines, leading to state-of-the-art UDA results on two challenging semantic segmentation benchmarks.  ( 2 min )
    Speech Denoising in the Waveform Domain with Self-Attention. (arXiv:2202.07790v1 [cs.SD])
    In this work, we present CleanUNet, a causal speech denoising model on the raw waveform. The proposed model is based on an encoder-decoder architecture combined with several self-attention blocks to refine its bottleneck representations, which is crucial to obtain good results. The model is optimized through a set of losses defined over both waveform and multi-resolution spectrograms. The proposed method outperforms the state-of-the-art models in terms of denoised speech quality from various objective and subjective evaluation metrics.  ( 2 min )
    Binary Classification for High Dimensional Data using Supervised Non-Parametric Ensemble Method. (arXiv:2202.07779v1 [cs.LG])
    Medical Research data used for prognostication deals with binary classification problems in most of the cases. The endocrinological disorders have data available and it can be leveraged using Machine Learning. The dataset for Polycystic Ovary Syndrome is available, which is termed as an endocrinological disorder in women. Non-Parametric Supervised Ensemble machine learning methods can be used for prediction of the disorder in early stages. In this paper we present the Bootstrap Aggregation Supervised Ensemble Non-parametric method for prognostication that competes state-of-the-art performance with accuracy of over 92% along with in depth analysis of the data.  ( 2 min )
    Safe Reinforcement Learning by Imagining the Near Future. (arXiv:2202.07789v1 [cs.LG])
    Safe reinforcement learning is a promising path toward applying reinforcement learning algorithms to real-world problems, where suboptimal behaviors may lead to actual negative consequences. In this work, we focus on the setting where unsafe states can be avoided by planning ahead a short time into the future. In this setting, a model-based agent with a sufficiently accurate model can avoid unsafe states. We devise a model-based algorithm that heavily penalizes unsafe trajectories, and derive guarantees that our algorithm can avoid unsafe states under certain assumptions. Experiments demonstrate that our algorithm can achieve competitive rewards with fewer safety violations in several continuous control tasks.  ( 2 min )
    Taking a Step Back with KCal: Multi-Class Kernel-Based Calibration for Deep Neural Networks. (arXiv:2202.07679v1 [stat.ML])
    Deep neural network (DNN) classifiers are often overconfident, producing miscalibrated class probabilities. Most existing calibration methods either lack theoretical guarantees for producing calibrated outputs or reduce the classification accuracy in the process. This paper proposes a new Kernel-based calibration method called KCal. Unlike other calibration procedures, KCal does not operate directly on the logits or softmax outputs of the DNN. Instead, it uses the penultimate-layer latent embedding to train a metric space in a supervised manner. In effect, KCal amounts to a supervised dimensionality reduction of the neural network embedding, and generates a prediction using kernel density estimation on a holdout calibration set. We first analyze KCal theoretically, showing that it enjoys a provable asymptotic calibration guarantee. Then, through extensive experiments, we confirm that KCal consistently outperforms existing calibration methods in terms of both the classification accuracy and the (confidence and class-wise) calibration error.  ( 2 min )
    Spatial Transformer K-Means. (arXiv:2202.07829v1 [cs.LG])
    K-means defines one of the most employed centroid-based clustering algorithms with performances tied to the data's embedding. Intricate data embeddings have been designed to push $K$-means performances at the cost of reduced theoretical guarantees and interpretability of the results. Instead, we propose preserving the intrinsic data space and augment K-means with a similarity measure invariant to non-rigid transformations. This enables (i) the reduction of intrinsic nuisances associated with the data, reducing the complexity of the clustering task and increasing performances and producing state-of-the-art results, (ii) clustering in the input space of the data, leading to a fully interpretable clustering algorithm, and (iii) the benefit of convergence guarantees.  ( 2 min )
    The NLP Task Effectiveness of Long-Range Transformers. (arXiv:2202.07856v1 [cs.CL])
    Transformer models cannot easily scale to long sequences due to their O(N^2) time and space complexity. This has led to Transformer variants seeking to lessen computational complexity, such as Longformer and Performer. While such models have theoretically greater efficiency, their effectiveness on real NLP tasks has not been well studied. We benchmark 7 variants of Transformer models on 5 difficult NLP tasks and 7 datasets. We design experiments to isolate the effect of pretraining and hyperparameter settings, to focus on their capacity for long-range attention. Moreover, we present various methods to investigate attention behaviors, to illuminate model details beyond metric scores. We find that attention of long-range transformers has advantages on content selection and query-guided decoding, but they come with previously unrecognized drawbacks such as insufficient attention to distant tokens.  ( 2 min )
    A Light-Weight Multi-Objective Asynchronous Hyper-Parameter Optimizer. (arXiv:2202.07735v1 [cs.LG])
    We describe a light-weight yet performant system for hyper-parameter optimization that approximately minimizes an overall scalar cost function that is obtained by combining multiple performance objectives using a target-priority-limit scalarizer. It also supports a trade-off mode, where the goal is to find an appropriate trade-off among objectives by interacting with the user. We focus on the common scenario where there are on the order of tens of hyper-parameters, each with various attributes such as a range of continuous values, or a finite list of values, and whether it should be treated on a linear or logarithmic scale. The system supports multiple asynchronous simulations and is robust to simulation stragglers and failures.  ( 2 min )
    Application of Long Short-Term Memory Recurrent Neural Networks Based on the BAT-MCS for Binary-State Network Approximated Time-Dependent Reliability Problems. (arXiv:2202.07837v1 [eess.SY])
    Reliability is an important tool for evaluating the performance of modern networks. Currently, it is NP-hard and #P-hard to calculate the exact reliability of a binary-state network when the reliability of each component is assumed to be fixed. However, this assumption is unrealistic because the reliability of each component always varies with time. To meet this practical requirement, we propose a new algorithm called the LSTM-BAT-MCS, based on long short-term memory (LSTM), the Monte Carlo simulation (MCS), and the binary-adaption-tree algorithm (BAT). The superiority of the proposed LSTM-BAT-MCS was demonstrated by experimental results of three benchmark networks with at most 10-4 mean square error.  ( 2 min )
    Architecture Agnostic Federated Learning for Neural Networks. (arXiv:2202.07757v1 [cs.LG])
    With growing concerns regarding data privacy and rapid increase in data volume, Federated Learning(FL) has become an important learning paradigm. However, jointly learning a deep neural network model in a FL setting proves to be a non-trivial task because of the complexities associated with the neural networks, such as varied architectures across clients, permutation invariance of the neurons, and presence of non-linear transformations in each layer. This work introduces a novel Federated Heterogeneous Neural Networks (FedHeNN) framework that allows each client to build a personalised model without enforcing a common architecture across clients. This allows each client to optimize with respect to local data and compute constraints, while still benefiting from the learnings of other (potentially more powerful) clients. The key idea of FedHeNN is to use the instance-level representations obtained from peer clients to guide the simultaneous training on each client. The extensive experimental results demonstrate that the FedHeNN framework is capable of learning better performing models on clients in both the settings of homogeneous and heterogeneous architectures across clients.  ( 2 min )
    The efficacy and generalizability of conditional GANs for posterior inference in physics-based inverse problems. (arXiv:2202.07773v1 [stat.ML])
    In this work, we train conditional Wasserstein generative adversarial networks to effectively sample from the posterior of physics-based Bayesian inference problems. The generator is constructed using a U-Net architecture, with the latent information injected using conditional instance normalization. The former facilitates a multiscale inverse map, while the latter enables the decoupling of the latent space dimension from the dimension of the measurement, and introduces stochasticity at all scales of the U-Net. We solve PDE-based inverse problems to demonstrate the performance of our approach in quantifying the uncertainty in the inferred field. Further, we show the generator can learn inverse maps which are local in nature, which in turn promotes generalizability when testing with out-of-distribution samples.  ( 2 min )
    IPD:An Incremental Prototype based DBSCAN for large-scale data with cluster representatives. (arXiv:2202.07870v1 [cs.LG])
    DBSCAN is a fundamental density-based clustering technique that identifies any arbitrary shape of the clusters. However, it becomes infeasible while handling big data. On the other hand, centroid-based clustering is important for detecting patterns in a dataset since unprocessed data points can be labeled to their nearest centroid. However, it can not detect non-spherical clusters. For a large data, it is not feasible to store and compute labels of every samples. These can be done as and when the information is required. The purpose can be accomplished when clustering act as a tool to identify cluster representatives and query is served by assigning cluster labels of nearest representative. In this paper, we propose an Incremental Prototype-based DBSCAN (IPD) algorithm which is designed to identify arbitrary-shaped clusters for large-scale data. Additionally, it chooses a set of representatives for each cluster.  ( 2 min )
  • Open

    Toward Development of Machine Learned Techniques for Production of Compact Kinetic Models. (arXiv:2202.08021v1 [physics.chem-ph])
    Chemical kinetic models are an essential component in the development and optimisation of combustion devices through their coupling to multi-dimensional simulations such as computational fluid dynamics (CFD). Low-dimensional kinetic models which retain good fidelity to the reality are needed, the production of which requires considerable human-time cost and expert knowledge. Here, we present a novel automated compute intensification methodology to produce overly-reduced and optimised (compact) chemical kinetic models. This algorithm, termed Machine Learned Optimisation of Chemical Kinetics (MLOCK), systematically perturbs each of the four sub-models of a chemical kinetic model to discover what combinations of terms results in a good model. A virtual reaction network comprised of n species is first obtained using conventional mechanism reduction. To counteract the imposed decrease in model performance, the weights (virtual reaction rate constants) of important connections (virtual reactions) between each node (species) of the virtual reaction network are numerically optimised to replicate selected calculations across four sequential phases. The first version of MLOCK, (MLOCK1.0) simultaneously perturbs all three virtual Arrhenius reaction rate constant parameters for important connections and assesses the suitability of the new parameters through objective error functions, which quantify the error in each compact model candidate's calculation of the optimisation targets, which may be comprised of detailed model calculations and/or experimental data. MLOCK1.0 is demonstrated by creating compact models for the archetypal case of methane air combustion. It is shown that the NUGMECH1.0 detailed model comprised of 2,789 species is reliably compacted to 15 species (nodes), whilst retaining an overall fidelity of ~87% to the detailed model calculations, outperforming the prior state-of-art.
    Distributionally Robust Weighted $k$-Nearest Neighbors. (arXiv:2006.04004v5 [stat.ML] UPDATED)
    Learning a robust classifier from a few samples remains a key challenge in machine learning. A major thrust of research has been focused on developing $k$-nearest neighbor ($k$-NN) based algorithms combined with metric learning that captures similarities between samples. When the samples are limited, robustness is especially crucial to ensure the generalization capability of the classifier. In this paper, we study a minimax distributionally robust formulation of weighted $k$-nearest neighbors, which aims to find the optimal weighted $k$-NN classifiers that hedge against feature uncertainties. We develop an algorithm, \texttt{Dr.k-NN}, that efficiently solves this functional optimization problem and features in assigning minimax optimal weights to training samples when performing classification. These weights are class-dependent, and are determined by the similarities of sample features under the least favorable scenarios. When the size of the uncertainty set is properly tuned, the robust classifier has a smaller Lipschitz norm than the vanilla $k$-NN, and thus improves the generalization capability. We also couple our framework with neural-network-based feature embedding. We demonstrate the competitive performance of our algorithm compared to the state-of-the-art in the few-training-sample setting with various real-data experiments.
    Robust Nonparametric Distribution Forecast with Backtest-based Bootstrap and Adaptive Residual Selection. (arXiv:2202.07955v1 [stat.ML])
    Distribution forecast can quantify forecast uncertainty and provide various forecast scenarios with their corresponding estimated probabilities. Accurate distribution forecast is crucial for planning - for example when making production capacity or inventory allocation decisions. We propose a practical and robust distribution forecast framework that relies on backtest-based bootstrap and adaptive residual selection. The proposed approach is robust to the choice of the underlying forecasting model, accounts for uncertainty around the input covariates, and relaxes the independence between residuals and covariates assumption. It reduces the Absolute Coverage Error by more than 63% compared to the classic bootstrap approaches and by 2% - 32% compared to a variety of State-of-the-Art deep learning approaches on in-house product sales data and M4-hourly competition data.
    Using the left Gram matrix to cluster high dimensional data. (arXiv:2202.08236v1 [stat.ML])
    For high dimensional data, where P features for N objects (P >> N) are represented in an NxP matrix X, we describe a clustering algorithm based on the normalized left Gram matrix, G = XX'/P. Under certain regularity conditions, the rows in G that correspond to objects in the same cluster converge to the same mean vector. By clustering on the row means, the algorithm does not require preprocessing by dimension reduction or feature selection techniques and does not require specification of tuning or hyperparameter values. Because it is based on the NxN matrix G, it has a lower computational cost than many methods based on clustering the feature matrix X. When compared to 14 other clustering algorithms applied to 32 benchmarked microarray datasets, the proposed algorithm provided the most accurate estimate of the underlying cluster configuration more than twice as often as its closest competitors.
    Geometry of the Minimum Volume Confidence Sets. (arXiv:2202.08180v1 [stat.ML])
    Computation of confidence sets is central to data science and machine learning, serving as the workhorse of A/B testing and underpinning the operation and analysis of reinforcement learning algorithms. This paper studies the geometry of the minimum-volume confidence sets for the multinomial parameter. When used in place of more standard confidence sets and intervals based on bounds and asymptotic approximation, learning algorithms can exhibit improved sample complexity. Prior work showed the minimum-volume confidence sets are the level-sets of a discontinuous function defined by an exact p-value. While the confidence sets are optimal in that they have minimum average volume, computation of membership of a single point in the set is challenging for problems of modest size. Since the confidence sets are level-sets of discontinuous functions, little is apparent about their geometry. This paper studies the geometry of the minimum volume confidence sets by enumerating and covering the continuous regions of the exact p-value function. This addresses a fundamental question in A/B testing: given two multinomial outcomes, how can one determine if their corresponding minimum volume confidence sets are disjoint? We answer this question in a restricted setting.
    Expand Globally, Shrink Locally: Discriminant Multi-label Learning with Missing Labels. (arXiv:2004.03951v2 [cs.LG] UPDATED)
    In multi-label learning, the issue of missing labels brings a major challenge. Many methods attempt to recovery missing labels by exploiting low-rank structure of label matrix. However, these methods just utilize global low-rank label structure, ignore both local low-rank label structures and label discriminant information to some extent, leaving room for further performance improvement. In this paper, we develop a simple yet effective discriminant multi-label learning (DM2L) method for multi-label learning with missing labels. Specifically, we impose the low-rank structures on all the predictions of instances from the same labels (local shrinking of rank), and a maximally separated structure (high-rank structure) on the predictions of instances from different labels (global expanding of rank). In this way, these imposed low-rank structures can help modeling both local and global low-rank label structures, while the imposed high-rank structure can help providing more underlying discriminability. Our subsequent theoretical analysis also supports these intuitions. In addition, we provide a nonlinear extension via using kernel trick to enhance DM2L and establish a concave-convex objective to learn these models. Compared to the other methods, our method involves the fewest assumptions and only one hyper-parameter. Even so, extensive experiments show that our method still outperforms the state-of-the-art methods.
    Bayesian Learning with Wasserstein Barycenters. (arXiv:1805.10833v4 [stat.ML] UPDATED)
    Based on recent developments in optimal transport theory, we propose a novel model-selection strategy for Bayesian learning. More precisely, the goal of this paper is to introduce the Wasserstein barycenter of the posterior law on models, as a Bayesian predictive posterior, alternative to classical choices such as the maximum a posteriori and the model average Bayesian estimators. After formulating the general problem of Bayesian model selection in a common, parameter-free framework, we exhibit conditions granting the existence and statistical consistency of this estimator, discuss some of its general and specific properties, and provide insight into its theoretical advantages. Furthermore, we illustrate how it can be computed using the theoretical stochastic gradient descent (SGD) algorithm in Wasserstein space introduced in a companion paper arXiv:2201.04232v2 [math.OC] , and provide a numerical example for experimental validation of the proposed method.
    On Measuring Excess Capacity in Neural Networks. (arXiv:2202.08070v1 [cs.LG])
    We study the excess capacity of deep networks in the context of supervised classification. That is, given a capacity measure of the underlying hypothesis class -- in our case, Rademacher complexity -- how much can we (a-priori) constrain this class while maintaining an empirical error comparable to the unconstrained setting. To assess excess capacity in modern architectures, we first extend an existing generalization bound to accommodate function composition and addition, as well as the specific structure of convolutions. This then facilitates studying residual networks through the lens of the accompanying capacity measure. The key quantities driving this measure are the Lipschitz constants of the layers and the (2,1) group norm distance to the initializations of the convolution weights. We show that these quantities (1) can be kept surprisingly small and, (2) since excess capacity unexpectedly increases with task difficulty, this points towards an unnecessarily large capacity of unconstrained models.
    Bayesian subset selection and variable importance for interpretable prediction and classification. (arXiv:2104.10150v2 [stat.ML] UPDATED)
    Subset selection is a valuable tool for interpretable learning, scientific discovery, and data compression. However, classical subset selection is often avoided due to selection instability, lack of regularization, and difficulties with post-selection inference. We address these challenges from a Bayesian perspective. Given any Bayesian predictive model $\mathcal{M}$, we extract a family of near-optimal subsets of variables for linear prediction or classification. This strategy deemphasizes the role of a single "best" subset and instead advances the broader perspective that often many subsets are highly competitive. The acceptable family of subsets offers a new pathway for model interpretation and is neatly summarized by key members such as the smallest acceptable subset, along with new (co-) variable importance metrics based on whether variables (co-) appear in all, some, or no acceptable subsets. More broadly, we apply Bayesian decision analysis to derive the optimal linear coefficients for any subset of variables. These coefficients inherit both regularization and predictive uncertainty quantification via $\mathcal{M}$. For both simulated and real data, the proposed approach exhibits better prediction, interval estimation, and variable selection than competing Bayesian and frequentist selection methods. These tools are applied to a large education dataset with highly correlated covariates. Our analysis provides unique insights into the combination of environmental, socioeconomic, and demographic factors that predict educational outcomes, and identifies over 200 distinct subsets of variables that offer near-optimal out-of-sample predictive accuracy.  ( 2 min )
    Enhancing Causal Estimation through Unlabeled Offline Data. (arXiv:2202.07895v1 [stat.ML])
    Consider a situation where a new patient arrives in the Intensive Care Unit (ICU) and is monitored by multiple sensors. We wish to assess relevant unmeasured physiological variables (e.g., cardiac contractility and output and vascular resistance) that have a strong effect on the patients diagnosis and treatment. We do not have any information about this specific patient, but, extensive offline information is available about previous patients, that may only be partially related to the present patient (a case of dataset shift). This information constitutes our prior knowledge, and is both partial and approximate. The basic question is how to best use this prior knowledge, combined with online patient data, to assist in diagnosing the current patient most effectively. Our proposed approach consists of three stages: (i) Use the abundant offline data in order to create both a non-causal and a causal estimator for the relevant unmeasured physiological variables. (ii) Based on the non-causal estimator constructed, and a set of measurements from a new group of patients, we construct a causal filter that provides higher accuracy in the prediction of the hidden physiological variables for this new set of patients. (iii) For any new patient arriving in the ICU, we use the constructed filter in order to predict relevant internal variables. Overall, this strategy allows us to make use of the abundantly available offline data in order to enhance causal estimation for newly arriving patients. We demonstrate the effectiveness of this methodology on a (non-medical) real-world task, in situations where the offline data is only partially related to the new observations. We provide a mathematical analysis of the merits of the approach in a linear setting of Kalman filtering and smoothing, demonstrating its utility.
    GAN Estimation of Lipschitz Optimal Transport Maps. (arXiv:2202.07965v1 [stat.ML])
    This paper introduces the first statistically consistent estimator of the optimal transport map between two probability distributions, based on neural networks. Building on theoretical and practical advances in the field of Lipschitz neural networks, we define a Lipschitz-constrained generative adversarial network penalized by the quadratic transportation cost. Then, we demonstrate that, under regularity assumptions, the obtained generator converges uniformly to the optimal transport map as the sample size increases to infinity. Furthermore, we show through a number of numerical experiments that the learnt mapping has promising performances. In contrast to previous work tackling either statistical guarantees or practicality, we provide an expressive and feasible estimator which paves way for optimal transport applications where the asymptotic behaviour must be certified.
    Understanding and Improving Graph Injection Attack by Promoting Unnoticeability. (arXiv:2202.08057v1 [cs.LG])
    Recently Graph Injection Attack (GIA) emerges as a practical attack scenario on Graph Neural Networks (GNNs), where the adversary can merely inject few malicious nodes instead of modifying existing nodes or edges, i.e., Graph Modification Attack (GMA). Although GIA has achieved promising results, little is known about why it is successful and whether there is any pitfall behind the success. To understand the power of GIA, we compare it with GMA and find that GIA can be provably more harmful than GMA due to its relatively high flexibility. However, the high flexibility will also lead to great damage to the homophily distribution of the original graph, i.e., similarity among neighbors. Consequently, the threats of GIA can be easily alleviated or even prevented by homophily-based defenses designed to recover the original homophily. To mitigate the issue, we introduce a novel constraint -- homophily unnoticeability that enforces GIA to preserve the homophily, and propose Harmonious Adversarial Objective (HAO) to instantiate it. Extensive experiments verify that GIA with HAO can break homophily-based defenses and outperform previous GIA attacks by a significant margin. We believe our methods can serve for a more reliable evaluation of the robustness of GNNs.
    Group Fairness in Bandit Arm Selection. (arXiv:1912.03802v3 [cs.LG] UPDATED)
    We propose a novel formulation of group fairness with biased feedback in the contextual multi-armed bandit (CMAB) setting. In the CMAB setting, a sequential decision maker must, at each time step, choose an arm to pull from a finite set of arms after observing some context for each of the potential arm pulls. In our model, arms are partitioned into two or more sensitive groups based on some protected feature(s) (e.g., age, race, or socio-economic status). Initial rewards received from pulling an arm may be distorted due to some unknown societal or measurement bias. We assume that in reality these groups are equal despite the biased feedback received by the agent. To alleviate this, we learn a societal bias term which can be used to both find the source of bias and to potentially fix the problem outside of the algorithm. We provide a novel algorithm that can accommodate this notion of fairness for an arbitrary number of groups, and provide a theoretical bound on the regret for our algorithm. We validate our algorithm using synthetic data and two real-world datasets for intervention settings wherein we want to allocate resources fairly across groups.
    Settling the Sharp Reconstruction Thresholds of Random Graph Matching. (arXiv:2102.00082v3 [math.ST] UPDATED)
    This paper studies the problem of recovering the hidden vertex correspondence between two edge-correlated random graphs. We focus on the Gaussian model where the two graphs are complete graphs with correlated Gaussian weights and the Erd\H{o}s-R\'enyi model where the two graphs are subsampled from a common parent Erd\H{o}s-R\'enyi graph $\mathcal{G}(n,p)$. For dense graphs with $p=n^{-o(1)}$, we prove that there exists a sharp threshold, above which one can correctly match all but a vanishing fraction of vertices and below which correctly matching any positive fraction is impossible, a phenomenon known as the "all-or-nothing" phase transition. Even more strikingly, in the Gaussian setting, above the threshold all vertices can be exactly matched with high probability. In contrast, for sparse Erd\H{o}s-R\'enyi graphs with $p=n^{-\Theta(1)}$, we show that the all-or-nothing phenomenon no longer holds and we determine the thresholds up to a constant factor. Along the way, we also derive the sharp threshold for exact recovery, sharpening the existing results in Erd\H{o}s-R\'enyi graphs. The proof of the negative results builds upon a tight characterization of the mutual information based on the truncated second-moment computation and an "area theorem" that relates the mutual information to the integral of the reconstruction error. The positive results follows from a tight analysis of the maximum likelihood estimator that takes into account the cycle structure of the induced permutation on the edges.
    Correlation detection in trees for planted graph alignment. (arXiv:2107.07623v3 [cs.DS] UPDATED)
    Motivated by alignment of correlated sparse random graphs, we introduce a hypothesis testing problem of deciding whether or not two random trees are correlated. We obtain sufficient conditions under which this testing is impossible or feasible. We propose MPAlign, a message-passing algorithm for graph alignment inspired by the tree correlation detection problem. We prove MPAlign to succeed in polynomial time at partial alignment whenever tree detection is feasible. As a result our analysis of tree detection reveals new ranges of parameters for which partial alignment of sparse random graphs is feasible in polynomial time. We then conjecture that graph alignment is not feasible in polynomial time when the associated tree detection problem is impossible. If true, this conjecture together with our sufficient conditions on tree detection impossibility would imply the existence of a hard phase for graph alignment, i.e. a parameter range where alignment cannot be done in polynomial time even though it is known to be feasible in non-polynomial time.
    On the Role of Channel Capacity in Learning Gaussian Mixture Models. (arXiv:2202.07707v1 [cs.IT])
    This paper studies the sample complexity of learning the $k$ unknown centers of a balanced Gaussian mixture model (GMM) in $\mathbb{R}^d$ with spherical covariance matrix $\sigma^2\mathbf{I}$. In particular, we are interested in the following question: what is the maximal noise level $\sigma^2$, for which the sample complexity is essentially the same as when estimating the centers from labeled measurements? To that end, we restrict attention to a Bayesian formulation of the problem, where the centers are uniformly distributed on the sphere $\sqrt{d}\mathcal{S}^{d-1}$. Our main results characterize the exact noise threshold $\sigma^2$ below which the GMM learning problem, in the large system limit $d,k\to\infty$, is as easy as learning from labeled observations, and above which it is substantially harder. The threshold occurs at $\frac{\log k}{d} = \frac12\log\left( 1+\frac{1}{\sigma^2} \right)$, which is the capacity of the additive white Gaussian noise (AWGN) channel. Thinking of the set of $k$ centers as a code, this noise threshold can be interpreted as the largest noise level for which the error probability of the code over the AWGN channel is small. Previous works on the GMM learning problem have identified the minimum distance between the centers as a key parameter in determining the statistical difficulty of learning the corresponding GMM. While our results are only proved for GMMs whose centers are uniformly distributed over the sphere, they hint that perhaps it is the decoding error probability associated with the center constellation as a channel code that determines the statistical difficulty of learning the corresponding GMM, rather than just the minimum distance.
    Unsupervised Ground Metric Learning using Wasserstein Singular Vectors. (arXiv:2102.06278v2 [stat.ML] UPDATED)
    Defining meaningful distances between samples, which are columns in a data matrix, is a fundamental problem in machine learning. Optimal Transport (OT) defines geometrically meaningful "Wasserstein" distances between probability distributions. However, a key bottleneck is the design of a "ground" cost which should be adapted to the task under study. OT is parametrized by a distance between the features (the rows of the data matrix): the "ground cost". However, there is usually no straightforward choice of distance on the features, and supervised metric learning is not possible either, leaving only ad-hoc approaches. Unsupervised metric learning is thus a fundamental problem to enable data-driven applications of OT. In this paper, we propose for the first time a canonical answer by simultaneously computing an OT distance between the rows and between the columns of a data matrix. These distance matrices emerge naturally as positive singular vectors of the function mapping ground costs to pairwise OT distances. We provide criteria to ensure the existence and uniqueness of these singular vectors. We then introduce scalable computational methods to approximate them in high-dimensional settings, using entropic regularization and stochastic approximation. First, we extend the definition using entropic regularization, and show that in the large regularization limit it operates a principal component analysis dimensionality reduction. Next, we propose a stochastic approximation scheme and study its convergence. Finally, we showcase Wasserstein Singular Vectors in the context of computational biology on a high-dimensional single-cell RNA-sequencing dataset.
    Taking a Step Back with KCal: Multi-Class Kernel-Based Calibration for Deep Neural Networks. (arXiv:2202.07679v1 [stat.ML])
    Deep neural network (DNN) classifiers are often overconfident, producing miscalibrated class probabilities. Most existing calibration methods either lack theoretical guarantees for producing calibrated outputs or reduce the classification accuracy in the process. This paper proposes a new Kernel-based calibration method called KCal. Unlike other calibration procedures, KCal does not operate directly on the logits or softmax outputs of the DNN. Instead, it uses the penultimate-layer latent embedding to train a metric space in a supervised manner. In effect, KCal amounts to a supervised dimensionality reduction of the neural network embedding, and generates a prediction using kernel density estimation on a holdout calibration set. We first analyze KCal theoretically, showing that it enjoys a provable asymptotic calibration guarantee. Then, through extensive experiments, we confirm that KCal consistently outperforms existing calibration methods in terms of both the classification accuracy and the (confidence and class-wise) calibration error.
    Doubly robust Thompson sampling for linear payoffs. (arXiv:2102.01229v2 [stat.ML] UPDATED)
    A challenging aspect of the bandit problem is that a stochastic reward is observed only for the chosen arm and the rewards of other arms remain missing. The dependence of the arm choice on the past context and reward pairs compounds the complexity of regret analysis. We propose a novel multi-armed contextual bandit algorithm called Doubly Robust (DR) Thompson Sampling employing the doubly-robust estimator used in missing data literature to Thompson Sampling with contexts (\texttt{LinTS}). Different from previous works relying on missing data techniques (\citet{dimakopoulou2019balanced}, \citet{kim2019doubly}), the proposed algorithm is designed to allow a novel additive regret decomposition leading to an improved regret bound with the order of $\tilde{O}(\phi^{-2}\sqrt{T})$, where $\phi^2$ is the minimum eigenvalue of the covariance matrix of contexts. This is the first regret bound of \texttt{LinTS} using $\phi^2$ without the dimension of the context, $d$. Applying the relationship between $\phi^2$ and $d$, the regret bound of the proposed algorithm is $\tilde{O}(d\sqrt{T})$ in many practical scenarios, improving the bound of \texttt{LinTS} by a factor of $\sqrt{d}$. A benefit of the proposed method is that it utilizes all the context data, chosen or not chosen, thus allowing to circumvent the technical definition of unsaturated arms used in theoretical analysis of \texttt{LinTS}. Empirical studies show the advantage of the proposed algorithm over \texttt{LinTS}.
    Wasserstein Graph Neural Networks for Graphs with Missing Attributes. (arXiv:2102.03450v2 [cs.LG] UPDATED)
    Missing node attributes is a common problem in real-world graphs. Graph neural networks have been demonstrated power in graph representation learning while their performance is affected by the completeness of graph information. Most of them are not specified for missing-attribute graphs and fail to leverage incomplete attribute information effectively. In this paper, we propose an innovative node representation learning framework, Wasserstein Graph Neural Network (WGNN), to mitigate the problem. To make the most of limited observed attribute information and capture the uncertainty caused by missing values, we express nodes as low-dimensional distributions derived from the decomposition of the attribute matrix. Furthermore, we strengthen the expressiveness of representations by developing a novel message passing schema that aggregates distributional information from neighbors in the Wasserstein space. We test WGNN in node classification tasks under two missing-attribute cases on both synthetic and real-world datasets. In addition, we find WGNN suitable to recover missing values and adapt them to tackle matrix completion problems with graphs of users and items. Experimental results on both tasks demonstrate the superiority of our method.
    Data-driven modelling of nonlinear dynamics by barycentric coordinates and memory. (arXiv:2112.06742v2 [math.DS] UPDATED)
    We present a numerical method to model dynamical systems from data. We use the recently introduced method Scalable Probabilistic Approximation (SPA) to project points from a Euclidean space to convex polytopes and represent these projected states of a system in new, lower-dimensional coordinates denoting their position in the polytope. We then introduce a specific nonlinear transformation to construct a model of the dynamics in the polytope and to transform back into the original state space. To overcome the potential loss of information from the projection to a lower-dimensional polytope, we use memory in the sense of the delay-embedding theorem of Takens. By construction, our method produces stable models. We illustrate the capacity of the method to reproduce even chaotic dynamics and attractors with multiple connected components on various examples.
    The efficacy and generalizability of conditional GANs for posterior inference in physics-based inverse problems. (arXiv:2202.07773v1 [stat.ML])
    In this work, we train conditional Wasserstein generative adversarial networks to effectively sample from the posterior of physics-based Bayesian inference problems. The generator is constructed using a U-Net architecture, with the latent information injected using conditional instance normalization. The former facilitates a multiscale inverse map, while the latter enables the decoupling of the latent space dimension from the dimension of the measurement, and introduces stochasticity at all scales of the U-Net. We solve PDE-based inverse problems to demonstrate the performance of our approach in quantifying the uncertainty in the inferred field. Further, we show the generator can learn inverse maps which are local in nature, which in turn promotes generalizability when testing with out-of-distribution samples.
    Regularised Least-Squares Regression with Infinite-Dimensional Output Space. (arXiv:2010.10973v7 [stat.ML] UPDATED)
    This short technical report presents some learning theory results on vector-valued reproducing kernel Hilbert space (RKHS) regression, where the input space is allowed to be non-compact and the output space is a (possibly infinite-dimensional) Hilbert space. Our approach is based on the integral operator technique using spectral theory for non-compact operators. We place a particular emphasis on obtaining results with as few assumptions as possible; as such we only use Chebyshev's inequality, and no effort is made to obtain the best rates or constants.
    Safe Policy Learning through Extrapolation: Application to Pre-trial Risk Assessment. (arXiv:2109.11679v3 [stat.ML] UPDATED)
    Algorithmic recommendations and decisions have become ubiquitous in today's society. Many of these and other data-driven policies, especially in the realm of public policy, are based on known, deterministic rules to ensure their transparency and interpretability. For example, algorithmic pre-trial risk assessments, which serve as our motivating application, provide relatively simple, deterministic classification scores and recommendations to help judges make release decisions. How can we use the data based on existing deterministic policies to learn new and better policies? Unfortunately, prior methods for policy learning are not applicable because they require existing policies to be stochastic rather than deterministic. We develop a robust optimization approach that partially identifies the expected utility of a policy, and then finds an optimal policy by minimizing the worst-case regret. The resulting policy is conservative but has a statistical safety guarantee, allowing the policy-maker to limit the probability of producing a worse outcome than the existing policy. We extend this approach to common and important settings where humans make decisions with the aid of algorithmic recommendations. Lastly, we apply the proposed methodology to a unique field experiment on pre-trial risk assessment instruments. We derive new classification and recommendation rules that retain the transparency and interpretability of the existing instrument while potentially leading to better overall outcomes at a lower cost.
    An Adversarial Approach to Structural Estimation. (arXiv:2007.06169v2 [econ.EM] UPDATED)
    We propose a new simulation-based estimation method, adversarial estimation, for structural models. The estimator is formulated as the solution to a minimax problem between a generator (which generates synthetic observations using the structural model) and a discriminator (which classifies if an observation is synthetic). The discriminator maximizes the accuracy of its classification while the generator minimizes it. We show that, with a sufficiently rich discriminator, the adversarial estimator attains parametric efficiency under correct specification and the parametric rate under misspecification. We advocate the use of a neural network as a discriminator that can exploit adaptivity properties and attain fast rates of convergence. We apply our method to the elderly's saving decision model and show that our estimator uncovers the bequest motive as an important source of saving across the wealth distribution, not only for the rich.  ( 2 min )
    Distribution-free binary classification: prediction sets, confidence intervals and calibration. (arXiv:2006.10564v4 [stat.ML] UPDATED)
    We study three notions of uncertainty quantification -- calibration, confidence intervals and prediction sets -- for binary classification in the distribution-free setting, that is without making any distributional assumptions on the data. With a focus towards calibration, we establish a 'tripod' of theorems that connect these three notions for score-based classifiers. A direct implication is that distribution-free calibration is only possible, even asymptotically, using a scoring function whose level sets partition the feature space into at most countably many sets. Parametric calibration schemes such as variants of Platt scaling do not satisfy this requirement, while nonparametric schemes based on binning do. To close the loop, we derive distribution-free confidence intervals for binned probabilities for both fixed-width and uniform-mass binning. As a consequence of our 'tripod' theorems, these confidence intervals for binned probabilities lead to distribution-free calibration. We also derive extensions to settings with streaming data and covariate shift.
    Online Control of Unknown Time-Varying Dynamical Systems. (arXiv:2202.07890v1 [cs.LG])
    We study online control of time-varying linear systems with unknown dynamics in the nonstochastic control model. At a high level, we demonstrate that this setting is \emph{qualitatively harder} than that of either unknown time-invariant or known time-varying dynamics, and complement our negative results with algorithmic upper bounds in regimes where sublinear regret is possible. More specifically, we study regret bounds with respect to common classes of policies: Disturbance Action (SLS), Disturbance Response (Youla), and linear feedback policies. While these three classes are essentially equivalent for LTI systems, we demonstrate that these equivalences break down for time-varying systems. We prove a lower bound that no algorithm can obtain sublinear regret with respect to the first two classes unless a certain measure of system variability also scales sublinearly in the horizon. Furthermore, we show that offline planning over the state linear feedback policies is NP-hard, suggesting hardness of the online learning problem. On the positive side, we give an efficient algorithm that attains a sublinear regret bound against the class of Disturbance Response policies up to the aforementioned system variability term. In fact, our algorithm enjoys sublinear \emph{adaptive} regret bounds, which is a strictly stronger metric than standard regret and is more appropriate for time-varying systems. We sketch extensions to Disturbance Action policies and partial observation, and propose an inefficient algorithm for regret against linear state feedback policies.
    Graph-Augmented Normalizing Flows for Anomaly Detection of Multiple Time Series. (arXiv:2202.07857v1 [cs.LG])
    Anomaly detection is a widely studied task for a broad variety of data types; among them, multiple time series appear frequently in applications, including for example, power grids and traffic networks. Detecting anomalies for multiple time series, however, is a challenging subject, owing to the intricate interdependencies among the constituent series. We hypothesize that anomalies occur in low density regions of a distribution and explore the use of normalizing flows for unsupervised anomaly detection, because of their superior quality in density estimation. Moreover, we propose a novel flow model by imposing a Bayesian network among constituent series. A Bayesian network is a directed acyclic graph (DAG) that models causal relationships; it factorizes the joint probability of the series into the product of easy-to-evaluate conditional probabilities. We call such a graph-augmented normalizing flow approach GANF and propose joint estimation of the DAG with flow parameters. We conduct extensive experiments on real-world datasets and demonstrate the effectiveness of GANF for density estimation, anomaly detection, and identification of time series distribution drift.
    Learning Across Bandits in High Dimension via Robust Statistics. (arXiv:2112.14233v2 [stat.ML] UPDATED)
    Decision-makers often face the "many bandits" problem, where one must simultaneously learn across related but heterogeneous contextual bandit instances. For instance, a large retailer may wish to dynamically learn product demand across many stores to solve pricing or inventory problems, making it desirable to learn jointly for stores serving similar customers; alternatively, a hospital network may wish to dynamically learn patient risk across many providers to allocate personalized interventions, making it desirable to learn jointly for hospitals serving similar patient populations. We study the setting where the unknown parameter in each bandit instance can be decomposed into a global parameter plus a sparse instance-specific term. Then, we propose a novel two-stage estimator that exploits this structure in a sample-efficient way by using a combination of robust statistics (to learn across similar instances) and LASSO regression (to debias the results). We embed this estimator within a bandit algorithm, and prove that it improves asymptotic regret bounds in the context dimension $d$; this improvement is exponential for data-poor instances. We further demonstrate how our results depend on the underlying network structure of bandit instances. Finally, we illustrate the value of our approach on synthetic and real datasets.
    Tracking Most Significant Arm Switches in Bandits. (arXiv:2112.13838v4 [cs.LG] UPDATED)
    In bandit with distribution shifts, one aims to automatically adapt to unknown changes in reward distribution, and restart exploration when necessary. While this problem has been studied for many years, a recent breakthrough of Auer et al. (2018, 2019) provides the first adaptive procedure to guarantee an optimal (dynamic) regret $\sqrt{LT}$, for $T$ rounds, and an unknown number $L$ of changes. However, while this rate is tight in the worst case, it remained open whether faster rates are possible, without prior knowledge, if few changes in distribution are actually severe. To resolve this question, we propose a new notion of significant shift, which only counts very severe changes that clearly necessitate a restart: roughly, these are changes involving not only best arm switches, but also involving large aggregate differences in reward overtime. Thus, our resulting procedure adaptively achieves rates always faster (sometimes significantly) than $O(\sqrt{ST})$, where $S\ll L$ only counts best arm switches, while at the same time, always faster than the optimal $O(V^{\frac{1}{3}}T^{\frac{2}{3}})$ when expressed in terms of total variation $V$ (which aggregates differences overtime). Our results are expressed in enough generality to also capture non-stochastic adversarial settings.
    A Prospective Approach for Human-to-Human Interaction Recognition from Wi-Fi Channel Data using Attention Bidirectional Gated Recurrent Neural Network with GUI Application Implementation. (arXiv:2202.08146v1 [cs.LG])
    With the recent advances in multi-disciplinary human activity recognition techniques, it has become inevitable to find an efficient, economical & privacy-friendly approach for human-to-human mutual interaction recognition in order to breakthrough the modern artificial intelligence centric indoor monitoring & surveillance system. This study initially attempted to set its sights on the already proposed human activity recognition mechanisms and found a void in mutual interaction recognition from Wi-Fi channel information which is convenient & affordable to be utilized. Then it elucidated on the corresponding components of wireless local area network gadgets along with the channel properties, and notable underlying causes of signal & channel perturbation. Thenceforth the study conducted three experiments on human-to-human mutual interaction recognition using the proposed Self-Attention furnished Bidirectional Gated Recurrent Neural Network deep learning model which is perceived to become emergent nowadays for time-series data classification through automated temporal feature extraction. Single pair mutual interaction recognition experiment achieved a maximum of 94% test benchmark while the experiment involving ten subject-pairs secured 88% benchmark with improved classification around interaction-transition region. Demonstration of a graphical user interface executable software designed using PyQt5 python module subsequently portrayed the overall mutual human-interaction recognition procedure, and finally the study concluded with a brief discourse regarding the possible solutions to the handicaps that resulted in curtailments observed in the case of cross-test experiment.
    Learning a Single Neuron for Non-monotonic Activation Functions. (arXiv:2202.08064v1 [stat.ML])
    We study the problem of learning a single neuron $\mathbf{x}\mapsto \sigma(\mathbf{w}^T\mathbf{x})$ with gradient descent (GD). All the existing positive results are limited to the case where $\sigma$ is monotonic. However, it is recently observed that non-monotonic activation functions outperform the traditional monotonic ones in many applications. To fill this gap, we establish learnability without assuming monotonicity. Specifically, when the input distribution is the standard Gaussian, we show that mild conditions on $\sigma$ (e.g., $\sigma$ has a dominating linear part) are sufficient to guarantee the learnability in polynomial time and polynomial samples. Moreover, with a stronger assumption on the activation function, the condition of input distribution can be relaxed to a non-degeneracy of the marginal distribution. We remark that our conditions on $\sigma$ are satisfied by practical non-monotonic activation functions, such as SiLU/Swish and GELU. We also discuss how our positive results are related to existing negative results on training two-layer neural networks.
    E(n) Equivariant Graph Neural Networks. (arXiv:2102.09844v3 [cs.LG] UPDATED)
    This paper introduces a new model to learn graph neural networks equivariant to rotations, translations, reflections and permutations called E(n)-Equivariant Graph Neural Networks (EGNNs). In contrast with existing methods, our work does not require computationally expensive higher-order representations in intermediate layers while it still achieves competitive or better performance. In addition, whereas existing methods are limited to equivariance on 3 dimensional spaces, our model is easily scaled to higher-dimensional spaces. We demonstrate the effectiveness of our method on dynamical systems modelling, representation learning in graph autoencoders and predicting molecular properties.
    The Fine-Grained Hardness of Sparse Linear Regression. (arXiv:2106.03131v2 [cs.LG] UPDATED)
    Sparse linear regression is the well-studied inference problem where one is given a design matrix $\mathbf{A} \in \mathbb{R}^{M\times N}$ and a response vector $\mathbf{b} \in \mathbb{R}^M$, and the goal is to find a solution $\mathbf{x} \in \mathbb{R}^{N}$ which is $k$-sparse (that is, it has at most $k$ non-zero coordinates) and minimizes the prediction error $\|\mathbf{A} \mathbf{x} - \mathbf{b}\|_2$. On the one hand, the problem is known to be $\mathcal{NP}$-hard which tells us that no polynomial-time algorithm exists unless $\mathcal{P} = \mathcal{NP}$. On the other hand, the best known algorithms for the problem do a brute-force search among $N^k$ possibilities. In this work, we show that there are no better-than-brute-force algorithms, assuming any one of a variety of popular conjectures including the weighted $k$-clique conjecture from the area of fine-grained complexity, or the hardness of the closest vector problem from the geometry of numbers. We also show the impossibility of better-than-brute-force algorithms when the prediction error is measured in other $\ell_p$ norms, assuming the strong exponential-time hypothesis.

  • Open

    [D] Uncertainty about a future in ML
    I recently moved to Montreal from the UK to study a PhD in Computational Neuroscience, and I'm starting to have some sort of quater-life crisis about what it can do for my future. I've wanted to go into a career in data science / ML / Deep learning for a while now (specifically in industry), but I don't have troves of experience (a Bachelors in Genetics and an MSc in Bioinformatics). My PhD offers the chance to work on projects / learn in detail stats, linear algebra, ML applications and engineering neural networks. But I'm lost about whether or not this will be enough to get a decent job in these industries., or whether the fact that I don't have a bachelors / masters in compsci is going to nullify me as a candidate for a lot of (industry) positions. I guess I want to know, could this be worth it? or am I about to spend the next 4 years of my life wasting my time on something which won't help me get where I want to go. I'd love some advice on how to deal with this uncertainty, or how I can utilise this time to put myself in a position where I can break into a decent paying industry job. submitted by /u/Pbro98 [link] [comments]  ( 2 min )
    [D] Paper Explained - CM3: A Causal Masked Multimodal Model of the Internet (Video Analysis and Author Interview)
    https://youtu.be/qNfCVGbvnJc This video contains a paper explanation and an incredibly informative interview with first author Armen Aghajanyan. Autoregressive Transformers have come to dominate many fields in Machine Learning, from text generation to image creation and many more. However, there are two problems. First, the collected data is usually scraped from the web and uni- or bi-modal and throws away a lot of structure of the original websites, and second, language modelling losses are uni-directional. CM3 addresses both problems: It directly operates on HTML and includes text, hyperlinks, and even images (via VQGAN tokenization) and can therefore be used in plenty of ways: Text generation, captioning, image creation, entity linking, and much more. It also introduces a new training strategy called Causally Masked Language Modelling, which brings a level of bi-directionality into autoregressive language modelling. In the interview after the paper explanation, Armen and I go deep into the how and why of these giant models, we go over the stunning results and we make sense of what they mean for the future of universal models. ​ OUTLINE: 0:00 - Intro & Overview 6:30 - Directly learning the structure of HTML 12:30 - Causally Masked Language Modelling 18:50 - A short look at how to use this model 23:20 - Start of interview 25:30 - Feeding language models with HTML 29:45 - How to get bi-directionality into decoder-only Transformers? 37:00 - Images are just tokens 41:15 - How does one train such giant models? 45:40 - CM3 results are amazing 58:20 - Large-scale dataset collection and content filtering 1:04:40 - More experimental results 1:12:15 - Why don't we use raw HTML? 1:18:20 - Does this paper contain too many things? ​ Paper: https://arxiv.org/abs/2201.07520 submitted by /u/ykilcher [link] [comments]  ( 1 min )
    [D] Is the competition/cooperation between symbolic AI and statistical AI (ML) about historical approach to research / engineering, or is it more fundamentally about what intelligent agents "are"?
    I have found that comprehensive overviews of artificial intelligence (Wikipedia, SEP article, Norvig and Russel's AI: A Modern Approach) make reference to symbolic AI and statistical AI in their historical context of the former preceding the latter, their corresponding limitations etc. But I have found it really difficult to dissect this from the question of whether the divide / cooperation between these paradigms are about the implementation of engineering of intelligent agents, or if they are getting at something more fundamental about the space of possible minds (I use this term to be as broad as possible considering anything we would label as a mind, regardless of ontogeny, architecture, physical components etc)? I have given a list of questions below, but some of them are mutually ex…  ( 3 min )
    [D] Alternative approaches to ordinal features?
    The traditional approaches to ordinal variables are to keep them as continuous or one hot encode them. I have a particular feature that's somewhere in the middle. It's not exactly ordinal but it does have an ordinal structure. I'm wondering if there is some appropriate third way that I am unaware of. submitted by /u/akirp001 [link] [comments]  ( 1 min )
    [D] What would you like to know in causal learning and what excites you?
    We are writing up a review paper on causal structure learning (causal discovery). Our aim is to capture a wider audience and bring them to light different aspects of causal learning like observational vs interventional data, structural causal models (causal graphs) and of course giving a thorough review of existing approaches (score based, conditional independence testing based, Bayesian methods and latest trends with smooth optimisation) to tackle the problem of causal discovery. Our aim is to present it in a way which would help researchers (possibly new to this field) in both causal inference and machine learning to benefit from it. Therefore, I wanted to reach out to understand what would you be interested to know in causal structure learning (causal discovery), any aspects which you find a bit confusing/unclear in this line of research or something which excites you about this problem which you would like to have a detailed account on. submitted by /u/Evening_Cranberry383 [link] [comments]  ( 2 min )
    [D] Coupling classification and regression
    I'm trying to build a deep learning architecture that can do classification and regression of precipitation events. To be precise I want to classify precipitation into one of None, Rain or Snow, and simultaneously estimate its intensity i.e. the precipitation rate, using regression. Obviously I can simply build a head that has outputs of class scores for the classification and an ouput for the regression value. However, the labels are coupled in the sense that the class None implies an intensity of zero in the regression output and vice versa. I think modelling this coupling may improve the accuracy and prevent contradicting outputs. What kind of coupling could be useful for this problem? Are there any similar ideas in the literature? I tried to search for them but I don't know whether there is a relevant keyword that would make it easier. Any pointers are appreciated. submitted by /u/tensolution [link] [comments]  ( 3 min )
    [R] Data Twinning
    We recently developed a fast algorithm to partition datasets into statistically similar twin sets. The algorithm can be used to generate optimal training-testing splits, k-fold cross validation sets, for data compression, e.t.c. Twinning will reduce the uncertainty that comes with random splits, without introducing bias. Further details on the algorithm and its applications are provided in the article: Data Twinning The R package for twinning can be installed from CRAN, and the Python module from GitHub. Hope it turns out useful to you! submitted by /u/cppoverc [link] [comments]  ( 3 min )
    [D] DeepSpeed vs PyTorch native API
    DeepSpeed provides data, model and pipeline parallelism which makes it possible to train really large models. They also provide other related resources like ZeRO optimizer. Most of these are now available on PyTorch as well. Are there valid reasons to use DeepSpeed still? submitted by /u/mlvpj [link] [comments]  ( 1 min )
    [D] Deploying GPT-NeoX 20B: lessons learned and a focus on Deepspeed
    Hello all, Deploying and using GPT-NeoX 20B reliably in production has been quite a challenge. You basically have 2 choices: have it run on a single huge GPU, or on multiple smaller GPUs. Here are a couple of lessons I learned during this interesting journey: https://nlpcloud.io/deploying-gpt-neox-20-production-focus-deepspeed.html If some of you used a different strategy, I'd love to hear about it. Also, if you have an idea about how to perform batch inference on GPT-NeoX 20B, I'm interested in hearing your thoughts. Thanks to EleutherAI for their amazing work. Can't wait to see what's next! submitted by /u/juliensalinas [link] [comments]  ( 1 min )
    [D] What are the standard ways to visualise, clean and manage the large image datasets?
    Hi everyone, I'm working on a computer vision problem that need large dataset, I scrapped multiple images of different categories and it's attributes/metadata like image name, url etc. from different websites. While scrapping the data, I wrote the scrapping scripts for different websites and saved the data from individual websites with the textual attributes/metadata of the images saved in the CSV file. Now, since I completed the scrapping from different websites, I want to merge all the downloaded images and its metadata in a central location for further preprocessing and cleaning. I'm wondering what are the de facto standard of managing large image dataset (>100K images) such that I can clean the data manually or programmatically while also keeping track of the previous states? When I say clean the data, this means dividing the images in different categories, removing the invalid images, performing transformations on the images etc. while doing all this, we also have to maintain the attributes/metadata of the images such that we can keep track of all this and at the same time the data is presentable and viewable through web or something to technical and non-technical stack holders so that they can keep a note of what's happening with the data. One thing I can think of is creating relational database, then ingest all the data from the CSV to that database, while maintaining some kind of flags to keep track of the valid or invalid images and its categories based on the data preprocessing steps. But I've no clue about how to make the data accessible/viewable to non-technical stackholders! I think there must be some better approach for this case! I'd really appreciate if you could share your experiences and expertise in dealing with such situations. Thank you so much. submitted by /u/goal_it [link] [comments]  ( 2 min )
    [R] DiffusionNet: Geometric Deep Learning
    submitted by /u/aidev2040 [link] [comments]
    [R] Multi-Stage Progressive Image Restoration
    submitted by /u/aidev2040 [link] [comments]
    [R] Transformer Memory as a Differentiable Search Index
    submitted by /u/hardmaru [link] [comments]
  • Open

    Accelerating fusion science through learned plasma control
    I wanted to share this blogpost on a recent paper of mine. It talks a bit through getting deep reinforcement learning to learn on a piece of hardware. If you have any questions left on the practicalities of reinforcement learning, feel free to AMA! https://www.deepmind.com/blog/article/Accelerating-fusion-science-through-learned-plasma-control submitted by /u/PeedLearning [link] [comments]  ( 1 min )
    For continuous output spaces w/ PPO, why is the std dev sometimes a function of the input and sometimes not?
    Hey, Is there any intuition around when we would want to learn std dev as a layer connected to our learned latent space vs as a separate Parameter space? In some PPO implementations where the policy outputs a tensor of normal distributions (e.g. in continuous output spaces), sometimes the std dev is a learned parameter but it is not a function of the input, e.g. https://github.com/openai/spinningup/blob/038665d62d569055401d91856abb287263096178/spinup/algos/pytorch/ppo/core.py#L85 In other cases, the core network will output both the mean + std. Thanks! submitted by /u/Deathcalibur [link] [comments]  ( 1 min )
    Resources for Convergence Results?
    Hi, I am wondering if anyone has some good resources that contain the fundamental convergence proofs for learning with either linear or non-linear function approximation? Thanks! submitted by /u/BrianBadondeBrrrr [link] [comments]
    Conditional Actions in RL
    Hellooo! I am working with an RL algorithm (actor-critic) with two actions as outputs [w1, w2] each takes values between -1 and 1. These two actions will be used in our target function as weights : TA = w1*P1 + w2*P2 then the reward will be computed based on TA. The question is how can I make a condition on values of w1 and w2 so that TA always takes positive values? submitted by /u/GuavaAgreeable208 [link] [comments]  ( 1 min )
    Different results for neural network training with every repetition
    Hi everyone, ​ I'm training a neural network on an image set to calculate Q values. I'm not doing this alternately with the evaluation, but with a set of saved states (about 6800 training examples and 680 test examples) from an already conducted reinforcement learning. This is to test how good a neural network will eventually become in this specific case. ​ A problem is that the results differ very strongly, when repeating the same process with ADAM. This naturally comes from the stochastic process, but the main problem is, that the training sometimes gets stuck at different points. I will show examples of 5 training processes in the following pictures. The legend for the pictures is: ​ black lines - training data blue lines - validation data purple lines - test data solid lines -…  ( 2 min )
    Do environments like OpenAI Gym Cartpole , Pendulum , Mountain have discrete or continous state-action space ? Can some one expplain.
    submitted by /u/aabra__ka__daabra [link] [comments]  ( 1 min )
    How does the actor(policy) differ in Actor-critic methods and DDPG algorithm. What is the output of the policy in both ?
    DDPG works in continous action space . What exactly does the policy gives in both of them (i.e is action continous, discrete,etc.) ? submitted by /u/aabra__ka__daabra [link] [comments]  ( 1 min )
    MIT Researchers Propose a New Deep Reinforcement Learning Algorithm Trained to Optimize Doses of Propofol to Maintain Unconsciousness During General Anesthesia
    A team of neuroscientists, engineers, and physicians showed a machine learning system for constantly automating propofol administration in a special issue of Artificial Intelligence in Medicine. The algorithm outperformed more traditional software in sophisticated, physiology-based simulations of patients using an application of deep reinforcement learning. The software’s neural networks simultaneously learned how to maintain unconsciousness and critique the efficacy of their own actions. It also nearly matched genuine anesthesiologists’ performance when demonstrating what it would take to maintain unconsciousness given data from nine actual procedures. The algorithm’s advances increase the feasibility for computers to maintain patient unconsciousness with no more drug than is needed. Hence, freeing up anesthesiologists for all of the other responsibilities in the operating room, such as ensuring patients remain immobile, experience no pain, remain stable, and receive adequate oxygen. Continue Reading Paper: https://www.sciencedirect.com/science/article/pii/S0933365721002207?via%3Dihub submitted by /u/ai-lover [link] [comments]  ( 1 min )
    Do next level photoshop using this AI model! Remove objects in any video! 😍😍😮
    submitted by /u/arxiv_code_test_b [link] [comments]  ( 1 min )
    Projects based on Reinforcement learning papers.
    We are asked to design a project based on some reinforcement learning papers published in the last 5 ‏‏‎ years. Any interesting paper recommendation on the same would be much appreciated TIA submitted by /u/AyoWhoGotTheChez [link] [comments]  ( 1 min )
  • Open

    Is the competition/cooperation between symbolic AI and statistical AI (ML) about historical approach to research / engineering, or is it more fundamentally about what intelligent agents "are"?
    I have found that comprehensive overviews of artificial intelligence (Wikipedia, SEP article, Norvig and Russel's AI: A Modern Approach) make reference to symbolic AI and statistical AI in their historical context of the former preceding the latter, their corresponding limitations etc. But I have found it really difficult to dissect this from the question of whether the divide / cooperation between these paradigms are about the implementation of engineering of intelligent agents, or if they are getting at something more fundamental about the space of possible minds (I use this term to be as broad as possible considering anything we would label as a mind, regardless of ontogeny, architecture, physical components etc)? I have given a list of questions below, but some of them are mutually e…  ( 3 min )
    UAV Mapping System for Agricultural Field Surveying
    I am currently reading a paper on UAV Mapping System for Agricultural Field Surveying and was curious on what AI technologies does it use (e.g., model-based diagnosis, belief networks,semantic networks, heuristic search, constraint satisfaction search, regression) or something else?? And also why is it intelligent or what aspect makes it intelligent?. Link to paper: https://www.mdpi.com/1424-8220/17/12/2703/htm submitted by /u/Ok_Acanthisitta5478 [link] [comments]
    Does "Person of Interest" represent a good example of Artfificial General Intelligence?
    I won't comment other than say I certainly hope so. submitted by /u/luigirovatti1 [link] [comments]
    Size of a DNN modeling the universal XOR
    Suppose I wish to model the universal XOR function with a deep neural network. I have come accross an explanation saying that the number of neurons needed is 3(N-1) which can be arranged in 2log_2(n) layers. Now how can I see that going below this minimum number of layer can result in an exponential size of the network? submitted by /u/laika00 [link] [comments]  ( 1 min )
    Recognizing Chemical Formulas from Research Papers Using a Transformer-Based Artificial Neural Network
    In the last few years, deep learning has been playing an integral role in various scientific and technology areas. This development promotes AI-based tools that can help us with information retrieval. Modern deep learning technologies, which typically demand enormous volumes of qualitative data for neural network training, will also transform chemistry. The good news is that chemical data holds up well over time. Even though a molecule was first synthesized more than a century ago, information regarding its structure, characteristics, and synthesis methods is still helpful today. Even in this age of universal digitalization, an organic chemist may turn to a thesis from a library collection for information about a poorly studied molecule published as far back as the early twentieth century. Continue Reading Paper: https://chemistry-europe.onlinelibrary.wiley.com/doi/pdfdirect/10.1002/cmtd.202100069 https://preview.redd.it/wds1kdd3ofi81.png?width=1322&format=png&auto=webp&s=f34608cea7eb7efe1699a43187f92d919997a9cd submitted by /u/ai-lover [link] [comments]  ( 1 min )
    New type of AI will save from stroke and protect against Parkinson’s disease
    submitted by /u/bendee983 [link] [comments]
    Day 3 of posting AI generated motivational/inspirational quotes
    submitted by /u/ScottBakerJr [link] [comments]
    6 Best Artificial Intelligence courses for Healthcare You should know 2022
    submitted by /u/sivasiriyapureddy [link] [comments]
    Looking for AI blog / content writing tools?
    Hey there, I am working on a book regarding self development / personal skills etc. I was wondering what resources might be available and the process, if I write out a book outline, are there AI tools that can put together a first draft / help generate similar topics? Are there topology searches (eg similar content) / Latent Semantic Indexing resources (the keywords that most frequently occur around the current list of words). Also, I use Grammarly to correct / proofread - are there better AI tools for that, and/or ways to better articulate the same concept? POORLY WORDED We all need to ensure that moving forward, that all parties were properly briefed and aware of the requirements as specified by the govt departments. AN IMPROVEMENT The Govt departments will need to properly brief everyone, and make sure they understand. submitted by /u/jaybestnz [link] [comments]  ( 1 min )
    When in doubt, be forgettable [inspirobot]
    submitted by /u/Bozzooo [link] [comments]
    Leveraging data science and AI to promote social justice, sustainability and equity
    "Technology itself, and the data we collect on a daily basis, is not inherently ‘good’ or ‘bad’. However, as humans, we can program our biases into the technology we create. Existing racial, social, and gender biases can be programmed into the algorithms we develop, often unconsciously, or the data that are available to train algorithms, biased in nature, inevitably affects the outputs...The concept of equitable AI reflects the notion that AI systems should be designed in a way that ensures human or social biases do not translate into algorithms. This ethical approach to designing and implementing AI systems is vital due to the significant risks associated with leaving these biased AI algorithms unchecked. The AI market is projected to grow significantly over the coming years." Read the full article here: https://www.responsible-investor.com/articles/leveraging-data-science-and-ai-to-promote-social-justice-sustainability-and-equity submitted by /u/ImpactMapper [link] [comments]  ( 1 min )
    Facial recognition firm Clearview AI tells investors it’s seeking massive expansion beyond law enforcement
    submitted by /u/Sorin61 [link] [comments]  ( 1 min )
    MIT Researchers Propose a New Deep Reinforcement Learning Algorithm Trained to Optimize Doses of Propofol to Maintain Unconsciousness During General Anesthesia
    A team of neuroscientists, engineers, and physicians showed a machine learning system for constantly automating propofol administration in a special issue of Artificial Intelligence in Medicine. The algorithm outperformed more traditional software in sophisticated, physiology-based simulations of patients using an application of deep reinforcement learning. The software’s neural networks simultaneously learned how to maintain unconsciousness and critique the efficacy of their own actions. It also nearly matched genuine anesthesiologists’ performance when demonstrating what it would take to maintain unconsciousness given data from nine actual procedures. The algorithm’s advances increase the feasibility for computers to maintain patient unconsciousness with no more drug than is needed. Hence, freeing up anesthesiologists for all of the other responsibilities in the operating room, such as ensuring patients remain immobile, experience no pain, remain stable, and receive adequate oxygen. Continue Reading Paper: https://www.sciencedirect.com/science/article/pii/S0933365721002207?via%3Dihub submitted by /u/ai-lover [link] [comments]  ( 1 min )
    Artificial Nightmares : Skin-walkers || DD PYTTI VQGAN AI Art Video [4K 60 FPS]
    submitted by /u/Thenamessd [link] [comments]
  • Open

    Machine Learning for Mechanical Ventilation Control
    Posted by Daniel Suo, Software Engineer and Elad Hazan, Research Scientist, Google Research, on behalf of the Google AI Princeton Team Mechanical ventilators provide critical support for patients who have difficulty breathing or are unable to breathe on their own. They see frequent use in scenarios ranging from routine anesthesia, to neonatal intensive care and life support during the COVID-19 pandemic. A typical ventilator consists of a compressed air source, valves to control the flow of air into and out of the lungs, and a "respiratory circuit" that connects the ventilator to the patient. In some cases, a sedated patient may be connected to the ventilator via a tube inserted through the trachea to their lungs, a process called invasive ventilation. A mechanical ventilator takes br…  ( 9 min )
    The Balloon Learning Environment
    Posted by Joshua Greaves, Software Engineer and Pablo Samuel Castro, Staff Software Engineer, Google Research, Brain Team Benchmark challenges have been a driving force in the advancement of machine learning (ML). In particular, difficult benchmark environments for reinforcement learning (RL) have been crucial for the rapid progress of the field by challenging researchers to overcome increasingly difficult tasks. The Arcade Learning Environment, Mujoco, and others have been used to push the envelope in RL algorithms, representation learning, exploration, and more. In “Autonomous Navigation of Stratospheric Balloons Using Reinforcement Learning”, published in Nature two years ago, we demonstrated how deep RL can be used to create a high-performing flight agent that can control stratosph…  ( 6 min )
  • Open

    Recognizing Chemical Formulas from Research Papers Using a Transformer-Based Artificial Neural Network
    In the last few years, deep learning has been playing an integral role in various scientific and technology areas. This development promotes AI-based tools that can help us with information retrieval. Modern deep learning technologies, which typically demand enormous volumes of qualitative data for neural network training, will also transform chemistry. The good news is that chemical data holds up well over time. Even though a molecule was first synthesized more than a century ago, information regarding its structure, characteristics, and synthesis methods is still helpful today. Even in this age of universal digitalization, an organic chemist may turn to a thesis from a library collection for information about a poorly studied molecule published as far back as the early twentieth century. Continue Reading Paper: https://chemistry-europe.onlinelibrary.wiley.com/doi/pdfdirect/10.1002/cmtd.202100069 https://preview.redd.it/1scxntw2ofi81.png?width=1322&format=png&auto=webp&s=6bf1d9e1fd2bee051c07726ee89bee4dbbbbb675 submitted by /u/ai-lover [link] [comments]  ( 1 min )
    Equal Width vs Equal Depth/Frequency
    Hi! Im building my first CNN and I was wondering which binning method is preferred for continuous data like age. What binning or discretization method is better? I see a lot of definitions but no reason to use one over the other Thanks! submitted by /u/Skinnybisquit [link] [comments]  ( 1 min )
  • Open

    What use is mental math in 2022?
    Now that most people are carrying around a powerful computer in their pocket, what use is it to be able to do math in your head? Here’s something I’ve noticed lately: being able to do quick approximations in mid-conversation is a superpower. When I’m on Zoom with a client, I can’t say “Excuse me a […] What use is mental math in 2022? first appeared on John D. Cook.  ( 2 min )
    Missing Morse codes
    Morse codes for Latin letters are sequences of between one and four symbols, where each symbol is a dot or a dash. There are 2 possible sequences with one symbol, 4 with two symbols, 8 with three symbols, and 16 with four symbols. This makes a total of 30 sequences with up to four symbols. […] Missing Morse codes first appeared on John D. Cook.  ( 2 min )
  • Open

    5 Easy Steps to Choosing a Great Data Visualization Platform for Your Business
    In the last two decades, the web and digital technologies have become inseparable parts of business operations globally. It is not only just using the internet and applications for enhancing workflow; businesses are relying on advanced and customised software tailor-made for their needs. They are also using diverse types of web-based services to reach out… Read More »5 Easy Steps to Choosing a Great Data Visualization Platform for Your Business The post 5 Easy Steps to Choosing a Great Data Visualization Platform for Your Business appeared first on Data Science Central.  ( 5 min )
  • Open

    Bringing Novel Idea to Life, NVIDIA Artists Create Retro Writer’s Room in Omniverse With ‘The Storyteller’
    Real-time rendering and photorealistic graphics used to be tall tales, but NVIDIA Omniverse has made them fact from fiction. NVIDIA’s own artists are writing new chapters in Omniverse, an accelerated 3D design platform that connects and enhances 3D apps and creative workflows, to showcase these stories. Combined with the NVIDIA Studio platform, Omniverse and Studio-validated Read article > The post Bringing Novel Idea to Life, NVIDIA Artists Create Retro Writer’s Room in Omniverse With ‘The Storyteller’ appeared first on The Official NVIDIA Blog.  ( 3 min )
    What Is Edge AI and How Does It Work?
    Recent strides in the efficacy of AI, the adoption of IoT devices and the power of edge computing have come together to unlock the power of edge AI. This has opened new opportunities for edge AI that were previously unimaginable — from helping radiologists identify pathologies in the hospital, to driving cars down the freeway, Read article > The post What Is Edge AI and How Does It Work? appeared first on The Official NVIDIA Blog.  ( 6 min )
    Performance You Can Feel: Putting GeForce NOW RTX 3080 Membership’s Ultra-Low Latency to the Test This GFN Thursday
    GeForce NOW’s RTX 3080 membership is the next generation of cloud gaming. This GFN Thursday looks at one of the tier’s major benefits: ultra-low-latency streaming from the cloud. This week also brings a new app update that lets members log in via Discord, a members-only World of Warships reward and eight titles joining the GeForce Read article > The post Performance You Can Feel: Putting GeForce NOW RTX 3080 Membership’s Ultra-Low Latency to the Test This GFN Thursday appeared first on The Official NVIDIA Blog.  ( 4 min )
  • Open

    What are Artificial Neural Network (ANN)?
    An introduction to Artificial neural networks  ( 4 min )
    What are Convolutional Neural Network (CNN)?
    In the previous article I have talked a lot about deep learning, neural networks, and types of them but today in this article we will learn…  ( 3 min )
  • Open

    User-Oriented Robust Reinforcement Learning. (arXiv:2202.07301v1 [cs.LG])
    Recently, improving the robustness of policies across different environments attracts increasing attention in the reinforcement learning (RL) community. Existing robust RL methods mostly aim to achieve the max-min robustness by optimizing the policy's performance in the worst-case environment. However, in practice, a user that uses an RL policy may have different preferences over its performance across environments. Clearly, the aforementioned max-min robustness is oftentimes too conservative to satisfy user preference. Therefore, in this paper, we integrate user preference into policy learning in robust RL, and propose a novel User-Oriented Robust RL (UOR-RL) framework. Specifically, we define a new User-Oriented Robustness (UOR) metric for RL, which allocates different weights to the environments according to user preference and generalizes the max-min robustness metric. To optimize the UOR metric, we develop two different UOR-RL training algorithms for the scenarios with or without a priori known environment distribution, respectively. Theoretically, we prove that our UOR-RL training algorithms converge to near-optimal policies even with inaccurate or completely no knowledge about the environment distribution. Furthermore, we carry out extensive experimental evaluations in 4 MuJoCo tasks. The experimental results demonstrate that UOR-RL is comparable to the state-of-the-art baselines under the average and worst-case performance metrics, and more importantly establishes new state-of-the-art performance under the UOR metric.  ( 2 min )
    Exploiting Data Sparsity in Secure Cross-Platform Social Recommendation. (arXiv:2202.07253v1 [cs.LG])
    Social recommendation has shown promising improvements over traditional systems since it leverages social correlation data as an additional input. Most existing work assumes that all data are available to the recommendation platform. However, in practice, user-item interaction data (e.g.,rating) and user-user social data are usually generated by different platforms, and both of which contain sensitive information. Therefore, "How to perform secure and efficient social recommendation across different platforms, where the data are highly-sparse in nature" remains an important challenge. In this work, we bring secure computation techniques into social recommendation, and propose S3Rec, a sparsity-aware secure cross-platform social recommendation framework. As a result, our model can not only improve the recommendation performance of the rating platform by incorporating the sparse social data on the social platform, but also protect data privacy of both platforms. Moreover, to further improve model training efficiency, we propose two secure sparse matrix multiplication protocols based on homomorphic encryption and private information retrieval. Our experiments on two benchmark datasets demonstrate the effectiveness of S3Rec.  ( 2 min )
    Navigating Local Minima in Quantized Spiking Neural Networks. (arXiv:2202.07221v1 [cs.LG])
    Spiking and Quantized Neural Networks (NNs) are becoming exceedingly important for hyper-efficient implementations of Deep Learning (DL) algorithms. However, these networks face challenges when trained using error backpropagation, due to the absence of gradient signals when applying hard thresholds. The broadly accepted trick to overcoming this is through the use of biased gradient estimators: surrogate gradients which approximate thresholding in Spiking Neural Networks (SNNs), and Straight-Through Estimators (STEs), which completely bypass thresholding in Quantized Neural Networks (QNNs). While noisy gradient feedback has enabled reasonable performance on simple supervised learning tasks, it is thought that such noise increases the difficulty of finding optima in loss landscapes, especially during the later stages of optimization. By periodically boosting the Learning Rate (LR) during training, we expect the network can navigate unexplored solution spaces that would otherwise be difficult to reach due to local minima, barriers, or flat surfaces. This paper presents a systematic evaluation of a cosine-annealed LR schedule coupled with weight-independent adaptive moment estimation as applied to Quantized SNNs (QSNNs). We provide a rigorous empirical evaluation of this technique on high precision and 4-bit quantized SNNs across three datasets, demonstrating (close to) state-of-the-art performance on the more complex datasets. Our source code is available at this link: https://github.com/jeshraghian/QSNNs.  ( 2 min )
    REPID: Regional Effect Plots with implicit Interaction Detection. (arXiv:2202.07254v1 [stat.ML])
    Machine learning models can automatically learn complex relationships, such as non-linear and interaction effects. Interpretable machine learning methods such as partial dependence plots visualize marginal feature effects but may lead to misleading interpretations when feature interactions are present. Hence, employing additional methods that can detect and measure the strength of interactions is paramount to better understand the inner workings of machine learning models. We demonstrate several drawbacks of existing global interaction detection approaches, characterize them theoretically, and evaluate them empirically. Furthermore, we introduce regional effect plots with implicit interaction detection, a novel framework to detect interactions between a feature of interest and other features. The framework also quantifies the strength of interactions and provides interpretable and distinct regions in which feature effects can be interpreted more reliably, as they are less confounded by interactions. We prove the theoretical eligibility of our method and show its applicability on various simulation and real-world examples.  ( 2 min )
    Explaining Reject Options of Learning Vector Quantization Classifiers. (arXiv:2202.07244v1 [cs.LG])
    While machine learning models are usually assumed to always output a prediction, there also exist extensions in the form of reject options which allow the model to reject inputs where only a prediction with an unacceptably low certainty would be possible. With the ongoing rise of eXplainable AI, a lot of methods for explaining model predictions have been developed. However, understanding why a given input was rejected, instead of being classified by the model, is also of interest. Surprisingly, explanations of rejects have not been considered so far. We propose to use counterfactual explanations for explaining rejects and investigate how to efficiently compute counterfactual explanations of different reject options for an important class of models, namely prototype-based classifiers such as learning vector quantization models.  ( 2 min )
    Pessimistic Minimax Value Iteration: Provably Efficient Equilibrium Learning from Offline Datasets. (arXiv:2202.07511v1 [cs.LG])
    We study episodic two-player zero-sum Markov games (MGs) in the offline setting, where the goal is to find an approximate Nash equilibrium (NE) policy pair based on a dataset collected a priori. When the dataset does not have uniform coverage over all policy pairs, finding an approximate NE involves challenges in three aspects: (i) distributional shift between the behavior policy and the optimal policy, (ii) function approximation to handle large state space, and (iii) minimax optimization for equilibrium solving. We propose a pessimism-based algorithm, dubbed as pessimistic minimax value iteration (PMVI), which overcomes the distributional shift by constructing pessimistic estimates of the value functions for both players and outputs a policy pair by solving NEs based on the two value functions. Furthermore, we establish a data-dependent upper bound on the suboptimality which recovers a sublinear rate without the assumption on uniform coverage of the dataset. We also prove an information-theoretical lower bound, which suggests that the data-dependent term in the upper bound is intrinsic. Our theoretical results also highlight a notion of "relative uncertainty", which characterizes the necessary and sufficient condition for achieving sample efficiency in offline MGs. To the best of our knowledge, we provide the first nearly minimax optimal result for offline MGs with function approximation.  ( 2 min )
    Scaling up Ranking under Constraints for Live Recommendations by Replacing Optimization with Prediction. (arXiv:2202.07088v1 [cs.IR])
    Many important multiple-objective decision problems can be cast within the framework of ranking under constraints and solved via a weighted bipartite matching linear program. Some of these optimization problems, such as personalized content recommendations, may need to be solved in real time and thus must comply with strict time requirements to prevent the perception of latency by consumers. Classical linear programming is too computationally inefficient for such settings. We propose a novel approach to scale up ranking under constraints by replacing the weighted bipartite matching optimization with a prediction problem in the algorithm deployment stage. We show empirically that the proposed approximate solution to the ranking problem leads to a major reduction in required computing resources without much sacrifice in constraint compliance and achieved utility, allowing us to solve larger constrained ranking problems real-time, within the required 50 milliseconds, than previously reported.  ( 2 min )
    Asymptotically Unbiased Estimation for Delayed Feedback Modeling via Label Correction. (arXiv:2202.06472v2 [cs.LG] UPDATED)
    Alleviating the delayed feedback problem is of crucial importance for the conversion rate(CVR) prediction in online advertising. Previous delayed feedback modeling methods using an observation window to balance the trade-off between waiting for accurate labels and consuming fresh feedback. Moreover, to estimate CVR upon the freshly observed but biased distribution with fake negatives, the importance sampling is widely used to reduce the distribution bias. While effective, we argue that previous approaches falsely treat fake negative samples as real negative during the importance weighting and have not fully utilized the observed positive samples, leading to suboptimal performance. In this work, we propose a new method, DElayed Feedback modeling with UnbiaSed Estimation, (DEFUSE), which aim to respectively correct the importance weights of the immediate positive, the fake negative, the real negative, and the delay positive samples at finer granularity. Specifically, we propose a two-step optimization approach that first infers the probability of fake negatives among observed negatives before applying importance sampling. To fully exploit the ground-truth immediate positives from the observed distribution, we further develop a bi-distribution modeling framework to jointly model the unbiased immediate positives and the biased delay conversions. Experimental results on both public and our industrial datasets validate the superiority of DEFUSE. Codes are available at https://github.com/ychen216/DEFUSE.git.  ( 2 min )
    Eliciting Best Practices for Collaboration with Computational Notebooks. (arXiv:2202.07233v1 [cs.HC])
    Despite the widespread adoption of computational notebooks, little is known about best practices for their usage in collaborative contexts. In this paper, we fill this gap by eliciting a catalog of best practices for collaborative data science with computational notebooks. With this aim, we first look for best practices through a multivocal literature review. Then, we conduct interviews with professional data scientists to assess their awareness of these best practices. Finally, we assess the adoption of best practices through the analysis of 1,380 Jupyter notebooks retrieved from the Kaggle platform. Findings reveal that experts are mostly aware of the best practices and tend to adopt them in their daily work. Nonetheless, they do not consistently follow all the recommendations as, depending on specific contexts, some are deemed unfeasible or counterproductive due to the lack of proper tool support. As such, we envision the design of notebook solutions that allow data scientists not to have to prioritize exploration and rapid prototyping over writing code of quality.  ( 2 min )
    Look back, look around: a systematic analysis of effective predictors for new outlinks in focused Web crawling. (arXiv:2111.05062v3 [cs.LG] UPDATED)
    Small and medium enterprises rely on detailed Web analytics to be informed about their market and competition. Focused crawlers meet this demand by crawling and indexing specific parts of the Web. Critically, a focused crawler must quickly find new pages that have not yet been indexed. Since a new page can be discovered only by following a new outlink, predicting new outlinks is very relevant in practice. In the literature, many feature designs have been proposed for predicting changes in the Web. In this work we provide a structured analysis of this problem, using new outlinks as our running prediction target. Specifically, we unify earlier feature designs in a taxonomic arrangement of features along two dimensions: static versus dynamic features, and features of a page versus features of the network around it. Within this taxonomy, complemented by our new (mainly, dynamic network) features, we identify best predictors for new outlinks. Our main conclusion is that most informative features are the recent history of new outlinks on a page itself, and on its content-related pages. Hence, we propose a new 'look back, look around' (LBLA) model, that uses only these features. With the obtained predictions, we design a number of scoring functions to guide a focused crawler to pages with most new outlinks, and compare their performance. Interestingly, the LBLA approach proved extremely effective, outperforming even the models that use a most complete set of features. One of the learners we use, is the recent NGBoost method that assumes a Poisson distribution for the number of new outlinks on a page, and learns its parameters. This connects the two so far unrelated avenues in the literature: predictions based on features of a page, and those based on probabilistic modeling. All experiments were carried out on an original dataset, made available by a commercial focused crawler.  ( 3 min )
    Exploring convolutional neural networks with transfer learning for diagnosing Lyme disease from skin lesion images. (arXiv:2106.14465v2 [eess.IV] UPDATED)
    Lyme disease which is one of the most common infectious vector-borne diseases manifests itself in most cases with erythema migrans (EM) skin lesions. Recent studies show that convolutional neural networks (CNNs) perform well to identify skin lesions from images. Lightweight CNN based pre-scanner applications for resource-constrained mobile devices can help users with early diagnosis of Lyme disease and prevent the transition to a severe late form thanks to appropriate antibiotic therapy. Also, resource-intensive CNN based robust computer applications can assist non-expert practitioners with an accurate diagnosis. The main objective of this study is to extensively analyze the effectiveness of CNNs for diagnosing Lyme disease from images and to find out the best CNN architectures considering resource constraints. First, we created an EM dataset with the help of expert dermatologists from Clermont-Ferrand University Hospital Center of France. Second, we benchmarked this dataset for twenty-three CNN architectures customized from VGG, ResNet, DenseNet, MobileNet, Xception, NASNet, and EfficientNet architectures in terms of predictive performance, computational complexity, and statistical significance. Third, to improve the performance of the CNNs, we used custom transfer learning from ImageNet pre-trained models as well as pre-trained the CNNs with the skin lesion dataset HAM10000. Fourth, for model explainability, we utilized Gradient-weighted Class Activation Mapping to visualize the regions of input that are significant to the CNNs for making predictions. Fifth, we provided guidelines for model selection based on predictive performance and computational complexity.  ( 3 min )
    A Unified Perspective on Value Backup and Exploration in Monte-Carlo Tree Search. (arXiv:2202.07071v1 [cs.AI])
    Monte-Carlo Tree Search (MCTS) is a class of methods for solving complex decision-making problems through the synergy of Monte-Carlo planning and Reinforcement Learning (RL). The highly combinatorial nature of the problems commonly addressed by MCTS requires the use of efficient exploration strategies for navigating the planning tree and quickly convergent value backup methods. These crucial problems are particularly evident in recent advances that combine MCTS with deep neural networks for function approximation. In this work, we propose two methods for improving the convergence rate and exploration based on a newly introduced backup operator and entropy regularization. We provide strong theoretical guarantees to bound convergence rate, approximation error, and regret of our methods. Moreover, we introduce a mathematical framework based on the use of the $\alpha$-divergence for backup and exploration in MCTS. We show that this theoretical formulation unifies different approaches, including our newly introduced ones, under the same mathematical framework, allowing to obtain different methods by simply changing the value of $\alpha$. In practice, our unified perspective offers a flexible way to balance between exploration and exploitation by tuning the single $\alpha$ parameter according to the problem at hand. We validate our methods through a rigorous empirical study from basic toy problems to the complex Atari games, and including both MDP and POMDP problems.  ( 2 min )
    A Survey on Dynamic Neural Networks for Natural Language Processing. (arXiv:2202.07101v1 [cs.CL])
    Effectively scaling large Transformer models is a main driver of recent advances in natural language processing. Dynamic neural networks, as an emerging research direction, are capable of scaling up neural networks with sub-linear increases in computation and time by dynamically adjusting their computational path based on the input. Dynamic neural networks could be a promising solution to the growing parameter numbers of pretrained language models, allowing both model pretraining with trillions of parameters and faster inference on mobile devices. In this survey, we summarize progress of three types of dynamic neural networks in NLP: skimming, mixture of experts, and early exit. We also highlight current challenges in dynamic neural networks and directions for future research.  ( 2 min )
    Scalable Spatiotemporally Varying Coefficient Modelling with Bayesian Kernelized Tensor Regression. (arXiv:2109.00046v2 [stat.ML] UPDATED)
    As a regression technique in spatial statistics, the spatiotemporally varying coefficient model (STVC) is an important tool for discovering nonstationary and interpretable response-covariate associations over both space and time. However, it is difficult to apply STVC for large-scale spatiotemporal analyses due to the high computational cost. To address this challenge, we summarize the spatiotemporally varying coefficients using a third-order tensor structure and propose to reformulate the spatiotemporally varying coefficient model as a special low-rank tensor regression problem. The low-rank decomposition can effectively model the global patterns of the large data sets with a substantially reduced number of parameters. To further incorporate the local spatiotemporal dependencies, we use Gaussian process (GP) priors on the spatial and temporal factor matrices. We refer to the overall framework as Bayesian Kernelized Tensor Regression (BKTR). For model inference, we develop an efficient Markov chain Monte Carlo (MCMC) algorithm, which uses Gibbs sampling to update factor matrices and slice sampling to update kernel hyperparameters. We conduct extensive experiments on both synthetic and real-world data sets, and our results confirm the superior performance and efficiency of BKTR for model estimation and parameter inference.  ( 2 min )
    Holistic Adversarial Robustness of Deep Learning Models. (arXiv:2202.07201v1 [cs.LG])
    Adversarial robustness studies the worst-case performance of a machine learning model to ensure safety and reliability. With the proliferation of deep-learning based technology, the potential risks associated with model development and deployment can be amplified and become dreadful vulnerabilities. This paper provides a comprehensive overview of research topics and foundational principles of research methods for adversarial robustness of deep learning models, including attacks, defenses, verification, and novel applications.  ( 2 min )
    G-Mixup: Graph Data Augmentation for Graph Classification. (arXiv:2202.07179v1 [cs.LG])
    This work develops \emph{mixup for graph data}. Mixup has shown superiority in improving the generalization and robustness of neural networks by interpolating features and labels between two random samples. Traditionally, Mixup can work on regular, grid-like, and Euclidean data such as image or tabular data. However, it is challenging to directly adopt Mixup to augment graph data because different graphs typically: 1) have different numbers of nodes; 2) are not readily aligned; and 3) have unique typologies in non-Euclidean space. To this end, we propose $\mathcal{G}$-Mixup to augment graphs for graph classification by interpolating the generator (i.e., graphon) of different classes of graphs. Specifically, we first use graphs within the same class to estimate a graphon. Then, instead of directly manipulating graphs, we interpolate graphons of different classes in the Euclidean space to get mixed graphons, where the synthetic graphs are generated through sampling based on the mixed graphons. Extensive experiments show that $\mathcal{G}$-Mixup substantially improves the generalization and robustness of GNNs.  ( 2 min )
    iRNN: Integer-only Recurrent Neural Network. (arXiv:2109.09828v2 [cs.LG] UPDATED)
    Recurrent neural networks (RNN) are used in many real-world text and speech applications. They include complex modules such as recurrence, exponential-based activation, gate interaction, unfoldable normalization, bi-directional dependence, and attention. The interaction between these elements prevents running them on integer-only operations without a significant performance drop. Deploying RNNs that include layer normalization and attention on integer-only arithmetic is still an open problem. We present a quantization-aware training method for obtaining a highly accurate integer-only recurrent neural network (iRNN). Our approach supports layer normalization, attention, and an adaptive piecewise linear approximation of activations (PWL), to serve a wide range of RNNs on various applications. The proposed method is proven to work on RNN-based language models and challenging automatic speech recognition, enabling AI applications on the edge. Our iRNN maintains similar performance as its full-precision counterpart, their deployment on smartphones improves the runtime performance by $2\times$, and reduces the model size by $4\times$.  ( 2 min )
    Conformal Prediction Sets with Limited False Positives. (arXiv:2202.07650v1 [cs.LG])
    We develop a new approach to multi-label conformal prediction in which we aim to output a precise set of promising prediction candidates with a bounded number of incorrect answers. Standard conformal prediction provides the ability to adapt to model uncertainty by constructing a calibrated candidate set in place of a single prediction, with guarantees that the set contains the correct answer with high probability. In order to obey this coverage property, however, conformal sets can become inundated with noisy candidates -- which can render them unhelpful in practice. This is particularly relevant to practical applications where there is a limited budget, and the cost (monetary or otherwise) associated with false positives is non-negligible. We propose to trade coverage for a notion of precision by enforcing that the presence of incorrect candidates in the predicted conformal sets (i.e., the total number of false positives) is bounded according to a user-specified tolerance. Subject to this constraint, our algorithm then optimizes for a generalized notion of set coverage (i.e., the true positive rate) that allows for any number of true answers for a given query (including zero). We demonstrate the effectiveness of this approach across a number of classification tasks in natural language processing, computer vision, and computational chemistry.  ( 2 min )
    Debiased Pseudo Labeling in Self-Training. (arXiv:2202.07136v1 [cs.LG])
    Deep neural networks achieve remarkable performances on a wide range of tasks with the aid of large-scale labeled datasets. However, large-scale annotations are time-consuming and labor-exhaustive to obtain on realistic tasks. To mitigate the requirement for labeled data, self-training is widely used in both academia and industry by pseudo labeling on readily-available unlabeled data. Despite its popularity, pseudo labeling is well-believed to be unreliable and often leads to training instability. Our experimental studies further reveal that the performance of self-training is biased due to data sampling, pre-trained models, and training strategies, especially the inappropriate utilization of pseudo labels. To this end, we propose Debiased, in which the generation and utilization of pseudo labels are decoupled by two independent heads. To further improve the quality of pseudo labels, we introduce a worst-case estimation of pseudo labeling and seamlessly optimize the representations to avoid the worst-case. Extensive experiments justify that the proposed Debiased not only yields an average improvement of $14.4$\% against state-of-the-art algorithms on $11$ tasks (covering generic object recognition, fine-grained object recognition, texture classification, and scene classification) but also helps stabilize training and balance performance across classes.  ( 2 min )
    Approximate Midpoint Policy Iteration for Linear Quadratic Control. (arXiv:2011.14212v3 [math.OC] UPDATED)
    We present a midpoint policy iteration algorithm to solve linear quadratic optimal control problems in both model-based and model-free settings. The algorithm is a variation of Newton's method, and we show that in the model-based setting it achieves cubic convergence, which is superior to standard policy iteration and policy gradient algorithms that achieve quadratic and linear convergence, respectively. We also demonstrate that the algorithm can be approximately implemented without knowledge of the dynamics model by using least-squares estimates of the state-action value function from trajectory data, from which policy improvements can be obtained. With sufficient trajectory data, the policy iterates converge cubically to approximately optimal policies, and this occurs with the same available sample budget as the approximate standard policy iteration. Numerical experiments demonstrate effectiveness of the proposed algorithms.  ( 2 min )
    A Guide to Computational Reproducibility in Signal Processing and Machine Learning. (arXiv:2108.12383v2 [eess.SP] UPDATED)
    Computational reproducibility is a growing problem that has been extensively studied among computational researchers and within the signal processing and machine learning research community. However, with the changing landscape of signal processing and machine learning research come new obstacles and unseen challenges in creating reproducible experiments. Due to these new challenges most computational experiments have become difficult, if not impossible, to be reproduced by an independent researcher. In 2016 a survey conducted by the journal Nature found that 50% of researchers were unable to reproduce their own experiments. While the issue of computational reproducibility has been discussed in the literature and specifically within the signal processing community, it is still unclear to most researchers what are the best practices to ensure reproducibility without impinging on their primary responsibility of conducting research. We feel that although researchers understand the importance of making experiments reproducible, the lack of a clear set of standards and tools makes it difficult to incorporate good reproducibility practices in most labs. It is in this regard that we aim to present signal processing researchers with a set of practical tools and strategies that can help mitigate many of the obstacles to producing reproducible computational experiments.  ( 2 min )
    Constrained Optimization Involving Nonconvex $\ell_p$ Norms: Optimality Conditions, Algorithm and Convergence. (arXiv:2110.14127v2 [math.OC] UPDATED)
    This paper investigates the optimality conditions for characterizing the local minimizers of the constrained optimization problems involving an $\ell_p$ norm ($0<p<1$) of the variables, which may appear in either the objective or the constraint. This kind of problems have strong applicability to a wide range of areas since usually the $\ell_p$ norm can promote sparse solutions. However, the nonsmooth and non-Lipschtiz nature of the $\ell_p$ norm often cause these problems difficult to analyze and solve. We provide the calculation of the subgradients of the $\ell_p$ norm and the normal cones of the $\ell_p$ ball. For both problems, we derive the first-order necessary conditions under various constraint qualifications. We also derive the sequential optimality conditions for both problems and study the conditions under which these conditions imply the first-order necessary conditions. We point out that the sequential optimality conditions can be easily satisfied for iteratively reweighted algorithms and show that the global convergence can be easily derived using sequential optimality conditions.  ( 2 min )
    Generating Disentangled Arguments with Prompts: A Simple Event Extraction Framework that Works. (arXiv:2110.04525v2 [cs.CL] UPDATED)
    Event Extraction bridges the gap between text and event signals. Based on the assumption of trigger-argument dependency, existing approaches have achieved state-of-the-art performance with expert-designed templates or complicated decoding constraints. In this paper, for the first time we introduce the prompt-based learning strategy to the domain of Event Extraction, which empowers the automatic exploitation of label semantics on both input and output sides. To validate the effectiveness of the proposed generative method, we conduct extensive experiments with 11 diverse baselines. Empirical results show that, in terms of F1 score on Argument Extraction, our simple architecture is stronger than any other generative counterpart and even competitive with algorithms that require template engineering. Regarding the measure of recall, it sets new overall records for both Argument and Trigger Extractions. We hereby recommend this framework to the community, with the code publicly available at https://git.io/GDAP.  ( 2 min )
    Lie Point Symmetry Data Augmentation for Neural PDE Solvers. (arXiv:2202.07643v1 [cs.LG])
    Neural networks are increasingly being used to solve partial differential equations (PDEs), replacing slower numerical solvers. However, a critical issue is that neural PDE solvers require high-quality ground truth data, which usually must come from the very solvers they are designed to replace. Thus, we are presented with a proverbial chicken-and-egg problem. In this paper, we present a method, which can partially alleviate this problem, by improving neural PDE solver sample complexity -- Lie point symmetry data augmentation (LPSDA). In the context of PDEs, it turns out that we are able to quantitatively derive an exhaustive list of data transformations, based on the Lie point symmetry group of the PDEs in question, something not possible in other application areas. We present this framework and demonstrate how it can easily be deployed to improve neural PDE solver sample complexity by an order of magnitude.  ( 2 min )
    Unreasonable Effectiveness of Last Hidden Layer Activations. (arXiv:2202.07342v1 [cs.LG])
    In standard Deep Neural Network (DNN) based classifiers, the general convention is to omit the activation function in the last (output) layer and directly apply the softmax function on the logits to get the probability scores of each class. In this type of architectures, the loss value of the classifier against any output class is directly proportional to the difference between the final probability score and the label value of the associated class. Standard White-box adversarial evasion attacks, whether targeted or untargeted, mainly try to exploit the gradient of the model loss function to craft adversarial samples and fool the model. In this study, we show both mathematically and experimentally that using some widely known activation functions in the output layer of the model with high temperature values has the effect of zeroing out the gradients for both targeted and untargeted attack cases, preventing attackers from exploiting the model's loss function to craft adversarial samples. We've experimentally verified the efficacy of our approach on MNIST (Digit), CIFAR10 datasets. Detailed experiments confirmed that our approach substantially improves robustness against gradient-based targeted and untargeted attack threats. And, we showed that the increased non-linearity at the output layer has some additional benefits against some other attack methods like Deepfool attack.  ( 2 min )
    DeepPAMM: Deep Piecewise Exponential Additive Mixed Models for Complex Hazard Structures in Survival Analysis. (arXiv:2202.07423v1 [stat.ML])
    Survival analysis (SA) is an active field of research that is concerned with time-to-event outcomes and is prevalent in many domains, particularly biomedical applications. Despite its importance, SA remains challenging due to small-scale data sets and complex outcome distributions, concealed by truncation and censoring processes. The piecewise exponential additive mixed model (PAMM) is a model class addressing many of these challenges, yet PAMMs are not applicable in high-dimensional feature settings or in the case of unstructured or multimodal data. We unify existing approaches by proposing DeepPAMM, a versatile deep learning framework that is well-founded from a statistical point of view, yet with enough flexibility for modeling complex hazard structures. We illustrate that DeepPAMM is competitive with other machine learning approaches with respect to predictive performance while maintaining interpretability through benchmark experiments and an extended case study.
    DeepSensor: Deep Learning Testing Framework Based on Neuron Sensitivity. (arXiv:2202.07464v1 [cs.LG])
    Despite impressive capabilities and outstanding performance, deep neural network(DNN) has captured increasing public concern for its security problem, due to frequent occurrence of erroneous behaviors. Therefore, it is necessary to conduct systematically testing before its deployment to real-world applications. Existing testing methods have provided fine-grained criteria based on neuron coverage and reached high exploratory degree of testing. But there is still a gap between the neuron coverage and model's robustness evaluation. To bridge the gap, we observed that neurons which change the activation value dramatically due to minor perturbation are prone to trigger incorrect corner cases. Motivated by it, we propose neuron sensitivity and develop a novel white-box testing framework for DNN, donated as DeepSensor. The number of sensitive neurons is maximized by particle swarm optimization, thus diverse corner cases could be triggered and neuron coverage be further improved when compared with baselines. Besides, considerable robustness enhancement can be reached when adopting testing examples based on neuron sensitivity for retraining. Extensive experiments implemented on scalable datasets and models can well demonstrate the testing effectiveness and robustness improvement of DeepSensor.
    Textless Speech Emotion Conversion using Discrete and Decomposed Representations. (arXiv:2111.07402v2 [cs.CL] UPDATED)
    Speech emotion conversion is the task of modifying the perceived emotion of a speech utterance while preserving the lexical content and speaker identity. In this study, we cast the problem of emotion conversion as a spoken language translation task. We use a decomposition of the speech signal into discrete learned representations, consisting of phonetic-content units, prosodic features, speaker, and emotion. First, we modify the speech content by translating the phonetic-content units to a target emotion, and then predict the prosodic features based on these units. Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder. Such a paradigm allows us to go beyond spectral and parametric changes of the signal, and model non-verbal vocalizations, such as laughter insertion, yawning removal, etc. We demonstrate objectively and subjectively that the proposed method is vastly superior to current approaches and even beats text-based systems in terms of perceived emotion and audio quality. We rigorously evaluate all components of such a complex system and conclude with an extensive model analysis and ablation study to better emphasize the architectural choices, strengths and weaknesses of the proposed method. Samples are available under the following link: https://speechbot.github.io/emotion.
    Adaptive Conformal Predictions for Time Series. (arXiv:2202.07282v1 [stat.ML])
    Uncertainty quantification of predictive models is crucial in decision-making problems. Conformal prediction is a general and theoretically sound answer. However, it requires exchangeable data, excluding time series. While recent works tackled this issue, we argue that Adaptive Conformal Inference (ACI, Gibbs and Cand{\`e}s, 2021), developed for distribution-shift time series, is a good procedure for time series with general dependency. We theoretically analyse the impact of the learning rate on its efficiency in the exchangeable and auto-regressive case. We propose a parameter-free method, AgACI, that adaptively builds upon ACI based on online expert aggregation. We lead extensive fair simulations against competing methods that advocate for ACI's use in time series. We conduct a real case study: electricity price forecasting. The proposed aggregation algorithm provides efficient prediction intervals for day-ahead forecasting. All the code and data to reproduce the experiments is made available.
    Confidence Threshold Neural Diving. (arXiv:2202.07506v1 [math.OC])
    Finding a better feasible solution in a shorter time is an integral part of solving Mixed Integer Programs. We present a post-hoc method based on Neural Diving to build heuristics more flexibly. We hypothesize that variables with higher confidence scores are more definite to be included in the optimal solution. For our hypothesis, we provide empirical evidence that confidence threshold technique produces partial solutions leading to final solutions with better primal objective values. Our method won 2nd place in the primal task on the NeurIPS 2021 ML4CO competition. Also, our method shows the best score among other learning-based methods in the competition.
    A Robust Cybersecurity Topic Classification Tool. (arXiv:2109.02473v2 [cs.IR] UPDATED)
    In this research, we use user defined labels from three internet text sources (Reddit, Stackexchange, Arxiv) to train 21 different machine learning models for the topic classification task of detecting cybersecurity discussions in natural text. We analyze the false positive and false negative rates of each of the 21 model's in a cross validation experiment. Then we present a Cybersecurity Topic Classification (CTC) tool, which takes the majority vote of the 21 trained machine learning models as the decision mechanism for detecting cybersecurity related text. We also show that the majority vote mechanism of the CTC tool provides lower false negative and false positive rates on average than any of the 21 individual models. We show that the CTC tool is scalable to the hundreds of thousands of documents with a wall clock time on the order of hours.
    Toward Degradation-Robust Voice Conversion. (arXiv:2110.07537v2 [eess.AS] UPDATED)
    Any-to-any voice conversion technologies convert the vocal timbre of an utterance to any speaker even unseen during training. Although there have been several state-of-the-art any-to-any voice conversion models, they were all based on clean utterances to convert successfully. However, in real-world scenarios, it is difficult to collect clean utterances of a speaker, and they are usually degraded by noises or reverberations. It thus becomes highly desired to understand how these degradations affect voice conversion and build a degradation-robust model. We report in this paper the first comprehensive study on the degradation robustness of any-to-any voice conversion. We show that the performance of state-of-the-art models nowadays was severely hampered given degraded utterances. To this end, we then propose speech enhancement concatenation and denoising training to improve the robustness. In addition to common degradations, we also consider adversarial noises, which alter the model output significantly yet are human-imperceptible. It was shown that both concatenations with off-the-shelf speech enhancement models and denoising training on voice conversion models could improve the robustness, while each of them had pros and cons.
    Statistical Inference After Adaptive Sampling in Non-Markovian Environments. (arXiv:2202.07098v1 [cs.LG])
    There is a great desire to use adaptive sampling methods, such as reinforcement learning (RL) and bandit algorithms, for the real-time personalization of interventions in digital applications like mobile health and education. A major obstacle preventing more widespread use of such algorithms in practice is the lack of assurance that the resulting adaptively collected data can be used to reliably answer inferential questions, including questions about time-varying causal effects. Current methods for statistical inference on such data are insufficient because they (a) make strong assumptions regarding the environment dynamics, e.g., assume a contextual bandit or Markovian environment, or (b) require data to be collected with one adaptive sampling algorithm per user, which excludes data collected by algorithms that learn to select actions by pooling the data of multiple users. In this work, we make initial progress by introducing the adaptive sandwich estimator to quantify uncertainty; this estimator (a) is valid even when user rewards and contexts are non-stationary and highly dependent over time, and (b) accommodates settings in which an online adaptive sampling algorithm learns using the data of all users. Furthermore, our inference method is robust to misspecification of the reward models used by the adaptive sampling algorithm. This work is motivated by our work designing experiments in which RL algorithms are used to select actions, yet reliable statistical inference is essential for conducting primary analyses after the trial is over.
    Bayesian Optimisation for Active Monitoring of Air Pollution. (arXiv:2202.07595v1 [cs.LG])
    Air pollution is one of the leading causes of mortality globally, resulting in millions of deaths each year. Efficient monitoring is important to measure exposure and enforce legal limits. New low-cost sensors can be deployed in greater numbers and in more varied locations, motivating the problem of efficient automated placement. Previous work suggests Bayesian optimisation is an appropriate method, but only considered a satellite data set, with data aggregated over all altitudes. It is ground-level pollution, that humans breathe, which matters most. We improve on those results using hierarchical models and evaluate our models on urban pollution data in London to show that Bayesian optimisation can be successfully applied to the problem.
    FAST-PCA: A Fast and Exact Algorithm for Distributed Principal Component Analysis. (arXiv:2108.12373v2 [cs.LG] UPDATED)
    Principal Component Analysis (PCA) is a fundamental data preprocessing tool in the world of machine learning. While PCA is often thought of as a dimensionality reduction method, the purpose of PCA is actually two-fold: dimension reduction and uncorrelated feature learning. Furthermore, the enormity of the dimensions and sample size in the modern day datasets have rendered the centralized PCA solutions unusable. In that vein, this paper reconsiders the problem of PCA when data samples are distributed across nodes in an arbitrarily connected network. While a few solutions for distributed PCA exist, those either overlook the uncorrelated feature learning aspect of the PCA, tend to have high communication overhead that makes them inefficient and/or lack `exact' or `global' convergence guarantees. To overcome these aforementioned issues, this paper proposes a distributed PCA algorithm termed FAST-PCA (Fast and exAct diSTributed PCA). The proposed algorithm is efficient in terms of communication and is proven to converge linearly and exactly to the principal components, leading to dimension reduction as well as uncorrelated features. The claims are further supported by experimental results.
    UserBERT: Modeling Long- and Short-Term User Preferences via Self-Supervision. (arXiv:2202.07605v1 [cs.LG])
    E-commerce platforms generate vast amounts of customer behavior data, such as clicks and purchases, from millions of unique users every day. However, effectively using this data for behavior understanding tasks is challenging because there are usually not enough labels to learn from all users in a supervised manner. This paper extends the BERT model to e-commerce user data for pre-training representations in a self-supervised manner. By viewing user actions in sequences as analogous to words in sentences, we extend the existing BERT model to user behavior data. Further, our model adopts a unified structure to simultaneously learn from long-term and short-term user behavior, as well as user attributes. We propose methods for the tokenization of different types of user behavior sequences, the generation of input representation vectors, and a novel pretext task to enable the pre-trained model to learn from its own input, eliminating the need for labeled training data. Extensive experiments demonstrate that the learned representations result in significant improvements when transferred to three different real-world tasks, particularly compared to task-specific modeling and multi-task representation learning
    Deep Convolutional Autoencoder for Assessment of Anomalies in Multi-stream Sensor Data. (arXiv:2202.07592v1 [cs.LG])
    A fully convolutional autoencoder is developed for the detection of anomalies in multi-sensor vehicle drive-cycle data from the powertrain domain. Preliminary results collected on real-world powertrain data show that the reconstruction error of faulty drive cycles deviates significantly relative to the reconstruction of healthy drive cycles using the trained autoencoder. The results demonstrate applicability for identifying faulty drive-cycles, and for improving the accuracy of system prognosis and predictive maintenance in connected vehicles.
    CUP: A Conservative Update Policy Algorithm for Safe Reinforcement Learning. (arXiv:2202.07565v1 [cs.LG])
    Safe reinforcement learning (RL) is still very challenging since it requires the agent to consider both return maximization and safe exploration. In this paper, we propose CUP, a Conservative Update Policy algorithm with a theoretical safety guarantee. We derive the CUP based on the new proposed performance bounds and surrogate functions. Although using bounds as surrogate functions to design safe RL algorithms have appeared in some existing works, we develop them at least three aspects: (i) We provide a rigorous theoretical analysis to extend the surrogate functions to generalized advantage estimator (GAE). GAE significantly reduces variance empirically while maintaining a tolerable level of bias, which is an efficient step for us to design CUP; (ii) The proposed bounds are tighter than existing works, i.e., using the proposed bounds as surrogate functions are better local approximations to the objective and safety constraints. (iii) The proposed CUP provides a non-convex implementation via first-order optimizers, which does not depend on any convex approximation. Finally, extensive experiments show the effectiveness of CUP where the agent satisfies safe constraints. We have opened the source code of CUP at https://github.com/RL-boxes/Safe-RL.
    A Survey on Model Compression for Natural Language Processing. (arXiv:2202.07105v1 [cs.CL])
    With recent developments in new architectures like Transformer and pretraining techniques, significant progress has been made in applications of natural language processing (NLP). However, the high energy cost and long inference delay of Transformer is preventing NLP from entering broader scenarios including edge and mobile computing. Efficient NLP research aims to comprehensively consider computation, time and carbon emission for the entire life-cycle of NLP, including data preparation, model training and inference. In this survey, we focus on the inference stage and review the current state of model compression for NLP, including the benchmarks, metrics and methodology. We outline the current obstacles and future research directions.
    Discriminability-enforcing loss to improve representation learning. (arXiv:2202.07073v1 [cs.CV])
    During the training process, deep neural networks implicitly learn to represent the input data samples through a hierarchy of features, where the size of the hierarchy is determined by the number of layers. In this paper, we focus on enforcing the discriminative power of the high-level representations, that are typically learned by the deeper layers (closer to the output). To this end, we introduce a new loss term inspired by the Gini impurity, which is aimed at minimizing the entropy (increasing the discriminative power) of individual high-level features with respect to the class labels. Although our Gini loss induces highly-discriminative features, it does not ensure that the distribution of the high-level features matches the distribution of the classes. As such, we introduce another loss term to minimize the Kullback-Leibler divergence between the two distributions. We conduct experiments on two image classification data sets (CIFAR-100 and Caltech 101), considering multiple neural architectures ranging from convolutional networks (ResNet-17, ResNet-18, ResNet-50) to transformers (CvT). Our empirical results show that integrating our novel loss terms into the training objective consistently outperforms the models trained with cross-entropy alone.
    Compositional Scene Representation Learning via Reconstruction: A Survey. (arXiv:2202.07135v1 [cs.LG])
    Visual scene representation learning is an important research problem in the field of computer vision. The performance on vision tasks could be improved if more suitable representations are learned for visual scenes. Complex visual scenes are the composition of relatively simple visual concepts, and have the property of combinatorial explosion. Compared with directly representing the entire visual scene, extracting compositional scene representations can better cope with the diverse combination of background and objects. Because compositional scene representations abstract the concept of objects, performing visual scene analysis and understanding based on these representations could be easier and more interpretable. Moreover, learning compositional scene representations via reconstruction can greatly reduce the need for training data annotations. Therefore, compositional scene representation learning via reconstruction has important research significance. In this survey, we first discuss representative methods that either learn from a single viewpoint or multiple viewpoints without object-level supervision, then the applications of compositional scene representations, and finally the future directions on this topic.
    On Incorporating Inductive Biases into VAEs. (arXiv:2106.13746v2 [stat.ML] UPDATED)
    We explain why directly changing the prior can be a surprisingly ineffective mechanism for incorporating inductive biases into VAEs, and introduce a simple and effective alternative approach: Intermediary Latent Space VAEs(InteL-VAEs). InteL-VAEs use an intermediary set of latent variables to control the stochasticity of the encoding process, before mapping these in turn to the latent representation using a parametric function that encapsulates our desired inductive bias(es). This allows us to impose properties like sparsity or clustering on learned representations, and incorporate human knowledge into the generative model. Whereas changing the prior only indirectly encourages behavior through regularizing the encoder, InteL-VAEs are able to directly enforce desired characteristics. Moreover, they bypass the computation and encoder design issues caused by non-Gaussian priors, while allowing for additional flexibility through training of the parametric mapping function. We show that these advantages, in turn, lead to both better generative models and better representations being learned.
    Domain Invariant Representation Learning with Domain Density Transformations. (arXiv:2102.05082v3 [cs.LG] UPDATED)
    Domain generalization refers to the problem where we aim to train a model on data from a set of source domains so that the model can generalize to unseen target domains. Naively training a model on the aggregate set of data (pooled from all source domains) has been shown to perform suboptimally, since the information learned by that model might be domain-specific and generalize imperfectly to target domains. To tackle this problem, a predominant approach is to find and learn some domain-invariant information in order to use it for the prediction task. In this paper, we propose a theoretically grounded method to learn a domain-invariant representation by enforcing the representation network to be invariant under all transformation functions among domains. We also show how to use generative adversarial networks to learn such domain transformations to implement our method in practice. We demonstrate the effectiveness of our method on several widely used datasets for the domain generalization problem, on all of which we achieve competitive results with state-of-the-art models.
    An algorithmic solution to the Blotto game using multi-marginal couplings. (arXiv:2202.07318v1 [cs.GT])
    We describe an efficient algorithm to compute solutions for the general two-player Blotto game on n battlefields with heterogeneous values. While explicit constructions for such solutions have been limited to specific, largely symmetric or homogeneous, setups, this algorithmic resolution covers the most general situation to date: value-asymmetric game with asymmetric budget. The proposed algorithm rests on recent theoretical advances regarding Sinkhorn iterations for matrix and tensor scaling. An important case which had been out of reach of previous attempts is that of heterogeneous but symmetric battlefield values with asymmetric budget. In this case, the Blotto game is constant-sum so optimal solutions exist, and our algorithm samples from an \eps-optimal solution in time O(n^2 + \eps^{-4}), independently of budgets and battlefield values. In the case of asymmetric values where optimal solutions need not exist but Nash equilibria do, our algorithm samples from an \eps-Nash equilibrium with similar complexity but where implicit constants depend on various parameters of the game such as battlefield values.
    L2C2: Locally Lipschitz Continuous Constraint towards Stable and Smooth Reinforcement Learning. (arXiv:2202.07152v1 [cs.RO])
    This paper proposes a new regularization technique for reinforcement learning (RL) towards making policy and value functions smooth and stable. RL is known for the instability of the learning process and the sensitivity of the acquired policy to noise. Several methods have been proposed to resolve these problems, and in summary, the smoothness of policy and value functions learned mainly in RL contributes to these problems. However, if these functions are extremely smooth, their expressiveness would be lost, resulting in not obtaining the global optimal solution. This paper therefore considers RL under local Lipschitz continuity constraint, so-called L2C2. By designing the spatio-temporal locally compact space for L2C2 from the state transition at each time step, the moderate smoothness can be achieved without loss of expressiveness. Numerical noisy simulations verified that the proposed L2C2 outperforms the task performance while smoothing out the robot action generated from the learned policy.
    RL4RS: A Real-World Benchmark for Reinforcement Learning based Recommender System. (arXiv:2110.11073v2 [cs.IR] UPDATED)
    Reinforcement learning based recommender systems (RL-based RS) aims at learning a good policy from a batch of collected data, by casting sequential recommendation to multi-step decision-making tasks. However, current RL-based RS benchmarks commonly have a large reality gap, because they involve artificial RL datasets or semi-simulated RS datasets, and the trained policy is directly evaluated in the simulation environment. In real-world situations, not all recommendation problems are suitable to be transformed into reinforcement learning problems. Unlike previous academic RL research, RL-based RS suffers from extrapolation error and the difficulties of being well-validated before deployment. In this paper, we introduce the RL4RS (Reinforcement Learning for Recommender Systems) benchmark - a new resource fully collected from industrial applications to train and evaluate RL algorithms with special concerns on the above issues. It contains two datasets, tuned simulation environments, related advanced RL baselines, data understanding tools, and counterfactual policy evaluation algorithms. The RL4RS suit can be found at https://github.com/fuxiAIlab/RL4RS. In addition to the RL-based recommender systems, we expect the resource to contribute to research in reinforcement learning and neural combinatorial optimization.
    Learning distinct features helps, provably. (arXiv:2106.06012v2 [cs.LG] UPDATED)
    We study the diversity of the features learned by a two-layer neural network trained with the least squares loss. We measure the diversity by the average $L_2$-distance between the hidden-layer features and theoretically investigate how learning non-redundant distinct features affects the performance of the network. To do so, we derive novel generalization bounds depending on feature diversity based on Rademacher complexity for such networks. Our analysis proves that more distinct features at the network's hidden units within the intermediate layer lead to better generalization. We also show how to extend our results to deeper networks and different losses.
    XAI for Transformers: Better Explanations through Conservative Propagation. (arXiv:2202.07304v1 [cs.LG])
    Transformers have become an important workhorse of machine learning, with numerous applications. This necessitates the development of reliable methods for increasing their transparency. Multiple interpretability methods, often based on gradient information, have been proposed. We show that the gradient in a Transformer reflects the function only locally, and thus fails to reliably identify the contribution of input features to the prediction. We identify Attention Heads and LayerNorm as main reasons for such unreliable explanations and propose a more stable way for propagation through these layers. Our proposal, which can be seen as a proper extension of the well-established LRP method to Transformers, is shown both theoretically and empirically to overcome the deficiency of a simple gradient-based approach, and achieves state-of-the-art explanation performance on a broad range of Transformer models and datasets.
    Wav2CLIP: Learning Robust Audio Representations From CLIP. (arXiv:2110.11499v2 [cs.SD] UPDATED)
    We propose Wav2CLIP, a robust audio representation learning method by distilling from Contrastive Language-Image Pre-training (CLIP). We systematically evaluate Wav2CLIP on a variety of audio tasks including classification, retrieval, and generation, and show that Wav2CLIP can outperform several publicly available pre-trained audio representation algorithms. Wav2CLIP projects audio into a shared embedding space with images and text, which enables multimodal applications such as zero-shot classification, and cross-modal retrieval. Furthermore, Wav2CLIP needs just ~10% of the data to achieve competitive performance on downstream tasks compared with fully supervised models, and is more efficient to pre-train than competing methods as it does not require learning a visual model in concert with an auditory model. Finally, we demonstrate image generation from Wav2CLIP as qualitative assessment of the shared embedding space. Our code and model weights are open sourced and made available for further applications.
    MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures. (arXiv:2006.07540v3 [cs.LG] UPDATED)
    Regularization and transfer learning are two popular techniques to enhance generalization on unseen data, which is a fundamental problem of machine learning. Regularization techniques are versatile, as they are task- and architecture-agnostic, but they do not exploit a large amount of data available. Transfer learning methods learn to transfer knowledge from one domain to another, but may not generalize across tasks and architectures, and may introduce new training cost for adapting to the target task. To bridge the gap between the two, we propose a transferable perturbation, MetaPerturb, which is meta-learned to improve generalization performance on unseen data. MetaPerturb is implemented as a set-based lightweight network that is agnostic to the size and the order of the input, which is shared across the layers. Then, we propose a meta-learning framework, to jointly train the perturbation function over heterogeneous tasks in parallel. As MetaPerturb is a set-function trained over diverse distributions across layers and tasks, it can generalize to heterogeneous tasks and architectures. We validate the efficacy and generality of MetaPerturb trained on a specific source domain and architecture, by applying it to the training of diverse neural architectures on heterogeneous target datasets against various regularizers and fine-tuning. The results show that the networks trained with MetaPerturb significantly outperform the baselines on most of the tasks and architectures, with a negligible increase in the parameter size and no hyperparameters to tune.
    Recent Advances in Reliable Deep Graph Learning: Adversarial Attack, Inherent Noise, and Distribution Shift. (arXiv:2202.07114v1 [cs.LG])
    Deep graph learning (DGL) has achieved remarkable progress in both business and scientific areas ranging from finance and e-commerce to drug and advanced material discovery. Despite the progress, applying DGL to real-world applications faces a series of reliability threats including adversarial attacks, inherent noise, and distribution shift. This survey aims to provide a comprehensive review of recent advances for improving the reliability of DGL algorithms against the above threats. In contrast to prior related surveys which mainly focus on adversarial attacks and defense, our survey covers more reliability-related aspects of DGL, i.e., inherent noise and distribution shift. Additionally, we discuss the relationships among above aspects and highlight some important issues to be explored in future research.
    Multi-class granular approximation by means of disjoint and adjacent fuzzy granules. (arXiv:2202.07584v1 [cs.AI])
    In granular computing, fuzzy sets can be approximated by granularly representable sets that are as close as possible to the original fuzzy set w.r.t. a given closeness measure. Such sets are called granular approximations. In this article, we introduce the concepts of disjoint and adjacent granules and we examine how the new definitions affect the granular approximations. First, we show that the new concepts are important for binary classification problems since they help to keep decision regions separated (disjoint granules) and at the same time to cover as much as possible of the attribute space (adjacent granules). Later, we consider granular approximations for multi-class classification problems leading to the definition of a multi-class granular approximation. Finally, we show how to efficiently calculate multi-class granular approximations for {\L}ukasiewicz fuzzy connectives. We also provide graphical illustrations for a better understanding of the introduced concepts.
    Orthogonalising gradients to speed up neural network optimisation. (arXiv:2202.07052v1 [cs.LG])
    The optimisation of neural networks can be sped up by orthogonalising the gradients before the optimisation step, ensuring the diversification of the learned representations. We orthogonalise the gradients of the layer's components/filters with respect to each other to separate out the intermediate representations. Our method of orthogonalisation allows the weights to be used more flexibly, in contrast to restricting the weights to an orthogonalised sub-space. We tested this method on ImageNet and CIFAR-10 resulting in a large decrease in learning time, and also obtain a speed-up on the semi-supervised learning BarlowTwins. We obtain similar accuracy to SGD without fine-tuning and better accuracy for na\"ively chosen hyper-parameters.
    Federated Learning with Sparsified Model Perturbation: Improving Accuracy under Client-Level Differential Privacy. (arXiv:2202.07178v1 [cs.LG])
    Federated learning (FL) that enables distributed clients to collaboratively learn a shared statistical model while keeping their training data locally has received great attention recently and can improve privacy and communication efficiency in comparison with traditional centralized machine learning paradigm. However, sensitive information about the training data can still be inferred from model updates shared in FL. Differential privacy (DP) is the state-of-the-art technique to defend against those attacks. The key challenge to achieve DP in FL lies in the adverse impact of DP noise on model accuracy, particularly for deep learning models with large numbers of model parameters. This paper develops a novel differentially-private FL scheme named Fed-SMP that provides client-level DP guarantee while maintaining high model accuracy. To mitigate the impact of privacy protection on model accuracy, Fed-SMP leverages a new technique called Sparsified Model Perturbation (SMP), where local models are sparsified first before being perturbed with additive Gaussian noise. Two sparsification strategies are considered in Fed-SMP: random sparsification and top-$k$ sparsification. We also apply R{\'e}nyi differential privacy to providing a tight analysis for the end-to-end DP guarantee of Fed-SMP and prove the convergence of Fed-SMP with general loss functions. Extensive experiments on real-world datasets are conducted to demonstrate the effectiveness of Fed-SMP in largely improving model accuracy with the same level of DP guarantee and saving communication cost simultaneously.
    ADeADA: Adaptive Density-aware Active Domain Adaptation for Semantic Segmentation. (arXiv:2202.06484v2 [cs.CV] UPDATED)
    In the field of domain adaptation, a trade-off exists between the model performance and the number of target domain annotations. Active learning, maximizing model performance with few informative labeled data, comes in handy for such a scenario. In this work, we present ADeADA, a general active domain adaptation framework for semantic segmentation. To adapt the model to the target domain with minimum queried labels, we propose acquiring labels of the samples with high probability density in the target domain yet with low probability density in the source domain, complementary to the existing source domain labeled data. To further facilitate the label efficiency, we design an adaptive budget allocation policy, which dynamically balances the labeling budgets among different categories as well as between density-aware and uncertainty-based methods. Extensive experiments show that our method outperforms existing active learning and domain adaptation baselines on two benchmarks, GTA5 -> Cityscapes and SYNTHIA -> Cityscapes. With less than 5% target domain annotations, our method reaches comparable results with that of full supervision.
    One Step at a Time: Long-Horizon Vision-and-Language Navigation with Milestones. (arXiv:2202.07028v1 [cs.AI])
    We study the problem of developing autonomous agents that can follow human instructions to infer and perform a sequence of actions to complete the underlying task. Significant progress has been made in recent years, especially for tasks with short horizons. However, when it comes to long-horizon tasks with extended sequences of actions, an agent can easily ignore some instructions or get stuck in the middle of the long instructions and eventually fail the task. To address this challenge, we propose a model-agnostic milestone-based task tracker (M-TRACK) to guide the agent and monitor its progress. Specifically, we propose a milestone builder that tags the instructions with navigation and interaction milestones which the agent needs to complete step by step, and a milestone checker that systemically checks the agent's progress in its current milestone and determines when to proceed to the next. On the challenging ALFRED dataset, our M-TRACK leads to a notable 45% and 70% relative improvement in unseen success rate over two competitive base models.
    Towards Best Practice of Interpreting Deep Learning Models for EEG-based Brain Computer Interfaces. (arXiv:2202.06948v1 [cs.NE])
    Understanding deep learning models is important for EEG-based brain-computer interface (BCI), since it not only can boost trust of end users but also potentially shed light on reasons that cause a model to fail. However, deep learning interpretability has not yet raised wide attention in this field. It remains unknown how reliably existing interpretation techniques can be used and to which extent they can reflect the model decisions. In order to fill this research gap, we conduct the first quantitative evaluation and explore the best practice of interpreting deep learning models designed for EEG-based BCI. We design metrics and test seven well-known interpretation techniques on benchmark deep learning models. Results show that methods of GradientInput, DeepLIFT, integrated gradient, and layer-wise relevance propagation (LRP) have similar and better performance than saliency map, deconvolution and guided backpropagation methods for interpreting the model decisions. In addition, we propose a set of processing steps that allow the interpretation results to be visualized in an understandable and trusted way. Finally, we illustrate with samples on how deep learning interpretability can benefit the domain of EEG-based BCI. Our work presents a promising direction of introducing deep learning interpretability to EEG-based BCI.
    Partially Fake Audio Detection by Self-attention-based Fake Span Discovery. (arXiv:2202.06684v2 [eess.AS] UPDATED)
    The past few years have witnessed the significant advances of speech synthesis and voice conversion technologies. However, such technologies can undermine the robustness of broadly implemented biometric identification models and can be harnessed by in-the-wild attackers for illegal uses. The ASVspoof challenge mainly focuses on synthesized audios by advanced speech synthesis and voice conversion models, and replay attacks. Recently, the first Audio Deep Synthesis Detection challenge (ADD 2022) extends the attack scenarios into more aspects. Also ADD 2022 is the first challenge to propose the partially fake audio detection task. Such brand new attacks are dangerous and how to tackle such attacks remains an open question. Thus, we propose a novel framework by introducing the question-answering (fake span discovery) strategy with the self-attention mechanism to detect partially fake audios. The proposed fake span detection module tasks the anti-spoofing model to predict the start and end positions of the fake clip within the partially fake audio, address the model's attention into discovering the fake spans rather than other shortcuts with less generalization, and finally equips the model with the discrimination capacity between real and partially fake audios. Our submission ranked second in the partially fake audio detection track of ADD 2022.  ( 2 min )
    Transformer Memory as a Differentiable Search Index. (arXiv:2202.06991v1 [cs.CL])
    In this paper, we demonstrate that information retrieval can be accomplished with a single Transformer, in which all information about the corpus is encoded in the parameters of the model. To this end, we introduce the Differentiable Search Index (DSI), a new paradigm that learns a text-to-text model that maps string queries directly to relevant docids; in other words, a DSI model answers queries directly using only its parameters, dramatically simplifying the whole retrieval process. We study variations in how documents and their identifiers are represented, variations in training procedures, and the interplay between models and corpus sizes. Experiments demonstrate that given appropriate design choices, DSI significantly outperforms strong baselines such as dual encoder models. Moreover, DSI demonstrates strong generalization capabilities, outperforming a BM25 baseline in a zero-shot setup.
    Principal Manifold Flows. (arXiv:2202.07037v1 [stat.ML])
    Normalizing flows map an independent set of latent variables to their samples using a bijective transformation. Despite the exact correspondence between samples and latent variables, their high level relationship is not well understood. In this paper we characterize the geometric structure of flows using principal manifolds and understand the relationship between latent variables and samples using contours. We introduce a novel class of normalizing flows, called principal manifold flows (PF), whose contours are its principal manifolds, and a variant for injective flows (iPF) that is more efficient to train than regular injective flows. PFs can be constructed using any flow architecture, are trained with a regularized maximum likelihood objective and can perform density estimation on all of their principal manifolds. In our experiments we show that PFs and iPFs are able to learn the principal manifolds over a variety of datasets. Additionally, we show that PFs can perform density estimation on data that lie on a manifold with variable dimensionality, which is not possible with existing normalizing flows.
    BED: A Real-Time Object Detection System for Edge Devices. (arXiv:2202.07503v1 [cs.CV])
    Deploying machine learning models to edge devices has many real-world applications, especially for the scenarios that demand low latency, low power, or data privacy. However, it requires substantial research and engineering efforts due to the limited computational resources and memory of edge devices. In this demo, we present BED, an object detection system for edge devices practiced on the MAX78000 DNN accelerator. BED integrates on-device DNN inference with a camera and a screen for image acquisition and output exhibition, respectively. Experiment results indicate BED can provide accurate detection with an only 300KB tiny DNN model.  ( 2 min )
    Zero-Shot Assistance in Novel Decision Problems. (arXiv:2202.07364v1 [cs.LG])
    We consider the problem of creating assistants that can help agents - often humans - solve novel sequential decision problems, assuming the agent is not able to specify the reward function explicitly to the assistant. Instead of aiming to automate, and act in place of the agent as in current approaches, we give the assistant an advisory role and keep the agent in the loop as the main decision maker. The difficulty is that we must account for potential biases induced by limitations or constraints of the agent which may cause it to seemingly irrationally reject advice. To do this we introduce a novel formalization of assistance that models these biases, allowing the assistant to infer and adapt to them. We then introduce a new method for planning the assistant's advice which can scale to large decision making problems. Finally, we show experimentally that our approach adapts to these agent biases, and results in higher cumulative reward for the agent than automation-based alternatives.  ( 2 min )
    SQuant: On-the-Fly Data-Free Quantization via Diagonal Hessian Approximation. (arXiv:2202.07471v1 [cs.LG])
    Quantization of deep neural networks (DNN) has been proven effective for compressing and accelerating DNN models. Data-free quantization (DFQ) is a promising approach without the original datasets under privacy-sensitive and confidential scenarios. However, current DFQ solutions degrade accuracy, need synthetic data to calibrate networks, and are time-consuming and costly. This paper proposes an on-the-fly DFQ framework with sub-second quantization time, called SQuant, which can quantize networks on inference-only devices with low computation and memory requirements. With the theoretical analysis of the second-order information of DNN task loss, we decompose and approximate the Hessian-based optimization objective into three diagonal sub-items, which have different areas corresponding to three dimensions of weight tensor: element-wise, kernel-wise, and output channel-wise. Then, we progressively compose sub-items and propose a novel data-free optimization objective in the discrete domain, minimizing Constrained Absolute Sum of Error (or CASE in short), which surprisingly does not need any dataset and is even not aware of network architecture. We also design an efficient algorithm without back-propagation to further reduce the computation complexity of the objective solver. Finally, without fine-tuning and synthetic datasets, SQuant accelerates the data-free quantization process to a sub-second level with >30% accuracy improvement over the existing data-free post-training quantization works, with the evaluated models under 4-bit quantization. We have open-sourced the SQuant framework at https://github.com/clevercool/SQuant.  ( 2 min )
    Deep Ensembles Work, But Are They Necessary?. (arXiv:2202.06985v1 [cs.LG])
    Ensembling neural networks is an effective way to increase accuracy, and can often match the performance of larger models. This observation poses a natural question: given the choice between a deep ensemble and a single neural network with similar accuracy, is one preferable over the other? Recent work suggests that deep ensembles may offer benefits beyond predictive power: namely, uncertainty quantification and robustness to dataset shift. In this work, we demonstrate limitations to these purported benefits, and show that a single (but larger) neural network can replicate these qualities. First, we show that ensemble diversity, by any metric, does not meaningfully contribute to an ensemble's ability to detect out-of-distribution (OOD) data, and that one can estimate ensemble diversity by measuring the relative improvement of a single larger model. Second, we show that the OOD performance afforded by ensembles is strongly determined by their in-distribution (InD) performance, and -- in this sense -- is not indicative of any "effective robustness". While deep ensembles are a practical way to achieve performance improvement (in agreement with prior work), our results show that they may be a tool of convenience rather than a fundamentally better model class.  ( 2 min )
    TURF: A Two-factor, Universal, Robust, Fast Distribution Learning Algorithm. (arXiv:2202.07172v1 [stat.ML])
    Approximating distributions from their samples is a canonical statistical-learning problem. One of its most powerful and successful modalities approximates every distribution to an $\ell_1$ distance essentially at most a constant times larger than its closest $t$-piece degree-$d$ polynomial, where $t\ge1$ and $d\ge0$. Letting $c_{t,d}$ denote the smallest such factor, clearly $c_{1,0}=1$, and it can be shown that $c_{t,d}\ge 2$ for all other $t$ and $d$. Yet current computationally efficient algorithms show only $c_{t,1}\le 2.25$ and the bound rises quickly to $c_{t,d}\le 3$ for $d\ge 9$. We derive a near-linear-time and essentially sample-optimal estimator that establishes $c_{t,d}=2$ for all $(t,d)\ne(1,0)$. Additionally, for many practical distributions, the lowest approximation distance is achieved by polynomials with vastly varying number of pieces. We provide a method that estimates this number near-optimally, hence helps approach the best possible approximation. Experiments combining the two techniques confirm improved performance over existing methodologies.  ( 2 min )
    Network Comparison with Interpretable Contrastive Network Representation Learning. (arXiv:2005.12419v2 [cs.LG] UPDATED)
    Identifying unique characteristics in a network through comparison with another network is an essential network analysis task. For example, with networks of protein interactions obtained from normal and cancer tissues, we can discover unique types of interactions in cancer tissues. This analysis task could be greatly assisted by contrastive learning, which is an emerging analysis approach to discover salient patterns in one dataset relative to another. However, existing contrastive learning methods cannot be directly applied to networks as they are designed only for high-dimensional data analysis. To address this problem, we introduce a new analysis approach called contrastive network representation learning (cNRL). By integrating two machine learning schemes, network representation learning and contrastive learning, cNRL enables embedding of network nodes into a low-dimensional representation that reveals the uniqueness of one network compared to another. Within this approach, we also design a method, named i-cNRL, which offers interpretability in the learned results, allowing for understanding which specific patterns are only found in one network. We demonstrate the effectiveness of i-cNRL for network comparison with multiple network models and real-world datasets. Furthermore, we compare i-cNRL and other potential cNRL algorithm designs through quantitative and qualitative evaluations.  ( 2 min )
    Predicting on the Edge: Identifying Where a Larger Model Does Better. (arXiv:2202.07652v1 [cs.LG])
    Much effort has been devoted to making large and more accurate models, but relatively little has been put into understanding which examples are benefiting from the added complexity. In this paper, we demonstrate and analyze the surprisingly tight link between a model's predictive uncertainty on individual examples and the likelihood that larger models will improve prediction on them. Through extensive numerical studies on the T5 encoder-decoder architecture, we show that large models have the largest improvement on examples where the small model is most uncertain. On more certain examples, even those where the small model is not particularly accurate, large models are often unable to improve at all, and can even perform worse than the smaller model. Based on these findings, we show that a switcher model which defers examples to a larger model when a small model is uncertain can achieve striking improvements in performance and resource usage. We also explore committee-based uncertainty metrics that can be more effective but less practical.  ( 2 min )
    XEM: An Explainable-by-Design Ensemble Method for Multivariate Time Series Classification. (arXiv:2005.03645v5 [cs.LG] UPDATED)
    We present XEM, an eXplainable-by-design Ensemble method for Multivariate time series classification. XEM relies on a new hybrid ensemble method that combines an explicit boosting-bagging approach to handle the bias-variance trade-off faced by machine learning models and an implicit divide-and-conquer approach to individualize classifier errors on different parts of the training data. Our evaluation shows that XEM outperforms the state-of-the-art MTS classifiers on the public UEA datasets. Furthermore, XEM provides faithful explainability-by-design and manifests robust performance when faced with challenges arising from continuous data collection (different MTS length, missing data and noise).
    Recurrent Neural Networks for Dynamical Systems: Applications to Ordinary Differential Equations, Collective Motion, and Hydrological Modeling. (arXiv:2202.07022v1 [math.DS])
    Classical methods of solving spatiotemporal dynamical systems include statistical approaches such as autoregressive integrated moving average, which assume linear and stationary relationships between systems' previous outputs. Development and implementation of linear methods are relatively simple, but they often do not capture non-linear relationships in the data. Thus, artificial neural networks (ANNs) are receiving attention from researchers in analyzing and forecasting dynamical systems. Recurrent neural networks (RNN), derived from feed-forward ANNs, use internal memory to process variable-length sequences of inputs. This allows RNNs to applicable for finding solutions for a vast variety of problems in spatiotemporal dynamical systems. Thus, in this paper, we utilize RNNs to treat some specific issues associated with dynamical systems. Specifically, we analyze the performance of RNNs applied to three tasks: reconstruction of correct Lorenz solutions for a system with a formulation error, reconstruction of corrupted collective motion trajectories, and forecasting of streamflow time series possessing spikes, representing three fields, namely, ordinary differential equations, collective motion, and hydrological modeling, respectively. We train and test RNNs uniquely in each task to demonstrate the broad applicability of RNNs in reconstruction and forecasting the dynamics of dynamical systems.
    Synthetically Controlled Bandits. (arXiv:2202.07079v1 [stat.ML])
    This paper presents a new dynamic approach to experiment design in settings where, due to interference or other concerns, experimental units are coarse. `Region-split' experiments on online platforms are one example of such a setting. The cost, or regret, of experimentation is a natural concern here. Our new design, dubbed Synthetically Controlled Thompson Sampling (SCTS), minimizes the regret associated with experimentation at no practically meaningful loss to inferential ability. We provide theoretical guarantees characterizing the near-optimal regret of our approach, and the error rates achieved by the corresponding treatment effect estimator. Experiments on synthetic and real world data highlight the merits of our approach relative to both fixed and `switchback' designs common to such experimental settings.  ( 2 min )
    Bayesian Imitation Learning for End-to-End Mobile Manipulation. (arXiv:2202.07600v1 [cs.RO])
    In this work we investigate and demonstrate benefits of a Bayesian approach to imitation learning from multiple sensor inputs, as applied to the task of opening office doors with a mobile manipulator. Augmenting policies with additional sensor inputs, such as RGB + depth cameras, is a straightforward approach to improving robot perception capabilities, especially for tasks that may favor different sensors in different situations. As we scale multi-sensor robotic learning to unstructured real-world settings (e.g. offices, homes) and more complex robot behaviors, we also increase reliance on simulators for cost, efficiency, and safety. Consequently, the sim-to-real gap across multiple sensor modalities also increases, making simulated validation more difficult. We show that using the Variational Information Bottleneck (Alemi et al., 2016) to regularize convolutional neural networks improves generalization to held-out domains and reduces the sim-to-real gap in a sensor-agnostic manner. As a side effect, the learned embeddings also provide useful estimates of model uncertainty for each sensor. We demonstrate that our method is able to help close the sim-to-real gap and successfully fuse RGB and depth modalities based on understanding of the situational uncertainty of each sensor. In a real-world office environment, we achieve 96% task success, improving upon the baseline by +16%.
    Slicing Aided Hyper Inference and Fine-tuning for Small Object Detection. (arXiv:2202.06934v2 [cs.CV] UPDATED)
    Detection of small objects and objects far away in the scene is a major challenge in surveillance applications. Such objects are represented by small number of pixels in the image and lack sufficient details, making them difficult to detect using conventional detectors. In this work, an open-source framework called Slicing Aided Hyper Inference (SAHI) is proposed that provides a generic slicing aided inference and fine-tuning pipeline for small object detection. The proposed technique is generic in the sense that it can be applied on top of any available object detector without any fine-tuning. Experimental evaluations, using object detection baselines on the Visdrone and xView aerial object detection datasets show that the proposed inference method can increase object detection AP by 6.8%, 5.1% and 5.3% for FCOS, VFNet and TOOD detectors, respectively. Moreover, the detection accuracy can be further increased with a slicing aided fine-tuning, resulting in a cumulative increase of 12.7%, 13.4% and 14.5% AP in the same order. Proposed technique has been integrated with Detectron2, MMDetection and YOLOv5 models and it is publicly available at https://github.com/obss/sahi.git .  ( 2 min )
    Machine Learning in Aerodynamic Shape Optimization. (arXiv:2202.07141v1 [cs.LG])
    Large volumes of experimental and simulation aerodynamic data have been rapidly advancing aerodynamic shape optimization (ASO) via machine learning (ML), whose effectiveness has been growing thanks to continued developments in deep learning. In this review, we first introduce the state of the art and the unsolved challenges in ASO. Next, we present a description of ML fundamentals and detail the ML algorithms that have succeeded in ASO. Then we review ML applications contributing to ASO from three fundamental perspectives: compact geometric design space, fast aerodynamic analysis, and efficient optimization architecture. In addition to providing a comprehensive summary of the research, we comment on the practicality and effectiveness of the developed methods. We show how cutting-edge ML approaches can benefit ASO and address challenging demands like interactive design optimization. However, practical large-scale design optimizations remain a challenge due to the costly ML training expense. A deep coupling of ML model construction with ASO prior experience and knowledge, such as taking physics into account, is recommended to train ML models effectively.  ( 2 min )
    Generalisation and the Risk--Entropy Curve. (arXiv:2202.07350v1 [cs.LG])
    In this paper we show that the expected generalisation performance of a learning machine is determined by the distribution of risks or equivalently its logarithm -- a quantity we term the risk entropy -- and the fluctuations in a quantity we call the training ratio. We show that the risk entropy can be empirically inferred for deep neural network models using Markov Chain Monte Carlo techniques. Results are presented for different deep neural networks on a variety of problems. The asymptotic behaviour of the risk entropy acts in an analogous way to the capacity of the learning machine, but the generalisation performance experienced in practical situations is determined by the behaviour of the risk entropy before the asymptotic regime is reached. This performance is strongly dependent on the distribution of the data (features and targets) and not just on the capacity of the learning machine.  ( 2 min )
    Sequential Bayesian experimental designs via reinforcement learning. (arXiv:2202.07472v1 [cs.LG])
    Bayesian experimental design (BED) has been used as a method for conducting efficient experiments based on Bayesian inference. The existing methods, however, mostly focus on maximizing the expected information gain (EIG); the cost of experiments and sample efficiency are often not taken into account. In order to address this issue and enhance practical applicability of BED, we provide a new approach Sequential Experimental Design via Reinforcement Learning to construct BED in a sequential manner by applying reinforcement learning in this paper. Here, reinforcement learning is a branch of machine learning in which an agent learns a policy to maximize its reward by interacting with the environment. The characteristics of interacting with the environment are similar to the sequential experiment, and reinforcement learning is indeed a method that excels at sequential decision making. By proposing a new real-world-oriented experimental environment, our approach aims to maximize the EIG while keeping the cost of experiments and sample efficiency in mind simultaneously. We conduct numerical experiments for three different examples. It is confirmed that our method outperforms the existing methods in various indices such as the EIG and sampling efficiency, indicating that our proposed method and experimental environment can make a significant contribution to application of BED to the real world.  ( 2 min )
    Fairness Amidst Non-IID Graph Data: A Literature Review. (arXiv:2202.07170v1 [cs.LG])
    Fairness in machine learning (ML), the process to understand and correct algorithmic bias, has gained increasing attention with numerous literature being carried out, commonly assume the underlying data is independent and identically distributed (IID). On the other hand, graphs are a ubiquitous data structure to capture connections among individual units and is non-IID by nature. It is therefore of great importance to bridge the traditional fairness literature designed on IID data and ubiquitous non-IID graph representations to tackle bias in ML systems. In this survey, we review such recent advance in fairness amidst non-IID graph data and identify datasets and evaluation metrics available for future research. We also point out the limitations of existing work as well as promising future directions.  ( 2 min )
    Analytic Mutual Information in Bayesian Neural Networks. (arXiv:2201.09815v2 [cs.IT] UPDATED)
    Bayesian neural networks have successfully designed and optimized a robust neural network model in many application problems, including uncertainty quantification. However, with its recent success, information-theoretic understanding about the Bayesian neural network is still at an early stage. Mutual information is an example of an uncertainty measure in a Bayesian neural network to quantify epistemic uncertainty. Still, no analytic formula is known to describe it, one of the fundamental information measures to understand the Bayesian deep learning framework. In this paper, we derive the analytical formula of the mutual information between model parameters and the predictive output by leveraging the notion of the point process entropy. Then, as an application, we discuss the parameter estimation of the Dirichlet distribution and show its practical application in the active learning uncertainty measures by demonstrating that our analytical formula can improve the performance of active learning further in practice.  ( 2 min )
    Convolutional Network Fabric Pruning With Label Noise. (arXiv:2202.07268v1 [cs.LG])
    This paper presents an iterative pruning strategy for Convolutional Network Fabrics (CNF) in presence of noisy training and testing data. With the continuous increase in size of neural network models, various authors have developed pruning approaches to build more compact network structures requiring less resources, while preserving performance. As we show in this paper, because of their intrinsic structure and function, Convolutional Network Fabrics are ideal candidates for pruning. We present a series of pruning strategies that can significantly reduce both the final network size and required training time by pruning either entire convolutional filters or individual weights, so that the grid remains visually understandable but that overall execution quality stays within controllable boundaries. Our approach can be iteratively applied during training so that the network complexity decreases rapidly, saving computational time. The paper addresses both data-dependent and dataindependent strategies, and also experimentally establishes the most efficient approaches when training or testing data contain annotation errors.  ( 2 min )
    Continual Learning from Demonstration of Robotic Skills. (arXiv:2202.06843v2 [cs.RO] UPDATED)
    Methods for teaching motion skills to robots focus on training for a single skill at a time. Robots capable of learning from demonstration can considerably benefit from the added ability to learn new movements without forgetting past knowledge. To this end, we propose an approach for continual learning from demonstration using hypernetworks and neural ordinary differential equation solvers. We empirically demonstrate the effectiveness of our approach in remembering long sequences of trajectory learning tasks without the need to store any data from past demonstrations. Our results show that hypernetworks outperform other state-of-the-art regularization-based continual learning approaches for learning from demonstration. In our experiments, we use the popular LASA trajectory benchmark, and a new dataset of kinesthetic demonstrations that we introduce in this paper called the HelloWorld dataset. We evaluate our approach using both trajectory error metrics and continual learning metrics, and we propose two new continual learning metrics. Our code, along with the newly collected dataset, is available at https://github.com/sayantanauddy/clfd.  ( 2 min )
    Algebraic function based Banach space valued ordinary and fractional neural network approximations. (arXiv:2202.07425v1 [stat.ML])
    Here we research the univariate quantitative approximation, ordinary and fractional, of Banach space valued continuous functions on a compact interval or all the real line by quasi-interpolation Banach space valued neural network operators. These approximations are derived by establishing Jackson type inequalities involving the modulus of continuity of the engaged function or its Banach space valued high order derivative of fractional derivatives. Our operators are defined by using a density function generated by an algebraic sigmoid function. The approximations are pointwise and of the uniform norm. The related Banach space valued feed-forward neural networks are with one hidden layer.  ( 2 min )
    A Generalized Hierarchical Nonnegative Tensor Decomposition. (arXiv:2109.14820v2 [cs.LG] UPDATED)
    Nonnegative matrix factorization (NMF) has found many applications including topic modeling and document analysis. Hierarchical NMF (HNMF) variants are able to learn topics at various levels of granularity and illustrate their hierarchical relationship. Recently, nonnegative tensor factorization (NTF) methods have been applied in a similar fashion in order to handle data sets with complex, multi-modal structure. Hierarchical NTF (HNTF) methods have been proposed, however these methods do not naturally generalize their matrix-based counterparts. Here, we propose a new HNTF model which directly generalizes a HNMF model special case, and provide a supervised extension. We also provide a multiplicative updates training method for this model. Our experimental results show that this model more naturally illuminates the topic hierarchy than previous HNMF and HNTF methods.
    A Statistical Learning View of Simple Kriging. (arXiv:2202.07365v1 [stat.ML])
    In the Big Data era, with the ubiquity of geolocation sensors in particular, massive datasets exhibiting a possibly complex spatial dependence structure are becoming increasingly available. In this context, the standard probabilistic theory of statistical learning does not apply directly and guarantees of the generalization capacity of predictive rules learned from such data are left to establish. We analyze here the simple Kriging task, the flagship problem in Geostatistics: the values of a square integrable random field $X=\{X_s\}_{s\in S}$, $S\subset \mathbb{R}^2$, with unknown covariance structure are to be predicted with minimum quadratic risk, based upon observing a single realization of the spatial process at a finite number of locations $s_1,\; \ldots,\; s_n$ in $S$. Despite the connection of this minimization problem with kernel ridge regression, establishing the generalization capacity of empirical risk minimizers is far from straightforward, due to the non i.i.d. nature of the spatial data $X_{s_1},\; \ldots,\; X_{s_n}$ involved. In this article, nonasymptotic bounds of order $O_{\mathbb{P}}(1/n)$ are proved for the excess risk of a plug-in predictive rule mimicking the true minimizer in the case of isotropic stationary Gaussian processes observed at locations forming a regular grid. These theoretical results, as well as the role played by the technical conditions required to establish them, are illustrated by various numerical experiments and hopefully pave the way for further developments in statistical learning based on spatial data.  ( 2 min )
    Predicting cognitive scores with graph neural networks through sample selection learning. (arXiv:2106.09408v3 [cs.LG] UPDATED)
    Analyzing the relation between intelligence and neural activity is of the utmost importance in understanding the working principles of the human brain in health and disease. In existing literature, functional brain connectomes have been used successfully to predict cognitive measures such as intelligence quotient (IQ) scores in both healthy and disordered cohorts using machine learning models. However, existing methods resort to flattening the brain connectome (i.e., graph) through vectorization which overlooks its topological properties. To address this limitation and inspired from the emerging graph neural networks (GNNs), we design a novel regression GNN model (namely RegGNN) for predicting IQ scores from brain connectivity. On top of that, we introduce a novel, fully modular sample selection method to select the best samples to learn from for our target prediction task. However, since such deep learning architectures are computationally expensive to train, we further propose a \emph{learning-based sample selection} method that learns how to choose the training samples with the highest expected predictive power on unseen samples. For this, we capitalize on the fact that connectomes (i.e., their adjacency matrices) lie in the symmetric positive definite (SPD) matrix cone. Our results on full-scale and verbal IQ prediction outperforms comparison methods in autism spectrum disorder cohorts and achieves a competitive performance for neurotypical subjects using 3-fold cross-validation. Furthermore, we show that our sample selection approach generalizes to other learning-based methods, which shows its usefulness beyond our GNN architecture.  ( 2 min )
    Resilience from Diversity: Population-based approach to harden models against adversarial attacks. (arXiv:2111.10272v2 [cs.LG] UPDATED)
    Traditional deep learning networks (DNN) exhibit intriguing vulnerabilities that allow an attacker to force them to fail at their task. Notorious attacks such as the Fast Gradient Sign Method (FGSM) and the more powerful Projected Gradient Descent (PGD) generate adversarial samples by adding a magnitude of perturbation $\epsilon$ to the input's computed gradient, resulting in a deterioration of the effectiveness of the model's classification. This work introduces a model that is resilient to adversarial attacks. Our model leverages an established mechanism of defense which utilizes randomness and a population of DNNs. More precisely, our model consists of a population of $n$ diverse submodels, each one of them trained to individually obtain a high accuracy for the task at hand, while forced to maintain meaningful differences in their weights. Each time our model receives a classification query, it selects a submodel from its population at random to answer the query. To counter the attack transferability, diversity is introduced and maintained in the population of submodels. Thus introducing the concept of counter linking weights. A Counter-Linked Model (CLM) consists of a population of DNNs of the same architecture where a periodic random similarity examination is conducted during the simultaneous training to guarantee diversity while maintaining accuracy. Though the randomization technique proved to be resilient against adversarial attacks, we show that by retraining the DNNs ensemble or training them from the start with counter linking would enhance the robustness by around 20\% when tested on the MNIST dataset and at least 15\% when tested on the CIFAR-10 dataset. When CLM is coupled with adversarial training, this defense mechanism achieves state-of-the-art robustness.
    HiMA: A Fast and Scalable History-based Memory Access Engine for Differentiable Neural Computer. (arXiv:2202.07275v1 [cs.AR])
    Memory-augmented neural networks (MANNs) provide better inference performance in many tasks with the help of an external memory. The recently developed differentiable neural computer (DNC) is a MANN that has been shown to outperform in representing complicated data structures and learning long-term dependencies. DNC's higher performance is derived from new history-based attention mechanisms in addition to the previously used content-based attention mechanisms. History-based mechanisms require a variety of new compute primitives and state memories, which are not supported by existing neural network (NN) or MANN accelerators. We present HiMA, a tiled, history-based memory access engine with distributed memories in tiles. HiMA incorporates a multi-mode network-on-chip (NoC) to reduce the communication latency and improve scalability. An optimal submatrix-wise memory partition strategy is applied to reduce the amount of NoC traffic; and a two-stage usage sort method leverages distributed tiles to improve computation speed. To make HiMA fundamentally scalable, we create a distributed version of DNC called DNC-D to allow almost all memory operations to be applied to local memories with trainable weighted summation to produce the global memory output. Two approximation techniques, usage skimming and softmax approximation, are proposed to further enhance hardware efficiency. HiMA prototypes are created in RTL and synthesized in a 40nm technology. By simulations, HiMA running DNC and DNC-D demonstrates 6.47x and 39.1x higher speed, 22.8x and 164.3x better area efficiency, and 6.1x and 61.2x better energy efficiency over the state-of-the-art MANN accelerator. Compared to an Nvidia 3080Ti GPU, HiMA demonstrates speedup by up to 437x and 2,646x when running DNC and DNC-D, respectively.
    Score-Based Generative Models for Robust Channel Estimation. (arXiv:2111.08177v2 [eess.SP] UPDATED)
    Channel estimation is a critical task in digital communications that greatly impacts end-to-end system performance. In this work, we introduce a novel approach for multiple-input multiple-output (MIMO) channel estimation using score-based generative models. Our method uses a deep neural network that is trained to estimate the gradient of the log-prior of wireless channels at any point in high-dimensional space, and leverages this model to solve channel estimation via posterior sampling. We train a score-based model on channel realizations from the CDL-D model for two antenna spacings and show that the approach leads to competitive in- and out-of-distribution performance when compared to generative adversarial network (GAN) and compressed sensing (CS) methods. When tested on CDL-D channels, the approach leads to a gain of at least $5$ dB in channel estimation error compared to GAN methods in-distribution at $\lambda/2$ antenna spacing. When tested on CDL-C channels which are never seen during training or fine-tuned on, the approach leads to end-to-end coded performance gains of up to $3$ dB compared to CS methods and losses of only $0.5$ dB compared to ideal channel knowledge.  ( 2 min )
    Don't stop the training: continuously-updating self-supervised algorithms best account for auditory responses in the cortex. (arXiv:2202.07290v1 [q-bio.NC])
    Over the last decade, numerous studies have shown that deep neural networks exhibit sensory representations similar to those of the mammalian brain, in that their activations linearly map onto cortical responses to the same sensory inputs. However, it remains unknown whether these artificial networks also learn like the brain. To address this issue, we analyze the brain responses of two ferret auditory cortices recorded with functional UltraSound imaging (fUS), while the animals were presented with 320 10\,s sounds. We compare these brain responses to the activations of Wav2vec 2.0, a self-supervised neural network pretrained with 960\,h of speech, and input with the same 320 sounds. Critically, we evaluate Wav2vec 2.0 under two distinct modes: (i) "Pretrained", where the same model is used for all sounds, and (ii) "Continuous Update", where the weights of the pretrained model are modified with back-propagation after every sound, presented in the same order as the ferrets. Our results show that the Continuous-Update mode leads Wav2Vec 2.0 to generate activations that are more similar to the brain than a Pretrained Wav2Vec 2.0 or than other control models using different training modes. These results suggest that the trial-by-trial modifications of self-supervised algorithms induced by back-propagation aligns with the corresponding fluctuations of cortical responses to sounds. Our finding thus provides empirical evidence of a common learning mechanism between self-supervised models and the mammalian cortex during sound processing.  ( 2 min )
    Tomayto, Tomahto. Beyond Token-level Answer Equivalence for Question Answering Evaluation. (arXiv:2202.07654v1 [cs.CL])
    The predictions of question answering (QA) systems are typically evaluated against manually annotated finite sets of one or more answers. This leads to a coverage limitation that results in underestimating the true performance of systems, and is typically addressed by extending over exact match (EM) with predefined rules or with the token-level F1 measure. In this paper, we present the first systematic conceptual and data-driven analysis to examine the shortcomings of token-level equivalence measures. To this end, we define the asymmetric notion of answer equivalence (AE), accepting answers that are equivalent to or improve over the reference, and collect over 26K human judgements for candidates produced by multiple QA systems on SQuAD. Through a careful analysis of this data, we reveal and quantify several concrete limitations of the F1 measure, such as false impression of graduality, missing dependence on question, and more. Since collecting AE annotations for each evaluated model is expensive, we learn a BERT matching BEM measure to approximate this task. Being a simpler task than QA, we find BEM to provide significantly better AE approximations than F1, and more accurately reflect the performance of systems. Finally, we also demonstrate the practical utility of AE and BEM on the concrete application of minimal accurate prediction sets, reducing the number of required answers by up to 2.6 times.
    MGCVAE: Multi-objective Inverse Design via Molecular Graph Conditional Variational Autoencoder. (arXiv:2202.07476v1 [cs.LG])
    The ultimate goal of various fields is to directly generate molecules with desired properties, such as finding water-soluble molecules in drug development and finding molecules suitable for organic light-emitting diode (OLED) or photosensitizers in the field of development of new organic materials. In this respect, this study proposes a molecular graph generative model based on the autoencoder for de novo design. The performance of molecular graph conditional variational autoencoder (MGCVAE) for generating molecules having specific desired properties is investigated by comparing it to molecular graph variational autoencoder (MGVAE). Furthermore, multi-objective optimization for MGCVAE was applied to satisfy two selected properties simultaneously. In this study, two physical properties -- logP and molar refractivity -- were used as optimization targets for the purpose of designing de novo molecules, especially in drug discovery. As a result, it was confirmed that among generated molecules, 25.89% optimized molecules were generated in MGCVAE compared to 0.66% in MGVAE. Hence, it demonstrates that MGCVAE effectively produced drug-like molecules with two target properties. The results of this study suggest that these graph-based data-driven models are one of the effective methods of designing new molecules that fulfill various physical properties, such as drug discovery.  ( 2 min )
    Geometrically Equivariant Graph Neural Networks: A Survey. (arXiv:2202.07230v1 [cs.LG])
    Many scientific problems require to process data in the form of geometric graphs. Unlike generic graph data, geometric graphs exhibit symmetries of translations, rotations, and/or reflections. Researchers have leveraged such inductive bias and developed geometrically equivariant Graph Neural Networks (GNNs) to better characterize the geometry and topology of geometric graphs. Despite fruitful achievements, it still lacks a survey to depict how equivariant GNNs are progressed, which in turn hinders the further development of equivariant GNNs. To this end, based on the necessary but concise mathematical preliminaries, we analyze and classify existing methods into three groups regarding how the message passing and aggregation in GNNs are represented. We also summarize the benchmarks as well as the related datasets to facilitate later researches for methodology development and experimental evaluation. The prospect for future potential directions is also provided.
    Defending against Reconstruction Attacks with R\'enyi Differential Privacy. (arXiv:2202.07623v1 [cs.LG])
    Reconstruction attacks allow an adversary to regenerate data samples of the training set using access to only a trained model. It has been recently shown that simple heuristics can reconstruct data samples from language models, making this threat scenario an important aspect of model release. Differential privacy is a known solution to such attacks, but is often used with a relatively large privacy budget (epsilon > 8) which does not translate to meaningful guarantees. In this paper we show that, for a same mechanism, we can derive privacy guarantees for reconstruction attacks that are better than the traditional ones from the literature. In particular, we show that larger privacy budgets do not protect against membership inference, but can still protect extraction of rare secrets. We show experimentally that our guarantees hold against various language models, including GPT-2 finetuned on Wikitext-103.
    Learning to Solve Routing Problems via Distributionally Robust Optimization. (arXiv:2202.07241v1 [cs.LG])
    Recent deep models for solving routing problems always assume a single distribution of nodes for training, which severely impairs their cross-distribution generalization ability. In this paper, we exploit group distributionally robust optimization (group DRO) to tackle this issue, where we jointly optimize the weights for different groups of distributions and the parameters for the deep model in an interleaved manner during training. We also design a module based on convolutional neural network, which allows the deep model to learn more informative latent pattern among the nodes. We evaluate the proposed approach on two types of well-known deep models including GCN and POMO. The experimental results on the randomly synthesized instances and the ones from two benchmark dataset (i.e., TSPLib and CVRPLib) demonstrate that our approach could significantly improve the cross-distribution generalization performance over the original models.
    QuadSim: A Quadcopter Rotational Dynamics Simulation Framework For Reinforcement Learning Algorithms. (arXiv:2202.07021v1 [cs.LG])
    This study focuses on designing and developing a mathematically based quadcopter rotational dynamics simulation framework for testing reinforcement learning (RL) algorithms in many flexible configurations. The design of the simulation framework aims to simulate both linear and nonlinear representations of a quadcopter by solving initial value problems for ordinary differential equation (ODE) systems. In addition, the simulation environment is capable of making the simulation deterministic/stochastic by adding random Gaussian noise in the forms of process and measurement noises. In order to ensure that the scope of this simulation environment is not limited only with our own RL algorithms, the simulation environment has been expanded to be compatible with the OpenAI Gym toolkit. The framework also supports multiprocessing capabilities to run simulation environments simultaneously in parallel. To test these capabilities, many state-of-the-art deep RL algorithms were trained in this simulation framework and the results were compared in detail.  ( 2 min )
    A Unified Framework for Masked and Mask-Free Face Recognition via Feature Rectification. (arXiv:2202.07358v1 [cs.CV])
    Face recognition under ideal conditions is now considered a well-solved problem with advances in deep learning. Recognizing faces under occlusion, however, still remains a challenge. Existing techniques often fail to recognize faces with both the mouth and nose covered by a mask, which is now very common under the COVID-19 pandemic. Common approaches to tackle this problem include 1) discarding information from the masked regions during recognition and 2) restoring the masked regions before recognition. Very few works considered the consistency between features extracted from masked faces and from their mask-free counterparts. This resulted in models trained for recognizing masked faces often showing degraded performance on mask-free faces. In this paper, we propose a unified framework, named Face Feature Rectification Network (FFR-Net), for recognizing both masked and mask-free faces alike. We introduce rectification blocks to rectify features extracted by a state-of-the-art recognition model, in both spatial and channel dimensions, to minimize the distance between a masked face and its mask-free counterpart in the rectified feature space. Experiments show that our unified framework can learn a rectified feature space for recognizing both masked and mask-free faces effectively, achieving state-of-the-art results. Project code: https://github.com/haoosz/FFR-Net
    SpeechPainter: Text-conditioned Speech Inpainting. (arXiv:2202.07273v1 [cs.SD])
    We propose SpeechPainter, a model for filling in gaps of up to one second in speech samples by leveraging an auxiliary textual input. We demonstrate that the model performs speech inpainting with the appropriate content, while maintaining speaker identity, prosody and recording environment conditions, and generalizing to unseen speakers. Our approach significantly outperforms baselines constructed using adaptive TTS, as judged by human raters in side-by-side preference and MOS tests.
    Information-Theoretic Analysis of Minimax Excess Risk. (arXiv:2202.07537v1 [cs.LG])
    Two main concepts studied in machine learning theory are generalization gap (difference between train and test error) and excess risk (difference between test error and the minimum possible error). While information-theoretic tools have been used extensively to study the generalization gap of learning algorithms, the information-theoretic nature of excess risk has not yet been fully investigated. In this paper, some steps are taken toward this goal. We consider the frequentist problem of minimax excess risk as a zero-sum game between algorithm designer and the world. Then, we argue that it is desirable to modify this game in a way that the order of play can be swapped. We prove that, under some regularity conditions, if the world and designer can play randomly the duality gap is zero and the order of play can be changed. In this case, a Bayesian problem surfaces in the dual representation. This makes it possible to utilize recent information-theoretic results on minimum excess risk in Bayesian learning to provide bounds on the minimax excess risk. We demonstrate the applicability of the results by providing information theoretic insight on two important classes of problems: classification when the hypothesis space has finite VC-dimension, and regularized least squares.
    Exploratory State Representation Learning. (arXiv:2109.13596v2 [cs.LG] UPDATED)
    Not having access to compact and meaningful representations is known to significantly increase the complexity of reinforcement learning (RL). For this reason, it can be useful to perform state representation learning (SRL) before tackling RL tasks. However, obtaining a good state representation can only be done if a large diversity of transitions is observed, which can require a difficult exploration, especially if the environment is initially reward-free. To solve the problems of exploration and SRL in parallel, we propose a new approach called XSRL (eXploratory State Representation Learning). On one hand, it jointly learns compact state representations and a state transition estimator which is used to remove unexploitable information from the representations. On the other hand, it continuously trains an inverse model, and adds to the prediction error of this model a $k$-step learning progress bonus to form the maximization objective of a discovery policy. This results in a policy that seeks complex transitions from which the trained models can effectively learn. Our experimental results show that the approach leads to efficient exploration in challenging environments with image observations, and to state representations that significantly accelerate learning in RL tasks.
    From Distributed Machine Learning to Federated Learning: A Survey. (arXiv:2104.14362v3 [cs.DC] UPDATED)
    In recent years, data and computing resources are typically distributed in the devices of end users, various regions or organizations. Because of laws or regulations, the distributed data and computing resources cannot be directly shared among different regions or organizations for machine learning tasks. Federated learning emerges as an efficient approach to exploit distributed data and computing resources, so as to collaboratively train machine learning models, while obeying the laws and regulations and ensuring data security and data privacy. In this paper, we provide a comprehensive survey of existing works for federated learning. We propose a functional architecture of federated learning systems and a taxonomy of related techniques. Furthermore, we present the distributed training, data communication, and security of FL systems. Finally, we analyze their limitations and propose future research directions.
    Realistic Counterfactual Explanations by Learned Relations. (arXiv:2202.07356v1 [stat.ML])
    Many existing methods of counterfactual explanations ignore the intrinsic relationships between data attributes and thus fail to generate realistic counterfactuals. Moreover, the existing methods that account for relationships between data attributes require domain knowledge, which limits their applicability in complex real-world applications. In this paper, we propose a novel approach to realistic counterfactual explanations that preserve relationships between data attributes. The model directly learns the relationships by a variational auto-encoder without domain knowledge and then learns to disturb the latent space accordingly. We conduct extensive experiments on both synthetic and real-world datasets. The results demonstrate that the proposed method learns relationships from the data and preserves these relationships in generated counterfactuals.
    Learned Turbulence Modelling with Differentiable Fluid Solvers. (arXiv:2202.06988v1 [physics.flu-dyn])
    In this paper, we train turbulence models based on convolutional neural networks. These learned turbulence models improve under-resolved low resolution solutions to the incompressible Navier-Stokes equations at simulation time. Our method involves the development of a differentiable numerical solver that supports the propagation of optimisation gradients through multiple solver steps. We showcase the significance of this property by demonstrating the superior stability and accuracy of those models that featured a higher number of unrolled steps during training. This approach is applied to three two-dimensional turbulence flow scenarios, a homogeneous decaying turbulence case, a temporally evolving mixing layer and a spatially evolving mixing layer. Our method achieves significant improvements of long-term \textit{a-posteriori} statistics when compared to no-model simulations, without requiring these statistics to be directly included in the learning targets. At inference time, our proposed method also gains substantial performance improvements over similarly accurate, purely numerical methods.
    Multi-style Training for South African Call Centre Audio. (arXiv:2202.07219v1 [cs.SD])
    Mismatched data is a challenging problem for automatic speech recognition (ASR) systems. One of the most common techniques used to address mismatched data is multi-style training (MTR), a form of data augmentation that attempts to transform the training data to be more representative of the testing data; and to learn robust representations applicable to different conditions. This task can be very challenging if the test conditions are unknown. We explore the impact of different MTR styles on system performance when testing conditions are different from training conditions in the context of deep neural network hidden Markov model (DNN-HMM) ASR systems. A controlled environment is created using the LibriSpeech corpus, where we isolate the effect of different MTR styles on final system performance. We evaluate our findings on a South African call centre dataset that contains noisy, WAV49-encoded audio.
    Deep Generative model with Hierarchical Latent Factors for Time Series Anomaly Detection. (arXiv:2202.07586v1 [cs.LG])
    Multivariate time series anomaly detection has become an active area of research in recent years, with Deep Learning models outperforming previous approaches on benchmark datasets. Among reconstruction-based models, most previous work has focused on Variational Autoencoders and Generative Adversarial Networks. This work presents DGHL, a new family of generative models for time series anomaly detection, trained by maximizing the observed likelihood by posterior sampling and alternating back-propagation. A top-down Convolution Network maps a novel hierarchical latent space to time series windows, exploiting temporal dynamics to encode information efficiently. Despite relying on posterior sampling, it is computationally more efficient than current approaches, with up to 10x shorter training times than RNN based models. Our method outperformed current state-of-the-art models on four popular benchmark datasets. Finally, DGHL is robust to variable features between entities and accurate even with large proportions of missing values, settings with increasing relevance with the advent of IoT. We demonstrate the superior robustness of DGHL with novel occlusion experiments in this literature. Our code is available at https://github.com/cchallu/dghl.
    Flexible learning of quantum states with generative query neural networks. (arXiv:2202.06804v1 [quant-ph] CROSS LISTED)
    Deep neural networks are a powerful tool for characterizing quantum states. In this task, neural networks are typically trained with measurement data gathered from the quantum state to be characterized. But is it possible to train a neural network in a general-purpose way, which makes it applicable to multiple unknown quantum states? Here we show that learning across multiple quantum states and different measurement settings can be achieved by a generative query neural network, a type of neural network originally used in the classical domain for learning 3D scenes from 2D pictures. Our network can be trained offline with classically simulated data, and later be used to characterize unknown quantum states from real experimental data. With little guidance of quantum physics, the network builds its own data-driven representation of quantum states, and then uses it to predict the outcome probabilities of requested quantum measurements on the states of interest. This approach can be applied to state learning scenarios where quantum measurement settings are not informationally complete and predictions must be given in real time, as experimental data become available, as well as to adversarial scenarios where measurement choices and prediction requests are designed to expose learning inaccuracies. The internal representation produced by the network can be used for other tasks beyond state characterization, including clustering of states and prediction of physical properties. The features of our method are illustrated on many-qubit ground states of Ising model and continuous-variable non-Gaussian states.
    Label-Wise Message Passing Graph Neural Network on Heterophilic Graphs. (arXiv:2110.08128v2 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) have achieved remarkable performance in modeling graphs for various applications. However, most existing GNNs assume the graphs exhibit strong homophily in node labels, i.e., nodes with similar labels are connected in the graphs. They fail to generalize to heterophilic graphs where linked nodes may have dissimilar labels and attributes. Therefore, in this paper, we investigate a novel framework that performs well on graphs with either homophily or heterophily. More specifically, we propose a label-wise message passing mechanism to avoid the negative effects caused by aggregating dissimilar node representations and preserve the heterophilic contexts for representation learning. We further propose a bi-level optimization method to automatically select the model for graphs with homophily/heterophily. Theoretical analysis and extensive experiments demonstrate the effectiveness of our proposed framework for node classification on both homophilic and heterophilic graphs.
    Learning Disentangled Behaviour Patterns for Wearable-based Human Activity Recognition. (arXiv:2202.07260v1 [cs.LG])
    In wearable-based human activity recognition (HAR) research, one of the major challenges is the large intra-class variability problem. The collected activity signal is often, if not always, coupled with noises or bias caused by personal, environmental, or other factors, making it difficult to learn effective features for HAR tasks, especially when with inadequate data. To address this issue, in this work, we proposed a Behaviour Pattern Disentanglement (BPD) framework, which can disentangle the behavior patterns from the irrelevant noises such as personal styles or environmental noises, etc. Based on a disentanglement network, we designed several loss functions and used an adversarial training strategy for optimization, which can disentangle activity signals from the irrelevant noises with the least dependency (between them) in the feature space. Our BPD framework is flexible, and it can be used on top of existing deep learning (DL) approaches for feature refinement. Extensive experiments were conducted on four public HAR datasets, and the promising results of our proposed BPD scheme suggest its flexibility and effectiveness. This is an open-source project, and the code can be found at this http URL
    NeuPL: Neural Population Learning. (arXiv:2202.07415v1 [cs.AI])
    Learning in strategy games (e.g. StarCraft, poker) requires the discovery of diverse policies. This is often achieved by iteratively training new policies against existing ones, growing a policy population that is robust to exploit. This iterative approach suffers from two issues in real-world games: a) under finite budget, approximate best-response operators at each iteration needs truncating, resulting in under-trained good-responses populating the population; b) repeated learning of basic skills at each iteration is wasteful and becomes intractable in the presence of increasingly strong opponents. In this work, we propose Neural Population Learning (NeuPL) as a solution to both issues. NeuPL offers convergence guarantees to a population of best-responses under mild assumptions. By representing a population of policies within a single conditional model, NeuPL enables transfer learning across policies. Empirically, we show the generality, improved performance and efficiency of NeuPL across several test domains. Most interestingly, we show that novel strategies become more accessible, not less, as the neural population expands.
    Robust Multi-Objective Bayesian Optimization Under Input Noise. (arXiv:2202.07549v1 [cs.LG])
    Bayesian optimization (BO) is a sample-efficient approach for tuning design parameters to optimize expensive-to-evaluate, black-box performance metrics. In many manufacturing processes, the design parameters are subject to random input noise, resulting in a product that is often less performant than expected. Although BO methods have been proposed for optimizing a single objective under input noise, no existing method addresses the practical scenario where there are multiple objectives that are sensitive to input perturbations. In this work, we propose the first multi-objective BO method that is robust to input noise. We formalize our goal as optimizing the multivariate value-at-risk (MVaR), a risk measure of the uncertain objectives. Since directly optimizing MVaR is computationally infeasible in many settings, we propose a scalable, theoretically-grounded approach for optimizing MVaR using random scalarizations. Empirically, we find that our approach significantly outperforms alternative methods and efficiently identifies optimal robust designs that will satisfy specifications across multiple metrics with high probability.
    Improving Human Sperm Head Morphology Classification with Unsupervised Anatomical Feature Distillation. (arXiv:2202.07191v1 [cs.CV])
    With rising male infertility, sperm head morphology classification becomes critical for accurate and timely clinical diagnosis. Recent deep learning (DL) morphology analysis methods achieve promising benchmark results, but leave performance and robustness on the table by relying on limited and possibly noisy class labels. To address this, we introduce a new DL training framework that leverages anatomical and image priors from human sperm microscopy crops to extract useful features without additional labeling cost. Our core idea is to distill sperm head information with reliably-generated pseudo-masks and unsupervised spatial prediction tasks. The predicted foreground masks from this distillation step are then leveraged to regularize and reduce image and label noise in the tuning stage. We evaluate our new approach on two public sperm datasets and achieve state-of-the-art performances (e.g. 65.9% SCIAN accuracy and 96.5% HuSHeM accuracy).
    Probabilistic Time Series Forecasts with Autoregressive Transformation Models. (arXiv:2110.08248v2 [cs.LG] UPDATED)
    Probabilistic forecasting of time series is an important matter in many applications and research fields. In order to draw conclusions from a probabilistic forecast, we must ensure that the model class used to approximate the true forecasting distribution is expressive enough. Yet, characteristics of the model itself, such as its uncertainty or its feature-outcome relationship are not of lesser importance. In this paper, we propose Autoregressive Transformation Models (ATMs), a model class inspired from various research directions to unite expressive distributional forecasts using a semi-parametric distribution assumption with an interpretable model specification and allow for uncertainty quantification based on (asymptotic) Maximum Likelihood theory. We demonstrate the properties of ATMs both theoretically and through empirical evaluation on several simulated and real-world forecasting datasets.
    Strategy Discovery and Mixture in Lifelong Learning from Heterogeneous Demonstration. (arXiv:2202.07014v1 [cs.LG])
    Learning from Demonstration (LfD) approaches empower end-users to teach robots novel tasks via demonstrations of the desired behaviors, democratizing access to robotics. A key challenge in LfD research is that users tend to provide heterogeneous demonstrations for the same task due to various strategies and preferences. Therefore, it is essential to develop LfD algorithms that ensure \textit{flexibility} (the robot adapts to personalized strategies), \textit{efficiency} (the robot achieves sample-efficient adaptation), and \textit{scalability} (robot reuses a concise set of strategies to represent a large amount of behaviors). In this paper, we propose a novel algorithm, Dynamic Multi-Strategy Reward Distillation (DMSRD), which distills common knowledge between heterogeneous demonstrations, leverages learned strategies to construct mixture policies, and continues to improve by learning from all available data. Our personalized, federated, and lifelong LfD architecture surpasses benchmarks in two continuous control problems with an average 77\% improvement in policy returns and 42\% improvement in log likelihood, alongside stronger task reward correlation and more precise strategy rewards.
    DualConv: Dual Convolutional Kernels for Lightweight Deep Neural Networks. (arXiv:2202.07481v1 [cs.CV])
    CNN architectures are generally heavy on memory and computational requirements which makes them infeasible for embedded systems with limited hardware resources. We propose dual convolutional kernels (DualConv) for constructing lightweight deep neural networks. DualConv combines 3$\times$3 and 1$\times$1 convolutional kernels to process the same input feature map channels simultaneously and exploits the group convolution technique to efficiently arrange convolutional filters. DualConv can be employed in any CNN model such as VGG-16 and ResNet-50 for image classification, YOLO and R-CNN for object detection, or FCN for semantic segmentation. In this paper, we extensively test DualConv for classification since these network architectures form the backbones for many other tasks. We also test DualConv for image detection on YOLO-V3. Experimental results show that, combined with our structural innovations, DualConv significantly reduces the computational cost and number of parameters of deep neural networks while surprisingly achieving slightly higher accuracy than the original models in some cases. We use DualConv to further reduce the number of parameters of the lightweight MobileNetV2 by 54% with only 0.68% drop in accuracy on CIFAR-100 dataset. When the number of parameters is not an issue, DualConv increases the accuracy of MobileNetV1 by 4.11% on the same dataset. Furthermore, DualConv significantly improves the YOLO-V3 object detection speed and improves its accuracy by 4.4% on PASCAL VOC dataset.
    Self-Training of Halfspaces with Generalization Guarantees under Massart Mislabeling Noise Model. (arXiv:2111.14427v3 [cs.LG] UPDATED)
    We investigate the generalization properties of a self-training algorithm with halfspaces. The approach learns a list of halfspaces iteratively from labeled and unlabeled training data, in which each iteration consists of two steps: exploration and pruning. In the exploration phase, the halfspace is found sequentially by maximizing the unsigned-margin among unlabeled examples and then assigning pseudo-labels to those that have a distance higher than the current threshold. The pseudo-labeled examples are then added to the training set, and a new classifier is learned. This process is repeated until no more unlabeled examples remain for pseudo-labeling. In the pruning phase, pseudo-labeled samples that have a distance to the last halfspace greater than the associated unsigned-margin are then discarded. We prove that the misclassification error of the resulting sequence of classifiers is bounded and show that the resulting semi-supervised approach never degrades performance compared to the classifier learned using only the initial labeled training set. Experiments carried out on a variety of benchmarks demonstrate the efficiency of the proposed approach compared to state-of-the-art methods.
    Minimax in Geodesic Metric Spaces: Sion's Theorem and Algorithms. (arXiv:2202.06950v1 [math.OC])
    Determining whether saddle points exist or are approximable for nonconvex-nonconcave problems is usually intractable. We take a step towards understanding certain nonconvex-nonconcave minimax problems that do remain tractable. Specifically, we study minimax problems cast in geodesic metric spaces, which provide a vast generalization of the usual convex-concave saddle point problems. The first main result of the paper is a geodesic metric space version of Sion's minimax theorem; we believe our proof is novel and transparent, as it relies on Helly's theorem only. In our second main result, we specialize to geodesically complete Riemannian manifolds: we devise and analyze the complexity of first-order methods for smooth minimax problems.
    Benchmarking Online Sequence-to-Sequence and Character-based Handwriting Recognition from IMU-Enhanced Pens. (arXiv:2202.07036v1 [cs.LG])
    Handwriting is one of the most frequently occurring patterns in everyday life and with it come challenging applications such as handwriting recognition (HWR), writer identification, and signature verification. In contrast to offline HWR that only uses spatial information (i.e., images), online HWR (OnHWR) uses richer spatio-temporal information (i.e., trajectory data or inertial data). While there exist many offline HWR datasets, there is only little data available for the development of OnHWR methods as it requires hardware-integrated pens. This paper presents data and benchmark models for real-time sequence-to-sequence (seq2seq) learning and single character-based recognition. Our data is recorded by a sensor-enhanced ballpoint pen, yielding sensor data streams from triaxial accelerometers, a gyroscope, a magnetometer and a force sensor at 100Hz. We propose a variety of datasets including equations and words for both the writer-dependent and writer-independent tasks. We provide an evaluation benchmark for seq2seq and single character-based HWR using recurrent and temporal convolutional networks and Transformers combined with a connectionist temporal classification (CTC) loss and cross entropy losses. Our methods do not resort to language or lexicon models.
    Damped Online Newton Step for Portfolio Selection. (arXiv:2202.07574v1 [cs.LG])
    We revisit the classic online portfolio selection problem, where at each round a learner selects a distribution over a set of portfolios to allocate its wealth. It is known that for this problem a logarithmic regret with respect to Cover's loss is achievable using the Universal Portfolio Selection algorithm, for example. However, all existing algorithms that achieve a logarithmic regret for this problem have per-round time and space complexities that scale polynomially with the total number of rounds, making them impractical. In this paper, we build on the recent work by Haipeng et al. 2018 and present the first practical online portfolio selection algorithm with a logarithmic regret and whose per-round time and space complexities depend only logarithmically on the horizon. Behind our approach are two key technical novelties of independent interest. We first show that the Damped Online Newton steps can approximate mirror descent iterates well, even when dealing with time-varying regularizers. Second, we present a new meta-algorithm that achieves an adaptive logarithmic regret (i.e. a logarithmic regret on any sub-interval) for mixable losses.
    A Comprehensive Survey and Performance Analysis of Activation Functions in Deep Learning. (arXiv:2109.14545v2 [cs.LG] UPDATED)
    Neural networks have shown tremendous growth in recent years to solve numerous problems. Various types of neural networks have been introduced to deal with different types of problems. However, the main goal of any neural network is to transform the non-linearly separable input data into more linearly separable abstract features using a hierarchy of layers. These layers are combinations of linear and nonlinear functions. The most popular and common non-linearity layers are activation functions (AFs), such as Logistic Sigmoid, Tanh, ReLU, ELU, Swish and Mish. In this paper, a comprehensive overview and survey is presented for AFs in neural networks for deep learning. Different classes of AFs such as Logistic Sigmoid and Tanh based, ReLU based, ELU based, and Learning based are covered. Several characteristics of AFs such as output range, monotonicity, and smoothness are also pointed out. A performance comparison is also performed among 18 state-of-the-art AFs with different networks on different types of data. The insights of AFs are presented to benefit the researchers for doing further research and practitioners to select among different choices. The code used for experimental comparison is released at: \url{https://github.com/shivram1987/ActivationFunctions}.  ( 2 min )
    Artificial Intelligence-Based Analytics for Impacts of COVID-19 and Online Learning on College Students' Mental Health. (arXiv:2202.07441v1 [cs.CY])
    COVID-19, the disease caused by the novel coronavirus (SARS-CoV-2), was first found in Wuhan, China late in the December of 2019. Not long after that the virus spread worldwide and was declared a pandemic by the World Health Organization in March 2020. This caused many changes around the world and in the United States. One of these changes was the shift towards online learning. In this paper, we seek to understand how the COVID-19 pandemic and online learning impact college students' emotional wellbeing. To do this we use several machine learning and statistical models to analyze data collected by the Faculty of Public Administration at the University of Ljubljana, Slovenia in conjunction with an international consortium of universities, other higher education institutions and students' associations. Our results indicate that learning modality (face-to-face, online synchronous, online asynchronous, etc.) is the main predictor of students' emotional wellbeing, followed by financial security. Factors such as satisfaction with their university's and government's handling of the pandemic are also important predictors.
    Two Sides of the Same Coin: Heterophily and Oversmoothing in Graph Convolutional Neural Networks. (arXiv:2102.06462v6 [cs.LG] UPDATED)
    In node classification tasks, heterophily and oversmoothing are two problems that can hurt the performance of graph convolutional neural networks (GCNs). The heterophily problem refers to the model's inability to handle heterophilous graphs where neighboring nodes belong to different classes; the oversmoothing problem refers to the model's degenerated performance with increasing number of layers. These two seemingly unrelated problems have been studied mostly independently, but there is recent empirical evidence that solving one problem may benefit the other. In this work, beyond empirical observations, we aim to: (1) analyze the heterophily and oversmoothing problems from a unified theoretical perspective, (2) identify the common causes of the two problems, and (3) propose simple yet effective strategies to address the common causes. In our theoretical analysis, we show that the common causes of the heterophily and oversmoothing problems--namely, the relative degree of a node and its heterophily level--trigger the node representations in consecutive layers to "move" closer to the original decision boundary, which increases the misclassification rate of node labels under certain constraints. We theoretically show that: (1) Nodes with high heterophily have a higher misclassification rate. (2) Even with low heterophily, degree disparity in a node's neighborhood can influence the movements of node representations and result in a "pseudo-heterophily" situation, which helps to explain oversmoothing. (3) Allowing not only positive but also negative messages during message passing can help counteract the common causes of the two problems. Based on our theoretical insights, we propose simple modifications to the GCN architecture (i.e., learned degree corrections and signed messages), and we show that they alleviate the heteorophily and oversmoothing problems with experiments on 9 networks.
    Track Boosting and Synthetic Data Aided Drone Detection. (arXiv:2111.12389v3 [cs.CV] UPDATED)
    This is the paper for the first place winning solution of the Drone vs. Bird Challenge, organized by AVSS 2021. As the usage of drones increases with lowered costs and improved drone technology, drone detection emerges as a vital object detection task. However, detecting distant drones under unfavorable conditions, namely weak contrast, long-range, low visibility, requires effective algorithms. Our method approaches the drone detection problem by fine-tuning a YOLOv5 model with real and synthetically generated data using a Kalman-based object tracker to boost detection confidence. Our results indicate that augmenting the real data with an optimal subset of synthetic data can increase the performance. Moreover, temporal information gathered by object tracking methods can increase performance further.  ( 2 min )
    Exploring Deep Reinforcement Learning-Assisted Federated Learning for Online Resource Allocation in EdgeIoT. (arXiv:2202.07391v1 [cs.LG])
    Federated learning (FL) has been increasingly considered to preserve data training privacy from eavesdropping attacks in mobile edge computing-based Internet of Thing (EdgeIoT). On the one hand, the learning accuracy of FL can be improved by selecting the IoT devices with large datasets for training, which gives rise to a higher energy consumption. On the other hand, the energy consumption can be reduced by selecting the IoT devices with small datasets for FL, resulting in a falling learning accuracy. In this paper, we formulate a new resource allocation problem for EdgeIoT to balance the learning accuracy of FL and the energy consumption of the IoT device. We propose a new federated learning-enabled twin-delayed deep deterministic policy gradient (FLDLT3) framework to achieve the optimal accuracy and energy balance in a continuous domain. Furthermore, long short term memory (LSTM) is leveraged in FL-DLT3 to predict the time-varying network state while FL-DLT3 is trained to select the IoT devices and allocate the transmit power. Numerical results demonstrate that the proposed FL-DLT3 achieves fast convergence (less than 100 iterations) while the FL accuracy-to-energy consumption ratio is improved by 51.8% compared to existing state-of-the-art benchmark.  ( 2 min )
    A Survey on Graph Structure Learning: Progress and Opportunities. (arXiv:2103.03036v2 [cs.LG] UPDATED)
    Graphs are widely used to describe real-world objects and their interactions. Graph Neural Networks (GNNs) as a de facto model for analyzing graphstructured data, are highly sensitive to the quality of the given graph structures. Therefore, noisy or incomplete graphs often lead to unsatisfactory representations and prevent us from fully understanding the mechanism underlying the system. In pursuit of an optimal graph structure for downstream tasks, recent studies have sparked an effort around the central theme of Graph Structure Learning (GSL), which aims to jointly learn an optimized graph structure and corresponding graph representations. In the presented survey, we broadly review recent progress in GSL methods. Specifically, we first formulate a general pipeline of GSL and review state-of-the-art methods classified by the way of modeling graph structures, followed by applications of GSL across domains. Finally, we point out some issues in current studies and discuss future directions.  ( 2 min )
    Coding for Straggler Mitigation in Federated Learning. (arXiv:2109.15226v2 [cs.LG] UPDATED)
    We present a novel coded federated learning (FL) scheme for linear regression that mitigates the effect of straggling devices while retaining the privacy level of conventional FL. The proposed scheme combines one-time padding to preserve privacy and gradient codes to yield resiliency against stragglers and consists of two phases. In the first phase, the devices share a one-time padded version of their local data with a subset of other devices. In the second phase, the devices and the central server collaboratively and iteratively train a global linear model using gradient codes on the one-time padded local data. To apply one-time padding to real data, our scheme exploits a fixed-point arithmetic representation of the data. Unlike the coded FL scheme recently introduced by Prakash \emph{et al.}, the proposed scheme maintains the same level of privacy as conventional FL while achieving a similar training time. Compared to conventional FL, we show that the proposed scheme achieves a training speed-up factor of $6.6$ and $9.2$ on the MNIST and Fashion-MNIST datasets for an accuracy of $95\%$ and $85\%$, respectively.  ( 2 min )
    Explainable COVID-19 Infections Identification and Delineation Using Calibrated Pseudo Labels. (arXiv:2202.07422v1 [eess.IV])
    The upheaval brought by the arrival of the COVID-19 pandemic has continued to bring fresh challenges over the past two years. During this COVID-19 pandemic, there has been a need for rapid identification of infected patients and specific delineation of infection areas in computed tomography (CT) images. Although deep supervised learning methods have been established quickly, the scarcity of both image-level and pixellevel labels as well as the lack of explainable transparency still hinder the applicability of AI. Can we identify infected patients and delineate the infections with extreme minimal supervision? Semi-supervised learning (SSL) has demonstrated promising performance under limited labelled data and sufficient unlabelled data. Inspired by SSL, we propose a model-agnostic calibrated pseudo-labelling strategy and apply it under a consistency regularization framework to generate explainable identification and delineation results. We demonstrate the effectiveness of our model with the combination of limited labelled data and sufficient unlabelled data or weakly-labelled data. Extensive experiments have shown that our model can efficiently utilize limited labelled data and provide explainable classification and segmentation results for decision-making in clinical routine.  ( 2 min )
    Implicit Bias of Linear Equivariant Networks. (arXiv:2110.06084v2 [cs.LG] UPDATED)
    Group equivariant convolutional neural networks (G-CNNs) are generalizations of convolutional neural networks (CNNs) which excel in a wide range of scientific and technical applications by explicitly encoding group symmetries, such as rotations and permutations, in their architectures. Although the success of G-CNNs is driven by the explicit symmetry bias of their convolutional architecture, a recent line of work has proposed that the implicit bias of training algorithms on a particular parameterization (or architecture) is key to understanding generalization for overparameterized neural nets. In this context, we show that $L$-layer full-width linear G-CNNs trained via gradient descent in a binary classification task converge to solutions with low-rank Fourier matrix coefficients, regularized by the $2/L$-Schatten matrix norm. Our work strictly generalizes previous analysis on the implicit bias of linear CNNs to linear G-CNNs over all finite groups, including the challenging setting of non-commutative symmetry groups (such as permutations). We validate our theorems via experiments on a variety of groups and empirically explore more realistic nonlinear networks, which locally capture similar regularization patterns. Finally, we provide intuitive interpretations of our Fourier space implicit regularization results in real space via uncertainty principles.  ( 2 min )
    Dual Training of Energy-Based Models with Overparametrized Shallow Neural Networks. (arXiv:2107.05134v2 [cs.LG] UPDATED)
    Energy-based models (EBMs) are generative models that are usually trained via maximum likelihood estimation. This approach becomes challenging in generic situations where the trained energy is non-convex, due to the need to sample the Gibbs distribution associated with this energy. Using general Fenchel duality results, we derive variational principles dual to maximum likelihood EBMs with shallow overparametrized neural network energies, both in the feature-learning and lazy linearized regimes. In the feature-learning regime, this dual formulation justifies using a two time-scale gradient ascent-descent (GDA) training algorithm in which one updates concurrently the particles in the sample space and the neurons in the parameter space of the energy. We also consider a variant of this algorithm in which the particles are sometimes restarted at random samples drawn from the data set, and show that performing these restarts at every iteration step corresponds to score matching training. These results are illustrated in simple numerical experiments, which indicates that GDA performs best when features and particles are updated using similar time scales.  ( 2 min )
    Development and evaluation of an Explainable Prediction Model for Chronic Kidney Disease Patients based on Ensemble Trees. (arXiv:2105.10368v3 [cs.LG] UPDATED)
    Chronic Kidney Disease (CKD), where delayed recognition implies premature mortality, is currently experiencing a globally increasing incidence and high cost to health systems. Data mining allows discovering subtle patterns in CKD indicators to contribute to an early diagnosis. This work presents the development and evaluation of an explainable prediction model that would support clinicians in the early diagnosis of CKD patients. The model development is based on a data management pipeline that detects the best combination of ensemble trees algorithms and features selected concerning classification performance. Furthermore, the main contribution of the paper involves an explainability-driven approach that allows selecting the best predictive model maintaining a balance between accuracy and explainability. Therefore, the most balanced explainable predictive model implements an extreme gradient boosting classifier over 3 features (packed cell value, specific gravity, and hypertension), achieving an accuracy of 99.2% and 97.5% with cross-validation technique and with new unseen data respectively. In addition, an analysis of the model's explainability shows that the packed cell value is the most relevant feature that influences the prediction results of the model, followed by specific gravity and hypertension. This small number of feature selected results in a reduced cost of the early diagnosis of CKD implying a promising solution for developing countries.  ( 2 min )
    Personalized Prompt Learning for Explainable Recommendation. (arXiv:2202.07371v1 [cs.IR])
    Providing user-understandable explanations to justify recommendations could help users better understand the recommended items, increase the system's ease of use, and gain users' trust. A typical approach to realize it is natural language generation. However, previous works mostly adopt recurrent neural networks to meet the ends, leaving the potentially more effective pre-trained Transformer models under-explored. In fact, user and item IDs, as important identifiers in recommender systems, are inherently in different semantic space as words that pre-trained models were already trained on. Thus, how to effectively fuse IDs into such models becomes a critical issue. Inspired by recent advancement in prompt learning, we come up with two solutions: find alternative words to represent IDs (called discrete prompt learning), and directly input ID vectors to a pre-trained model (termed continuous prompt learning). In the latter case, ID vectors are randomly initialized but the model is trained in advance on large corpora, so they are actually in different learning stages. To bridge the gap, we further propose two training strategies: sequential tuning and recommendation as regularization. Extensive experiments show that our continuous prompt learning approach equipped with the training strategies consistently outperforms strong baselines on three datasets of explainable recommendation.  ( 2 min )
    Closing the Management Gap for Satellite-Integrated Community Networks: A Hierarchical Approach to Self-Maintenance. (arXiv:2202.07532v1 [cs.NI])
    Community networks (CNs) have become an important paradigm for providing essential Internet connectivity in unserved and underserved areas across the world. However, an indispensable part for CNs is network management, where responsive and autonomous maintenance is much needed. With the technological advancement in telecommunications networks, a classical satellite-dependent CN is envisioned to be transformed into a satellite-integrated CN (SICN), which will embrace significant autonomy, intelligence, and scalability in network management. This article discusses the machine-learning (ML) based hierarchical approach to enabling autonomous self-maintenance for SICNs. The approach is split into the anomaly identification and anomaly mitigation phases, where the related ML methods, data collection means, deployment options, and mitigation schemes are presented. With the case study, we discuss a typical scenario using satellite and fixed connections as backhaul options and show the effectiveness \hl{and performance improvements} of the proposed approach \hl{with recurrent neural network and ensemble methods  ( 2 min )
    Understanding DDPM Latent Codes Through Optimal Transport. (arXiv:2202.07477v1 [stat.ML])
    Diffusion models have recently outperformed alternative approaches to model the distribution of natural images, such as GANs. Such diffusion models allow for deterministic sampling via the probability flow ODE, giving rise to a latent space and an encoder map. While having important practical applications, such as estimation of the likelihood, the theoretical properties of this map are not yet fully understood. In the present work, we partially address this question for the popular case of the VP SDE (DDPM) approach. We show that, perhaps surprisingly, the DDPM encoder map coincides with the optimal transport map for common distributions; we support this claim theoretically and by extensive numerical experiments.  ( 2 min )
    Random Feature Amplification: Feature Learning and Generalization in Neural Networks. (arXiv:2202.07626v1 [cs.LG])
    In this work, we provide a characterization of the feature-learning process in two-layer ReLU networks trained by gradient descent on the logistic loss following random initialization. We consider data with binary labels that are generated by an XOR-like function of the input features. We permit a constant fraction of the training labels to be corrupted by an adversary. We show that, although linear classifiers are no better than random guessing for the distribution we consider, two-layer ReLU networks trained by gradient descent achieve generalization error close to the label noise rate, refuting the conjecture of Malach and Shalev-Shwartz that 'deeper is better only when shallow is good'. We develop a novel proof technique that shows that at initialization, the vast majority of neurons function as random features that are only weakly correlated with useful features, and the gradient descent dynamics 'amplify' these weak, random features to strong, useful features.  ( 2 min )
    Baby Intuitions Benchmark (BIB): Discerning the goals, preferences, and actions of others. (arXiv:2102.11938v4 [cs.AI] UPDATED)
    To achieve human-like common sense about everyday life, machine learning systems must understand and reason about the goals, preferences, and actions of other agents in the environment. By the end of their first year of life, human infants intuitively achieve such common sense, and these cognitive achievements lay the foundation for humans' rich and complex understanding of the mental states of others. Can machines achieve generalizable, commonsense reasoning about other agents like human infants? The Baby Intuitions Benchmark (BIB) challenges machines to predict the plausibility of an agent's behavior based on the underlying causes of its actions. Because BIB's content and paradigm are adopted from developmental cognitive science, BIB allows for direct comparison between human and machine performance. Nevertheless, recently proposed, deep-learning-based agency reasoning models fail to show infant-like reasoning, leaving BIB an open challenge.  ( 2 min )
    A machine learning analysis of the relationship between some underlying medical conditions and COVID-19 susceptibility. (arXiv:2112.12901v2 [cs.LG] UPDATED)
    For the past couple years, the Coronavirus, commonly known as COVID-19, has significantly affected the daily lives of all citizens residing in the United States by imposing several, fatal health risks that cannot go unnoticed. In response to the growing fear and danger COVID-19 inflicts upon societies in the USA, several vaccines and boosters have been created as a permanent remedy for individuals to take advantage of. In this paper, we investigate the relationship between the COVID-19 vaccines and boosters and the total case count for the Coronavirus across multiple states in the USA. Additionally, this paper discusses the relationship between several, selected underlying health conditions with COVID-19. To discuss these relationships effectively, this paper will utilize statistical tests and machine learning methods for analysis and discussion purposes. Furthermore, this paper reflects upon conclusions made about the relationship between educational attainment, race, and COVID-19 and the possible connections that can be established with underlying health conditions, vaccination rates, and COVID-19 total case and death counts.  ( 2 min )
    EvoKG: Jointly Modeling Event Time and Network Structure for Reasoning over Temporal Knowledge Graphs. (arXiv:2202.07648v1 [cs.LG])
    How can we perform knowledge reasoning over temporal knowledge graphs (TKGs)? TKGs represent facts about entities and their relations, where each fact is associated with a timestamp. Reasoning over TKGs, i.e., inferring new facts from time-evolving KGs, is crucial for many applications to provide intelligent services. However, despite the prevalence of real-world data that can be represented as TKGs, most methods focus on reasoning over static knowledge graphs, or cannot predict future events. In this paper, we present a problem formulation that unifies the two major problems that need to be addressed for an effective reasoning over TKGs, namely, modeling the event time and the evolving network structure. Our proposed method EvoKG jointly models both tasks in an effective framework, which captures the ever-changing structural and temporal dynamics in TKGs via recurrent event modeling, and models the interactions between entities based on the temporal neighborhood aggregation framework. Further, EvoKG achieves an accurate modeling of event time, using flexible and efficient mechanisms based on neural density estimation. Experiments show that EvoKG outperforms existing methods in terms of effectiveness (up to 77% and 116% more accurate time and link prediction) and efficiency.  ( 2 min )
    Quantifying Memorization Across Neural Language Models. (arXiv:2202.07646v1 [cs.LG])
    Large language models (LMs) have been shown to memorize parts of their training data, and when prompted appropriately, they will emit the memorized training data verbatim. This is undesirable because memorization violates privacy (exposing user data), degrades utility (repeated easy-to-memorize text is often low quality), and hurts fairness (some texts are memorized over others). We describe three log-linear relationships that quantify the degree to which LMs emit memorized training data. Memorization significantly grows as we increase (1) the capacity of a model, (2) the number of times an example has been duplicated, and (3) the number of tokens of context used to prompt the model. Surprisingly, we find the situation becomes complicated when generalizing these results across model families. On the whole, we find that memorization in LMs is more prevalent than previously believed and will likely get worse as models continues to scale, at least without active mitigations.  ( 2 min )
    Forecasting Global Weather with Graph Neural Networks. (arXiv:2202.07575v1 [physics.ao-ph])
    We present a data-driven approach for forecasting global weather using graph neural networks. The system learns to step forward the current 3D atmospheric state by six hours, and multiple steps are chained together to produce skillful forecasts going out several days into the future. The underlying model is trained on reanalysis data from ERA5 or forecast data from GFS. Test performance on metrics such as Z500 (geopotential height) and T850 (temperature) improves upon previous data-driven approaches and is comparable to operational, full-resolution, physical models from GFS and ECMWF, at least when evaluated on 1-degree scales and when using reanalysis initial conditions. We also show results from connecting this data-driven model to live, operational forecasts from GFS.  ( 2 min )
    Stein Latent Optimization for Generative Adversarial Networks. (arXiv:2106.05319v5 [cs.LG] UPDATED)
    Generative adversarial networks (GANs) with clustered latent spaces can perform conditional generation in a completely unsupervised manner. In the real world, the salient attributes of unlabeled data can be imbalanced. However, most of existing unsupervised conditional GANs cannot cluster attributes of these data in their latent spaces properly because they assume uniform distributions of the attributes. To address this problem, we theoretically derive Stein latent optimization that provides reparameterizable gradient estimations of the latent distribution parameters assuming a Gaussian mixture prior in a continuous latent space. Structurally, we introduce an encoder network and novel unsupervised conditional contrastive loss to ensure that data generated from a single mixture component represent a single attribute. We confirm that the proposed method, named Stein Latent Optimization for GANs (SLOGAN), successfully learns balanced or imbalanced attributes and achieves state-of-the-art unsupervised conditional generation performance even in the absence of attribute information (e.g., the imbalance ratio). Moreover, we demonstrate that the attributes to be learned can be manipulated using a small amount of probe data.  ( 2 min )
    A Non-parametric View of FedAvg and FedProx: Beyond Stationary Points. (arXiv:2106.15216v2 [stat.ML] UPDATED)
    Federated Learning (FL) is a promising decentralized learning framework and has great potentials in privacy preservation and in lowering the computation load at the cloud. Recent work showed that FedAvg and FedProx - the two widely-adopted FL algorithms - fail to reach the stationary points of the global optimization objective even for homogeneous linear regression problems. Further, it is concerned that the common model learned might not generalize well locally at all in the presence of heterogeneity. In this paper, we analyze the convergence and statistical efficiency of FedAvg and FedProx, addressing the above two concerns. Our analysis is based on the standard non-parametric regression in a reproducing kernel Hilbert space (RKHS), and allows for heterogeneous local data distributions and unbalanced local datasets. We prove that the estimation errors, measured in either the empirical norm or the RKHS norm, decay with a rate of 1/t in general and exponentially for finite-rank kernels. In certain heterogeneous settings, these upper bounds also imply that both FedAvg and FedProx achieve the optimal error rate. To further analytically quantify the impact of the heterogeneity at each client, we propose and characterize a novel notion-federation gain, defined as the reduction of the estimation error for a client to join the FL. We discover that when the data heterogeneity is moderate, a client with limited local data can benefit from a common model with a large federation gain. Numerical experiments further corroborate our theoretical findings.  ( 2 min )
    Deep learning and differential equations for modeling changes in individual-level latent dynamics between observation periods. (arXiv:2202.07403v1 [stat.ML])
    When modeling longitudinal biomedical data, often dimensionality reduction as well as dynamic modeling in the resulting latent representation is needed. This can be achieved by artificial neural networks for dimension reduction, and differential equations for dynamic modeling of individual-level trajectories. However, such approaches so far assume that parameters of individual-level dynamics are constant throughout the observation period. Motivated by an application from psychological resilience research, we propose an extension where different sets of differential equation parameters are allowed for observation sub-periods. Still, estimation for intra-individual sub-periods is coupled for being able to fit the model also with a relatively small dataset. We subsequently derive prediction targets from individual dynamic models of resilience in the application. These serve as interpretable resilience-related outcomes, to be predicted from characteristics of individuals, measured at baseline and a follow-up time point, and selecting a small set of important predictors. Our approach is seen to successfully identify individual-level parameters of dynamic models that allows us to stably select predictors, i.e., resilience factors. Furthermore, we can identify those characteristics of individuals that are the most promising for updates at follow-up, which might inform future study design. This underlines the usefulness of our proposed deep dynamic modeling approach with changes in parameters between observation sub-periods.  ( 2 min )
    Unsupervised Learning of Group Invariant and Equivariant Representations. (arXiv:2202.07559v1 [cs.LG])
    Equivariant neural networks, whose hidden features transform according to representations of a group G acting on the data, exhibit training efficiency and an improved generalisation performance. In this work, we extend group invariant and equivariant representation learning to the field of unsupervised deep learning. We propose a general learning strategy based on an encoder-decoder framework in which the latent representation is disentangled in an invariant term and an equivariant group action component. The key idea is that the network learns the group action on the data space and thus is able to solve the reconstruction task from an invariant data representation, hence avoiding the necessity of ad-hoc group-specific implementations. We derive the necessary conditions on the equivariant encoder, and we present a construction valid for any G, both discrete and continuous. We describe explicitly our construction for rotations, translations and permutations. We test the validity and the robustness of our approach in a variety of experiments with diverse data types employing different network architectures.  ( 2 min )
    Can Online Customer Reviews Help Design More Sustainable Products? A Preliminary Study on Amazon Climate Pledge Friendly Products. (arXiv:2202.07463v1 [cs.CY])
    Online product reviews are a valuable resource for product developers to improve the design of their products. Yet, the potential value of customer feedback to improve the sustainability performance of products is still to be exploited. The present paper investigates and analyzes Amazon product reviews to bring new light on the following question: ``What sustainable design insights can be identified or interpreted from online product reviews?''. To do so, the top 100 reviews, evenly distributed by star ratings, for three product categories (laptop, printer, cable) are collected, manually annotated, analyzed and interpreted. For each product category, the reviews of two similar products (one with environmental certification and one standard version) are compared and combined to come up with sustainable design solutions. In all, for the six products considered, between 12% and 20% of the reviews mentioned directly or indirectly aspects or attributes that could be exploited to improve the design of these products from a sustainability perspective. Concrete examples of sustainable design leads that could be elicited from product reviews are given and discussed. As such, this contribution provides a baseline for future work willing to automate this process to gain further insights from online product reviews. Notably, the deployment of machine learning tools and the use of natural language processing techniques to do so are discussed as promising lines for future research.  ( 2 min )
    One-bit Submission for Locally Private Quasi-MLE: Its Asymptotic Normality and Limitation. (arXiv:2202.07194v1 [stat.ML])
    Local differential privacy~(LDP) is an information-theoretic privacy definition suitable for statistical surveys that involve an untrusted data curator. An LDP version of quasi-maximum likelihood estimator~(QMLE) has been developed, but the existing method to build LDP QMLE is difficult to implement for a large-scale survey system in the real world due to long waiting time, expensive communication cost, and the boundedness assumption of derivative of a log-likelihood function. We provided an alternative LDP protocol without those issues, which is potentially much easily deployable to a large-scale survey. We also provided sufficient conditions for the consistency and asymptotic normality and limitations of our protocol. Our protocol is less burdensome for the users, and the theoretical guarantees cover more realistic cases than those for the existing method.  ( 2 min )
    Contextual Importance and Utility: aTheoretical Foundation. (arXiv:2202.07292v1 [cs.AI])
    This paper provides new theory to support to the eXplainable AI (XAI) method Contextual Importance and Utility (CIU). CIU arithmetic is based on the concepts of Multi-Attribute Utility Theory, which gives CIU a solid theoretical foundation. The novel concept of contextual influence is also defined, which makes it possible to compare CIU directly with so-called additive feature attribution (AFA) methods for model-agnostic outcome explanation. One key takeaway is that the "influence" concept used by AFA methods is inadequate for outcome explanation purposes even for simple models to explain. Experiments with simple models show that explanations using contextual importance (CI) and contextual utility (CU) produce explanations where influence-based methods fail. It is also shown that CI and CU guarantees explanation faithfulness towards the explained model.  ( 2 min )
    OLIVE: Oblivious and Differentially Private Federated Learning on Trusted Execution Environment. (arXiv:2202.07165v1 [cs.LG])
    By combining Federated Learning with Differential Privacy, it has become possible to train deep models while taking privacy into account. Using Local Differential Privacy (LDP) does not require trust in the server, but its utility is limited due to strong gradient perturbations. On the other hand, client-level Central Differential Privacy (CDP) provides a good balance between the privacy and utility of the trained model, but requires trust in the central server since they have to share raw gradients. We propose OLIVE, a system that can benefit from CDP while eliminating the need for trust in the server as LDP achieves, by using Trusted Execution Environment (TEE), which has attracted much attention in recent years. In particular, OLIVE provides an efficient data oblivious algorithm to minimize the privacy risk that can occur during aggregation in a TEE even on a privileged untrusted server. In this work, firstly, we design an inference attack to leak training data privacy from index information of gradients which can be obtained by side channels in a sparsified gradients setting, and demonstrate the attack's effectiveness on real world dataset. Secondly, we propose a fully-oblivious but efficient algorithm that keeps the memory access patterns completely uniform and secure to protect privacy against the designed attack. We also demonstrate that our method works practically by various empirical experiments. Our experimental results show our proposed algorithm is more efficient compared to state-of-the-art general-purpose Oblivious RAM, and can be a practical method in the real-world scales.  ( 2 min )
    Unsupervised word-level prosody tagging for controllable speech synthesis. (arXiv:2202.07200v1 [eess.AS])
    Although word-level prosody modeling in neural text-to-speech (TTS) has been investigated in recent research for diverse speech synthesis, it is still challenging to control speech synthesis manually without a specific reference. This is largely due to lack of word-level prosody tags. In this work, we propose a novel approach for unsupervised word-level prosody tagging with two stages, where we first group the words into different types with a decision tree according to their phonetic content and then cluster the prosodies using GMM within each type of words separately. This design is based on the assumption that the prosodies of different type of words, such as long or short words, should be tagged with different label sets. Furthermore, a TTS system with the derived word-level prosody tags is trained for controllable speech synthesis. Experiments on LJSpeech show that the TTS model trained with word-level prosody tags not only achieves better naturalness than a typical FastSpeech2 model, but also gains the ability to manipulate word-level prosody.  ( 2 min )
    On the Origins of the Block Structure Phenomenon in Neural Network Representations. (arXiv:2202.07184v1 [cs.LG])
    Recent work has uncovered a striking phenomenon in large-capacity neural networks: they contain blocks of contiguous hidden layers with highly similar representations. This block structure has two seemingly contradictory properties: on the one hand, its constituent layers exhibit highly similar dominant first principal components (PCs), but on the other hand, their representations, and their common first PC, are highly dissimilar across different random seeds. Our work seeks to reconcile these discrepant properties by investigating the origin of the block structure in relation to the data and training methods. By analyzing properties of the dominant PCs, we find that the block structure arises from dominant datapoints - a small group of examples that share similar image statistics (e.g. background color). However, the set of dominant datapoints, and the precise shared image statistic, can vary across random seeds. Thus, the block structure reflects meaningful dataset statistics, but is simultaneously unique to each model. Through studying hidden layer activations and creating synthetic datapoints, we demonstrate that these simple image statistics dominate the representational geometry of the layers inside the block structure. We explore how the phenomenon evolves through training, finding that the block structure takes shape early in training, but the underlying representations and the corresponding dominant datapoints continue to change substantially. Finally, we study the interplay between the block structure and different training mechanisms, introducing a targeted intervention to eliminate the block structure, as well as examining the effects of pretraining and Shake-Shake regularization.  ( 2 min )
    Robust Policy Learning over Multiple Uncertainty Sets. (arXiv:2202.07013v1 [cs.LG])
    Reinforcement learning (RL) agents need to be robust to variations in safety-critical environments. While system identification methods provide a way to infer the variation from online experience, they can fail in settings where fast identification is not possible. Another dominant approach is robust RL which produces a policy that can handle worst-case scenarios, but these methods are generally designed to achieve robustness to a single uncertainty set that must be specified at train time. Towards a more general solution, we formulate the multi-set robustness problem to learn a policy robust to different perturbation sets. We then design an algorithm that enjoys the benefits of both system identification and robust RL: it reduces uncertainty where possible given a few interactions, but can still act robustly with respect to the remaining uncertainty. On a diverse set of control tasks, our approach demonstrates improved worst-case performance on new environments compared to prior methods based on system identification and on robust RL alone.  ( 2 min )
    Interpretable Reinforcement Learning with Multilevel Subgoal Discovery. (arXiv:2202.07414v1 [cs.AI])
    We propose a novel Reinforcement Learning model for discrete environments, which is inherently interpretable and supports the discovery of deep subgoal hierarchies. In the model, an agent learns information about environment in the form of probabilistic rules, while policies for (sub)goals are learned as combinations thereof. No reward function is required for learning; an agent only needs to be given a primary goal to achieve. Subgoals of a goal G from the hierarchy are computed as descriptions of states, which if previously achieved increase the total efficiency of the available policies for G. These state descriptions are introduced as new sensor predicates into the rule language of the agent, which allows for sensing important intermediate states and for updating environment rules and policies accordingly.  ( 2 min )
    Memory via Temporal Delays in weightless Spiking Neural Network. (arXiv:2202.07132v1 [cs.NE])
    A common view in the neuroscience community is that memory is encoded in the connection strength between neurons. This perception led artificial neural network models to focus on connection weights as the key variables to modulate learning. In this paper, we present a prototype for weightless spiking neural networks that can perform a simple classification task. The memory in this network is stored in the timing between neurons, rather than the strength of the connection, and is trained using a Hebbian Spike Timing Dependent Plasticity (STDP), which modulates the delays of the connection.  ( 2 min )
    Federated Contrastive Learning for Dermatological Disease Diagnosis via On-device Learning. (arXiv:2202.07470v1 [cs.LG])
    Deep learning models have been deployed in an increasing number of edge and mobile devices to provide healthcare. These models rely on training with a tremendous amount of labeled data to achieve high accuracy. However, for medical applications such as dermatological disease diagnosis, the private data collected by mobile dermatology assistants exist on distributed mobile devices of patients, and each device only has a limited amount of data. Directly learning from limited data greatly deteriorates the performance of learned models. Federated learning (FL) can train models by using data distributed on devices while keeping the data local for privacy. Existing works on FL assume all the data have ground-truth labels. However, medical data often comes without any accompanying labels since labeling requires expertise and results in prohibitively high labor costs. The recently developed self-supervised learning approach, contrastive learning (CL), can leverage the unlabeled data to pre-train a model, after which the model is fine-tuned on limited labeled data for dermatological disease diagnosis. However, simply combining CL with FL as federated contrastive learning (FCL) will result in ineffective learning since CL requires diverse data for learning but each device only has limited data. In this work, we propose an on-device FCL framework for dermatological disease diagnosis with limited labels. Features are shared in the FCL pre-training process to provide diverse and accurate contrastive information. After that, the pre-trained model is fine-tuned with local labeled data independently on each device or collaboratively with supervised federated learning on all devices. Experiments on dermatological disease datasets show that the proposed framework effectively improves the recall and precision of dermatological disease diagnosis compared with state-of-the-art methods.  ( 2 min )
    Transformers in Time Series: A Survey. (arXiv:2202.07125v1 [cs.LG])
    Transformers have achieved superior performances in many tasks in natural language processing and computer vision, which also intrigues great interests in the time series community. Among multiple advantages of transformers, the ability to capture long-range dependencies and interactions is especially attractive for time series modeling, leading to exciting progress in various time series applications. In this paper, we systematically review transformer schemes for time series modeling by highlighting their strengths as well as limitations through a new taxonomy to summarize existing time series transformers in two perspectives. From the perspective of network modifications, we summarize the adaptations of module level and architecture level of the time series transformers. From the perspective of applications, we categorize time series transformers based on common tasks including forecasting, anomaly detection, and classification. Empirically, we perform robust analysis, model size analysis, and seasonal-trend decomposition analysis to study how Transformers perform in time series. Finally, we discuss and suggest future directions to provide useful research guidance. To the best of our knowledge, this paper is the first work to comprehensively and systematically summarize the recent advances of Transformers for modeling time series data. We hope this survey will ignite further research interests in time series Transformers.  ( 2 min )
    Between Stochastic and Adversarial Online Convex Optimization: Improved Regret Bounds via Smoothness. (arXiv:2202.07554v1 [cs.LG])
    Stochastic and adversarial data are two widely studied settings in online learning. But many optimization tasks are neither i.i.d. nor fully adversarial, which makes it of fundamental interest to get a better theoretical understanding of the world between these extremes. In this work we establish novel regret bounds for online convex optimization in a setting that interpolates between stochastic i.i.d. and fully adversarial losses. By exploiting smoothness of the expected losses, these bounds replace a dependence on the maximum gradient length by the variance of the gradients, which was previously known only for linear losses. In addition, they weaken the i.i.d. assumption by allowing adversarially poisoned rounds or shifts in the data distribution. To accomplish this goal, we introduce two key quantities associated with the loss sequence, that we call the cumulative stochastic variance and the adversarial variation. Our upper bounds are attained by instances of optimistic follow the regularized leader, and we design adaptive learning rates that automatically adapt to the cumulative stochastic variance and adversarial variation. In the fully i.i.d. case, our bounds match the rates one would expect from results in stochastic acceleration, and in the fully adversarial case they gracefully deteriorate to match the minimax regret. We further provide lower bounds showing that our regret upper bounds are tight for all intermediate regimes for the cumulative stochastic variance and the adversarial variation.  ( 2 min )
    Neural Architecture Search for Dense Prediction Tasks in Computer Vision. (arXiv:2202.07242v1 [cs.CV])
    The success of deep learning in recent years has lead to a rising demand for neural network architecture engineering. As a consequence, neural architecture search (NAS), which aims at automatically designing neural network architectures in a data-driven manner rather than manually, has evolved as a popular field of research. With the advent of weight sharing strategies across architectures, NAS has become applicable to a much wider range of problems. In particular, there are now many publications for dense prediction tasks in computer vision that require pixel-level predictions, such as semantic segmentation or object detection. These tasks come with novel challenges, such as higher memory footprints due to high-resolution data, learning multi-scale representations, longer training times, and more complex and larger neural architectures. In this manuscript, we provide an overview of NAS for dense prediction tasks by elaborating on these novel challenges and surveying ways to address them to ease future research and application of existing methods to novel problems.  ( 2 min )
    Identifying equivalent Calabi--Yau topologies: A discrete challenge from math and physics for machine learning. (arXiv:2202.07590v1 [hep-th])
    We review briefly the characteristic topological data of Calabi--Yau threefolds and focus on the question of when two threefolds are equivalent through related topological data. This provides an interesting test case for machine learning methodology in discrete mathematics problems motivated by physics.  ( 2 min )
    Stochastic Gradient Descent-Ascent: Unified Theory and New Efficient Methods. (arXiv:2202.07262v1 [math.OC])
    Stochastic Gradient Descent-Ascent (SGDA) is one of the most prominent algorithms for solving min-max optimization and variational inequalities problems (VIP) appearing in various machine learning tasks. The success of the method led to several advanced extensions of the classical SGDA, including variants with arbitrary sampling, variance reduction, coordinate randomization, and distributed variants with compression, which were extensively studied in the literature, especially during the last few years. In this paper, we propose a unified convergence analysis that covers a large variety of stochastic gradient descent-ascent methods, which so far have required different intuitions, have different applications and have been developed separately in various communities. A key to our unified framework is a parametric assumption on the stochastic estimates. Via our general theoretical framework, we either recover the sharpest known rates for the known special cases or tighten them. Moreover, to illustrate the flexibility of our approach we develop several new variants of SGDA such as a new variance-reduced method (L-SVRGDA), new distributed methods with compression (QSGDA, DIANA-SGDA, VR-DIANA-SGDA), and a new method with coordinate randomization (SEGA-SGDA). Although variants of the new methods are known for solving minimization problems, they were never considered or analyzed for solving min-max problems and VIPs. We also demonstrate the most important properties of the new methods through extensive numerical experiments.  ( 2 min )
    Graph Neural Networks for Graphs with Heterophily: A Survey. (arXiv:2202.07082v1 [cs.LG])
    Recent years have witnessed fast developments of graph neural networks (GNNs) that have benefited myriads of graph analytic tasks and applications. In general, most GNNs depend on the homophily assumption that nodes belonging to the same class are more likely to be connected. However, as a ubiquitous graph property in numerous real-world scenarios, heterophily, i.e., nodes with different labels tend to be linked, significantly limits the performance of tailor-made homophilic GNNs. Hence, \textit{GNNs for heterophilic graphs} are gaining increasing attention in this community. To the best of our knowledge, in this paper, we provide a comprehensive review of GNNs for heterophilic graphs for the first time. Specifically, we propose a systematic taxonomy that essentially governs existing heterophilic GNN models, along with a general summary and detailed analysis. Furthermore, we summarize the mainstream heterophilic graph benchmarks to facilitate robust and fair evaluations. In the end, we point out the potential directions to advance and stimulate future research and applications on heterophilic graphs.  ( 2 min )
    Federated Graph Neural Networks: Overview, Techniques and Challenges. (arXiv:2202.07256v1 [cs.DC])
    With its powerful capability to deal with graph data widely found in practical applications, graph neural networks (GNNs) have received significant research attention. However, as societies become increasingly concerned with data privacy, GNNs face the need to adapt to this new normal. This has led to the rapid development of federated graph neural networks (FedGNNs) research in recent years. Although promising, this interdisciplinary field is highly challenging for interested researchers to enter into. The lack of an insightful survey on this topic only exacerbates this problem. In this paper, we bridge this gap by offering a comprehensive survey of this emerging field. We propose a unique 3-tiered taxonomy of the FedGNNs literature to provide a clear view into how GNNs work in the context of Federated Learning (FL). It puts existing works into perspective by analyzing how graph data manifest themselves in FL settings, how GNN training is performed under different FL system architectures and degrees of graph data overlap across data silo, and how GNN aggregation is performed under various FL settings. Through discussions of the advantages and limitations of existing works, we envision future research directions that can help build more robust, dynamic, efficient, and interpretable FedGNNs.  ( 2 min )
    DeepONet-Grid-UQ: A Trustworthy Deep Operator Framework for Predicting the Power Grid's Post-Fault Trajectories. (arXiv:2202.07176v1 [math.NA])
    This paper proposes a new data-driven method for the reliable prediction of power system post-fault trajectories. The proposed method is based on the fundamentally new concept of Deep Operator Networks (DeepONets). Compared to traditional neural networks that learn to approximate functions, DeepONets are designed to approximate nonlinear operators. Under this operator framework, we design a DeepONet to (1) take as inputs the fault-on trajectories collected, for example, via simulation or phasor measurement units, and (2) provide as outputs the predicted post-fault trajectories. In addition, we endow our method with a much-needed ability to balance efficiency with reliable/trustworthy predictions via uncertainty quantification. To this end, we propose and compare two methods that enable quantifying the predictive uncertainty. First, we propose a \textit{Bayesian DeepONet} (B-DeepONet) that uses stochastic gradient Hamiltonian Monte-Carlo to sample from the posterior distribution of the DeepONet parameters. Then, we propose a \textit{Probabilistic DeepONet} (Prob-DeepONet) that uses a probabilistic training strategy to equip DeepONets with a form of automated uncertainty quantification, at virtually no extra computational cost. Finally, we validate the predictive power and uncertainty quantification capability of the proposed B-DeepONet and Prob-DeepONet using the IEEE 16-machine 68-bus system.  ( 2 min )
    Vau da muntanialas: Energy-efficient multi-die scalable acceleration of RNN inference. (arXiv:2202.07462v1 [cs.LG])
    Recurrent neural networks such as Long Short-Term Memories (LSTMs) learn temporal dependencies by keeping an internal state, making them ideal for time-series problems such as speech recognition. However, the output-to-input feedback creates distinctive memory bandwidth and scalability challenges in designing accelerators for RNNs. We present Muntaniala, an RNN accelerator architecture for LSTM inference with a silicon-measured energy-efficiency of 3.25$TOP/s/W$ and performance of 30.53$GOP/s$ in UMC 65 $nm$ technology. The scalable design of Muntaniala allows running large RNN models by combining multiple tiles in a systolic array. We keep all parameters stationary on every die in the array, drastically reducing the I/O communication to only loading new features and sharing partial results with other dies. For quantifying the overall system power, including I/O power, we built Vau da Muntanialas, to the best of our knowledge, the first demonstration of a systolic multi-chip-on-PCB array of RNN accelerator. Our multi-die prototype performs LSTM inference with 192 hidden states in 330$\mu s$ with a total system power of 9.0$mW$ at 10$MHz$ consuming 2.95$\mu J$. Targeting the 8/16-bit quantization implemented in Muntaniala, we show a phoneme error rate (PER) drop of approximately 3% with respect to floating-point (FP) on a 3L-384NH-123NI LSTM network on the TIMIT dataset.  ( 2 min )
    FILTRA: Rethinking Steerable CNN by Filter Transform. (arXiv:2105.11636v2 [cs.CV] UPDATED)
    Steerable CNN imposes the prior knowledge of transformation invariance or equivariance in the network architecture to enhance the the network robustness on geometry transformation of data and reduce overfitting. It has been an intuitive and widely used technique to construct a steerable filter by augmenting a filter with its transformed copies in the past decades, which is named as filter transform in this paper. Recently, the problem of steerable CNN has been studied from aspect of group representation theory, which reveals the function space structure of a steerable kernel function. However, it is not yet clear on how this theory is related to the filter transform technique. In this paper, we show that kernel constructed by filter transform can also be interpreted in the group representation theory. This interpretation help complete the puzzle of steerable CNN theory and provides a novel and simple approach to implement steerable convolution operators. Experiments are executed on multiple datasets to verify the feasibility of the proposed approach.  ( 2 min )
    Continuously Generalized Ordinal Regression for Linear and Deep Models. (arXiv:2202.07005v1 [cs.LG])
    Ordinal regression is a classification task where classes have an order and prediction error increases the further the predicted class is from the true class. The standard approach for modeling ordinal data involves fitting parallel separating hyperplanes that optimize a certain loss function. This assumption offers sample efficient learning via inductive bias, but is often too restrictive in real-world datasets where features may have varying effects across different categories. Allowing class-specific hyperplane slopes creates generalized logistic ordinal regression, increasing the flexibility of the model at a cost to sample efficiency. We explore an extension of the generalized model to the all-thresholds logistic loss and propose a regularization approach that interpolates between these two extremes. Our method, which we term continuously generalized ordinal logistic, significantly outperforms the standard ordinal logistic model over a thorough set of ordinal regression benchmark datasets. We further extend this method to deep learning and show that it achieves competitive or lower prediction error compared to previous models over a range of datasets and modalities. Furthermore, two primary alternative models for deep learning ordinal regression are shown to be special cases of our framework.  ( 2 min )
    A Theory of PAC Learnability under Transformation Invariances. (arXiv:2202.07552v1 [cs.LG])
    Transformation invariances are present in many real-world problems. For example, image classification is usually invariant to rotation and color transformation: a rotated car in a different color is still identified as a car. Data augmentation, which adds the transformed data into the training set and trains a model on the augmented data, is one commonly used technique to build these invariances into the learning process. However, it is unclear how data augmentation performs theoretically and what the optimal algorithm is in presence of transformation invariances. In this paper, we study PAC learnability under transformation invariances in three settings according to different levels of realizability: (i) A hypothesis fits the augmented data; (ii) A hypothesis fits only the original data and the transformed data lying in the support of the data distribution; (iii) Agnostic case. One interesting observation is that distinguishing between the original data and the transformed data is necessary to achieve optimal accuracy in setting (ii) and (iii), which implies that any algorithm not differentiating between the original and transformed data (including data augmentation) is not optimal. Furthermore, this type of algorithms can even "harm" the accuracy. In setting (i), although it is unnecessary to distinguish between the two data sets, data augmentation still does not perform optimally. Due to such a difference, we propose two combinatorial measures characterizing the optimal sample complexity in setting (i) and (ii)(iii) and provide the optimal algorithms.  ( 2 min )
    Accelerating Non-Negative and Bounded-Variable Linear Regression Algorithms with Safe Screening. (arXiv:2202.07258v1 [cs.LG])
    Non-negative and bounded-variable linear regression problems arise in a variety of applications in machine learning and signal processing. In this paper, we propose a technique to accelerate existing solvers for these problems by identifying saturated coordinates in the course of iterations. This is akin to safe screening techniques previously proposed for sparsity-regularized regression problems. The proposed strategy is provably safe as it provides theoretical guarantees that the identified coordinates are indeed saturated in the optimal solution. Experimental results on synthetic and real data show compelling accelerations for both non-negative and bounded-variable problems.  ( 2 min )
    Optimal Algorithms for Stochastic Multi-Level Compositional Optimization. (arXiv:2202.07530v1 [cs.LG])
    In this paper, we investigate the problem of stochastic multi-level compositional optimization, where the objective function is a composition of multiple smooth but possibly non-convex functions. Existing methods for solving this problem either suffer from sub-optimal sample complexities or need a huge batch size. To address this limitation, we propose a Stochastic Multi-level Variance Reduction method (SMVR), which achieves the optimal sample complexity of $\mathcal{O}\left(1 / \epsilon^{3}\right)$ to find an $\epsilon$-stationary point for non-convex objectives. Furthermore, when the objective function satisfies the convexity or Polyak-Lojasiewicz (PL) condition, we propose a stage-wise variant of SMVR and improve the sample complexity to $\mathcal{O}\left(1 / \epsilon^{2}\right)$ for convex functions or $\mathcal{O}\left(1 /(\mu\epsilon)\right)$ for non-convex functions satisfying the $\mu$-PL condition. The latter result implies the same complexity for $\mu$-strongly convex functions. To make use of adaptive learning rates, we also develop Adaptive SMVR, which achieves the same optimal complexities but converges faster in practice. All our complexities match the lower bounds not only in terms of $\epsilon$ but also in terms of $\mu$ (for PL or strongly convex functions), without using a large batch size in each iteration.  ( 2 min )
    StratDef: a strategic defense against adversarial attacks in malware detection. (arXiv:2202.07568v1 [cs.LG])
    Over the years, most research towards defenses against adversarial attacks on machine learning models has been in the image processing domain. The malware detection domain has received less attention despite its importance. Moreover, most work exploring defenses focuses on feature-based, gradient-based or randomized methods but with no strategy when applying them. In this paper, we introduce StratDef, which is a strategic defense system tailored for the malware detection domain based on a Moving Target Defense and Game Theory approach. We overcome challenges related to the systematic construction, selection and strategic use of models to maximize adversarial robustness. StratDef dynamically and strategically chooses the best models to increase the uncertainty for the attacker, whilst minimizing critical aspects in the adversarial ML domain like attack transferability. We provide the first comprehensive evaluation of defenses against adversarial attacks on machine learning for malware detection, where our threat model explores different levels of threat, attacker knowledge, capabilities, and attack intensities. We show that StratDef performs better than other defenses even when facing the peak adversarial threat. We also show that, from the existing defenses, only a few adversarially-trained models provide substantially better protection than just using vanilla models but are still outperformed by StratDef.  ( 2 min )
    Constructing coarse-scale bifurcation diagrams from spatio-temporal observations of microscopic simulations: A parsimonious machine learning approach. (arXiv:2201.13323v2 [math.DS] UPDATED)
    We address a three-tier data-driven approach to solve the inverse problem in complex systems modelling from spatio-temporal data produced by microscopic simulators using machine learning. In the first step, we exploit manifold learning and in particular parsimonious Diffusion Maps using leave-one-out cross-validation (LOOCV) to both identify the intrinsic dimension of the manifold where the emergent dynamics evolve and for feature selection over the parametric space. In the second step, based on the selected features, we learn the right-hand-side of the effective partial differential equations (PDEs) using two machine learning schemes, namely shallow Feedforward Neural Networks (FNNs) with two hidden layers and single-layer Random Projection Networks(RPNNs) which basis functions are constructed using an appropriate random sampling approach. Finally, based on the learned black-box PDE model, we construct the corresponding bifurcation diagram, thus exploiting the numerical bifurcation analysis toolkit. For our illustrations, we implemented the proposed method to construct the one-parameter bifurcation diagram of the 1D FitzHugh-Nagumo PDEs from data generated by $D1Q3$ Lattice Boltzmann simulations. The proposed method was quite effective in terms of numerical accuracy regarding the construction of the coarse-scale bifurcation diagram. Furthermore, the proposed RPNN scheme was $\sim$ 20 to 30 times less costly regarding the training phase than the traditional shallow FNNs, thus arising as a promising alternative to deep learning for solving the inverse problem for high-dimensional PDEs.  ( 2 min )
    STaR: Knowledge Graph Embedding by Scaling, Translation and Rotation. (arXiv:2202.07130v1 [cs.LG])
    The bilinear method is mainstream in Knowledge Graph Embedding (KGE), aiming to learn low-dimensional representations for entities and relations in Knowledge Graph (KG) and complete missing links. Most of the existing works are to find patterns between relationships and effectively model them to accomplish this task. Previous works have mainly discovered 6 important patterns like non-commutativity. Although some bilinear methods succeed in modeling these patterns, they neglect to handle 1-to-N, N-to-1, and N-to-N relations (or complex relations) concurrently, which hurts their expressiveness. To this end, we integrate scaling, the combination of translation and rotation that can solve complex relations and patterns, respectively, where scaling is a simplification of projection. Therefore, we propose a corresponding bilinear model Scaling Translation and Rotation (STaR) consisting of the above two parts. Besides, since translation cannot be incorporated into the bilinear model directly, we introduce translation matrix as the equivalent. Theoretical analysis proves that STaR is capable of modeling all patterns and handling complex relations simultaneously, and experiments demonstrate its effectiveness on commonly used benchmarks for link prediction.  ( 2 min )
    textless-lib: a Library for Textless Spoken Language Processing. (arXiv:2202.07359v1 [cs.CL])
    Textless spoken language processing research aims to extend the applicability of standard NLP toolset onto spoken language and languages with few or no textual resources. In this paper, we introduce textless-lib, a PyTorch-based library aimed to facilitate research in this research area. We describe the building blocks that the library provides and demonstrate its usability by discuss three different use-case examples: (i) speaker probing, (ii) speech resynthesis and compression, and (iii) speech continuation. We believe that textless-lib substantially simplifies research the textless setting and will be handful not only for speech researchers but also for the NLP community at large. The code, documentation, and pre-trained models are available at https://github.com/facebookresearch/textlesslib/ .  ( 2 min )
    Adversarial Attacks and Defense Methods for Power Quality Recognition. (arXiv:2202.07421v1 [cs.CR])
    Vulnerability of various machine learning methods to adversarial examples has been recently explored in the literature. Power systems which use these vulnerable methods face a huge threat against adversarial examples. To this end, we first propose a signal-specific method and a universal signal-agnostic method to attack power systems using generated adversarial examples. Black-box attacks based on transferable characteristics and the above two methods are also proposed and evaluated. We then adopt adversarial training to defend systems against adversarial attacks. Experimental analyses demonstrate that our signal-specific attack method provides less perturbation compared to the FGSM (Fast Gradient Sign Method), and our signal-agnostic attack method can generate perturbations fooling most natural signals with high probability. What's more, the attack method based on the universal signal-agnostic algorithm has a higher transfer rate of black-box attacks than the attack method based on the signal-specific algorithm. In addition, the results show that the proposed adversarial training improves robustness of power systems to adversarial examples.  ( 2 min )
    Unlabeled Data Help: Minimax Analysis and Adversarial Robustness. (arXiv:2202.06996v1 [stat.ML])
    The recent proposed self-supervised learning (SSL) approaches successfully demonstrate the great potential of supplementing learning algorithms with additional unlabeled data. However, it is still unclear whether the existing SSL algorithms can fully utilize the information of both labelled and unlabeled data. This paper gives an affirmative answer for the reconstruction-based SSL algorithm \citep{lee2020predicting} under several statistical models. While existing literature only focuses on establishing the upper bound of the convergence rate, we provide a rigorous minimax analysis, and successfully justify the rate-optimality of the reconstruction-based SSL algorithm under different data generation models. Furthermore, we incorporate the reconstruction-based SSL into the existing adversarial training algorithms and show that learning from unlabeled data helps improve the robustness.  ( 2 min )
    Beyond the Policy Gradient Theorem for Efficient Policy Updates in Actor-Critic Algorithms. (arXiv:2202.07496v1 [cs.LG])
    In Reinforcement Learning, the optimal action at a given state is dependent on policy decisions at subsequent states. As a consequence, the learning targets evolve with time and the policy optimization process must be efficient at unlearning what it previously learnt. In this paper, we discover that the policy gradient theorem prescribes policy updates that are slow to unlearn because of their structural symmetry with respect to the value target. To increase the unlearning speed, we study a novel policy update: the gradient of the cross-entropy loss with respect to the action maximizing $q$, but find that such updates may lead to a decrease in value. Consequently, we introduce a modified policy update devoid of that flaw, and prove its guarantees of convergence to global optimality in $\mathcal{O}(t^{-1})$ under classic assumptions. Further, we assess standard policy updates and our cross-entropy policy updates along six analytical dimensions. Finally, we empirically validate our theoretical findings.  ( 2 min )
    Analysis of Neural Fragility: Bounding the Norm of a Rank-One Perturbation Matrix. (arXiv:2202.07026v1 [cs.LG])
    Over 15 million epilepsy patients worldwide do not respond to drugs and require surgical treatment. Successful surgical treatment requires complete removal, or disconnection of the epileptogenic zone (EZ), but without a prospective biomarker of the EZ, surgical success rates vary between 30%-70%. Neural fragility is a model recently proposed to localize the EZ. Neural fragility is computed as the l2 norm of a structured rank-one perturbation of an estimated linear dynamical system. However, an analysis of its numerical properties have not been explored. We show that neural fragility is a well-defined model given a good estimator of the linear dynamical system from data. Specifically, we provide bounds on neural fragility as a function of the underlying linear system and noise.  ( 2 min )
    DermX: an end-to-end framework for explainable automated dermatological diagnosis. (arXiv:2202.06956v1 [eess.IV])
    Dermatological diagnosis automation is essential in addressing the high prevalence of skin diseases and critical shortage of dermatologists. Despite approaching expert-level diagnosis performance, convolutional neural network (ConvNet) adoption in clinical practice is impeded by their limited explainability, and by subjective, expensive explainability validations. We introduce DermX and DermX+, an end-to-end framework for explainable automated dermatological diagnosis. DermX is a clinically-inspired explainable dermatological diagnosis ConvNet, trained using DermXDB, a 554 images dataset annotated by eight dermatologists with diagnoses and supporting explanations. DermX+ extends DermX with guided attention training for explanation attention maps. Both methods achieve near-expert diagnosis performance, with DermX, DermX+, and dermatologist F1 scores of 0.79, 0.79, and 0.87, respectively. We assess the explanation plausibility in terms of identification and localization, by comparing model-selected with dermatologist-selected explanations, and gradient-weighted class-activation maps with dermatologist explanation maps. Both DermX and DermX+ obtain an identification F1 score of 0.78. The localization F1 score is 0.39 for DermX and 0.35 for DermX+. Explanation faithfulness is assessed through contrasting samples, DermX obtaining 0.53 faithfulness and DermX+ 0.25. These results show that explainability does not necessarily come at the expense of predictive power, as our high-performance models provide both plausible and faithful explanations for their diagnoses.  ( 2 min )
  • Open

    Long-term Causal Inference Under Persistent Confounding via Data Combination. (arXiv:2202.07234v1 [stat.ME])
    We study the identification and estimation of long-term treatment effects when both experimental and observational data are available. Since the long-term outcome is observed only after a long delay, it is not measured in the experimental data, but only recorded in the observational data. However, both types of data include observations of some short-term outcomes. In this paper, we uniquely tackle the challenge of persistent unmeasured confounders, i.e., some unmeasured confounders that can simultaneously affect the treatment, short-term outcomes and the long-term outcome, noting that they invalidate identification strategies in previous literature. To address this challenge, we exploit the sequential structure of multiple short-term outcomes, and develop three novel identification strategies for the average long-term treatment effect. We further propose three corresponding estimators and prove their asymptotic consistency and asymptotic normality. We finally apply our methods to estimate the effect of a job training program on long-term employment using semi-synthetic data. We numerically show that our proposals outperform existing methods that fail to handle persistent confounders.
    Understanding DDPM Latent Codes Through Optimal Transport. (arXiv:2202.07477v1 [stat.ML])
    Diffusion models have recently outperformed alternative approaches to model the distribution of natural images, such as GANs. Such diffusion models allow for deterministic sampling via the probability flow ODE, giving rise to a latent space and an encoder map. While having important practical applications, such as estimation of the likelihood, the theoretical properties of this map are not yet fully understood. In the present work, we partially address this question for the popular case of the VP SDE (DDPM) approach. We show that, perhaps surprisingly, the DDPM encoder map coincides with the optimal transport map for common distributions; we support this claim theoretically and by extensive numerical experiments.  ( 2 min )
    TURF: A Two-factor, Universal, Robust, Fast Distribution Learning Algorithm. (arXiv:2202.07172v1 [stat.ML])
    Approximating distributions from their samples is a canonical statistical-learning problem. One of its most powerful and successful modalities approximates every distribution to an $\ell_1$ distance essentially at most a constant times larger than its closest $t$-piece degree-$d$ polynomial, where $t\ge1$ and $d\ge0$. Letting $c_{t,d}$ denote the smallest such factor, clearly $c_{1,0}=1$, and it can be shown that $c_{t,d}\ge 2$ for all other $t$ and $d$. Yet current computationally efficient algorithms show only $c_{t,1}\le 2.25$ and the bound rises quickly to $c_{t,d}\le 3$ for $d\ge 9$. We derive a near-linear-time and essentially sample-optimal estimator that establishes $c_{t,d}=2$ for all $(t,d)\ne(1,0)$. Additionally, for many practical distributions, the lowest approximation distance is achieved by polynomials with vastly varying number of pieces. We provide a method that estimates this number near-optimally, hence helps approach the best possible approximation. Experiments combining the two techniques confirm improved performance over existing methodologies.  ( 2 min )
    XEM: An Explainable-by-Design Ensemble Method for Multivariate Time Series Classification. (arXiv:2005.03645v5 [cs.LG] UPDATED)
    We present XEM, an eXplainable-by-design Ensemble method for Multivariate time series classification. XEM relies on a new hybrid ensemble method that combines an explicit boosting-bagging approach to handle the bias-variance trade-off faced by machine learning models and an implicit divide-and-conquer approach to individualize classifier errors on different parts of the training data. Our evaluation shows that XEM outperforms the state-of-the-art MTS classifiers on the public UEA datasets. Furthermore, XEM provides faithful explainability-by-design and manifests robust performance when faced with challenges arising from continuous data collection (different MTS length, missing data and noise).  ( 2 min )
    Deep learning and differential equations for modeling changes in individual-level latent dynamics between observation periods. (arXiv:2202.07403v1 [stat.ML])
    When modeling longitudinal biomedical data, often dimensionality reduction as well as dynamic modeling in the resulting latent representation is needed. This can be achieved by artificial neural networks for dimension reduction, and differential equations for dynamic modeling of individual-level trajectories. However, such approaches so far assume that parameters of individual-level dynamics are constant throughout the observation period. Motivated by an application from psychological resilience research, we propose an extension where different sets of differential equation parameters are allowed for observation sub-periods. Still, estimation for intra-individual sub-periods is coupled for being able to fit the model also with a relatively small dataset. We subsequently derive prediction targets from individual dynamic models of resilience in the application. These serve as interpretable resilience-related outcomes, to be predicted from characteristics of individuals, measured at baseline and a follow-up time point, and selecting a small set of important predictors. Our approach is seen to successfully identify individual-level parameters of dynamic models that allows us to stably select predictors, i.e., resilience factors. Furthermore, we can identify those characteristics of individuals that are the most promising for updates at follow-up, which might inform future study design. This underlines the usefulness of our proposed deep dynamic modeling approach with changes in parameters between observation sub-periods.  ( 2 min )
    MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures. (arXiv:2006.07540v3 [cs.LG] UPDATED)
    Regularization and transfer learning are two popular techniques to enhance generalization on unseen data, which is a fundamental problem of machine learning. Regularization techniques are versatile, as they are task- and architecture-agnostic, but they do not exploit a large amount of data available. Transfer learning methods learn to transfer knowledge from one domain to another, but may not generalize across tasks and architectures, and may introduce new training cost for adapting to the target task. To bridge the gap between the two, we propose a transferable perturbation, MetaPerturb, which is meta-learned to improve generalization performance on unseen data. MetaPerturb is implemented as a set-based lightweight network that is agnostic to the size and the order of the input, which is shared across the layers. Then, we propose a meta-learning framework, to jointly train the perturbation function over heterogeneous tasks in parallel. As MetaPerturb is a set-function trained over diverse distributions across layers and tasks, it can generalize to heterogeneous tasks and architectures. We validate the efficacy and generality of MetaPerturb trained on a specific source domain and architecture, by applying it to the training of diverse neural architectures on heterogeneous target datasets against various regularizers and fine-tuning. The results show that the networks trained with MetaPerturb significantly outperform the baselines on most of the tasks and architectures, with a negligible increase in the parameter size and no hyperparameters to tune.  ( 2 min )
    Adaptive Conformal Predictions for Time Series. (arXiv:2202.07282v1 [stat.ML])
    Uncertainty quantification of predictive models is crucial in decision-making problems. Conformal prediction is a general and theoretically sound answer. However, it requires exchangeable data, excluding time series. While recent works tackled this issue, we argue that Adaptive Conformal Inference (ACI, Gibbs and Cand{\`e}s, 2021), developed for distribution-shift time series, is a good procedure for time series with general dependency. We theoretically analyse the impact of the learning rate on its efficiency in the exchangeable and auto-regressive case. We propose a parameter-free method, AgACI, that adaptively builds upon ACI based on online expert aggregation. We lead extensive fair simulations against competing methods that advocate for ACI's use in time series. We conduct a real case study: electricity price forecasting. The proposed aggregation algorithm provides efficient prediction intervals for day-ahead forecasting. All the code and data to reproduce the experiments is made available.  ( 2 min )
    Information-Theoretic Analysis of Minimax Excess Risk. (arXiv:2202.07537v1 [cs.LG])
    Two main concepts studied in machine learning theory are generalization gap (difference between train and test error) and excess risk (difference between test error and the minimum possible error). While information-theoretic tools have been used extensively to study the generalization gap of learning algorithms, the information-theoretic nature of excess risk has not yet been fully investigated. In this paper, some steps are taken toward this goal. We consider the frequentist problem of minimax excess risk as a zero-sum game between algorithm designer and the world. Then, we argue that it is desirable to modify this game in a way that the order of play can be swapped. We prove that, under some regularity conditions, if the world and designer can play randomly the duality gap is zero and the order of play can be changed. In this case, a Bayesian problem surfaces in the dual representation. This makes it possible to utilize recent information-theoretic results on minimum excess risk in Bayesian learning to provide bounds on the minimax excess risk. We demonstrate the applicability of the results by providing information theoretic insight on two important classes of problems: classification when the hypothesis space has finite VC-dimension, and regularized least squares.  ( 2 min )
    DeepPAMM: Deep Piecewise Exponential Additive Mixed Models for Complex Hazard Structures in Survival Analysis. (arXiv:2202.07423v1 [stat.ML])
    Survival analysis (SA) is an active field of research that is concerned with time-to-event outcomes and is prevalent in many domains, particularly biomedical applications. Despite its importance, SA remains challenging due to small-scale data sets and complex outcome distributions, concealed by truncation and censoring processes. The piecewise exponential additive mixed model (PAMM) is a model class addressing many of these challenges, yet PAMMs are not applicable in high-dimensional feature settings or in the case of unstructured or multimodal data. We unify existing approaches by proposing DeepPAMM, a versatile deep learning framework that is well-founded from a statistical point of view, yet with enough flexibility for modeling complex hazard structures. We illustrate that DeepPAMM is competitive with other machine learning approaches with respect to predictive performance while maintaining interpretability through benchmark experiments and an extended case study.  ( 2 min )
    Robust Multi-Objective Bayesian Optimization Under Input Noise. (arXiv:2202.07549v1 [cs.LG])
    Bayesian optimization (BO) is a sample-efficient approach for tuning design parameters to optimize expensive-to-evaluate, black-box performance metrics. In many manufacturing processes, the design parameters are subject to random input noise, resulting in a product that is often less performant than expected. Although BO methods have been proposed for optimizing a single objective under input noise, no existing method addresses the practical scenario where there are multiple objectives that are sensitive to input perturbations. In this work, we propose the first multi-objective BO method that is robust to input noise. We formalize our goal as optimizing the multivariate value-at-risk (MVaR), a risk measure of the uncertain objectives. Since directly optimizing MVaR is computationally infeasible in many settings, we propose a scalable, theoretically-grounded approach for optimizing MVaR using random scalarizations. Empirically, we find that our approach significantly outperforms alternative methods and efficiently identifies optimal robust designs that will satisfy specifications across multiple metrics with high probability.  ( 2 min )
    One-bit Submission for Locally Private Quasi-MLE: Its Asymptotic Normality and Limitation. (arXiv:2202.07194v1 [stat.ML])
    Local differential privacy~(LDP) is an information-theoretic privacy definition suitable for statistical surveys that involve an untrusted data curator. An LDP version of quasi-maximum likelihood estimator~(QMLE) has been developed, but the existing method to build LDP QMLE is difficult to implement for a large-scale survey system in the real world due to long waiting time, expensive communication cost, and the boundedness assumption of derivative of a log-likelihood function. We provided an alternative LDP protocol without those issues, which is potentially much easily deployable to a large-scale survey. We also provided sufficient conditions for the consistency and asymptotic normality and limitations of our protocol. Our protocol is less burdensome for the users, and the theoretical guarantees cover more realistic cases than those for the existing method.  ( 2 min )
    NeuPL: Neural Population Learning. (arXiv:2202.07415v1 [cs.AI])
    Learning in strategy games (e.g. StarCraft, poker) requires the discovery of diverse policies. This is often achieved by iteratively training new policies against existing ones, growing a policy population that is robust to exploit. This iterative approach suffers from two issues in real-world games: a) under finite budget, approximate best-response operators at each iteration needs truncating, resulting in under-trained good-responses populating the population; b) repeated learning of basic skills at each iteration is wasteful and becomes intractable in the presence of increasingly strong opponents. In this work, we propose Neural Population Learning (NeuPL) as a solution to both issues. NeuPL offers convergence guarantees to a population of best-responses under mild assumptions. By representing a population of policies within a single conditional model, NeuPL enables transfer learning across policies. Empirically, we show the generality, improved performance and efficiency of NeuPL across several test domains. Most interestingly, we show that novel strategies become more accessible, not less, as the neural population expands.  ( 2 min )
    Defending against Reconstruction Attacks with R\'enyi Differential Privacy. (arXiv:2202.07623v1 [cs.LG])
    Reconstruction attacks allow an adversary to regenerate data samples of the training set using access to only a trained model. It has been recently shown that simple heuristics can reconstruct data samples from language models, making this threat scenario an important aspect of model release. Differential privacy is a known solution to such attacks, but is often used with a relatively large privacy budget (epsilon > 8) which does not translate to meaningful guarantees. In this paper we show that, for a same mechanism, we can derive privacy guarantees for reconstruction attacks that are better than the traditional ones from the literature. In particular, we show that larger privacy budgets do not protect against membership inference, but can still protect extraction of rare secrets. We show experimentally that our guarantees hold against various language models, including GPT-2 finetuned on Wikitext-103.  ( 2 min )
    Stein Latent Optimization for Generative Adversarial Networks. (arXiv:2106.05319v5 [cs.LG] UPDATED)
    Generative adversarial networks (GANs) with clustered latent spaces can perform conditional generation in a completely unsupervised manner. In the real world, the salient attributes of unlabeled data can be imbalanced. However, most of existing unsupervised conditional GANs cannot cluster attributes of these data in their latent spaces properly because they assume uniform distributions of the attributes. To address this problem, we theoretically derive Stein latent optimization that provides reparameterizable gradient estimations of the latent distribution parameters assuming a Gaussian mixture prior in a continuous latent space. Structurally, we introduce an encoder network and novel unsupervised conditional contrastive loss to ensure that data generated from a single mixture component represent a single attribute. We confirm that the proposed method, named Stein Latent Optimization for GANs (SLOGAN), successfully learns balanced or imbalanced attributes and achieves state-of-the-art unsupervised conditional generation performance even in the absence of attribute information (e.g., the imbalance ratio). Moreover, we demonstrate that the attributes to be learned can be manipulated using a small amount of probe data.  ( 2 min )
    A Generalized Hierarchical Nonnegative Tensor Decomposition. (arXiv:2109.14820v2 [cs.LG] UPDATED)
    Nonnegative matrix factorization (NMF) has found many applications including topic modeling and document analysis. Hierarchical NMF (HNMF) variants are able to learn topics at various levels of granularity and illustrate their hierarchical relationship. Recently, nonnegative tensor factorization (NTF) methods have been applied in a similar fashion in order to handle data sets with complex, multi-modal structure. Hierarchical NTF (HNTF) methods have been proposed, however these methods do not naturally generalize their matrix-based counterparts. Here, we propose a new HNTF model which directly generalizes a HNMF model special case, and provide a supervised extension. We also provide a multiplicative updates training method for this model. Our experimental results show that this model more naturally illuminates the topic hierarchy than previous HNMF and HNTF methods.  ( 2 min )
    A Non-parametric View of FedAvg and FedProx: Beyond Stationary Points. (arXiv:2106.15216v2 [stat.ML] UPDATED)
    Federated Learning (FL) is a promising decentralized learning framework and has great potentials in privacy preservation and in lowering the computation load at the cloud. Recent work showed that FedAvg and FedProx - the two widely-adopted FL algorithms - fail to reach the stationary points of the global optimization objective even for homogeneous linear regression problems. Further, it is concerned that the common model learned might not generalize well locally at all in the presence of heterogeneity. In this paper, we analyze the convergence and statistical efficiency of FedAvg and FedProx, addressing the above two concerns. Our analysis is based on the standard non-parametric regression in a reproducing kernel Hilbert space (RKHS), and allows for heterogeneous local data distributions and unbalanced local datasets. We prove that the estimation errors, measured in either the empirical norm or the RKHS norm, decay with a rate of 1/t in general and exponentially for finite-rank kernels. In certain heterogeneous settings, these upper bounds also imply that both FedAvg and FedProx achieve the optimal error rate. To further analytically quantify the impact of the heterogeneity at each client, we propose and characterize a novel notion-federation gain, defined as the reduction of the estimation error for a client to join the FL. We discover that when the data heterogeneity is moderate, a client with limited local data can benefit from a common model with a large federation gain. Numerical experiments further corroborate our theoretical findings.  ( 2 min )
    Algebraic function based Banach space valued ordinary and fractional neural network approximations. (arXiv:2202.07425v1 [stat.ML])
    Here we research the univariate quantitative approximation, ordinary and fractional, of Banach space valued continuous functions on a compact interval or all the real line by quasi-interpolation Banach space valued neural network operators. These approximations are derived by establishing Jackson type inequalities involving the modulus of continuity of the engaged function or its Banach space valued high order derivative of fractional derivatives. Our operators are defined by using a density function generated by an algebraic sigmoid function. The approximations are pointwise and of the uniform norm. The related Banach space valued feed-forward neural networks are with one hidden layer.  ( 2 min )
    Robust Estimation for Random Graphs. (arXiv:2111.05320v2 [cs.DS] UPDATED)
    We study the problem of robustly estimating the parameter $p$ of an Erd\H{o}s-R\'enyi random graph on $n$ nodes, where a $\gamma$ fraction of nodes may be adversarially corrupted. After showing the deficiencies of canonical estimators, we design a computationally-efficient spectral algorithm which estimates $p$ up to accuracy $\tilde O(\sqrt{p(1-p)}/n + \gamma\sqrt{p(1-p)} /\sqrt{n}+ \gamma/n)$ for $\gamma < 1/60$. Furthermore, we give an inefficient algorithm with similar accuracy for all $\gamma <1/2$, the information-theoretic limit. Finally, we prove a nearly-matching statistical lower bound, showing that the error of our algorithms is optimal up to logarithmic factors.  ( 2 min )
    Estimating risks of option books using neural-SDE market models. (arXiv:2202.07148v1 [q-fin.CP])
    In this paper, we examine the capacity of an arbitrage-free neural-SDE market model to produce realistic scenarios for the joint dynamics of multiple European options on a single underlying. We subsequently demonstrate its use as a risk simulation engine for option portfolios. Through backtesting analysis, we show that our models are more computationally efficient and accurate for evaluating the Value-at-Risk (VaR) of option portfolios, with better coverage performance and less procyclicality than standard filtered historical simulation approaches.  ( 2 min )
    Pessimistic Minimax Value Iteration: Provably Efficient Equilibrium Learning from Offline Datasets. (arXiv:2202.07511v1 [cs.LG])
    We study episodic two-player zero-sum Markov games (MGs) in the offline setting, where the goal is to find an approximate Nash equilibrium (NE) policy pair based on a dataset collected a priori. When the dataset does not have uniform coverage over all policy pairs, finding an approximate NE involves challenges in three aspects: (i) distributional shift between the behavior policy and the optimal policy, (ii) function approximation to handle large state space, and (iii) minimax optimization for equilibrium solving. We propose a pessimism-based algorithm, dubbed as pessimistic minimax value iteration (PMVI), which overcomes the distributional shift by constructing pessimistic estimates of the value functions for both players and outputs a policy pair by solving NEs based on the two value functions. Furthermore, we establish a data-dependent upper bound on the suboptimality which recovers a sublinear rate without the assumption on uniform coverage of the dataset. We also prove an information-theoretical lower bound, which suggests that the data-dependent term in the upper bound is intrinsic. Our theoretical results also highlight a notion of "relative uncertainty", which characterizes the necessary and sufficient condition for achieving sample efficiency in offline MGs. To the best of our knowledge, we provide the first nearly minimax optimal result for offline MGs with function approximation.  ( 2 min )
    Network Comparison with Interpretable Contrastive Network Representation Learning. (arXiv:2005.12419v2 [cs.LG] UPDATED)
    Identifying unique characteristics in a network through comparison with another network is an essential network analysis task. For example, with networks of protein interactions obtained from normal and cancer tissues, we can discover unique types of interactions in cancer tissues. This analysis task could be greatly assisted by contrastive learning, which is an emerging analysis approach to discover salient patterns in one dataset relative to another. However, existing contrastive learning methods cannot be directly applied to networks as they are designed only for high-dimensional data analysis. To address this problem, we introduce a new analysis approach called contrastive network representation learning (cNRL). By integrating two machine learning schemes, network representation learning and contrastive learning, cNRL enables embedding of network nodes into a low-dimensional representation that reveals the uniqueness of one network compared to another. Within this approach, we also design a method, named i-cNRL, which offers interpretability in the learned results, allowing for understanding which specific patterns are only found in one network. We demonstrate the effectiveness of i-cNRL for network comparison with multiple network models and real-world datasets. Furthermore, we compare i-cNRL and other potential cNRL algorithm designs through quantitative and qualitative evaluations.  ( 2 min )
    Transformers in Time Series: A Survey. (arXiv:2202.07125v1 [cs.LG])
    Transformers have achieved superior performances in many tasks in natural language processing and computer vision, which also intrigues great interests in the time series community. Among multiple advantages of transformers, the ability to capture long-range dependencies and interactions is especially attractive for time series modeling, leading to exciting progress in various time series applications. In this paper, we systematically review transformer schemes for time series modeling by highlighting their strengths as well as limitations through a new taxonomy to summarize existing time series transformers in two perspectives. From the perspective of network modifications, we summarize the adaptations of module level and architecture level of the time series transformers. From the perspective of applications, we categorize time series transformers based on common tasks including forecasting, anomaly detection, and classification. Empirically, we perform robust analysis, model size analysis, and seasonal-trend decomposition analysis to study how Transformers perform in time series. Finally, we discuss and suggest future directions to provide useful research guidance. To the best of our knowledge, this paper is the first work to comprehensively and systematically summarize the recent advances of Transformers for modeling time series data. We hope this survey will ignite further research interests in time series Transformers.  ( 2 min )
    REPID: Regional Effect Plots with implicit Interaction Detection. (arXiv:2202.07254v1 [stat.ML])
    Machine learning models can automatically learn complex relationships, such as non-linear and interaction effects. Interpretable machine learning methods such as partial dependence plots visualize marginal feature effects but may lead to misleading interpretations when feature interactions are present. Hence, employing additional methods that can detect and measure the strength of interactions is paramount to better understand the inner workings of machine learning models. We demonstrate several drawbacks of existing global interaction detection approaches, characterize them theoretically, and evaluate them empirically. Furthermore, we introduce regional effect plots with implicit interaction detection, a novel framework to detect interactions between a feature of interest and other features. The framework also quantifies the strength of interactions and provides interpretable and distinct regions in which feature effects can be interpreted more reliably, as they are less confounded by interactions. We prove the theoretical eligibility of our method and show its applicability on various simulation and real-world examples.  ( 2 min )
    On Incorporating Inductive Biases into VAEs. (arXiv:2106.13746v2 [stat.ML] UPDATED)
    We explain why directly changing the prior can be a surprisingly ineffective mechanism for incorporating inductive biases into VAEs, and introduce a simple and effective alternative approach: Intermediary Latent Space VAEs(InteL-VAEs). InteL-VAEs use an intermediary set of latent variables to control the stochasticity of the encoding process, before mapping these in turn to the latent representation using a parametric function that encapsulates our desired inductive bias(es). This allows us to impose properties like sparsity or clustering on learned representations, and incorporate human knowledge into the generative model. Whereas changing the prior only indirectly encourages behavior through regularizing the encoder, InteL-VAEs are able to directly enforce desired characteristics. Moreover, they bypass the computation and encoder design issues caused by non-Gaussian priors, while allowing for additional flexibility through training of the parametric mapping function. We show that these advantages, in turn, lead to both better generative models and better representations being learned.  ( 2 min )
    Random Feature Amplification: Feature Learning and Generalization in Neural Networks. (arXiv:2202.07626v1 [cs.LG])
    In this work, we provide a characterization of the feature-learning process in two-layer ReLU networks trained by gradient descent on the logistic loss following random initialization. We consider data with binary labels that are generated by an XOR-like function of the input features. We permit a constant fraction of the training labels to be corrupted by an adversary. We show that, although linear classifiers are no better than random guessing for the distribution we consider, two-layer ReLU networks trained by gradient descent achieve generalization error close to the label noise rate, refuting the conjecture of Malach and Shalev-Shwartz that 'deeper is better only when shallow is good'. We develop a novel proof technique that shows that at initialization, the vast majority of neurons function as random features that are only weakly correlated with useful features, and the gradient descent dynamics 'amplify' these weak, random features to strong, useful features.  ( 2 min )
    Between Stochastic and Adversarial Online Convex Optimization: Improved Regret Bounds via Smoothness. (arXiv:2202.07554v1 [cs.LG])
    Stochastic and adversarial data are two widely studied settings in online learning. But many optimization tasks are neither i.i.d. nor fully adversarial, which makes it of fundamental interest to get a better theoretical understanding of the world between these extremes. In this work we establish novel regret bounds for online convex optimization in a setting that interpolates between stochastic i.i.d. and fully adversarial losses. By exploiting smoothness of the expected losses, these bounds replace a dependence on the maximum gradient length by the variance of the gradients, which was previously known only for linear losses. In addition, they weaken the i.i.d. assumption by allowing adversarially poisoned rounds or shifts in the data distribution. To accomplish this goal, we introduce two key quantities associated with the loss sequence, that we call the cumulative stochastic variance and the adversarial variation. Our upper bounds are attained by instances of optimistic follow the regularized leader, and we design adaptive learning rates that automatically adapt to the cumulative stochastic variance and adversarial variation. In the fully i.i.d. case, our bounds match the rates one would expect from results in stochastic acceleration, and in the fully adversarial case they gracefully deteriorate to match the minimax regret. We further provide lower bounds showing that our regret upper bounds are tight for all intermediate regimes for the cumulative stochastic variance and the adversarial variation.  ( 2 min )
    Self-Training of Halfspaces with Generalization Guarantees under Massart Mislabeling Noise Model. (arXiv:2111.14427v3 [cs.LG] UPDATED)
    We investigate the generalization properties of a self-training algorithm with halfspaces. The approach learns a list of halfspaces iteratively from labeled and unlabeled training data, in which each iteration consists of two steps: exploration and pruning. In the exploration phase, the halfspace is found sequentially by maximizing the unsigned-margin among unlabeled examples and then assigning pseudo-labels to those that have a distance higher than the current threshold. The pseudo-labeled examples are then added to the training set, and a new classifier is learned. This process is repeated until no more unlabeled examples remain for pseudo-labeling. In the pruning phase, pseudo-labeled samples that have a distance to the last halfspace greater than the associated unsigned-margin are then discarded. We prove that the misclassification error of the resulting sequence of classifiers is bounded and show that the resulting semi-supervised approach never degrades performance compared to the classifier learned using only the initial labeled training set. Experiments carried out on a variety of benchmarks demonstrate the efficiency of the proposed approach compared to state-of-the-art methods.  ( 2 min )
    Realistic Counterfactual Explanations by Learned Relations. (arXiv:2202.07356v1 [stat.ML])
    Many existing methods of counterfactual explanations ignore the intrinsic relationships between data attributes and thus fail to generate realistic counterfactuals. Moreover, the existing methods that account for relationships between data attributes require domain knowledge, which limits their applicability in complex real-world applications. In this paper, we propose a novel approach to realistic counterfactual explanations that preserve relationships between data attributes. The model directly learns the relationships by a variational auto-encoder without domain knowledge and then learns to disturb the latent space accordingly. We conduct extensive experiments on both synthetic and real-world datasets. The results demonstrate that the proposed method learns relationships from the data and preserves these relationships in generated counterfactuals.  ( 2 min )
    Optimal scaling of random-walk Metropolis algorithms using Bayesian large-sample asymptotics. (arXiv:2104.06384v3 [stat.ME] UPDATED)
    High-dimensional limit theorems have been shown useful to derive tuning rules for finding the optimal scaling in random-walk Metropolis algorithms. The assumptions under which weak convergence results are proved are however restrictive: the target density is typically assumed to be of a product form. Users may thus doubt the validity of such tuning rules in practical applications. In this paper, we shed some light on optimal-scaling problems from a different perspective, namely a large-sample one. This allows to prove weak convergence results under realistic assumptions and to propose novel parameter-dimension-dependent tuning guidelines. The proposed guidelines are consistent with previous ones when the target density is close to having a product form, and the results highlight that the correlation structure has to be accounted for to avoid performance deterioration if that is not the case, while justifying the use of a natural (asymptotically exact) approximation to the correlation matrix that can be employed for the very first algorithm run.  ( 2 min )
    A Statistical Learning View of Simple Kriging. (arXiv:2202.07365v1 [stat.ML])
    In the Big Data era, with the ubiquity of geolocation sensors in particular, massive datasets exhibiting a possibly complex spatial dependence structure are becoming increasingly available. In this context, the standard probabilistic theory of statistical learning does not apply directly and guarantees of the generalization capacity of predictive rules learned from such data are left to establish. We analyze here the simple Kriging task, the flagship problem in Geostatistics: the values of a square integrable random field $X=\{X_s\}_{s\in S}$, $S\subset \mathbb{R}^2$, with unknown covariance structure are to be predicted with minimum quadratic risk, based upon observing a single realization of the spatial process at a finite number of locations $s_1,\; \ldots,\; s_n$ in $S$. Despite the connection of this minimization problem with kernel ridge regression, establishing the generalization capacity of empirical risk minimizers is far from straightforward, due to the non i.i.d. nature of the spatial data $X_{s_1},\; \ldots,\; X_{s_n}$ involved. In this article, nonasymptotic bounds of order $O_{\mathbb{P}}(1/n)$ are proved for the excess risk of a plug-in predictive rule mimicking the true minimizer in the case of isotropic stationary Gaussian processes observed at locations forming a regular grid. These theoretical results, as well as the role played by the technical conditions required to establish them, are illustrated by various numerical experiments and hopefully pave the way for further developments in statistical learning based on spatial data.  ( 2 min )
    Identifying strongly correlated groups of sections in a large motorway network. (arXiv:2202.07644v1 [physics.soc-ph])
    In a motorway network, correlations between the different links, i.e. between the parts of (different) motorways, are of considerable interest. Knowledge of fluxes and velocities on individual motorways is not sufficient, rather, their correlations determine or reflect, respectively, the functionality of and the dynamics on the network as a whole. These correlations are time dependent as the dynamics on the network is highly non-stationary, as it strongly varies during the day and over the week. Correlations are indispensable to detect risks of failure in a traffic network. Discovery of alternative routes less correlated with the vulnerable ones helps to make the traffic network robust and to avoid a collapse. Hence, the identification of, especially, groups of strongly correlated road sections is needed. To this end, we employ an optimized $k$-means clustering method. A major ingredient is the spectral information of certain correlation matrices in which the leading collective motion of the network has been removed. We identify strongly correlated groups of sections in the large motorway network of North Rhine-Westphalia (NRW), Germany. The groups classify the motorway sections in terms of spectral and geographic features as well as of traffic phases during different time periods. The representation and visualization of the groups on the real topology, i.e. on the road map, provides new results on the dynamics on the motorway network. Our approach is very general and can also be applied to other correlated complex systems.  ( 2 min )
    Predicting on the Edge: Identifying Where a Larger Model Does Better. (arXiv:2202.07652v1 [cs.LG])
    Much effort has been devoted to making large and more accurate models, but relatively little has been put into understanding which examples are benefiting from the added complexity. In this paper, we demonstrate and analyze the surprisingly tight link between a model's predictive uncertainty on individual examples and the likelihood that larger models will improve prediction on them. Through extensive numerical studies on the T5 encoder-decoder architecture, we show that large models have the largest improvement on examples where the small model is most uncertain. On more certain examples, even those where the small model is not particularly accurate, large models are often unable to improve at all, and can even perform worse than the smaller model. Based on these findings, we show that a switcher model which defers examples to a larger model when a small model is uncertain can achieve striking improvements in performance and resource usage. We also explore committee-based uncertainty metrics that can be more effective but less practical.  ( 2 min )
    Synthetically Controlled Bandits. (arXiv:2202.07079v1 [stat.ML])
    This paper presents a new dynamic approach to experiment design in settings where, due to interference or other concerns, experimental units are coarse. `Region-split' experiments on online platforms are one example of such a setting. The cost, or regret, of experimentation is a natural concern here. Our new design, dubbed Synthetically Controlled Thompson Sampling (SCTS), minimizes the regret associated with experimentation at no practically meaningful loss to inferential ability. We provide theoretical guarantees characterizing the near-optimal regret of our approach, and the error rates achieved by the corresponding treatment effect estimator. Experiments on synthetic and real world data highlight the merits of our approach relative to both fixed and `switchback' designs common to such experimental settings.  ( 2 min )
    Beyond the Policy Gradient Theorem for Efficient Policy Updates in Actor-Critic Algorithms. (arXiv:2202.07496v1 [cs.LG])
    In Reinforcement Learning, the optimal action at a given state is dependent on policy decisions at subsequent states. As a consequence, the learning targets evolve with time and the policy optimization process must be efficient at unlearning what it previously learnt. In this paper, we discover that the policy gradient theorem prescribes policy updates that are slow to unlearn because of their structural symmetry with respect to the value target. To increase the unlearning speed, we study a novel policy update: the gradient of the cross-entropy loss with respect to the action maximizing $q$, but find that such updates may lead to a decrease in value. Consequently, we introduce a modified policy update devoid of that flaw, and prove its guarantees of convergence to global optimality in $\mathcal{O}(t^{-1})$ under classic assumptions. Further, we assess standard policy updates and our cross-entropy policy updates along six analytical dimensions. Finally, we empirically validate our theoretical findings.  ( 2 min )
    Scalable Spatiotemporally Varying Coefficient Modelling with Bayesian Kernelized Tensor Regression. (arXiv:2109.00046v2 [stat.ML] UPDATED)
    As a regression technique in spatial statistics, the spatiotemporally varying coefficient model (STVC) is an important tool for discovering nonstationary and interpretable response-covariate associations over both space and time. However, it is difficult to apply STVC for large-scale spatiotemporal analyses due to the high computational cost. To address this challenge, we summarize the spatiotemporally varying coefficients using a third-order tensor structure and propose to reformulate the spatiotemporally varying coefficient model as a special low-rank tensor regression problem. The low-rank decomposition can effectively model the global patterns of the large data sets with a substantially reduced number of parameters. To further incorporate the local spatiotemporal dependencies, we use Gaussian process (GP) priors on the spatial and temporal factor matrices. We refer to the overall framework as Bayesian Kernelized Tensor Regression (BKTR). For model inference, we develop an efficient Markov chain Monte Carlo (MCMC) algorithm, which uses Gibbs sampling to update factor matrices and slice sampling to update kernel hyperparameters. We conduct extensive experiments on both synthetic and real-world data sets, and our results confirm the superior performance and efficiency of BKTR for model estimation and parameter inference.  ( 2 min )
    Accelerating Non-Negative and Bounded-Variable Linear Regression Algorithms with Safe Screening. (arXiv:2202.07258v1 [cs.LG])
    Non-negative and bounded-variable linear regression problems arise in a variety of applications in machine learning and signal processing. In this paper, we propose a technique to accelerate existing solvers for these problems by identifying saturated coordinates in the course of iterations. This is akin to safe screening techniques previously proposed for sparsity-regularized regression problems. The proposed strategy is provably safe as it provides theoretical guarantees that the identified coordinates are indeed saturated in the optimal solution. Experimental results on synthetic and real data show compelling accelerations for both non-negative and bounded-variable problems.  ( 2 min )
    Dual Training of Energy-Based Models with Overparametrized Shallow Neural Networks. (arXiv:2107.05134v2 [cs.LG] UPDATED)
    Energy-based models (EBMs) are generative models that are usually trained via maximum likelihood estimation. This approach becomes challenging in generic situations where the trained energy is non-convex, due to the need to sample the Gibbs distribution associated with this energy. Using general Fenchel duality results, we derive variational principles dual to maximum likelihood EBMs with shallow overparametrized neural network energies, both in the feature-learning and lazy linearized regimes. In the feature-learning regime, this dual formulation justifies using a two time-scale gradient ascent-descent (GDA) training algorithm in which one updates concurrently the particles in the sample space and the neurons in the parameter space of the energy. We also consider a variant of this algorithm in which the particles are sometimes restarted at random samples drawn from the data set, and show that performing these restarts at every iteration step corresponds to score matching training. These results are illustrated in simple numerical experiments, which indicates that GDA performs best when features and particles are updated using similar time scales.  ( 2 min )
    Minimax in Geodesic Metric Spaces: Sion's Theorem and Algorithms. (arXiv:2202.06950v1 [math.OC])
    Determining whether saddle points exist or are approximable for nonconvex-nonconcave problems is usually intractable. We take a step towards understanding certain nonconvex-nonconcave minimax problems that do remain tractable. Specifically, we study minimax problems cast in geodesic metric spaces, which provide a vast generalization of the usual convex-concave saddle point problems. The first main result of the paper is a geodesic metric space version of Sion's minimax theorem; we believe our proof is novel and transparent, as it relies on Helly's theorem only. In our second main result, we specialize to geodesically complete Riemannian manifolds: we devise and analyze the complexity of first-order methods for smooth minimax problems.  ( 2 min )
    Deep Ensembles Work, But Are They Necessary?. (arXiv:2202.06985v1 [cs.LG])
    Ensembling neural networks is an effective way to increase accuracy, and can often match the performance of larger models. This observation poses a natural question: given the choice between a deep ensemble and a single neural network with similar accuracy, is one preferable over the other? Recent work suggests that deep ensembles may offer benefits beyond predictive power: namely, uncertainty quantification and robustness to dataset shift. In this work, we demonstrate limitations to these purported benefits, and show that a single (but larger) neural network can replicate these qualities. First, we show that ensemble diversity, by any metric, does not meaningfully contribute to an ensemble's ability to detect out-of-distribution (OOD) data, and that one can estimate ensemble diversity by measuring the relative improvement of a single larger model. Second, we show that the OOD performance afforded by ensembles is strongly determined by their in-distribution (InD) performance, and -- in this sense -- is not indicative of any "effective robustness". While deep ensembles are a practical way to achieve performance improvement (in agreement with prior work), our results show that they may be a tool of convenience rather than a fundamentally better model class.  ( 2 min )
    Principal Manifold Flows. (arXiv:2202.07037v1 [stat.ML])
    Normalizing flows map an independent set of latent variables to their samples using a bijective transformation. Despite the exact correspondence between samples and latent variables, their high level relationship is not well understood. In this paper we characterize the geometric structure of flows using principal manifolds and understand the relationship between latent variables and samples using contours. We introduce a novel class of normalizing flows, called principal manifold flows (PF), whose contours are its principal manifolds, and a variant for injective flows (iPF) that is more efficient to train than regular injective flows. PFs can be constructed using any flow architecture, are trained with a regularized maximum likelihood objective and can perform density estimation on all of their principal manifolds. In our experiments we show that PFs and iPFs are able to learn the principal manifolds over a variety of datasets. Additionally, we show that PFs can perform density estimation on data that lie on a manifold with variable dimensionality, which is not possible with existing normalizing flows.  ( 2 min )
    Unlabeled Data Help: Minimax Analysis and Adversarial Robustness. (arXiv:2202.06996v1 [stat.ML])
    The recent proposed self-supervised learning (SSL) approaches successfully demonstrate the great potential of supplementing learning algorithms with additional unlabeled data. However, it is still unclear whether the existing SSL algorithms can fully utilize the information of both labelled and unlabeled data. This paper gives an affirmative answer for the reconstruction-based SSL algorithm \citep{lee2020predicting} under several statistical models. While existing literature only focuses on establishing the upper bound of the convergence rate, we provide a rigorous minimax analysis, and successfully justify the rate-optimality of the reconstruction-based SSL algorithm under different data generation models. Furthermore, we incorporate the reconstruction-based SSL into the existing adversarial training algorithms and show that learning from unlabeled data helps improve the robustness.  ( 2 min )

  • Open

    DeepMind Has Trained an AI to Control Nuclear Fusion
    submitted by /u/nick7566 [link] [comments]
    AI Training Tips and Comparisons
    submitted by /u/Kagermanov [link] [comments]
    AI inputs and processing
    First of all, I hope everyone reading this is having an awesome day so far! Okay so, I'm (fairly) new to AI but I have a question that was bothering me for a long time. If for example i train an AI on trading BTC for example, is the AI going to learn the relation between the actual price or between the relation of the price. Example: Is the AI going to learn if BTC drops below 30k the its good to buy Or: Is the AI going to learn if BC drops 5% buy Or none of these? Or a mix of both? Or does it depend on how I train it? What happens if I let the on BTC trained AI trade on ETH? Will it never trade because ETH is not at 30k-60k? My next guess was: what if I generalize price into something like a sigmoid function? But then I realized well 30k in the sigmoid function is still more than 3k so that might not work either. Fell free to ask me any questions regarding my question. submitted by /u/RenkeLudwig [link] [comments]  ( 1 min )
    any way to synthesize tweet like text from a database of over 1million tweets? I want to do this on a commercial level for a tweetbot. (Python preferred)
    any help is highly appreciaited :) submitted by /u/saucymacaroni [link] [comments]  ( 1 min )
    Bootstrapping Automation with Teleoperation and Data-Driven Reinforcement Learning
    submitted by /u/lorepieri [link] [comments]
    The 10 most exciting computer vision research applications in 2021! Perfect resource if you're wondering what happened in 2021 in AI/CV!
    submitted by /u/OnlyProggingForFun [link] [comments]
    Artificial Intelligence Newsletter
    The newsletter with a roundup of in-depth articles and news about artificial intelligence and machine learning that will provide you with recent AI technology trends and help you think about how AI can shape your company for the future. https://info.exadel.com/en/exadel-artificial-intelligence-newsletter submitted by /u/lklimusheuskaja [link] [comments]
    Day 2 of ai generated motivational quotes.
    submitted by /u/ScottBakerJr [link] [comments]  ( 1 min )
    AstraZeneca Researchers Explain the Concept and Applications of the Shapley Value in Machine Learning
    In many practical areas of machine learning, such as explainability, feature selection, data valuation, ensemble pruning, and federated learning, measuring relevance and attribution of various gains is a crucial topic. For example, one may wonder: How important is a certain feature in a machine learning model’s decisions? What is the value of a single data point? Which models in an ensemble are the most valuable? Specific ways have been used to solve these concerns in various sectors. Continue Reading Paper: https://arxiv.org/pdf/2202.05594v1.pdf https://preview.redd.it/992wcgjmo5i81.png?width=1876&format=png&auto=webp&s=309d64e304d52baec19e8ae4c834d1727691c8a1 submitted by /u/ai-lover [link] [comments]  ( 1 min )
    How Artificial Intelligence Impacts the Future of Work?
    Artificial intelligence, or AI, seems to be on everyone's lips these days. While I've known about this key trend in tech growth for a while, I've witnessed AI become one of the most sought-after areas of knowledge for job seekers. While no one can foresee exactly how AI will progress in the future , current trends and advancements offer a very different picture of how AI will become a part of our life. In reality, AI is already at work all around us, influencing everything from our search results to our chances of finding love online to the way we purchase. The use of AI in several commercial areas has increased by 270 percent in the last four years, according to data. What, on the other hand, will AI entail for the future of work? AI will probably not make human workers obsolete, at least not for a long time . AI is going to become a standard in all businesses, not just in the world of tech. AI can have a big impact on the job search industry. AI can help in inventing new type of jobs opportunities, empowering existing ones. Lastly, AI has the potential to transform businesses and contribute to economic growth. submitted by /u/ai-bees [link] [comments]  ( 1 min )
    hopfield network, what is the question trying to say?
    Consider the following problem. We are required to create Discrete Hopfield Network with bipolar representation of input vector as [1 1 1 -1] or [1 1 1 0] (in case of binary representation) is stored in the network. Test the hopfield network with missing entries in the first and second component of the stored vector (i.e. [0 0 1 0]). source-: https://www.geeksforgeeks.org/hopfield-neural-network/ What do I need to test here? Can you just tell me?Do I need to test if it equals to input vector? Or what? submitted by /u/kalokeshmarelimai [link] [comments]  ( 1 min )
    What's the best sentiment analysis platform?
    I know everyone has their own preferences, but I was wondering if people had used any services for this before? I'm looking to use a platform for a side project that can hopefully turn into a business and was wondering what's easiest to use and is accurate. I have looked at Amazon's, Google's, and Monkey Learn's sentiment analysis products. If it comes down to purely price, I would be saving around $100 a month using Amazon or Google. I'm open to suggestions or if you guys have opinions on what I gave that would be great! submitted by /u/Mikel123123123 [link] [comments]  ( 1 min )
    Love you, bro
    submitted by /u/Bozzooo [link] [comments]
    The “Magic” Behind the AI Revolution - Explaining the basics of Artificial Neural Networks
    submitted by /u/InLamestTerms [link] [comments]
  • Open

    "Cerebro-cerebellar networks facilitate learning through feedback decoupling", Boven et al 2022 (cerebellum as synthetic gradients / critic)
    submitted by /u/gwern [link] [comments]  ( 1 min )
    First time I got an RL policy on hardware!!
    submitted by /u/wtfbbq121 [link] [comments]  ( 1 min )
    "The Elusive Hunt for a Robot That Can Pick a Ripe Strawberry: It's a tricky, delicate task that combines machine vision and robotics. Progress has been slow, but entrepreneurs and farmers continue to invest"
    submitted by /u/gwern [link] [comments]  ( 1 min )
    "Magnetic control of tokamak plasmas through deep reinforcement learning", Degrave et al 2022 {DM}
    submitted by /u/gwern [link] [comments]  ( 1 min )
    Bootstrapping Automation with Teleoperation and Data-Driven Reinforcement Learning
    submitted by /u/lorepieri [link] [comments]
  • Open

    [R] Cityscapes - foggy cityscapes Domain Adaptation
    Hi, i am happy to share with you the edited version of cityscapes to foggy cityscapes dataset for unsupervised domain adaptation ready to be used =) https://github.com/fpv-iplab/Cityscapes-FoggyCityscapes submitted by /u/CapitalShake3085 [link] [comments]
    [N] DeepMind is tackling controlled fusion through deep reinforcement learning
    Yesss.... A first paper in Nature today: Magnetic control of tokamak plasmas through deep reinforcement learning. After the proteins folding breakthrough, Deepmind is tackling controlled fusion through deep reinforcement learning (DRL). With the long-term promise of abundant energy without greenhouse gas emissions. What a challenge! But Deemind's Google's folks, you are our heros! Do it again! A Wired popular article. submitted by /u/ClaudeCoulombe [link] [comments]  ( 4 min )
    [P] Easy Labeling & Unsupervised Model Training using Scale Rapid
    Scale released an interesting blogpost recently, looks like anyone can just log in and start using Scale Rapid to get data labels. Also thought the use case with the anime CycleGAN was pretty cool and had some interesting results! submitted by /u/DragonfruitSalad1 [link] [comments]
    [Discussion] Fragility of AlphaZero vs Stockfish in chess-Domain adaptation.
    In terms of transfer learning/domain adaptation for these models I was wondering how you think minor changes to the game might affect the performance? Stockfish is a simulation based engine, while AlphaZero is the reinforcement learning engine. In a trained state, AlphaZero completely dominates Stockfish. Assuming both are trained on a regular chess set, suppose you change the starting squares of bishops and knights, would this completely mess up AlphaZero? I don't see it causing a huge problem for Stockfish due to the similarity in piece movements and method in evaluating a position. For AlphaZero, switching bishops and knights is not a possible starting position. It is close to a real position, and likely after a couple of moves it becomes a viable position. I understand if you train it on chess and apply it to checkers, it won't perform. But this is still "chess". As an extrapolation, what if it became other variations of chess such as Fischer random chess (with a random but mirrored back rank) or Plunder chess. submitted by /u/CrazySheepherder1339 [link] [comments]  ( 1 min )
    [D] Any good place to find TV scripts in text form for use as training data?
    I saw one of those fake "I forced an AI to watch 10000 hours of the office" posts float by my feed again and I want to make something like that real. I'm specifically looking for The Office and Breaking Bad. I can do any cleaning or refining of the data myself. submitted by /u/CosmicMemer [link] [comments]  ( 1 min )
    [D] Which platform for chatbot in a game? Huggingface vs. Rasa?
    I want to create a chatbot for a game I am working on. I'm currently using Rasa, but just saw that I can already access GPT-3 and did some testing on Playground and am amazed at how easy it is. After further googling I came across Huggingface which seems to be a cheaper option and I have access to other simpler models. I also like the Spaces option, which allows me to create some sort of test setup for my chatbot. I have several questions that are somehow related to my main question: It would be nice if I could use Huggingface's Spaces to create training data that I can use to further train my model. Is that even possible? Also, I can later host the chatbot on Huggingface and access it via an API, right? Does anyone here have some experience and can point me in the right direction? Which option would be the best? Thanks in advance! submitted by /u/KaputteNutte [link] [comments]  ( 2 min )
    [D] AI against Censorship: Genetic Algorithms, The Geneva Project, ML in Security, and more! (Video Interview)
    https://youtu.be/zcGOPqFZ4Tk Most of us conceive the internet as a free and open space where we are able to send traffic between any two nodes, but for large parts of the world this is not the case. Entire nations have large machinery in place to survey all internet traffic and automated procedures to block any undesirable connections. Evading such censorship has been largely a cat-and-mouse game between security researchers and government actors. A new system, called Geneva, uses a Genetic Algorithm in combination with Evolutionary Search in order to dynamically evade such censorship and adjust itself in real-time to any potential response by its adversaries. In this video, I talk to Security researcher Kevin Bock, who is one of Geneva's main contributors and member of the Breakerspace project. We talk about the evolution of internet censorship, how to evade it, how to mess with the censors' infrastructure, as well as the broader emerging connections between AI and Security. ​ OUTLINE: 0:00 - Intro 3:30 - What is automated censorship in networks? 7:20 - The evolution of censorship vs evasion 12:40 - Why do we need a dynamic, evolving system? 16:30 - The building blocks of Geneva 23:15 - Introducing evolution 28:30 - What's the censors' response? 31:45 - How was Geneva's media reception? 33:15 - Where do we go from here? 37:30 - Can we deliberately attack the censors? 47:00 - On responsible disclosure 49:40 - Breakerspace: Security research for undergrads 50:40 - How often do you get into trouble? 52:10 - How can I get started in security? ​ Learn more at: - Geneva (& more) project page: https://censorship.ai - Open Observatory of Network Interference: https://ooni.org - Censored Planet: https://censoredplanet.org - Breakerspace: https://breakerspace.cs.umd.edu submitted by /u/ykilcher [link] [comments]  ( 1 min )
    [D] Combining DVC and MLflow tools
    Do you use DVC and MLflow tools on your projects? I just wrote a blogpost about how to combine their power to manage your experiments and I would love to hear your feedback! Thanks! submitted by /u/__yannickw__ [link] [comments]  ( 2 min )
    [D] Is there any dataset designed to detect oilseed rape diseases using machine learning?
    I am working on a mobile application supporting the identification of rape diseases and I am looking for a suitable dataset, but so far I have not been able to find any that would be dedicated to the detection of this plant and its diseases. I was counting that I could find something on PlantVillage, however I was wrong. I will be very grateful for any help. Greetings. submitted by /u/2biczT [link] [comments]  ( 1 min )
    [D] TorchScript model vs ONNX
    What is your take on the comparison from the title? I was completing some course where one of the topics was deployment and an inference model was created which leveraged to_torchscript model. Until now I thought that ONNX was a way to go. My take until now is: Set up: may be a bit harder for ONNX since you need to write the code from translating the model from PyTorch to ONNX and the inference code (I may be wrong on this, please express your opinion) Speed: ??? This is what interests me the most submitted by /u/Icy_Fisherman7187 [link] [comments]  ( 1 min )
    [R] Federated Learning and data processing research papers online
    Hi ML community, we recently released all of our papers regarding ML / AI federation, scaling and multi-processing over our website: https://www.databloom.ai/science Its free, and we are happy to answer questions. We are also the team behind Apache Wayang, if you want to contribute, we are also happy! submitted by /u/2pk03 [link] [comments]
    [R] What are some up-to-date references to the evaluation of model fit across various types of generative models (GANs, flows, VAEs, diffusion, etc.)?
    Hi all, I am interested in finding references to up-to-date evaluation methods for generative modeling in unsupervised tasks across different model types. I am familiar with papers such as https://arxiv.org/abs/2002.09797, https://arxiv.org/abs/2102.08921, https://arxiv.org/abs/1806.00035, which are motivated by being unable to estimate the likelihood or a lower-bound on the likelihood when using GANs. Suppose one hoped to compare the performance across GANs, flows, and VAEs in a particular scenario. Would one of the above references, or something similar, be an approach you would consider? Also, say you ignore GANs and compare models such as flows, VAEs, and other latent variable models. Would you consider similar approaches? I understand these models involve likelihood estimation or estimating a lower bound on the likelihood, but considering it is a lower bound for some of these models, comparing these seems off. I appreciate any help you can provide. submitted by /u/hasco641 [link] [comments]  ( 1 min )
    [Discussion] Multi-target Domain Adaptation vs Fine-tuning/Re-training of networks
    I'm currently doing some research in multi-target domain adaptation from one dataset to multiple datasets. I was wondering what the industry application of domain adaptation is. Big companies like Google/FB have the resources to manually label datasets for a specific domain and hence train a model to be very good at that specific domain. On the flip side, when using a multi-target domain adaptation method, you get a more generalised model, however, the performance on that specific domain may suffer. If the companies' primary goal is to maximise performance, is there much of a benefit of having a generalised model over a specialised one? Then the follow-up question is: Is research in domain adaptation beneficial to companies that have the resources to collect enough data? If so, how? submitted by /u/remmsta [link] [comments]  ( 1 min )
    [R] Introducing the ICBe Dataset: Very High Recall and Precision Event Extraction from Narratives about International Crises
    Link: https://arxiv.org/abs/2202.07081 Abstract: "How do international crises unfold? We conceive of international affairs as a strategic chess game between adversaries, necessitating a systematic way to measure pieces, moves, and gambits accurately and consistently over different contexts and periods. We develop such a measurement strategy with an ontology of crisis actions and interactions and apply it to a high-quality corpus of crisis narratives recorded by the International Crisis Behavior (ICB) Project. We demonstrate that the ontology has high coverage over most of the thoughts, speech, and actions contained in these narratives and produces high inter-coder agreement when applied by human coders. We introduce a new crisis event dataset ICB Events (ICBe). We find that ICBe captures the process of a crisis with greater accuracy and granularity than other well-regarded events or crisis datasets. We make the data, replication material, and additional visualizations available at a companion website (http://www.crisisevents.org/)." A cool looking paper, with a dataset going back 200 years. submitted by /u/bikeskata [link] [comments]  ( 1 min )
    [P] Probabilistic Machine Learning: An Introduction (New Textbook by Kevin Murphy)
    The textbook is published in print format, but a pdf version (recent draft) is available as a pdf. Link: https://probml.github.io/pml-book/book1.html submitted by /u/hardmaru [link] [comments]  ( 1 min )
    [R]: Compute Trends Across Three Eras of Machine Learning
    submitted by /u/giugiacaglia [link] [comments]  ( 2 min )
    [D] Paper Explained - HyperTransformer: Model Generation for Supervised and Semi-Supervised Few-Shot Learning (video analysis w/ first author interview)
    https://youtu.be/D6osiiEoV0w This video contains a paper explanation and an interview with author Andrey Zhmoginov! Few-shot learning is an interesting sub-field in meta-learning, with wide applications, such as creating personalized models based on just a handful of data points. Traditionally, approaches have followed the BERT approach where a large model is pre-trained and then fine-tuned. However, this couples the size of the final model to the size of the model that has been pre-trained. Similar problems exist with "true" meta-learners, such as MaML. HyperTransformer fundamentally decouples the meta-learner from the size of the final model by directly predicting the weights of the final model. The HyperTransformer takes the few-shot dataset as a whole into its context and predicts either one or multiple layers of a (small) ConvNet, meaning its output are the weights of the convolution filters. Interestingly, and with the correct engineering care, this actually appears to deliver promising results and can be extended in many ways. ​ OUTLINE: 0:00 - Intro & Overview 3:05 - Weight-generation vs Fine-tuning for few-shot learning 10:10 - HyperTransformer model architecture overview 22:30 - Why the self-attention mechanism is useful here 34:45 - Start of Interview 39:45 - Can neural networks even produce weights of other networks? 47:00 - How complex does the computational graph get? 49:45 - Why are transformers particularly good here? 58:30 - What can the attention maps tell us about the algorithm? 1:07:00 - How could we produce larger weights? 1:09:30 - Diving into experimental results 1:14:30 - What questions remain open? ​ Paper: https://arxiv.org/abs/2201.04182 submitted by /u/ykilcher [link] [comments]  ( 1 min )
  • Open

    How SIGNAL IDUNA operationalizes machine learning projects on AWS
    This post is co-authored with Jan Paul Assendorp, Thomas Lietzow, Christopher Masch, Alexander Meinert, Dr. Lars Palzer, Jan Schillemans of SIGNAL IDUNA. At SIGNAL IDUNA, a large German insurer, we are currently reinventing ourselves with our transformation program VISION2023 to become even more customer oriented. Two aspects are central to this transformation: the reorganization of […]  ( 8 min )
    Bongo Learn provides real-time feedback to improve learning outcomes with Amazon Transcribe
    Real-time feedback helps drive learning. This is especially important for designing presentations, learning new languages, and strengthening other essential skills that are critical to succeed in today’s workplace. However, many students and lifelong learners lack access to effective face-to-face instruction to hone these skills. In addition, with the rapid adoption of remote learning, educators are […]  ( 4 min )
    Prepare time series data with Amazon SageMaker Data Wrangler
    Time series data is widely present in our lives. Stock prices, house prices, weather information, and sales data captured over time are just a few examples. As businesses increasingly look for new ways to gain meaningful insights from time-series data, the ability to visualize data and apply desired transformations are fundamental steps. However, time-series data […]  ( 12 min )
  • Open

    Almost periodic functions
    When you see the word “almost” in a mathematical context, it might be used informally, but often it has a precise meaning. I wrote about this before in the post Common words that have a technical meaning. Often the technical meaning of “almost” is “within any finite tolerance.” That’s how it is used in the […] Almost periodic functions first appeared on John D. Cook.  ( 2 min )
    2s, 5s, and decibels
    Last night I ran across a delightful Twitter thread by Nathan Mishra-Linger that starts out by saying It’s nice that 2^7 ≈ 5^3, because that lets you approximate any integer valued decibel quantity as a product of powers of 2 and 5. The observation that 27 ≈ 53, is more familiar if you multiply both […] 2s, 5s, and decibels first appeared on John D. Cook.  ( 3 min )
  • Open

    What are the best resources to learn to use Keras and Tensorflow Online?
    I am currently enrolled in an AI class at my college where we are learning Keras. I, for a different class, and planning on building a project using neural nets. I was talking with my professor and he said some of the stuff I want to do isn't going to get covered or isn't going to be covered in time for my project, so I'm venturing out on my own. What're some good tools to help get me through the basics and into the weeds quickly? submitted by /u/HoneyBunchsOGoats [link] [comments]  ( 1 min )
    Trained the second model. how do you like the result?
    It remains to write ui. But this will have to wait, as I am currently busy developing the Silent Install Builder. before after Silent Install Builder is a project for installing applications that do not have automatic installation. With UI Recorder you can record your installer. Silent Install Builde Hope you don't mind if I mention other non-AI related software. I am just want to finish the software. submitted by /u/vlad_ma [link] [comments]  ( 1 min )
    hopfield network, what is the question trying to say?
    Consider the following problem. We are required to create Discrete Hopfield Network with bipolar representation of input vector as [1 1 1 -1] or [1 1 1 0] (in case of binary representation) is stored in the network. Test the hopfield network with missing entries in the first and second component of the stored vector (i.e. [0 0 1 0]). source-: https://www.geeksforgeeks.org/hopfield-neural-network/ What do I need to test here? Can you just tell me?Do I need to test if it equals to input vector? Or what? submitted by /u/kalokeshmarelimai [link] [comments]  ( 1 min )
    Order of images during ANN training
    How does changing the order of images fed to an ANN during training change the end model? Let's say I have 1000 images of cats and 1000 images of dogs. I want to build an ANN for classification into cats,dogs. If I feed the cats first then dogs how will the resulting model be different than if I feed dogs first and then cats? submitted by /u/rand3289 [link] [comments]  ( 1 min )
  • Open

    How to Use Blockchain Technology In Human Resource
    Easily one of the most revolutionary technologies in recent times — Blockchain is slated to disrupt so many industries that its…  ( 4 min )
    Why High-Quality Annotated Data is Required for Machine Learning?
    We are surrounded by artificial intelligence (AI) and machine learning (ML). Both AI and machine learning have altered the way we live and…  ( 4 min )
  • Open

    Reimagining Modern Luxury: NVIDIA Announces Partnership with Jaguar Land Rover
    Jaguar Land Rover and NVIDIA are redefining modern luxury, infusing intelligence into the customer experience. As part of its Reimagine strategy, Jaguar Land Rover announced today that it will develop its upcoming vehicles on the full-stack NVIDIA DRIVE Hyperion 8 platform, with DRIVE Orin delivering a wide spectrum of active safety, automated driving and parking Read article > The post Reimagining Modern Luxury: NVIDIA Announces Partnership with Jaguar Land Rover appeared first on The Official NVIDIA Blog.  ( 3 min )
    The Greatest Podcast Ever Recorded
    Is this the best podcast ever recorded? Let’s just say you don’t need a GPU to know that’s a stretch. But it’s pretty great if you’re a fan of tall tales. And better still if you’re not a fan of stretching the truth at all. That’s because detecting hyperbole may one day get more manageable, Read article > The post The Greatest Podcast Ever Recorded appeared first on The Official NVIDIA Blog.  ( 3 min )
    Atos Previews Energy-Efficient, AI-Augmented Hybrid Supercomputer
    Stepping deeper into the era of exascale AI, Atos gave the first look at its next-generation high-performance computer. The BullSequana XH3000 combines Atos’ patented fourth-generation liquid-cooled HPC design with NVIDIA technologies to deliver both more performance and energy efficiency. Giving users a choice of Arm or x86 computing architectures, it will come in versions using Read article > The post Atos Previews Energy-Efficient, AI-Augmented Hybrid Supercomputer appeared first on The Official NVIDIA Blog.  ( 3 min )
  • Open

    Feature Construction and Selection for PV Solar Power Modeling. (arXiv:2202.06226v1 [eess.SY])
    Using solar power in the process industry can reduce greenhouse gas emissions and make the production process more sustainable. However, the intermittent nature of solar power renders its usage challenging. Building a model to predict photovoltaic (PV) power generation allows decision-makers to hedge energy shortages and further design proper operations. The solar power output is time-series data dependent on many factors, such as irradiance and weather. A machine learning framework for 1-hour ahead solar power prediction is developed in this paper based on the historical data. Our method extends the input dataset into higher dimensional Chebyshev polynomial space. Then, a feature selection scheme is developed with constrained linear regression to construct the predictor for different weather types. Several tests show that the proposed approach yields lower mean squared error than classical machine learning methods, such as support vector machine (SVM), random forest (RF), and gradient boosting decision tree (GBDT).  ( 2 min )
    Statistical limits of dictionary learning: random matrix theory and the spectral replica method. (arXiv:2109.06610v3 [cs.IT] UPDATED)
    We consider increasingly complex models of matrix denoising and dictionary learning in the Bayes-optimal setting, in the challenging regime where the matrices to infer have a rank growing linearly with the system size. This is in contrast with most existing literature concerned with the low-rank (i.e., constant-rank) regime. We first consider a class of rotationally invariant matrix denoising problems whose mutual information and minimum mean-square error are computable using techniques from random matrix theory. Next, we analyze the more challenging models of dictionary learning. To do so we introduce a novel combination of the replica method from statistical mechanics together with random matrix theory, coined spectral replica method. This allows us to derive variational formulas for the mutual information between hidden representations and the noisy data of the dictionary learning problem, as well as for the overlaps quantifying the optimal reconstruction error. The proposed method reduces the number of degrees of freedom from $\Theta(N^2)$ matrix entries to $\Theta(N)$ eigenvalues (or singular values), and yields Coulomb gas representations of the mutual information which are reminiscent of matrix models in physics. The main ingredients are a combination of large deviation results for random matrices together with a new replica symmetric decoupling ansatz at the level of the probability distributions of eigenvalues (or singular values) of certain overlap matrices and the use of HarishChandra-Itzykson-Zuber spherical integrals.  ( 2 min )
    Emotion Based Hate Speech Detection using Multimodal Learning. (arXiv:2202.06218v1 [cs.LG])
    In recent years, monitoring hate speech and offensive language on social media platforms has become paramount due to its widespread usage among all age groups, races, and ethnicities. Consequently, there have been substantial research efforts towards automated detection of such content using Natural Language Processing (NLP). While successfully filtering textual data, no research has focused on detecting hateful content in multimedia data. With increased ease of data storage and the exponential growth of social media platforms, multimedia content proliferates the internet as much as text data. Nevertheless, it escapes the automatic filtering systems. Hate speech and offensiveness can be detected in multimedia primarily via three modalities, i.e., visual, acoustic, and verbal. Our preliminary study concluded that the most essential features in classifying hate speech would be the speaker's emotional state and its influence on the spoken words, therefore limiting our current research to these modalities. This paper proposes the first multimodal deep learning framework to combine the auditory features representing emotion and the semantic features to detect hateful content. Our results demonstrate that incorporating emotional attributes leads to significant improvement over text-based models in detecting hateful multimedia content. This paper also presents a new Hate Speech Detection Video Dataset (HSDVD) collected for the purpose of multimodal learning as no such dataset exists today.  ( 2 min )
    Towards Understanding and Defending Input Space Trojans. (arXiv:2202.06382v1 [cs.LG])
    Deep Neural Networks (DNNs) can learn Trojans (or backdoors) from benign or poisoned data, which raises security concerns of using them. By exploiting such Trojans, the adversary can add a fixed input space perturbation to any given input to mislead the model predicting certain outputs (i.e., target labels). In this paper, we analyze such input space Trojans in DNNs, and propose a theory to explain the relationship of a model's decision regions and Trojans: a complete and accurate Trojan corresponds to a hyperplane decision region in the input domain. We provide a formal proof of this theory, and provide empirical evidence to support the theory and its relaxations. Based on our analysis, we design a novel training method that removes Trojans during training even on poisoned datasets, and evaluate our prototype on five datasets and five different attacks. Results show that our method outperforms existing solutions. Code: \url{https://anonymous.4open.science/r/NOLE-84C3}.  ( 2 min )
    Multi network InfoMax: A pre-training method involving graph convolutional networks. (arXiv:2111.01276v2 [cs.LG] UPDATED)
    Discovering distinct features and their relations from data can help us uncover valuable knowledge crucial for various tasks, e.g., classification. In neuroimaging, these features could help to understand, classify, and possibly prevent brain disorders. Model introspection of highly performant overparameterized deep learning (DL) models could help find these features and relations. However, to achieve high-performance level DL models require numerous labeled training samples ($n$) rarely available in many fields. This paper presents a pre-training method involving graph convolutional/neural networks (GCNs/GNNs), based on maximizing mutual information between two high-level embeddings of an input sample. Many of the recently proposed pre-training methods pre-train one of many possible networks of an architecture. Since almost every DL model is an ensemble of multiple networks, we take our high-level embeddings from two different networks of a model --a convolutional and a graph network--. The learned high-level graph latent representations help increase performance for downstream graph classification tasks and bypass the need for a high number of labeled data samples. We apply our method to a neuroimaging dataset for classifying subjects into healthy control (HC) and schizophrenia (SZ) groups. Our experiments show that the pre-trained model significantly outperforms the non-pre-trained model and requires $50\%$ less data for similar performance.  ( 2 min )
    LEAF: Navigating Concept Drift in Cellular Networks. (arXiv:2109.03011v2 [cs.NI] UPDATED)
    Operational networks commonly rely on machine learning models for many tasks, including detecting anomalies, inferring application performance, and forecasting demand. Yet, unfortunately, model accuracy can degrade due to concept drift, whereby the relationship between the features and the target prediction changes due to reasons ranging from software upgrades to seasonality to changes in user behavior. Mitigating concept drift is thus an essential part of operationalizing machine learning models, and yet despite its importance, concept drift has not been extensively explored in the context of networking -- or regression models in general. Thus, it is not well-understood how to detect or mitigate it for many common network management tasks that currently rely on machine learning models. As we show, concept drift cannot always be mitigated by periodic retraining models using newly available data, and doing so can even degrade model accuracy. In this paper, we characterize concept drift in a large cellular network for a metropolitan area in the United States. We find that concept drift occurs across key performance indicators (KPIs), regardless of model, training set size, and time interval -- thus necessitating practical approaches to detect, explain, and mitigate it. To do so, we develop Local Error Approximation of Features (LEAF). LEAF detects drift; explains features and time intervals that most contribute to drift; and mitigates drift using resampling, augmentation, or ensembling. We evaluate LEAF against industry-standard mitigations (i.e., periodic retraining) with more than three years of cellular data from Verizon. LEAF consistently outperforms periodic retraining on a variety of KPIs and models, while reducing costly retrains by an order of magnitude. Due to its effectiveness, a major cellular carrier is now integrating LEAF into its forecasting and provisioning processes.  ( 2 min )
    Sharp Bounds for Federated Averaging (Local SGD) and Continuous Perspective. (arXiv:2111.03741v2 [cs.LG] UPDATED)
    Federated Averaging (FedAvg), also known as Local SGD, is one of the most popular algorithms in Federated Learning (FL). Despite its simplicity and popularity, the convergence rate of FedAvg has thus far been undetermined. Even under the simplest assumptions (convex, smooth, homogeneous, and bounded covariance), the best-known upper and lower bounds do not match, and it is not clear whether the existing analysis captures the capacity of the algorithm. In this work, we first resolve this question by providing a lower bound for FedAvg that matches the existing upper bound, which shows the existing FedAvg upper bound analysis is not improvable. Additionally, we establish a lower bound in a heterogeneous setting that nearly matches the existing upper bound. While our lower bounds show the limitations of FedAvg, under an additional assumption of third-order smoothness, we prove more optimistic state-of-the-art convergence results in both convex and non-convex settings. Our analysis stems from a notion we call iterate bias, which is defined by the deviation of the expectation of the SGD trajectory from the noiseless gradient descent trajectory with the same initialization. We prove novel sharp bounds on this quantity, and show intuitively how to analyze this quantity from a Stochastic Differential Equation (SDE) perspective.  ( 2 min )
    Skeletal Feature Compensation for Imitation Learning with Embodiment Mismatch. (arXiv:2104.07810v2 [cs.LG] UPDATED)
    Learning from demonstrations in the wild (e.g. YouTube videos) is a tantalizing goal in imitation learning. However, for this goal to be achieved, imitation learning algorithms must deal with the fact that the demonstrators and learners may have bodies that differ from one another. This condition -- "embodiment mismatch" -- is ignored by many recent imitation learning algorithms. Our proposed imitation learning technique, SILEM (\textbf{S}keletal feature compensation for \textbf{I}mitation \textbf{L}earning with \textbf{E}mbodiment \textbf{M}ismatch), addresses a particular type of embodiment mismatch by introducing a learned affine transform to compensate for differences in the skeletal features obtained from the learner and expert. We create toy domains based on PyBullet's HalfCheetah and Ant to assess SILEM's benefits for this type of embodiment mismatch. We also provide qualitative and quantitative results on more realistic problems -- teaching simulated humanoid agents, including Atlas from Boston Dynamics, to walk by observing human demonstrations.  ( 2 min )
    STOPS: Short-Term-based Volatility-controlled Policy Search and its Global Convergence. (arXiv:2201.09857v2 [cs.LG] UPDATED)
    It remains challenging to deploy existing risk-averse approaches to real-world applications. The reasons are multi-fold, including the lack of global optimality guarantee and the necessity of learning from long-term consecutive trajectories. Long-term consecutive trajectories are prone to involving visiting hazardous states, which is a major concern in the risk-averse setting. This paper proposes \textbf{\ul{S}}hort-\textbf{\ul{T}}erm V\textbf{\ul{O}}latility-controlled \textbf{\ul{P}}olicy \textbf{\ul{S}}earch (STOPS), a novel algorithm that solves risk-averse problems by learning from short-term trajectories instead of long-term trajectories. Short-term trajectories are more flexible to generate, and can avoid the danger of hazardous state visitations. By using an actor-critic scheme with an overparameterized two-layer neural network, our algorithm finds a globally optimal policy at a sublinear rate with proximal policy optimization and natural policy gradient, with effectiveness comparable to the state-of-the-art convergence rate of risk-neutral policy-search methods. The algorithm is evaluated on challenging Mujoco robot simulation tasks under the mean-variance evaluation metric. Both theoretical analysis and experimental results demonstrate a state-of-the-art level of STOPS' performance among existing risk-averse policy search methods.  ( 2 min )
    TCCT: Tightly-Coupled Convolutional Transformer on Time Series Forecasting. (arXiv:2108.12784v2 [cs.LG] UPDATED)
    Time series forecasting is essential for a wide range of real-world applications. Recent studies have shown the superiority of Transformer in dealing with such problems, especially long sequence time series input(LSTI) and long sequence time series forecasting(LSTF) problems. To improve the efficiency and enhance the locality of Transformer, these studies combine Transformer with CNN in varying degrees. However, their combinations are loosely-coupled and do not make full use of CNN. To address this issue, we propose the concept of tightly-coupled convolutional Transformer(TCCT) and three TCCT architectures which apply transformed CNN architectures into Transformer: (1) CSPAttention: through fusing CSPNet with self-attention mechanism, the computation cost of self-attention mechanism is reduced by 30% and the memory usage is reduced by 50% while achieving equivalent or beyond prediction accuracy. (2) Dilated causal convolution: this method is to modify the distilling operation proposed by Informer through replacing canonical convolutional layers with dilated causal convolutional layers to gain exponentially receptive field growth. (3) Passthrough mechanism: the application of passthrough mechanism to stack of self-attention blocks helps Transformer-like models get more fine-grained information with negligible extra computation costs. Our experiments on real-world datasets show that our TCCT architectures could greatly improve the performance of existing state-of-art Transformer models on time series forecasting with much lower computation and memory costs, including canonical Transformer, LogTrans and Informer.  ( 2 min )
    Generalization of GANs and overparameterized models under Lipschitz continuity. (arXiv:2104.02388v2 [cs.LG] UPDATED)
    Generative adversarial networks (GANs) are so complex that the existing learning theories do not provide a satisfactory explanation for why GANs have great success in practice. The same situation also remains largely open for deep neural networks. To fill this gap, we introduce a Lipschitz theory to analyze generalization. We demonstrate its simplicity by analyzing generalization and consistency of overparameterized neural networks. We then use this theory to derive Lipschitz-based generalization bounds for GANs. Our bounds show that penalizing the Lipschitz constant of the GAN loss can improve generalization. This result answers the long mystery of why the popular use of Lipschitz constraint for GANs often leads to great success, empirically without a solid theory. Finally but surprisingly, we show that, when using Dropout or spectral normalization, both \emph{truly deep} neural networks and GANs can generalize well without the curse of dimensionality.  ( 2 min )
    Learning Disentangled Representation by Exploiting Pretrained Generative Models: A Contrastive Learning View. (arXiv:2102.10543v2 [cs.CV] UPDATED)
    From the intuitive notion of disentanglement, the image variations corresponding to different factors should be distinct from each other, and the disentangled representation should reflect those variations with separate dimensions. To discover the factors and learn disentangled representation, previous methods typically leverage an extra regularization term when learning to generate realistic images. However, the term usually results in a trade-off between disentanglement and generation quality. For the generative models pretrained without any disentanglement term, the generated images show semantically meaningful variations when traversing along different directions in the latent space. Based on this observation, we argue that it is possible to mitigate the trade-off by $(i)$ leveraging the pretrained generative models with high generation quality, $(ii)$ focusing on discovering the traversal directions as factors for disentangled representation learning. To achieve this, we propose Disentaglement via Contrast (DisCo) as a framework to model the variations based on the target disentangled representations, and contrast the variations to jointly discover disentangled directions and learn disentangled representations. DisCo achieves the state-of-the-art disentangled representation learning and distinct direction discovering, given pretrained non-disentangled generative models including GAN, VAE, and Flow. Source code is at https://github.com/xrenaa/DisCo.  ( 2 min )
    A Unified Analysis Method for Online Optimization in Normed Vector Space. (arXiv:2112.12134v4 [cs.LG] UPDATED)
    This paper studies online optimization from a high-level unified theoretical perspective. We not only generalize both Optimistic-DA and Optimistic-MD in normed vector space, but also unify their analysis methods for dynamic regret. Regret bounds are the tightest possible due to the introduction of $\phi$-convex. As instantiations, regret bounds of normalized exponentiated subgradient and greedy/lazy projection are better than the currently known optimal results. By replacing losses of online game with monotone operators, and extending the definition of regret, namely regret$^n$, we extend online convex optimization to online monotone optimization.  ( 2 min )
    Vector Quantized Bayesian Neural Network Inference for Data Streams. (arXiv:1907.05911v3 [cs.LG] CROSS LISTED)
    Bayesian neural networks (BNN) can estimate the uncertainty in predictions, as opposed to non-Bayesian neural networks (NNs). However, BNNs have been far less widely used than non-Bayesian NNs in practice since they need iterative NN executions to predict a result for one data, and it gives rise to prohibitive computational cost. This computational burden is a critical problem when processing data streams with low-latency. To address this problem, we propose a novel model VQ-BNN, which approximates BNN inference for data streams. In order to reduce the computational burden, VQ-BNN inference predicts NN only once and compensates the result with previously memorized predictions. To be specific, VQ-BNN inference for data streams is given by temporal exponential smoothing of recent predictions. The computational cost of this model is almost the same as that of non-Bayesian NNs. Experiments including semantic segmentation on real-world data show that this model performs significantly faster than BNNs while estimating predictive results comparable to or superior to the results of BNNs.  ( 2 min )
    Supported Policy Optimization for Offline Reinforcement Learning. (arXiv:2202.06239v1 [cs.LG])
    Policy constraint methods to offline reinforcement learning (RL) typically utilize parameterization or regularization that constrains the policy to perform actions within the support set of the behavior policy. The elaborative designs of parameterization methods usually intrude into the policy networks, which may bring extra inference cost and cannot take full advantage of well-established online methods. Regularization methods reduce the divergence between the learned policy and the behavior policy, which may mismatch the inherent density-based definition of support set thereby failing to avoid the out-of-distribution actions effectively. This paper presents Supported Policy OpTimization (SPOT), which is directly derived from the theoretical formalization of the density-based support constraint. SPOT adopts a VAE-based density estimator to explicitly model the support set of behavior policy and presents a simple but effective density-based regularization term, which can be plugged non-intrusively into off-the-shelf off-policy RL algorithms. On the standard benchmarks for offline RL, SPOT substantially outperforms state-of-the-art offline RL methods. Benefiting from the pluggable design, the offline pretrained models from SPOT can also be applied to perform online fine-tuning seamlessly.  ( 2 min )
    On the Convergence of Clustered Federated Learning. (arXiv:2202.06187v1 [cs.LG])
    In a federated learning system, the clients, e.g. mobile devices and organization participants, usually have different personal preferences or behavior patterns, namely Non-IID data problems across clients. Clustered federated learning is to group users into different clusters that the clients in the same group will share the same or similar behavior patterns that are to satisfy the IID data assumption for most traditional machine learning algorithms. Most of the existing clustering methods in FL treat every client equally that ignores the different importance contributions among clients. This paper proposes a novel weighted client-based clustered FL algorithm to leverage the client's group and each client in a unified optimization framework. Moreover, the paper proposes convergence analysis to the proposed clustered FL method. The experimental analysis has demonstrated the effectiveness of the proposed method.  ( 2 min )
    Anomaly Transformer: Time Series Anomaly Detection with Association Discrepancy. (arXiv:2110.02642v4 [cs.LG] UPDATED)
    Unsupervised detection of anomaly points in time series is a challenging problem, which requires the model to derive a distinguishable criterion. Previous methods tackle the problem mainly through learning pointwise representation or pairwise association, however, neither is sufficient to reason about the intricate dynamics. Recently, Transformers have shown great power in unified modeling of pointwise representation and pairwise association, and we find that the self-attention weight distribution of each time point can embody rich association with the whole series. Our key observation is that due to the rarity of anomalies, it is extremely difficult to build nontrivial associations from abnormal points to the whole series, thereby, the anomalies' associations shall mainly concentrate on their adjacent time points. This adjacent-concentration bias implies an association-based criterion inherently distinguishable between normal and abnormal points, which we highlight through the \emph{Association Discrepancy}. Technically, we propose the \emph{Anomaly Transformer} with a new \emph{Anomaly-Attention} mechanism to compute the association discrepancy. A minimax strategy is devised to amplify the normal-abnormal distinguishability of the association discrepancy. The Anomaly Transformer achieves state-of-the-art results on six unsupervised time series anomaly detection benchmarks of three applications: service monitoring, space & earth exploration, and water treatment.  ( 2 min )
    Robust and Information-theoretically Safe Bias Classifier against Adversarial Attacks. (arXiv:2111.04404v2 [cs.LG] UPDATED)
    In this paper, the bias classifier is introduced, that is, the bias part of a DNN with Relu as the activation function is used as a classifier. The work is motivated by the fact that the bias part is a piecewise constant function with zero gradient and hence cannot be directly attacked by gradient-based methods to generate adversaries, such as FGSM. The existence of the bias classifier is proved and an effective training method for the bias classifier is given. It is proved that by adding a proper random first-degree part to the bias classifier, an information-theoretically safe classifier against the original-model gradient attack is obtained in the sense that the attack will generate a totally random attacking direction. This seems to be the first time that the concept of information-theoretically safe classifier is proposed. Several attack methods for the bias classifier are proposed and numerical experiments are used to show that the bias classifier is more robust than DNNs with similar size against these attacks in most cases.  ( 2 min )
    Blurs Behave Like Ensembles: Spatial Smoothings to Improve Accuracy, Uncertainty, and Robustness. (arXiv:2105.12639v2 [cs.LG] UPDATED)
    Neural network ensembles, such as Bayesian neural networks (BNNs), have shown success in the areas of uncertainty estimation and robustness. However, a crucial challenge prohibits their use in practice. BNNs require a large number of predictions to produce reliable results, leading to a significant increase in computational cost. To alleviate this issue, we propose spatial smoothing, a method that spatially ensembles neighboring feature map points of convolutional neural networks. By simply adding a few blur layers to the models, we empirically show that spatial smoothing improves accuracy, uncertainty estimation, and robustness of BNNs across a whole range of ensemble sizes. In particular, BNNs incorporating spatial smoothing achieve high predictive performance merely with a handful of ensembles. Moreover, this method also can be applied to canonical deterministic neural networks to improve the performances. A number of evidences suggest that the improvements can be attributed to the stabilized feature maps and the smoothing of the loss landscape. In addition, we provide a fundamental explanation for prior works - namely, global average pooling, pre-activation, and ReLU6 - by addressing them as special cases of spatial smoothing. These not only enhance accuracy, but also improve uncertainty estimation and robustness by making the loss landscape smoother in the same manner as spatial smoothing. The code is available at https://github.com/xxxnell/spatial-smoothing.  ( 2 min )
    Robust Learning from Observation with Model Misspecification. (arXiv:2202.06003v1 [cs.LG])
    Imitation learning (IL) is a popular paradigm for training policies in robotic systems when specifying the reward function is difficult. However, despite the success of IL algorithms, they impose the somewhat unrealistic requirement that the expert demonstrations must come from the same domain in which a new imitator policy is to be learned. We consider a practical setting, where (i) state-only expert demonstrations from the real (deployment) environment are given to the learner, (ii) the imitation learner policy is trained in a simulation (training) environment whose transition dynamics is slightly different from the real environment, and (iii) the learner does not have any access to the real environment during the training phase beyond the batch of demonstrations given. Most of the current IL methods, such as generative adversarial imitation learning and its state-only variants, fail to imitate the optimal expert behavior under the above setting. By leveraging insights from the Robust reinforcement learning (RL) literature and building on recent adversarial imitation approaches, we propose a robust IL algorithm to learn policies that can effectively transfer to the real environment without fine-tuning. Furthermore, we empirically demonstrate on continuous-control benchmarks that our method outperforms the state-of-the-art state-only IL method in terms of the zero-shot transfer performance in the real environment and robust performance under different testing conditions.  ( 2 min )
    Online V2X Scheduling for Raw-Level Cooperative Perception. (arXiv:2202.06085v1 [cs.IT])
    Cooperative perception of connected vehicles comes to the rescue when the field of view restricts stand-alone intelligence. While raw-level cooperative perception preserves most information to guarantee accuracy, it is demanding in communication bandwidth and computation power. Therefore, it is important to schedule the most beneficial vehicle to share its sensor in terms of supplementary view and stable network connection. In this paper, we present a model of raw-level cooperative perception and formulate the energy minimization problem of sensor sharing scheduling as a variant of the Multi-Armed Bandit (MAB) problem. Specifically, volatility of the neighboring vehicles, heterogeneity of V2X channels, and the time-varying traffic context are taken into consideration. Then we propose an online learning-based algorithm with logarithmic performance loss, achieving a decent trade-off between exploration and exploitation. Simulation results under different scenarios indicate that the proposed algorithm quickly learns to schedule the optimal cooperative vehicle and saves more energy as compared to baseline algorithms.  ( 2 min )
    On the complexity of All $\varepsilon$-Best Arms Identification. (arXiv:2202.06280v1 [stat.ML])
    We consider the problem introduced by \cite{Mason2020} of identifying all the $\varepsilon$-optimal arms in a finite stochastic multi-armed bandit with Gaussian rewards. In the fixed confidence setting, we give a lower bound on the number of samples required by any algorithm that returns the set of $\varepsilon$-good arms with a failure probability less than some risk level $\delta$. This bound writes as $T_{\varepsilon}^*(\mu)\log(1/\delta)$, where $T_{\varepsilon}^*(\mu)$ is a characteristic time that depends on the vector of mean rewards $\mu$ and the accuracy parameter $\varepsilon$. We also provide an efficient numerical method to solve the convex max-min program that defines the characteristic time. Our method is based on a complete characterization of the alternative bandit instances that the optimal sampling strategy needs to rule out, thus making our bound tighter than the one provided by \cite{Mason2020}. Using this method, we propose a Track-and-Stop algorithm that identifies the set of $\varepsilon$-good arms w.h.p and enjoys asymptotic optimality (when $\delta$ goes to zero) in terms of the expected sample complexity. Finally, using numerical simulations, we demonstrate our algorithm's advantage over state-of-the-art methods, even for moderate values of the risk parameter.  ( 2 min )
    Towards Automated Error Analysis: Learning to Characterize Errors. (arXiv:2201.05017v3 [cs.CL] UPDATED)
    Characterizing the patterns of errors that a system makes helps researchers focus future development on increasing its accuracy and robustness. We propose a novel form of "meta learning" that automatically learns interpretable rules that characterize the types of errors that a system makes, and demonstrate these rules' ability to help understand and improve two NLP systems. Our approach works by collecting error cases on validation data, extracting meta-features describing these samples, and finally learning rules that characterize errors using these features. We apply our approach to VilBERT, for Visual Question Answering, and RoBERTa, for Common Sense Question Answering. Our system learns interpretable rules that provide insights into systemic errors these systems make on the given tasks. Using these insights, we are also able to "close the loop" and modestly improve performance of these systems.  ( 2 min )
    Sample-Efficient Reinforcement Learning with loglog(T) Switching Cost. (arXiv:2202.06385v1 [cs.LG])
    We study the problem of reinforcement learning (RL) with low (policy) switching cost - a problem well-motivated by real-life RL applications in which deployments of new policies are costly and the number of policy updates must be low. In this paper, we propose a new algorithm based on stage-wise exploration and adaptive policy elimination that achieves a regret of $\widetilde{O}(\sqrt{H^4S^2AT})$ while requiring a switching cost of $O(HSA \log\log T)$. This is an exponential improvement over the best-known switching cost $O(H^2SA\log T)$ among existing methods with $\widetilde{O}(\mathrm{poly}(H,S,A)\sqrt{T})$ regret. In the above, $S,A$ denotes the number of states and actions in an $H$-horizon episodic Markov Decision Process model with unknown transitions, and $T$ is the number of steps. We also prove an information-theoretical lower bound which says that a switching cost of $\Omega(HSA)$ is required for any no-regret algorithm. As a byproduct, our new algorithmic techniques allow us to derive a \emph{reward-free} exploration algorithm with an optimal switching cost of $O(HSA)$.  ( 2 min )
    An Evaluation of Change Point Detection Algorithms. (arXiv:2003.06222v3 [stat.ML] UPDATED)
    Change point detection is an important part of time series analysis, as the presence of a change point indicates an abrupt and significant change in the data generating process. While many algorithms for change point detection have been proposed, comparatively little attention has been paid to evaluating their performance on real-world time series. Algorithms are typically evaluated on simulated data and a small number of commonly-used series with unreliable ground truth. Clearly this does not provide sufficient insight into the comparative performance of these algorithms. Therefore, instead of developing yet another change point detection method, we consider it vastly more important to properly evaluate existing algorithms on real-world data. To achieve this, we present a data set specifically designed for the evaluation of change point detection algorithms that consists of 37 time series from various application domains. Each series was annotated by five human annotators to provide ground truth on the presence and location of change points. We analyze the consistency of the human annotators, and describe evaluation metrics that can be used to measure algorithm performance in the presence of multiple ground truth annotations. Next, we present a benchmark study where 14 algorithms are evaluated on each of the time series in the data set. Our aim is that this data set will serve as a proving ground in the development of novel change point detection algorithms.  ( 2 min )
    PQuAD: A Persian Question Answering Dataset. (arXiv:2202.06219v1 [cs.CL])
    We present Persian Question Answering Dataset (PQuAD), a crowdsourced reading comprehension dataset on Persian Wikipedia articles. It includes 80,000 questions along with their answers, with 25% of the questions being adversarially unanswerable. We examine various properties of the dataset to show the diversity and the level of its difficulty as an MRC benchmark. By releasing this dataset, we aim to ease research on Persian reading comprehension and development of Persian question answering systems. Our experiments on different state-of-the-art pre-trained contextualized language models show 74.8% Exact Match (EM) and 87.6% F1-score that can be used as the baseline results for further research on Persian QA.  ( 2 min )
    Vital Node Identification in Complex Networks Using a Machine Learning-Based Approach. (arXiv:2202.06229v1 [cs.SI])
    Vital node identification is the problem of finding nodes of highest importance in complex networks. This problem has crucial applications in various contexts such as viral marketing or controlling the propagation of virus or rumours in real-world networks. Existing approaches for vital node identification mainly focus on capturing the importance of a node through a mathematical expression which directly relates structural properties of the node to its vitality. Although these heuristic approaches have achieved good performance in practice, they have weak adaptability, and their performance is limited to specific settings and certain dynamics. Inspired by the power of machine learning models for efficiently capturing different types of patterns and relations, we propose a machine learning-based, data driven approach for vital node identification. The main idea is to train the model with a small portion of the graph, say 0.5% of the nodes, and do the prediction on the rest of the nodes. The ground-truth vitality for the train data is computed by simulating the SIR diffusion method starting from the train nodes. We use collective feature engineering where each node in the network is represented by incorporating elements of its connectivity, degree and extended coreness. Several machine learning models are trained on the node representations, but the best results are achieved by a Support Vector Regression machine with RBF kernel. The empirical results confirms that the proposed model outperforms state-of-the-art models on a selection of datasets, while it also shows more adaptability to changes in the dynamics parameters.  ( 2 min )
    Optimal sizing of a holdout set for safe predictive model updating. (arXiv:2202.06374v1 [stat.ML])
    Risk models are becoming ubiquitous in healthcare and may guide intervention by providing practitioners with insights from patient data. Should a model be updated after a guided intervention, it may lead to its own failure at making predictions. The use of a `holdout set' -- a subset of the population that does not receive interventions guided by the model -- has been proposed to prevent this. Since patients in the holdout set do not benefit from risk predictions, the chosen size must trade off maximising model performance whilst minimising the number of held out patients. By defining a general loss function, we prove the existence and uniqueness of an optimal holdout set size, and introduce parametric and semi-parametric algorithms for its estimation. We demonstrate their use on a recent risk score for pre-eclampsia. Based on these results, we argue that a holdout set is a safe, viable and easily implemented solution to the model update problem.  ( 2 min )
    Active feature selection discovers minimal gene sets for classifying cell types and disease states with single-cell mRNA-seq data. (arXiv:2106.08317v2 [q-bio.GN] UPDATED)
    Sequencing costs currently prohibit the application of single-cell mRNA-seq to many biological and clinical analyses. Targeted single-cell mRNA-sequencing reduces sequencing costs by profiling reduced gene sets that capture biological information with a minimal number of genes. Here, we introduce an active learning method (ActiveSVM) that identifies minimal but highly-informative gene sets that enable the identification of cell-types, physiological states, and genetic perturbations in single-cell data using a small number of genes. Our active feature selection procedure generates minimal gene sets from single-cell data through an iterative cell-type classification task where misclassified cells are examined at each round of analysis to identify maximally informative genes through an `active' support vector machine (ActiveSVM) classifier. By focusing computational resources on misclassified cells, ActiveSVM scales to analyze data sets with over a million single cells. We demonstrate that ActiveSVM feature selection identifies gene sets that enable ~90% cell-type classification accuracy across a variety of data sets including cell atlas and disease characterization data sets. The method generalizes to reveal genes that respond to genetic perturbations and to identify region specific gene expression patterns in spatial transcriptomics data. The discovery of small but highly informative gene sets should enable substantial reductions in the number of measurements necessary for application of single-cell mRNA-seq to clinical tests, therapeutic discovery, and genetic screens.  ( 2 min )
    Robust Deep Semi-Supervised Learning: A Brief Introduction. (arXiv:2202.05975v1 [cs.LG])
    Semi-supervised learning (SSL) is the branch of machine learning that aims to improve learning performance by leveraging unlabeled data when labels are insufficient. Recently, SSL with deep models has proven to be successful on standard benchmark tasks. However, they are still vulnerable to various robustness threats in real-world applications as these benchmarks provide perfect unlabeled data, while in realistic scenarios, unlabeled data could be corrupted. Many researchers have pointed out that after exploiting corrupted unlabeled data, SSL suffers severe performance degradation problems. Thus, there is an urgent need to develop SSL algorithms that could work robustly with corrupted unlabeled data. To fully understand robust SSL, we conduct a survey study. We first clarify a formal definition of robust SSL from the perspective of machine learning. Then, we classify the robustness threats into three categories: i) distribution corruption, i.e., unlabeled data distribution is mismatched with labeled data; ii) feature corruption, i.e., the features of unlabeled examples are adversarially attacked; and iii) label corruption, i.e., the label distribution of unlabeled data is imbalanced. Under this unified taxonomy, we provide a thorough review and discussion of recent works that focus on these issues. Finally, we propose possible promising directions within robust SSL to provide insights for future research.  ( 2 min )
    Adaptive Regret for Control of Time-Varying Dynamics. (arXiv:2007.04393v3 [cs.LG] UPDATED)
    We consider the problem of online control of systems with time-varying linear dynamics. This is a general formulation that is motivated by the use of local linearization in control of nonlinear dynamical systems. To state meaningful guarantees over changing environments, we introduce the metric of {\it adaptive regret} to the field of control. This metric, originally studied in online learning, measures performance in terms of regret against the best policy in hindsight on {\it any interval in time}, and thus captures the adaptation of the controller to changing dynamics. Our main contribution is a novel efficient meta-algorithm: it converts a controller with sublinear regret bounds into one with sublinear {\it adaptive regret} bounds in the setting of time-varying linear dynamical systems. The main technical innovation is the first adaptive regret bound for the more general framework of online convex optimization with memory. Furthermore, we give a lower bound showing that our attained adaptive regret bound is nearly tight for this general framework.  ( 2 min )
    Efficient Natural Gradient Descent Methods for Large-Scale Optimization Problems. (arXiv:2202.06236v1 [math.OC])
    We propose an efficient numerical method for computing natural gradient descent directions with respect to a generic metric in the state space. Our technique relies on representing the natural gradient direction as a solution to a standard least-squares problem. Hence, instead of calculating, storing, or inverting the information matrix directly, we apply efficient methods from numerical linear algebra to solve this least-squares problem. We treat both scenarios where the derivative of the state variable with respect to the parameter is either explicitly known or implicitly given through constraints. We apply the QR decomposition to solve the least-squares problem in the former case and utilize the adjoint-state method to compute the natural gradient descent direction in the latter case. As a result, we can reliably compute several natural gradient descents, including the Wasserstein natural gradient, for a large-scale parameter space with thousands of dimensions, which was believed to be out of reach. Finally, our numerical results shed light on the qualitative differences among the standard gradient descent method and various natural gradient descent methods based on different metric spaces in large-scale nonconvex optimization problems.  ( 2 min )
    StoryBuddy: A Human-AI Collaborative Chatbot for Parent-Child Interactive Storytelling with Flexible Parental Involvement. (arXiv:2202.06205v1 [cs.HC])
    Despite its benefits for children's skill development and parent-child bonding, many parents do not often engage in interactive storytelling by having story-related dialogues with their child due to limited availability or challenges in coming up with appropriate questions. While recent advances made AI generation of questions from stories possible, the fully-automated approach excludes parent involvement, disregards educational goals, and underoptimizes for child engagement. Informed by need-finding interviews and participatory design (PD) results, we developed StoryBuddy, an AI-enabled system for parents to create interactive storytelling experiences. StoryBuddy's design highlighted the need for accommodating dynamic user needs between the desire for parent involvement and parent-child bonding and the goal of minimizing parent intervention when busy. The PD revealed varied assessment and educational goals of parents, which StoryBuddy addressed by supporting configuring question types and tracking child progress. A user study validated StoryBuddy's usability and suggested design insights for future parent-AI collaboration systems.  ( 2 min )
    The Sample Complexity of One-Hidden-Layer Neural Networks. (arXiv:2202.06233v1 [cs.LG])
    We study norm-based uniform convergence bounds for neural networks, aiming at a tight understanding of how these are affected by the architecture and type of norm constraint, for the simple class of scalar-valued one-hidden-layer networks, and inputs bounded in Euclidean norm. We begin by proving that in general, controlling the spectral norm of the hidden layer weight matrix is insufficient to get uniform convergence guarantees (independent of the network width), while a stronger Frobenius norm control is sufficient, extending and improving on previous work. Motivated by the proof constructions, we identify and analyze two important settings where a mere spectral norm control turns out to be sufficient: First, when the network's activation functions are sufficiently smooth (with the result extending to deeper networks); and second, for certain types of convolutional networks. In the latter setting, we study how the sample complexity is additionally affected by parameters such as the amount of overlap between patches and the overall number of patches.  ( 2 min )
    A Tech Hybrid-Recommendation Engine and Personalized Notification: An integrated tool to assist users through Recommendations (Project ATHENA). (arXiv:2202.06248v1 [cs.IR])
    Project ATHENA aims to develop an application to address information overload, primarily focused on Recommendation Systems (RSs) with the personalization and user experience design of a modern system. Two machine learning (ML) algorithms were used: (1) TF-IDF for Content-based filtering (CBF); (2) Classification with Matrix Factorization- Singular Value Decomposition(SVD) applied with Collaborative filtering (CF) and mean (normalization) for prediction accuracy of the CF. Data sampling in academic Research and Development of Philippine Council for Agriculture, Aquatic, and Natural Resources Research and Development (PCAARRD) e-Library and Project SARAI publications plus simulated data used as training sets to generate a recommendation of items that uses the three RS filtering (CF, CBF, and personalized version of item recommendations). Series of Testing and TAM performed and discussed. Findings allow users to engage in online information and quickly evaluate retrieved items produced by the application. Compatibility-testing (CoT) shows the application is compatible with all major browsers and mobile-friendly. Performance-testing (PT) recommended v-parameter specs and TAM evaluations results indicate strongly associated with overall positive feedback, thoroughly enough to address the information-overload problem as the core of the paper. A modular architecture presented addressing the information overload, primarily focused on RSs with the personalization and design of modern systems. Developers utilized Two ML algorithms and prototyped a simplified version of the architecture. Series of testing (CoT and PT) and evaluations with TAM were performed and discussed. Project ATHENA added a UX feature design of a modern system.  ( 2 min )
    FairStyle: Debiasing StyleGAN2 with Style Channel Manipulations. (arXiv:2202.06240v1 [cs.CV])
    Recent advances in generative adversarial networks have shown that it is possible to generate high-resolution and hyperrealistic images. However, the images produced by GANs are only as fair and representative as the datasets on which they are trained. In this paper, we propose a method for directly modifying a pre-trained StyleGAN2 model that can be used to generate a balanced set of images with respect to one (e.g., eyeglasses) or more attributes (e.g., gender and eyeglasses). Our method takes advantage of the style space of the StyleGAN2 model to perform disentangled control of the target attributes to be debiased. Our method does not require training additional models and directly debiases the GAN model, paving the way for its use in various downstream applications. Our experiments show that our method successfully debiases the GAN model within a few minutes without compromising the quality of the generated images. To promote fair generative models, we share the code and debiased models at this http URL  ( 2 min )
    Learning to Balance: Bayesian Meta-Learning for Imbalanced and Out-of-distribution Tasks. (arXiv:1905.12917v3 [cs.LG] UPDATED)
    While tasks could come with varying the number of instances and classes in realistic settings, the existing meta-learning approaches for few-shot classification assume that the number of instances per task and class is fixed. Due to such restriction, they learn to equally utilize the meta-knowledge across all the tasks, even when the number of instances per task and class largely varies. Moreover, they do not consider distributional difference in unseen tasks, on which the meta-knowledge may have less usefulness depending on the task relatedness. To overcome these limitations, we propose a novel meta-learning model that adaptively balances the effect of the meta-learning and task-specific learning within each task. Through the learning of the balancing variables, we can decide whether to obtain a solution by relying on the meta-knowledge or task-specific learning. We formulate this objective into a Bayesian inference framework and tackle it using variational inference. We validate our Bayesian Task-Adaptive Meta-Learning (Bayesian TAML) on multiple realistic task- and class-imbalanced datasets, on which it significantly outperforms existing meta-learning approaches. Further ablation study confirms the effectiveness of each balancing component and the Bayesian learning framework.  ( 2 min )
    Near-optimal Local Convergence of Alternating Gradient Descent-Ascent for Minimax Optimization. (arXiv:2102.09468v3 [cs.LG] UPDATED)
    Smooth minimax games often proceed by simultaneous or alternating gradient updates. Although algorithms with alternating updates are commonly used in practice, the majority of existing theoretical analyses focus on simultaneous algorithms for convenience of analysis. In this paper, we study alternating gradient descent-ascent (Alt-GDA) in minimax games and show that Alt-GDA is superior to its simultaneous counterpart~(Sim-GDA) in many settings. We prove that Alt-GDA achieves a near-optimal local convergence rate for strongly convex-strongly concave (SCSC) problems while Sim-GDA converges at a much slower rate. To our knowledge, this is the \emph{first} result of any setting showing that Alt-GDA converges faster than Sim-GDA by more than a constant. We further adapt the theory of integral quadratic constraints (IQC) and show that Alt-GDA attains the same rate \emph{globally} for a subclass of SCSC minimax problems. Empirically, we demonstrate that alternating updates speed up GAN training significantly and the use of optimism only helps for simultaneous algorithms.  ( 2 min )
    Goal Recognition as Reinforcement Learning. (arXiv:2202.06356v1 [cs.AI])
    Most approaches for goal recognition rely on specifications of the possible dynamics of the actor in the environment when pursuing a goal. These specifications suffer from two key issues. First, encoding these dynamics requires careful design by a domain expert, which is often not robust to noise at recognition time. Second, existing approaches often need costly real-time computations to reason about the likelihood of each potential goal. In this paper, we develop a framework that combines model-free reinforcement learning and goal recognition to alleviate the need for careful, manual domain design, and the need for costly online executions. This framework consists of two main stages: Offline learning of policies or utility functions for each potential goal, and online inference. We provide a first instance of this framework using tabular Q-learning for the learning stage, as well as three measures that can be used to perform the inference stage. The resulting instantiation achieves state-of-the-art performance against goal recognizers on standard evaluation domains and superior performance in noisy environments.  ( 2 min )
    Generalization Bounds and Representation Learning for Estimation of Potential Outcomes and Causal Effects. (arXiv:2001.07426v3 [cs.LG] UPDATED)
    Practitioners in diverse fields such as healthcare, economics and education are eager to apply machine learning to improve decision making. The cost and impracticality of performing experiments and a recent monumental increase in electronic record keeping has brought attention to the problem of evaluating decisions based on non-experimental observational data. This is the setting of this work. In particular, we study estimation of individual-level causal effects, such as a single patient's response to alternative medication, from recorded contexts, decisions and outcomes. We give generalization bounds on the error in estimated effects based on distance measures between groups receiving different treatments, allowing for sample re-weighting. We provide conditions under which our bound is tight and show how it relates to results for unsupervised domain adaptation. Led by our theoretical results, we devise representation learning algorithms that minimize our bound, by regularizing the representation's induced treatment group distance, and encourage sharing of information between treatment groups. We extend these algorithms to simultaneously learn a weighted representation to further reduce treatment group distances. Finally, an experimental evaluation on real and synthetic data shows the value of our proposed representation architecture and regularization scheme.  ( 2 min )
    Geometric Graph Representation Learning via Maximizing Rate Reduction. (arXiv:2202.06241v1 [cs.LG])
    Learning discriminative node representations benefits various downstream tasks in graph analysis such as community detection and node classification. Existing graph representation learning methods (e.g., based on random walk and contrastive learning) are limited to maximizing the local similarity of connected nodes. Such pair-wise learning schemes could fail to capture the global distribution of representations, since it has no explicit constraints on the global geometric properties of representation space. To this end, we propose Geometric Graph Representation Learning (G2R) to learn node representations in an unsupervised manner via maximizing rate reduction. In this way, G2R maps nodes in distinct groups (implicitly stored in the adjacency matrix) into different subspaces, while each subspace is compact and different subspaces are dispersedly distributed. G2R adopts a graph neural network as the encoder and maximizes the rate reduction with the adjacency matrix. Furthermore, we theoretically and empirically demonstrate that rate reduction maximization is equivalent to maximizing the principal angles between different subspaces. Experiments on real-world datasets show that G2R outperforms various baselines on node classification and community detection tasks.  ( 2 min )
    Stochastic Client Selection for Federated Learning with Volatile Clients. (arXiv:2011.08756v3 [cs.LG] UPDATED)
    Federated Learning (FL), arising as a privacy-preserving machine learning paradigm, has received notable attention from the public. In each round of synchronous FL training, only a fraction of available clients are chosen to participate, and the selection decision might have a significant effect on the training efficiency, as well as the final model performance. In this paper, we investigate the client selection problem under a volatile context, in which the local training of heterogeneous clients is likely to fail due to various kinds of reasons and in different levels of frequency. {\color{black}Intuitively, too much training failure might potentially reduce the training efficiency, while too much selection on clients with greater stability might introduce bias, thereby resulting in degradation of the training effectiveness. To tackle this tradeoff, we in this paper formulate the client selection problem under joint consideration of effective participation and fairness.} Further, we propose E3CS, a stochastic client selection scheme to solve the problem, and we corroborate its effectiveness by conducting real data-based experiments. According to our experimental results, the proposed selection scheme is able to achieve up to 2x faster convergence to a fixed model accuracy while maintaining the same level of final model accuracy, compared with the state-of-the-art selection schemes.  ( 2 min )
    Meta Dropout: Learning to Perturb Features for Generalization. (arXiv:1905.12914v3 [cs.LG] UPDATED)
    A machine learning model that generalizes well should obtain low errors on unseen test examples. Thus, if we know how to optimally perturb training examples to account for test examples, we may achieve better generalization performance. However, obtaining such perturbation is not possible in standard machine learning frameworks as the distribution of the test data is unknown. To tackle this challenge, we propose a novel regularization method, meta-dropout, which learns to perturb the latent features of training examples for generalization in a meta-learning framework. Specifically, we meta-learn a noise generator which outputs a multiplicative noise distribution for latent features, to obtain low errors on the test instances in an input-dependent manner. Then, the learned noise generator can perturb the training examples of unseen tasks at the meta-test time for improved generalization. We validate our method on few-shot classification datasets, whose results show that it significantly improves the generalization performance of the base model, and largely outperforms existing regularization methods such as information bottleneck, manifold mixup, and information dropout.  ( 2 min )
    Local approximation of operators. (arXiv:2202.06392v1 [math.NA])
    Many applications, such as system identification, classification of time series, direct and inverse problems in partial differential equations, and uncertainty quantification lead to the question of approximation of a non-linear operator between metric spaces $\mathfrak{X}$ and $\mathfrak{Y}$. We study the problem of determining the degree of approximation of a such operators on a compact subset $K_\mathfrak{X}\subset \mathfrak{X}$ using a finite amount of information. If $\mathcal{F}: K_\mathfrak{X}\to K_\mathfrak{Y}$, a well established strategy to approximate $\mathcal{F}(F)$ for some $F\in K_\mathfrak{X}$ is to encode $F$ (respectively, $\mathcal{F}(F)$) in terms of a finite number $d$ (repectively $m$) of real numbers. Together with appropriate reconstruction algorithms (decoders), the problem reduces to the approximation of $m$ functions on a compact subset of a high dimensional Euclidean space $\mathbb{R}^d$, equivalently, the unit sphere $\mathbb{S}^d$ embedded in $\mathbb{R}^{d+1}$. The problem is challenging because $d$, $m$, as well as the complexity of the approximation on $\mathbb{S}^d$ are all large, and it is necessary to estimate the accuracy keeping track of the inter-dependence of all the approximations involved. In this paper, we establish constructive methods to do this efficiently; i.e., with the constants involved in the estimates on the approximation on $\\mathbb{S}^d$ being $\mathcal{O}(d^{1/6})$. We study different smoothness classes for the operators, and also propose a method for approximation of $\mathcal{F}(F)$ using only information in a small neighborhood of $F$, resulting in an effective reduction in the number of parameters involved. To further mitigate the problem of large number of parameters, we propose prefabricated networks, resulting in a substantially smaller number of effective parameters.  ( 2 min )
    Individual-Level Inverse Reinforcement Learning for Mean Field Games. (arXiv:2202.06401v1 [cs.LG])
    The recent mean field game (MFG) formalism has enabled the application of inverse reinforcement learning (IRL) methods in large-scale multi-agent systems, with the goal of inferring reward signals that can explain demonstrated behaviours of large populations. The existing IRL methods for MFGs are built upon reducing an MFG to a Markov decision process (MDP) defined on the collective behaviours and average rewards of the population. However, this paper reveals that the reduction from MFG to MDP holds only for the fully cooperative setting. This limitation invalidates existing IRL methods on MFGs with non-cooperative environments. To measure more general behaviours in large populations, we study the use of individual behaviours to infer ground-truth reward functions for MFGs. We propose Mean Field IRL (MFIRL), the first dedicated IRL framework for MFGs that can handle both cooperative and non-cooperative environments. Based on this theoretically justified framework, we develop a practical algorithm effective for MFGs with unknown dynamics. We evaluate MFIRL on both cooperative and mixed cooperative-competitive scenarios with many agents. Results demonstrate that MFIRL excels in reward recovery, sample efficiency and robustness in the face of changing dynamics.
    Benign Overfitting without Linearity: Neural Network Classifiers Trained by Gradient Descent for Noisy Linear Data. (arXiv:2202.05928v1 [cs.LG])
    Benign overfitting, the phenomenon where interpolating models generalize well in the presence of noisy data, was first observed in neural network models trained with gradient descent. To better understand this empirical observation, we consider the generalization error of two-layer neural networks trained to interpolation by gradient descent on the logistic loss following random initialization. We assume the data comes from well-separated class-conditional log-concave distributions and allow for a constant fraction of the training labels to be corrupted by an adversary. We show that in this setting, neural networks exhibit benign overfitting: they can be driven to zero training error, perfectly fitting any noisy training labels, and simultaneously achieve test error close to the Bayes-optimal error. In contrast to previous work on benign overfitting that require linear or kernel-based predictors, our analysis holds in a setting where both the model and learning dynamics are fundamentally nonlinear.  ( 2 min )
    Learning long-term music representations via hierarchical contextual constraints. (arXiv:2202.06180v1 [cs.SD])
    Learning symbolic music representations, especially disentangled representations with probabilistic interpretations, has been shown to benefit both music understanding and generation. However, most models are only applicable to short-term music, while learning long-term music representations remains a challenging task. We have seen several studies attempting to learn hierarchical representations directly in an end-to-end manner, but these models have not been able to achieve the desired results and the training process is not stable. In this paper, we propose a novel approach to learn long-term symbolic music representations through contextual constraints. First, we use contrastive learning to pre-train a long-term representation by constraining its difference from the short-term representation (extracted by an off-the-shelf model). Then, we fine-tune the long-term representation by a hierarchical prediction model such that a good long-term representation (e.g., an 8-bar representation) can reconstruct the corresponding short-term ones (e.g., the 2-bar representations within the 8-bar range). Experiments show that our method stabilizes the training and the fine-tuning steps. In addition, the designed contextual constraints benefit both reconstruction and disentanglement, significantly outperforming the baselines.  ( 2 min )
    Stochastic Multi-level Composition Optimization Algorithms with Level-Independent Convergence Rates. (arXiv:2008.10526v5 [math.OC] UPDATED)
    In this paper, we study smooth stochastic multi-level composition optimization problems, where the objective function is a nested composition of $T$ functions. We assume access to noisy evaluations of the functions and their gradients, through a stochastic first-order oracle. For solving this class of problems, we propose two algorithms using moving-average stochastic estimates, and analyze their convergence to an $\epsilon$-stationary point of the problem. We show that the first algorithm, which is a generalization of \cite{GhaRuswan20} to the $T$ level case, can achieve a sample complexity of $\mathcal{O}(1/\epsilon^6)$ by using mini-batches of samples in each iteration. By modifying this algorithm using linearized stochastic estimates of the function values, we improve the sample complexity to $\mathcal{O}(1/\epsilon^4)$. {\color{black}This modification not only removes the requirement of having a mini-batch of samples in each iteration, but also makes the algorithm parameter-free and easy to implement}. To the best of our knowledge, this is the first time that such an online algorithm designed for the (un)constrained multi-level setting, obtains the same sample complexity of the smooth single-level setting, under standard assumptions (unbiasedness and boundedness of the second moments) on the stochastic first-order oracle.  ( 2 min )
    A Deep Learning approach to Reduced Order Modelling of Parameter Dependent Partial Differential Equations. (arXiv:2103.06183v2 [math.NA] UPDATED)
    Within the framework of parameter dependent PDEs, we develop a constructive approach based on Deep Neural Networks for the efficient approximation of the parameter-to-solution map. The research is motivated by the limitations and drawbacks of state-of-the-art algorithms, such as the Reduced Basis method, when addressing problems that show a slow decay in the Kolmogorov n-width. Our work is based on the use of deep autoencoders, which we employ for encoding and decoding a high fidelity approximation of the solution manifold. To provide guidelines for the design of deep autoencoders, we consider a nonlinear version of the Kolmogorov n-width over which we base the concept of a minimal latent dimension. We show that the latter is intimately related to the topological properties of the solution manifold, and we provide theoretical results with particular emphasis on second order elliptic PDEs, characterizing the minimal dimension and the approximation errors of the proposed approach. The theory presented is further supported by numerical experiments, where we compare the proposed approach with classical POD-Galerkin reduced order models. In particular, we consider parametrized advection-diffusion PDEs, and we test the methodology in the presence of strong transport fields, singular terms and stochastic coefficients.  ( 2 min )
    Fine-Grained Population Mobility Data-Based Community-Level COVID-19 Prediction Model. (arXiv:2202.06257v1 [cs.LG])
    Predicting the number of infections in the anti-epidemic process is extremely beneficial to the government in developing anti-epidemic strategies, especially in fine-grained geographic units. Previous works focus on low spatial resolution prediction, e.g., county-level, and preprocess data to the same geographic level, which loses some useful information. In this paper, we propose a fine-grained population mobility data-based model (FGC-COVID) utilizing data of two geographic levels for community-level COVID-19 prediction. We use the population mobility data between Census Block Groups (CBGs), which is a finer-grained geographic level than community, to build the graph and capture the dependencies between CBGs using graph neural networks (GNNs). To mine as finer-grained patterns as possible for prediction, a spatial weighted aggregation module is introduced to aggregate the embeddings of CBGs to community level based on their geographic affiliation and spatial autocorrelation. Extensive experiments on 300 days LA city COVID-19 data indicate our model outperforms existing forecasting models on community-level COVID-19 prediction.  ( 2 min )
    fairmodels: A Flexible Tool For Bias Detection, Visualization, And Mitigation. (arXiv:2104.00507v2 [stat.ML] UPDATED)
    Machine learning decision systems are getting omnipresent in our lives. From dating apps to rating loan seekers, algorithms affect both our well-being and future. Typically, however, these systems are not infallible. Moreover, complex predictive models are really eager to learn social biases present in historical data that can lead to increasing discrimination. If we want to create models responsibly then we need tools for in-depth validation of models also from the perspective of potential discrimination. This article introduces an R package fairmodels that helps to validate fairness and eliminate bias in classification models in an easy and flexible fashion. The fairmodels package offers a model-agnostic approach to bias detection, visualization and mitigation. The implemented set of functions and fairness metrics enables model fairness validation from different perspectives. The package includes a series of methods for bias mitigation that aim to diminish the discrimination in the model. The package is designed not only to examine a single model, but also to facilitate comparisons between multiple models.  ( 2 min )
    Graph-adaptive Rectified Linear Unit for Graph Neural Networks. (arXiv:2202.06281v1 [cs.LG])
    Graph Neural Networks (GNNs) have achieved remarkable success by extending traditional convolution to learning on non-Euclidean data. The key to the GNNs is adopting the neural message-passing paradigm with two stages: aggregation and update. The current design of GNNs considers the topology information in the aggregation stage. However, in the updating stage, all nodes share the same updating function. The identical updating function treats each node embedding as i.i.d. random variables and thus ignores the implicit relationships between neighborhoods, which limits the capacity of the GNNs. The updating function is usually implemented with a linear transformation followed by a non-linear activation function. To make the updating function topology-aware, we inject the topological information into the non-linear activation function and propose Graph-adaptive Rectified Linear Unit (GReLU), which is a new parametric activation function incorporating the neighborhood information in a novel and efficient way. The parameters of GReLU are obtained from a hyperfunction based on both node features and the corresponding adjacent matrix. To reduce the risk of overfitting and the computational cost, we decompose the hyperfunction as two independent components for nodes and features respectively. We conduct comprehensive experiments to show that our plug-and-play GReLU method is efficient and effective given different GNN backbones and various downstream tasks.  ( 2 min )
    Information Bottleneck: Exact Analysis of (Quantized) Neural Networks. (arXiv:2106.12912v2 [cs.LG] UPDATED)
    The information bottleneck (IB) principle has been suggested as a way to analyze deep neural networks. The learning dynamics are studied by inspecting the mutual information (MI) between the hidden layers and the input and output. Notably, separate fitting and compression phases during training have been reported. This led to some controversy including claims that the observations are not reproducible and strongly dependent on the type of activation function used as well as on the way the MI is estimated. Our study confirms that different ways of binning when computing the MI lead to qualitatively different results, either supporting or refusing IB conjectures. To resolve the controversy, we study the IB principle in settings where MI is non-trivial and can be computed exactly. We monitor the dynamics of quantized neural networks, that is, we discretize the whole deep learning system so that no approximation is required when computing the MI. This allows us to quantify the information flow without measurement errors. In this setting, we observed a fitting phase for all layers and a compression phase for the output layer in all experiments; the compression in the hidden layers was dependent on the type of activation function. Our study shows that the initial IB results were not artifacts of binning when computing the MI. However, the critical claim that the compression phase may not be observed for some networks also holds true.
    From Online Optimization to PID Controllers: Mirror Descent with Momentum. (arXiv:2202.06152v1 [math.OC])
    We study a family of first-order methods with momentum based on mirror descent for online convex optimization, which we dub online mirror descent with momentum (OMDM). Our algorithms include as special cases gradient descent and exponential weights update with momentum. We provide a new and simple analysis of momentum-based methods in a stochastic setting that yields a regret bound that decreases as momentum increases. This immediately establishes that momentum can help in the convergence of stochastic subgradient descent in convex nonsmooth optimization. We showcase the robustness of our algorithm by also providing an analysis in an adversarial setting that gives the first non-trivial regret bounds for OMDM. Our work aims to provide a better understanding of the benefits of momentum-based methods, which despite their recent empirical success, is incomplete. Finally, we discuss how OMDM can be applied to stochastic online allocation problems, which are central problems in computer science and operations research. In doing so, we establish an important connection between OMDM and popular approaches from optimal control such as PID controllers, thereby providing regret bounds on the performance of PID controllers. The improvements of momentum are most pronounced when the step-size is large, thereby indicating that momentum provides a robustness to misspecification of tuning parameters. We provide a numerical evaluation that verifies the robustness of our algorithms.
    Online Adaptation of Neural Network Models by Modified Extended Kalman Filter for Customizable and Transferable Driving Behavior Prediction. (arXiv:2112.06129v2 [cs.LG] UPDATED)
    High fidelity behavior prediction of human drivers is crucial for efficient and safe deployment of autonomous vehicles, which is challenging due to the stochasticity, heterogeneity, and time-varying nature of human behaviors. On one hand, the trained prediction model can only capture the motion pattern in an average sense, while the nuances among individuals can hardly be reflected. On the other hand, the prediction model trained on the training set may not generalize to the testing set which may be in a different scenario or data distribution, resulting in low transferability and generalizability. In this paper, we applied a $\tau$-step modified Extended Kalman Filter parameter adaptation algorithm (MEKF$_\lambda$) to the driving behavior prediction task, which has not been studied before in literature. With the feedback of the observed trajectory, the algorithm is applied to neural-network-based models to improve the performance of driving behavior predictions across different human subjects and scenarios. A new set of metrics is proposed for systematic evaluation of online adaptation performance in reducing the prediction error for different individuals and scenarios. Empirical studies on the best layer in the model and steps of observation to adapt are also provided.
    The Devil is in the Detail: Simple Tricks Improve Systematic Generalization of Transformers. (arXiv:2108.12284v4 [cs.LG] UPDATED)
    Recently, many datasets have been proposed to test the systematic generalization ability of neural networks. The companion baseline Transformers, typically trained with default hyper-parameters from standard tasks, are shown to fail dramatically. Here we demonstrate that by revisiting model configurations as basic as scaling of embeddings, early stopping, relative positional embedding, and Universal Transformer variants, we can drastically improve the performance of Transformers on systematic generalization. We report improvements on five popular datasets: SCAN, CFQ, PCFG, COGS, and Mathematics dataset. Our models improve accuracy from 50% to 85% on the PCFG productivity split, and from 35% to 81% on COGS. On SCAN, relative positional embedding largely mitigates the EOS decision problem (Newman et al., 2020), yielding 100% accuracy on the length split with a cutoff at 26. Importantly, performance differences between these models are typically invisible on the IID data split. This calls for proper generalization validation sets for developing neural networks that generalize systematically. We publicly release the code to reproduce our results.
    How and Why to Manipulate Your Own Agent. (arXiv:2112.07640v2 [cs.GT] UPDATED)
    We consider strategic settings where several users engage in a repeated online interaction, assisted by regret-minimizing learning agents that repeatedly play a "game" on their behalf. We study the dynamics and average outcomes of the repeated game of the agents, and propose to view it as inducing a "meta-game" between the users. Our main focus is on whether users can benefit in this meta-game from "manipulating" their own agents by misreporting their parameters to them. We formally define the model of these meta-games between the users for general games and analyze the equilibria induced on the users in two classes of games in which the time-average of all regret-minimizing dynamics converge to a single equilibrium.
    Transfer RL across Observation Feature Spaces via Model-Based Regularization. (arXiv:2201.00248v2 [cs.LG] UPDATED)
    In many reinforcement learning (RL) applications, the observation space is specified by human developers and restricted by physical realizations, and may thus be subject to dramatic changes over time (e.g. increased number of observable features). However, when the observation space changes, the previous policy will likely fail due to the mismatch of input features, and another policy must be trained from scratch, which is inefficient in terms of computation and sample complexity. Following theoretical insights, we propose a novel algorithm which extracts the latent-space dynamics in the source task, and transfers the dynamics model to the target task to use as a model-based regularizer. Our algorithm works for drastic changes of observation space (e.g. from vector-based observation to image-based observation), without any inter-task mapping or any prior knowledge of the target task. Empirical results show that our algorithm significantly improves the efficiency and stability of learning in the target task.
    Mirror Learning: A Unifying Framework of Policy Optimisation. (arXiv:2201.02373v8 [cs.LG] UPDATED)
    Modern deep reinforcement learning (RL) algorithms are motivated by either the general policy improvement (GPI) or trust-region learning (TRL) frameworks. However, algorithms that strictly respect these theoretical frameworks have proven unscalable. Surprisingly, the only known scalable algorithms violate the GPI/TRL assumptions, e.g. due to required regularisation or other heuristics. The current explanation of their empirical success is essentially "by analogy": they are deemed approximate adaptations of theoretically sound methods. Unfortunately, studies have shown that in practice these algorithms differ greatly from their conceptual ancestors. In contrast, in this paper we introduce a novel theoretical framework, named Mirror Learning, which provides theoretical guarantees to a large class of algorithms, including TRPO and PPO. While the latter two exploit the flexibility of our framework, GPI and TRL fit in merely as pathologically restrictive corner cases thereof. This suggests that the empirical performance of state-of-the-art methods is a direct consequence of their theoretical properties, rather than of aforementioned approximate analogies. Mirror learning sets us free to boldly explore novel, theoretically sound RL algorithms, a thus far uncharted wonderland.
    Few-shot Quality-Diversity Optimization. (arXiv:2109.06826v2 [cs.LG] UPDATED)
    In the past few years, a considerable amount of research has been dedicated to the exploitation of previous learning experiences and the design of Few-shot and Meta Learning approaches, in problem domains ranging from Computer Vision to Reinforcement Learning based control. A notable exception, where to the best of our knowledge, little to no effort has been made in this direction is Quality-Diversity (QD) optimization. QD methods have been shown to be effective tools in dealing with deceptive minima and sparse rewards in Reinforcement Learning. However, they remain costly due to their reliance on inherently sample inefficient evolutionary processes. We show that, given examples from a task distribution, information about the paths taken by optimization in parameter space can be leveraged to build a prior population, which when used to initialize QD methods in unseen environments, allows for few-shot adaptation. Our proposed method does not require backpropagation. It is simple to implement and scale, and furthermore, it is agnostic to the underlying models that are being trained. Experiments carried in both sparse and dense reward settings using robotic manipulation and navigation benchmarks show that it considerably reduces the number of generations that are required for QD optimization in these environments.
    Privacy-Aware Communication Over a Wiretap Channel with Generative Networks. (arXiv:2110.04094v2 [cs.IT] UPDATED)
    We study privacy-aware communication over a wiretap channel using end-to-end learning. Alice wants to transmit a source signal to Bob over a binary symmetric channel, while passive eavesdropper Eve tries to infer some sensitive attribute of Alice's source based on its overheard signal. Since we usually do not have access to true distributions, we propose a data-driven approach using variational autoencoder (VAE)-based joint source channel coding (JSCC). We show through simulations with the colored MNIST dataset that our approach provides high reconstruction quality at the receiver while confusing the eavesdropper about the latent sensitive attribute, which consists of the color and thickness of the digits. Finally, we consider a parallel-channel scenario, and show that our approach arranges the information transmission such that the channels with higher noise levels at the eavesdropper carry the sensitive information, while the non-sensitive information is transmitted over more vulnerable channels.
    Deep Integrated Pipeline of Segmentation Guided Classification of Breast Cancer from Ultrasound Images. (arXiv:2110.14013v2 [eess.IV] UPDATED)
    Breast cancer has become a symbol of tremendous concern in the modern world, as it is one of the major causes of cancer mortality worldwide. In this regard, breast ultrasonography images are frequently utilized by doctors to diagnose breast cancer at an early stage. However, the complex artifacts and heavily noised breast ultrasonography images make diagnosis a great challenge. Furthermore, the ever-increasing number of patients being screened for breast cancer necessitates the use of automated end-to-end technology for highly accurate diagnosis at a low cost and in a short time. In this concern, to develop an end-to-end integrated pipeline for breast ultrasonography image classification, we conducted an exhaustive analysis of image preprocessing methods such as K Means++ and SLIC, as well as four transfer learning models such as VGG16, VGG19, DenseNet121, and ResNet50. With a Dice-coefficient score of 63.4 in the segmentation stage and accuracy and an F1-Score (Benign) of 73.72 percent and 78.92 percent in the classification stage, the combination of SLIC, UNET, and VGG16 outperformed all other integrated combinations. Finally, we have proposed an end to end integrated automated pipelining framework which includes preprocessing with SLIC to capture super-pixel features from the complex artifact of ultrasonography images, complementing semantic segmentation with modified U-Net, leading to breast tumor classification using a transfer learning approach with a pre-trained VGG16 and a densely connected neural network. The proposed automated pipeline can be effectively implemented to assist medical practitioners in making more accurate and timely diagnoses of breast cancer.
    Model Free Barrier Functions via Implicit Evading Maneuvers. (arXiv:2107.12871v2 [cs.LG] UPDATED)
    This paper demonstrates that the safety override arising from the use of a barrier function can in some cases be needlessly restrictive. In particular, we examine the case of fixed wing collision avoidance and show that when using a barrier function, there are cases where two fixed wing aircraft can come closer to colliding than if there were no barrier function at all. In addition, we construct cases where the barrier function labels the system as unsafe even when the vehicles start arbitrarily far apart. In other words, the barrier function ensures safety but with unnecessary costs to performance. We therefore introduce model free barrier functions which take a data driven approach to creating a barrier function. We demonstrate the effectiveness of model free barrier functions in a collision avoidance simulation of two fixed-wing aircraft.
    Relaxing the Feature Covariance Assumption: Time-Variant Bounds for Benign Overfitting in Linear Regression. (arXiv:2202.06054v1 [cs.LG])
    Benign overfitting demonstrates that overparameterized models can perform well on test data while fitting noisy training data. However, it only considers the final min-norm solution in linear regression, which ignores the algorithm information and the corresponding training procedure. In this paper, we generalize the idea of benign overfitting to the whole training trajectory instead of the min-norm solution and derive a time-variant bound based on the trajectory analysis. Starting from the time-variant bound, we further derive a time interval that suffices to guarantee a consistent generalization error for a given feature covariance. Unlike existing approaches, the newly proposed generalization bound is characterized by a time-variant effective dimension of feature covariance. By introducing the time factor, we relax the strict assumption on the feature covariance matrix required in previous benign overfitting under the regimes of overparameterized linear regression with gradient descent. This paper extends the scope of benign overfitting, and experiment results indicate that the proposed bound accords better with empirical evidence.  ( 2 min )
    Exploring Beyond-Demonstrator via Meta Learning-Based Reward Extrapolation. (arXiv:2102.02454v11 [cs.LG] UPDATED)
    Extrapolating beyond-demonstrator (BD) performance through the imitation learning (IL) algorithm aims to learn from and subsequently outperform the demonstrator. To that end, a representative approach is to leverage inverse reinforcement learning (IRL) to infer a reward function from demonstrations before performing RL on the learned reward function. However, most existing reward extrapolation methods require massive demonstrations, making it difficult to be applied in tasks of limited training data. To address this problem, one simple solution is to perform data augmentation to artificially generate more training data, which may incur severe inductive bias and policy performance loss. In this paper, we propose a novel meta learning-based reward extrapolation (MLRE) algorithm, which can effectively approximate the ground-truth rewards using limited demonstrations. More specifically, MLRE first learns an initial reward function from a set of tasks that have abundant training data. Then the learned reward function will be fine-tuned using data of the target task. Extensive simulation results demonstrated that the proposed MLRE can achieve impressive performance improvement as compared to other similar BDIL algorithms.
    Compositional Attention: Disentangling Search and Retrieval. (arXiv:2110.09419v2 [cs.LG] UPDATED)
    Multi-head, key-value attention is the backbone of the widely successful Transformer model and its variants. This attention mechanism uses multiple parallel key-value attention blocks (called heads), each performing two fundamental computations: (1) search - selection of a relevant entity from a set via query-key interactions, and (2) retrieval - extraction of relevant features from the selected entity via a value matrix. Importantly, standard attention heads learn a rigid mapping between search and retrieval. In this work, we first highlight how this static nature of the pairing can potentially: (a) lead to learning of redundant parameters in certain tasks, and (b) hinder generalization. To alleviate this problem, we propose a novel attention mechanism, called Compositional Attention, that replaces the standard head structure. The proposed mechanism disentangles search and retrieval and composes them in a dynamic, flexible and context-dependent manner through an additional soft competition stage between the query-key combination and value pairing. Through a series of numerical experiments, we show that it outperforms standard multi-head attention on a variety of tasks, including some out-of-distribution settings. Through our qualitative analysis, we demonstrate that Compositional Attention leads to dynamic specialization based on the type of retrieval needed. Our proposed mechanism generalizes multi-head attention, allows independent scaling of search and retrieval, and can easily be implemented in lieu of standard attention heads in any network architecture.
    Compressed Gradient Tracking for Decentralized Optimization Over General Directed Networks. (arXiv:2106.07243v2 [math.OC] UPDATED)
    In this paper, we propose two communication efficient decentralized optimization algorithms over a general directed multi-agent network. The first algorithm, termed Compressed Push-Pull (CPP), combines the gradient tracking Push-Pull method with communication compression. We show that CPP is applicable to a general class of unbiased compression operators and achieves linear convergence rate for strongly convex and smooth objective functions. The second algorithm is a broadcast-like version of CPP (B-CPP), and it also achieves linear convergence rate under the same conditions on the objective functions. B-CPP can be applied in an asynchronous broadcast setting and further reduce communication costs compared to CPP. Numerical experiments complement the theoretical analysis and confirm the effectiveness of the proposed methods.
    Scheduling Techniques for Liver Segmentation: ReduceLRonPlateau Vs OneCycleLR. (arXiv:2202.06373v1 [cs.CV])
    Machine learning and computer vision techniques have influenced many fields including the biomedical one. The aim of this paper is to investigate the important concept of schedulers in manipulating the learning rate (LR), for the liver segmentation task, throughout the training process, focusing on the newly devised OneCycleLR against the ReduceLRonPlateau. A dataset, published in 2018 and produced by the Medical Segmentation Decathlon Challenge organizers, called Task 8 Hepatic Vessel (MSDC-T8) has been used for testing and validation. The reported results that have the same number of maximum epochs (75), and are the average of 5-fold cross-validation, indicate that ReduceLRonPlateau converges faster while maintaining a similar or even better loss score on the validation set when compared to OneCycleLR. The epoch at which the peak LR occurs perhaps should be made early for the OneCycleLR such that the super-convergence feature can be observed. Moreover, the overall results outperform the state-of-the-art results from the researchers who published the liver masks for this dataset. To conclude, both schedulers are suitable for medical segmentation challenges, especially the MSDC-T8 dataset, and can be used confidently in rapidly converging the validation loss with a minimal number of epochs.  ( 2 min )
    Graph Neural Networks in Recommender Systems: A Survey. (arXiv:2011.02260v3 [cs.IR] UPDATED)
    With the explosive growth of online information, recommender systems play a key role to alleviate such information overload. Due to the important application value of recommender systems, there have always been emerging works in this field. In recommender systems, the main challenge is to learn the effective user/item representations from their interactions and side information (if any). Recently, graph neural network (GNN) techniques have been widely utilized in recommender systems since most of the information in recommender systems essentially has graph structure and GNN has superiority in graph representation learning. This article aims to provide a comprehensive review of recent research efforts on GNN-based recommender systems. Specifically, we provide a taxonomy of GNN-based recommendation models according to the types of information used and recommendation tasks. Moreover, we systematically analyze the challenges of applying GNN on different types of data and discuss how existing works in this field address these challenges. Furthermore, we state new perspectives pertaining to the development of this field. We collect the representative papers along with their open-source implementations in https://github.com/wusw14/GNN-in-RS.
    Evolving Neural Networks with Optimal Balance between Information Flow and Connections Cost. (arXiv:2202.06163v1 [cs.NE])
    Evolving Neural Networks (NNs) has recently seen an increasing interest as an alternative path that might be more successful. It has many advantages compared to other approaches, such as learning the architecture of the NNs. However, the extremely large search space and the existence of many complex interacting parts still represent a major obstacle. Many criteria were recently investigated to help guide the algorithm and to cut down the large search space. Recently there has been growing research bringing insights from network science to improve the design of NNs. In this paper, we investigate evolving NNs architectures that have one of the most fundamental characteristics of real-world networks, namely the optimal balance between connections cost and information flow. The performance of different metrics that represent this balance is evaluated and the improvement in the accuracy of putting more selection pressure toward this balance is demonstrated on three datasets.
    Metric Learning-enhanced Optimal Transport for Biochemical Regression Domain Adaptation. (arXiv:2202.06208v1 [cs.LG])
    Generalizing knowledge beyond source domains is a crucial prerequisite for many biomedical applications such as drug design and molecular property prediction. To meet this challenge, researchers have used optimal transport (OT) to perform representation alignment between the source and target domains. Yet existing OT algorithms are mainly designed for classification tasks. Accordingly, we consider regression tasks in the unsupervised and semi-supervised settings in this paper. To exploit continuous labels, we propose novel metrics to measure domain distances and introduce a posterior variance regularizer on the transport plan. Further, while computationally appealing, OT suffers from ambiguous decision boundaries and biased local data distributions brought by the mini-batch training. To address those issues, we propose to couple OT with metric learning to yield more robust boundaries and reduce bias. Specifically, we present a dynamic hierarchical triplet loss to describe the global data distribution, where the cluster centroids are progressively adjusted among consecutive iterations. We evaluate our method on both unsupervised and semi-supervised learning tasks in biochemistry. Experiments show the proposed method significantly outperforms state-of-the-art baselines across various benchmark datasets of small molecules and material crystals.
    Incorporating Kinematic Wave Theory into a Deep Learning Method for High-Resolution Traffic Speed Estimation. (arXiv:2102.02906v2 [cs.LG] UPDATED)
    We propose a kinematic wave-based Deep Convolutional Neural Network (Deep CNN) to estimate high-resolution traffic speed fields from sparse probe vehicle trajectories. We introduce two key approaches that allow us to incorporate kinematic wave theory principles to improve the robustness of existing learning-based estimation methods. First, we propose an anisotropic traffic kernel for the Deep CNN. The anisotropic kernel explicitly accounts for space-time correlations in macroscopic traffic and effectively reduces the number of trainable parameters in the Deep CNN model. Second, we propose to use simulated data for training the Deep CNN. Using a targeted simulated data for training provides an implicit way to impose desirable traffic physical features on the learning model. In the experiments, we highlight the benefits of using anisotropic kernels and evaluate the transferability of the trained model to real-world traffic using the Next Generation Simulation (NGSIM) and the German Highway Drone (HighD) datasets. The results demonstrate that anisotropic kernels significantly reduce model complexity and model over-fitting, and improve the physical correctness of the estimated speed fields. We find that model complexity scales linearly with problem size for anisotropic kernels compared to quadratic scaling for isotropic kernels. Furthermore, evaluation on real-world datasets shows acceptable performance, which establishes that simulation-based training is a viable surrogate to learning from real-world data. Finally, a comparison with standard estimation techniques shows the superior estimation accuracy of the proposed method.
    Off-Policy Evaluation for Large Action Spaces via Embeddings. (arXiv:2202.06317v1 [cs.LG])
    Off-policy evaluation (OPE) in contextual bandits has seen rapid adoption in real-world systems, since it enables offline evaluation of new policies using only historic log data. Unfortunately, when the number of actions is large, existing OPE estimators -- most of which are based on inverse propensity score weighting -- degrade severely and can suffer from extreme bias and variance. This foils the use of OPE in many applications from recommender systems to language models. To overcome this issue, we propose a new OPE estimator that leverages marginalized importance weights when action embeddings provide structure in the action space. We characterize the bias, variance, and mean squared error of the proposed estimator and analyze the conditions under which the action embedding provides statistical benefits over conventional estimators. In addition to the theoretical analysis, we find that the empirical performance improvement can be substantial, enabling reliable OPE even when existing estimators collapse due to a large number of actions.
    Handwritten Mathematical Expression Recognition via Attention Aggregation based Bi-directional Mutual Learning. (arXiv:2112.03603v2 [cs.CV] UPDATED)
    Handwritten mathematical expression recognition aims to automatically generate LaTeX sequences from given images. Currently, attention-based encoder-decoder models are widely used in this task. They typically generate target sequences in a left-to-right (L2R) manner, leaving the right-to-left (R2L) contexts unexploited. In this paper, we propose an Attention aggregation based Bi-directional Mutual learning Network (ABM) which consists of one shared encoder and two parallel inverse decoders (L2R and R2L). The two decoders are enhanced via mutual distillation, which involves one-to-one knowledge transfer at each training step, making full use of the complementary information from two inverse directions. Moreover, in order to deal with mathematical symbols in diverse scales, an Attention Aggregation Module (AAM) is proposed to effectively integrate multi-scale coverage attentions. Notably, in the inference phase, given that the model already learns knowledge from two inverse directions, we only use the L2R branch for inference, keeping the original parameter size and inference speed. Extensive experiments demonstrate that our proposed approach achieves the recognition accuracy of 56.85 % on CROHME 2014, 52.92 % on CROHME 2016, and 53.96 % on CROHME 2019 without data augmentation and model ensembling, substantially outperforming the state-of-the-art methods. The source code is available in https://github.com/XH-B/ABM.
    Adaptive Bandit Convex Optimization with Heterogeneous Curvature. (arXiv:2202.06150v1 [cs.LG])
    We consider the problem of adversarial bandit convex optimization, that is, online learning over a sequence of arbitrary convex loss functions with only one function evaluation for each of them. While all previous works assume known and homogeneous curvature on these loss functions, we study a heterogeneous setting where each function has its own curvature that is only revealed after the learner makes a decision. We develop an efficient algorithm that is able to adapt to the curvature on the fly. Specifically, our algorithm not only recovers or \emph{even improves} existing results for several homogeneous settings, but also leads to surprising results for some heterogeneous settings -- for example, while Hazan and Levy (2014) showed that $\widetilde{O}(d^{3/2}\sqrt{T})$ regret is achievable for a sequence of $T$ smooth and strongly convex $d$-dimensional functions, our algorithm reveals that the same is achievable even if $T^{3/4}$ of them are not strongly convex, and sometimes even if a constant fraction of them are not strongly convex. Our approach is inspired by the framework of Bartlett et al. (2007) who studied a similar heterogeneous setting but with stronger gradient feedback. Extending their framework to the bandit feedback setting requires novel ideas such as lifting the feasible domain and using a logarithmically homogeneous self-concordant barrier regularizer.  ( 2 min )
    Combining Fast and Slow Thinking for Human-like and Efficient Navigation in Constrained Environments. (arXiv:2201.07050v2 [cs.AI] UPDATED)
    Current AI systems lack several important human capabilities, such as adaptability, generalizability, self-control, consistency, common sense, and causal reasoning. We believe that existing cognitive theories of human decision making, such as the thinking fast and slow theory, can provide insights on how to advance AI systems towards some of these capabilities. In this paper, we propose a general architecture that is based on fast/slow solvers and a metacognitive component. We then present experimental results on the behavior of an instance of this architecture, for AI systems that make decisions about navigating in a constrained environment. We show how combining the fast and slow decision modalities allows the system to evolve over time and gradually pass from slow to fast thinking with enough experience, and that this greatly helps in decision quality, resource consumption, and efficiency.
    Topologically Regularized Data Embeddings. (arXiv:2110.09193v3 [cs.LG] UPDATED)
    Unsupervised feature learning often finds low-dimensional embeddings that capture the structure of complex data. For tasks for which prior expert topological knowledge is available, incorporating this into the learned representation may lead to higher quality embeddings. For example, this may help one to embed the data into a given number of clusters, or to accommodate for noise that prevents one from deriving the distribution of the data over the model directly, which can then be learned more effectively. However, a general tool for integrating different prior topological knowledge into embeddings is lacking. Although differentiable topology layers have been recently developed that can (re)shape embeddings into prespecified topological models, they have two important limitations for representation learning, which we address in this paper. First, the currently suggested topological losses fail to represent simple models such as clusters and flares in a natural manner. Second, these losses neglect all original structural (such as neighborhood) information in the data that is useful for learning. We overcome these limitations by introducing a new set of topological losses, and proposing their usage as a way for \emph{topologically regularizing} data embeddings to naturally represent a prespecified model. We include thorough experiments on synthetic and real data that highlight the usefulness and versatility of this approach, with applications ranging from modeling high-dimensional single-cell data, to graph embedding.
    Robust alignment of cross-session recordings of neural population activity by behaviour via unsupervised domain adaptation. (arXiv:2202.06159v1 [q-bio.NC])
    Neural population activity relating to behaviour is assumed to be inherently low-dimensional despite the observed high dimensionality of data recorded using multi-electrode arrays. Therefore, predicting behaviour from neural population recordings has been shown to be most effective when using latent variable models. Over time however, the activity of single neurons can drift, and different neurons will be recorded due to movement of implanted neural probes. This means that a decoder trained to predict behaviour on one day performs worse when tested on a different day. On the other hand, evidence suggests that the latent dynamics underlying behaviour may be stable even over months and years. Based on this idea, we introduce a model capable of inferring behaviourally relevant latent dynamics from previously unseen data recorded from the same animal, without any need for decoder recalibration. We show that unsupervised domain adaptation combined with a sequential variational autoencoder, trained on several sessions, can achieve good generalisation to unseen data and correctly predict behaviour where conventional methods fail. Our results further support the hypothesis that behaviour-related neural dynamics are low-dimensional and stable over time, and will enable more effective and flexible use of brain computer interface technologies.
    Flowformer: Linearizing Transformers with Conservation Flows. (arXiv:2202.06258v1 [cs.LG])
    Transformers based on the attention mechanism have achieved impressive success in various areas. However, the attention mechanism has a quadratic complexity, significantly impeding Transformers from dealing with numerous tokens and scaling up to bigger models. Previous methods mainly utilize the similarity decomposition and the associativity of matrix multiplication to devise linear-time attention mechanisms. They avoid degeneration of attention to a trivial distribution by reintroducing inductive biases such as the locality, thereby at the expense of model generality and expressiveness. In this paper, we linearize Transformers free from specific inductive biases based on the flow network theory. We cast attention as the information flow aggregated from the sources (values) to the sinks (results) through the learned flow capacities (attentions). Within this framework, we apply the property of flow conservation with attention and propose the Flow-Attention mechanism of linear complexity. By respectively conserving the incoming flow of sinks for source competition and the outgoing flow of sources for sink allocation, Flow-Attention inherently generates informative attentions without using specific inductive biases. Empowered by the Flow-Attention, Flowformer yields strong performance in linear time for wide areas, including long sequence, time series, vision, natural language, and reinforcement learning.
    Distribution augmentation for low-resource expressive text-to-speech. (arXiv:2202.06409v1 [eess.AS])
    This paper presents a novel data augmentation technique for text-to-speech (TTS), that allows to generate new (text, audio) training examples without requiring any additional data. Our goal is to increase diversity of text conditionings available during training. This helps to reduce overfitting, especially in low-resource settings. Our method relies on substituting text and audio fragments in a way that preserves syntactical correctness. We take additional measures to ensure that synthesized speech does not contain artifacts caused by combining inconsistent audio samples. The perceptual evaluations show that our method improves speech quality over a number of datasets, speakers, and TTS architectures. We also demonstrate that it greatly improves robustness of attention-based TTS models.
    Models of fairness in federated learning. (arXiv:2112.00818v2 [cs.CY] UPDATED)
    In many real-world situations, data is distributed across multiple locations and can't be combined for training. Federated learning is a novel distributed learning approach that allows multiple federating agents to jointly learn a model. While this approach might reduce the error each agent experiences, it also raises questions of fairness: to what extent can the error experienced by one agent be significantly lower than the error experienced by another agent? In this work, we consider two notions of fairness that each may be appropriate in different circumstances: egalitarian fairness (which aims to bound how dissimilar error rates can be) and proportional fairness (which aims to reward players for contributing more data). For egalitarian fairness, we obtain a tight multiplicative bound on how widely error rates can diverge between agents federating together. For proportional fairness, we show that sub-proportional error (relative to the number of data points contributed) is guaranteed for any individually rational federating coalition.
    Demystifying Why Local Aggregation Helps: Convergence Analysis of Hierarchical SGD. (arXiv:2010.12998v3 [cs.LG] UPDATED)
    Hierarchical SGD (H-SGD) has emerged as a new distributed SGD algorithm for multi-level communication networks. In H-SGD, before each global aggregation, workers send their updated local models to local servers for aggregations. Despite recent research efforts, the effect of local aggregation on global convergence still lacks theoretical understanding. In this work, we first introduce a new notion of "upward" and "downward" divergences. We then use it to conduct a novel analysis to obtain a worst-case convergence upper bound for two-level H-SGD with non-IID data, non-convex objective function, and stochastic gradient. By extending this result to the case with random grouping, we observe that this convergence upper bound of H-SGD is between the upper bounds of two single-level local SGD settings, with the number of local iterations equal to the local and global update periods in H-SGD, respectively. We refer to this as the "sandwich behavior". Furthermore, we extend our analytical approach based on "upward" and "downward" divergences to study the convergence for the general case of H-SGD with more than two levels, where the "sandwich behavior" still holds. Our theoretical results provide key insights of why local aggregation can be beneficial in improving the convergence of H-SGD.
    Corralling a Larger Band of Bandits: A Case Study on Switching Regret for Linear Bandits. (arXiv:2202.06151v1 [cs.LG])
    We consider the problem of combining and learning over a set of adversarial bandit algorithms with the goal of adaptively tracking the best one on the fly. The CORRAL algorithm of Agarwal et al. (2017) and its variants (Foster et al., 2020a) achieve this goal with a regret overhead of order $\widetilde{O}(\sqrt{MT})$ where $M$ is the number of base algorithms and $T$ is the time horizon. The polynomial dependence on $M$, however, prevents one from applying these algorithms to many applications where $M$ is poly$(T)$ or even larger. Motivated by this issue, we propose a new recipe to corral a larger band of bandit algorithms whose regret overhead has only \emph{logarithmic} dependence on $M$ as long as some conditions are satisfied. As the main example, we apply our recipe to the problem of adversarial linear bandits over a $d$-dimensional $\ell_p$ unit-ball for $p \in (1,2]$. By corralling a large set of $T$ base algorithms, each starting at a different time step, our final algorithm achieves the first optimal switching regret $\widetilde{O}(\sqrt{d S T})$ when competing against a sequence of comparators with $S$ switches (for some known $S$). We further extend our results to linear bandits over a smooth and strongly convex domain as well as unconstrained linear bandits.
    Incremental user embedding modeling for personalized text classification. (arXiv:2202.06369v1 [cs.LG])
    Individual user profiles and interaction histories play a significant role in providing customized experiences in real-world applications such as chatbots, social media, retail, and education. Adaptive user representation learning by utilizing user personalized information has become increasingly challenging due to ever-growing history data. In this work, we propose an incremental user embedding modeling approach, in which embeddings of user's recent interaction histories are dynamically integrated into the accumulated history vectors via a transformer encoder. This modeling paradigm allows us to create generalized user representations in a consecutive manner and also alleviate the challenges of data management. We demonstrate the effectiveness of this approach by applying it to a personalized multi-class classification task based on the Reddit dataset, and achieve 9% and 30% relative improvement on prediction accuracy over a baseline system for two experiment settings through appropriate comment history encoding and task modeling.
    Measuring the Contribution of Multiple Model Representations in Detecting Adversarial Instances. (arXiv:2111.07035v2 [cs.LG] UPDATED)
    Deep learning models have been used for a wide variety of tasks. They are prevalent in computer vision, natural language processing, speech recognition, and other areas. While these models have worked well under many scenarios, it has been shown that they are vulnerable to adversarial attacks. This has led to a proliferation of research into ways that such attacks could be identified and/or defended against. Our goal is to explore the contribution that can be attributed to using multiple underlying models for the purpose of adversarial instance detection. Our paper describes two approaches that incorporate representations from multiple models for detecting adversarial examples. We devise controlled experiments for measuring the detection impact of incrementally utilizing additional models. For many of the scenarios we consider, the results show that performance increases with the number of underlying models used for extracting representations.
    From the String Landscape to the Mathematical Landscape: a Machine-Learning Outlook. (arXiv:2202.06086v1 [hep-th])
    We review the recent programme of using machine-learning to explore the landscape of mathematical problems. With this paradigm as a model for human intuition - complementary to and in contrast with the more formalistic approach of automated theorem proving - we highlight some experiments on how AI helps with conjecture formulation, pattern recognition and computation.
    Blessings and curse of smoothness and phase transitions in nonparametric regressions: a nonasymptotic perspective. (arXiv:2112.03626v2 [math.ST] UPDATED)
    When the regression function belongs to the standard smooth classes consisting of univariate functions with derivatives up to the $(\gamma+1)$th order bounded in absolute values by a common constant everywhere or a.e., it is well known that the minimax optimal rate of convergence in mean squared error (MSE) is $\left(\frac{\sigma^{2}}{n}\right)^{\frac{2\gamma+2}{2\gamma+3}}$ when $\gamma$ is finite and the sample size $n\rightarrow\infty$. From a nonasymptotic viewpoint that does not take $n$ to infinity, this paper shows that: for the standard H\"older and Sobolev classes, the minimax optimal rate is $\frac{\sigma^{2}\left(\gamma+1\right)}{n}$ ($\succsim\left(\frac{\sigma^{2}}{n}\right)^{\frac{2\gamma+2}{2\gamma+3}}$) when $\frac{n}{\sigma^{2}}\precsim\left(\gamma+1\right)^{2\gamma+3}$ and $\left(\frac{\sigma^{2}}{n}\right)^{\frac{2\gamma+2}{2\gamma+3}}$ ($\succsim\frac{\sigma^{2}\left(\gamma+1\right)}{n}$) when $\frac{n}{\sigma^{2}}\succsim\left(\gamma+1\right)^{2\gamma+3}$. To establish these results, we derive upper and lower bounds on the covering and packing numbers for the generalized H\"older class where the absolute value of the $k$th ($k=0,...,\gamma$) derivative is bounded by a parameter $R_{k}$ and the $\gamma$th derivative is $R_{\gamma+1}-$Lipschitz (and also for the generalized ellipsoid class of smooth functions). Our bounds sharpen the classical metric entropy results for the standard classes, and give the general dependence on $\gamma$ and $R_{k}$. By deriving the minimax optimal MSE rates under various (well motivated) $R_{k}$s for the smooth classes with the help of our new entropy bounds, we show several interesting results that cannot be shown with the existing entropy bounds in the literature.
    FOLD-R++: A Scalable Toolset for Automated Inductive Learning of Default Theories from Mixed Data. (arXiv:2110.07843v3 [cs.LG] UPDATED)
    FOLD-R is an automated inductive learning algorithm for learning default rules for mixed (numerical and categorical) data. It generates an (explainable) answer set programming (ASP) rule set for classification tasks. We present an improved FOLD-R algorithm, called FOLD-R++, that significantly increases the efficiency and scalability of FOLD-R by orders of magnitude. FOLD-R++ improves upon FOLD-R without compromising or losing information in the input training data during the encoding or feature selection phase. The FOLD-R++ algorithm is competitive in performance with the widely-used XGBoost algorithm, however, unlike XGBoost, the FOLD-R++ algorithm produces an explainable model. FOLD-R++ is also competitive in performance with the RIPPER system, however, on large datasets FOLD-R++ outperforms RIPPER. We also create a powerful tool-set by combining FOLD-R++ with s(CASP)-a goal-directed ASP execution engine-to make predictions on new data samples using the answer set program generated by FOLD-R++. The s(CASP) system also produces a justification for the prediction. Experiments presented in this paper show that our improved FOLD-R++ algorithm is a significant improvement over the original design and that the s(CASP) system can make predictions in an efficient manner as well.
    Zero-shot Audio Source Separation through Query-based Learning from Weakly-labeled Data. (arXiv:2112.07891v4 [cs.SD] UPDATED)
    Deep learning techniques for separating audio into different sound sources face several challenges. Standard architectures require training separate models for different types of audio sources. Although some universal separators employ a single model to target multiple sources, they have difficulty generalizing to unseen sources. In this paper, we propose a three-component pipeline to train a universal audio source separator from a large, but weakly-labeled dataset: AudioSet. First, we propose a transformer-based sound event detection system for processing weakly-labeled training data. Second, we devise a query-based audio separation model that leverages this data for model training. Third, we design a latent embedding processor to encode queries that specify audio targets for separation, allowing for zero-shot generalization. Our approach uses a single model for source separation of multiple sound types, and relies solely on weakly-labeled data for training. In addition, the proposed audio separator can be used in a zero-shot setting, learning to separate types of audio sources that were never seen in training. To evaluate the separation performance, we test our model on MUSDB18, while training on the disjoint AudioSet. We further verify the zero-shot performance by conducting another experiment on audio source types that are held-out from training. The model achieves comparable Source-to-Distortion Ratio (SDR) performance to current supervised models in both cases.
    Leveraging Local Domains for Image-to-Image Translation. (arXiv:2109.04468v3 [cs.CV] UPDATED)
    Image-to-image (i2i) networks struggle to capture local changes because they do not affect the global scene structure. For example, translating from highway scenes to offroad, i2i networks easily focus on global color features but ignore obvious traits for humans like the absence of lane markings. In this paper, we leverage human knowledge about spatial domain characteristics which we refer to as 'local domains' and demonstrate its benefit for image-to-image translation. Relying on a simple geometrical guidance, we train a patch-based GAN on few source data and hallucinate a new unseen domain which subsequently eases transfer learning to target. We experiment on three tasks ranging from unstructured environments to adverse weather. Our comprehensive evaluation setting shows we are able to generate realistic translations, with minimal priors, and training only on a few images. Furthermore, when trained on our translations images we show that all tested proxy tasks are significantly improved, without ever seeing target domain at training.
    Physics and Equality Constrained Artificial Neural Networks: Application to Forward and Inverse Problems with Multi-fidelity Data Fusion. (arXiv:2109.14860v2 [physics.comp-ph] UPDATED)
    Physics-informed neural networks (PINNs) have been proposed to learn the solution of partial differential equations (PDE). In PINNs, the residual form of the PDE of interest and its boundary conditions are lumped into a composite objective function as soft penalties. Here, we show that this specific way of formulating the objective function is the source of severe limitations in the PINN approach when applied to different kinds of PDEs. To address these limitations, we propose a versatile framework based on a constrained optimization problem formulation, where we use the augmented Lagrangian method (ALM) to constrain the solution of a PDE with its boundary conditions and any high-fidelity data that may be available. Our approach is adept at forward and inverse problems with multi-fidelity data fusion. We demonstrate the efficacy and versatility of our physics- and equality-constrained deep-learning framework by applying it to several forward and inverse problems involving multi-dimensional PDEs. Our framework achieves orders of magnitude improvements in accuracy levels in comparison with state-of-the-art physics-informed neural networks.
    CausalSim: Unbiased Trace-Driven Network Simulation. (arXiv:2201.01811v2 [cs.LG] UPDATED)
    Evaluating the real-world performance of network protocols is challenging. Randomized control trials (RCT) are expensive and inaccessible to most, while simulators fail to capture complex behaviors in real networks. We present CausalSim, a trace-driven counterfactual simulator for network protocols that addresses this challenge. Counterfactual simulation aims to predict what would happen using different protocols under the same conditions as a given trace. This is complicated due to the bias introduced by the protocols used during trace collection. CausalSim uses traces from an initial RCT under a set of protocols to learn a causal network model, effectively removing the biases present in the data. Key to CausalSim is mapping the task of counterfactual simulation to that of tensor completion with extremely sparse observations. Through an adversarial neural network training technique that exploits distributional invariances that are present in training data coming from an RCT, CausalSim enables a novel tensor completion method despite the sparsity of observations. Our extensive evaluation of CausalSim on both real and synthetic datasets and two use cases, including more than nine months of real data from the Puffer video streaming system, shows that it provides accurate counterfactual predictions, reducing prediction error by 44% and 53% on average compared to expert-designed and supervised learning baselines.
    Manas: Mining Software Repositories to Assist AutoML. (arXiv:2112.03395v2 [cs.SE] UPDATED)
    Today deep learning is widely used for building software. A software engineering problem with deep learning is that finding an appropriate convolutional neural network (CNN) model for the task can be a challenge for developers. Recent work on AutoML, more precisely neural architecture search (NAS), embodied by tools like Auto-Keras aims to solve this problem by essentially viewing it as a search problem where the starting point is a default CNN model, and mutation of this CNN model allows exploration of the space of CNN models to find a CNN model that will work best for the problem. These works have had significant success in producing high-accuracy CNN models. There are two problems, however. First, NAS can be very costly, often taking several hours to complete. Second, CNN models produced by NAS can be very complex that makes it harder to understand them and costlier to train them. We propose a novel approach for NAS, where instead of starting from a default CNN model, the initial model is selected from a repository of models extracted from GitHub. The intuition being that developers solving a similar problem may have developed a better starting point compared to the default model. We also analyze common layer patterns of CNN models in the wild to understand changes that the developers make to improve their models. Our approach uses commonly occurring changes as mutation operators in NAS. We have extended Auto-Keras to implement our approach. Our evaluation using 8 top voted problems from Kaggle for tasks including image classification and image regression shows that given the same search time, without loss of accuracy, Manas produces models with 42.9% to 99.6% fewer number of parameters than Auto-Keras' models. Benchmarked on GPU, Manas' models train 30.3% to 641.6% faster than Auto-Keras' models.
    A New Notion of Individually Fair Clustering: $\alpha$-Equitable $k$-Center. (arXiv:2106.05423v3 [cs.LG] UPDATED)
    Clustering is a fundamental problem in unsupervised machine learning, and fair variants of it have recently received significant attention due to its societal implications. In this work we introduce a novel definition of individual fairness for clustering problems. Specifically, in our model, each point $j$ has a set of other points $\mathcal{S}_j$ that it perceives as similar to itself, and it feels that it is fairly treated if the quality of service it receives in the solution is $\alpha$-close (in a multiplicative sense, for a given $\alpha \geq 1$) to that of the points in $\mathcal{S}_j$. We begin our study by answering questions regarding the structure of the problem, namely for what values of $\alpha$ the problem is well-defined, and what the behavior of the \emph{Price of Fairness (PoF)} for it is. For the well-defined region of $\alpha$, we provide efficient and easily-implementable approximation algorithms for the $k$-center objective, which in certain cases enjoy bounded-PoF guarantees. We finally complement our analysis by an extensive suite of experiments that validates the effectiveness of our theoretical results.
    AlphaFold Accelerates Artificial Intelligence Powered Drug Discovery: Efficient Discovery of a Novel Cyclin-dependent Kinase 20 (CDK20) Small Molecule Inhibitor. (arXiv:2201.09647v2 [q-bio.BM] UPDATED)
    The AlphaFold computer program predicted protein structures for the whole human genome, which has been considered as a remarkable breakthrough both in artificial intelligence (AI) application and structural biology. Despite the varying confidence level, these predicted structures still could significantly contribute to structure-based drug design of novel targets, especially the ones with no or limited structural information. In this work, we successfully applied AlphaFold in our end-to-end AI-powered drug discovery engines constituted of a biocomputational platform PandaOmics and a generative chemistry platform Chemistry42, to identify a first-in-class hit molecule of a novel target without an experimental structure starting from target selection towards hit identification in a cost- and time-efficient manner. PandaOmics provided the targets of interest and Chemistry42 generated the molecules based on the AlphaFold predicted structure, and the selected molecules were synthesized and tested in biological assays. Through this approach, we identified a small molecule hit compound for CDK20 with a Kd value of 8.9 +/- 1.6 uM (n = 4) within 30 days from target selection and after only synthesizing 7 compounds. Based on the available data, the second round of AI-powered compound generation was conducted and through which, a more potent hit molecule, ISM042-2 048, was discovered with a Kd value of 210.0 +/- 42.4 nM (n = 2), within 30 days and after synthesizing 6 compounds from the discovery of the first hit ISM042-2-001. To the best of our knowledge, this is the first reported small molecule targeting CDK20 and more importantly, this work is the first demonstration of AlphaFold application in the hit identification process in early drug discovery.
    MIONet: Learning multiple-input operators via tensor product. (arXiv:2202.06137v1 [cs.LG])
    As an emerging paradigm in scientific machine learning, neural operators aim to learn operators, via neural networks, that map between infinite-dimensional function spaces. Several neural operators have been recently developed. However, all the existing neural operators are only designed to learn operators defined on a single Banach space, i.e., the input of the operator is a single function. Here, for the first time, we study the operator regression via neural networks for multiple-input operators defined on the product of Banach spaces. We first prove a universal approximation theorem of continuous multiple-input operators. We also provide detailed theoretical analysis including the approximation error, which provides a guidance of the design of the network architecture. Based on our theory and a low-rank approximation, we propose a novel neural operator, MIONet, to learn multiple-input operators. MIONet consists of several branch nets for encoding the input functions and a trunk net for encoding the domain of the output function. We demonstrate that MIONet can learn solution operators involving systems governed by ordinary and partial differential equations. In our computational examples, we also show that we can endow MIONet with prior knowledge of the underlying system, such as linearity and periodicity, to further improve the accuracy.
    Desiderata for Explainable AI in statistical production systems of the European Central Bank. (arXiv:2107.08045v2 [cs.CY] UPDATED)
    Explainable AI constitutes a fundamental step towards establishing fairness and addressing bias in algorithmic decision-making. Despite the large body of work on the topic, the benefit of solutions is mostly evaluated from a conceptual or theoretical point of view and the usefulness for real-world use cases remains uncertain. In this work, we aim to state clear user-centric desiderata for explainable AI reflecting common explainability needs experienced in statistical production systems of the European Central Bank. We link the desiderata to archetypical user roles and give examples of techniques and methods which can be used to address the user's needs. To this end, we provide two concrete use cases from the domain of statistical data production in central banks: the detection of outliers in the Centralised Securities Database and the data-driven identification of data quality checks for the Supervisory Banking data system.
    Brain dynamics via Cumulative Auto-Regressive Self-Attention. (arXiv:2111.01271v2 [cs.LG] UPDATED)
    Multivariate dynamical processes can often be intuitively described by a weighted connectivity graph between components representing each individual time-series. Even a simple representation of this graph as a Pearson correlation matrix may be informative and predictive as demonstrated in the brain imaging literature. However, there is a consensus expectation that powerful graph neural networks (GNNs) should perform better in similar settings. In this work, we present a model that is considerably shallow than deep GNNs, yet outperforms them in predictive accuracy in a brain imaging application. Our model learns the autoregressive structure of individual time series and estimates directed connectivity graphs between the learned representations via a self-attention mechanism in an end-to-end fashion. The supervised training of the model as a classifier between patients and controls results in a model that generates directed connectivity graphs and highlights the components of the time-series that are predictive for each subject. We demonstrate our results on a functional neuroimaging dataset classifying schizophrenia patients and controls.
    A Private and Computationally-Efficient Estimator for Unbounded Gaussians. (arXiv:2111.04609v2 [stat.ML] UPDATED)
    We give the first polynomial-time, polynomial-sample, differentially private estimator for the mean and covariance of an arbitrary Gaussian distribution $\mathcal{N}(\mu,\Sigma)$ in $\mathbb{R}^d$. All previous estimators are either nonconstructive, with unbounded running time, or require the user to specify a priori bounds on the parameters $\mu$ and $\Sigma$. The primary new technical tool in our algorithm is a new differentially private preconditioner that takes samples from an arbitrary Gaussian $\mathcal{N}(0,\Sigma)$ and returns a matrix $A$ such that $A \Sigma A^T$ has constant condition number.
    A Survey on Data-driven Software Vulnerability Assessment and Prioritization. (arXiv:2107.08364v3 [cs.SE] UPDATED)
    Software Vulnerabilities (SVs) are increasing in complexity and scale, posing great security risks to many software systems. Given the limited resources in practice, SV assessment and prioritization help practitioners devise optimal SV mitigation plans based on various SV characteristics. The surge in SV data sources and data-driven techniques such as Machine Learning and Deep Learning have taken SV assessment and prioritization to the next level. Our survey provides a taxonomy of the past research efforts and highlights the best practices for data-driven SV assessment and prioritization. We also discuss the current limitations and propose potential solutions to address such issues.
    Towards Practical Credit Assignment for Deep Reinforcement Learning. (arXiv:2106.04499v2 [cs.LG] UPDATED)
    Credit assignment is a fundamental problem in reinforcement learning, the problem of measuring an action's influence on future rewards. Explicit credit assignment methods have the potential to boost the performance of RL algorithms on many tasks, but thus far remain impractical for general use. Recently, a family of methods called Hindsight Credit Assignment (HCA) was proposed, which explicitly assign credit to actions in hindsight based on the probability of the action having led to an observed outcome. This approach has appealing properties, but remains a largely theoretical idea applicable to a limited set of tabular RL tasks. Moreover, it is unclear how to extend HCA to deep RL environments. In this work, we explore the use of HCA-style credit in a deep RL context. We first describe the limitations of existing HCA algorithms in deep RL that lead to their poor performance or complete lack of training, then propose several theoretically-justified modifications to overcome them. We explore the quantitative and qualitative effects of the resulting algorithm on the Arcade Learning Environment (ALE) benchmark, and observe that it improves performance over Advantage Actor-Critic (A2C) on many games where non-trivial credit assignment is necessary to achieve high scores and where hindsight probabilities can be accurately estimated.
    Robustness against Adversarial Attacks in Neural Networks using Incremental Dissipativity. (arXiv:2111.12906v2 [cs.LG] UPDATED)
    Adversarial examples can easily degrade the classification performance in neural networks. Empirical methods for promoting robustness to such examples have been proposed, but often lack both analytical insights and formal guarantees. Recently, some robustness certificates have appeared in the literature based on system theoretic notions. This work proposes an incremental dissipativity-based robustness certificate for neural networks in the form of a linear matrix inequality for each layer. We also propose an equivalent spectral norm bound for this certificate which is scalable to neural networks with multiple layers. We demonstrate the improved performance against adversarial attacks on a feed-forward neural network trained on MNIST and an Alexnet trained using CIFAR-10.
    Distributed Sparse Normal Means Estimation with Sublinear Communication. (arXiv:2102.03060v2 [stat.ML] UPDATED)
    We consider the problem of sparse normal means estimation in a distributed setting with communication constraints. We assume there are $M$ machines, each holding $d$-dimensional observations of a $K$-sparse vector $\mu$ corrupted by additive Gaussian noise. The $M$ machines are connected in a star topology to a fusion center, whose goal is to estimate the vector $\mu$ with a low communication budget. Previous works have shown that to achieve the centralized minimax rate for the $\ell_2$ risk, the total communication must be high - at least linear in the dimension $d$. This phenomenon occurs, however, at very weak signals. We show that at signal-to-noise ratios (SNRs) that are sufficiently high - but not enough for recovery by any individual machine - the support of $\mu$ can be correctly recovered with significantly less communication. Specifically, we present two algorithms for distributed estimation of a sparse mean vector corrupted by either Gaussian or sub-Gaussian noise. We then prove that above certain SNR thresholds, with high probability, these algorithms recover the correct support with total communication that is sublinear in the dimension $d$. Furthermore, the communication decreases exponentially as a function of signal strength. If in addition $KM\ll \tfrac{d}{\log d}$, then with an additional round of sublinear communication, our algorithms achieve the centralized rate for the $\ell_2$ risk. Finally, we present simulations that illustrate the performance of our algorithms in different parameter regimes.
    Machine Learning-Based COVID-19 Patients Triage Algorithm using Patient-Generated Health Data from Nationwide Multicenter Database. (arXiv:2109.09001v5 [cs.LG] UPDATED)
    A prompt severity assessment model of patients confirmed for having infectious diseases could enable efficient diagnosis while alleviating burden on the medical system. This study aims to develop a SARS-CoV-2 severity assessment model and establish a medical system that allows patients to check the severity of their cases and informs them to visit the appropriate clinic center based on past treatment data of other patients with similar severity levels. This paper provides the development processes of a severity assessment model using machine learning techniques and its application on SARS-CoV-2 patients. The proposed model is trained on a nationwide dataset provided by a Korean government agency and only requires patients' basic personal data, allowing them to judge the severity of their own cases. After modeling, the boosting-based decision tree model was selected as the classifier while mortality rate was interpreted as the probability score. The dataset was collected from all Korean citizens who were confirmed with COVID-19 between February, 2020 and July, 2021. The experiments achieved high model performance with an approximate precision of $0{\cdot}923$ and AUROC score of $0{\cdot}950$ [$95$% Tolerance Interval $0{\cdot}940$-$0{\cdot}958$, $95$% Confidence Interval $0{\cdot}949$-$0{\cdot}950$]. Moreover, our experiments identified the most important variables affecting the severity in the model via sensitivity analysis. The prompt severity assessment model for managing infectious people has been attained through using a nationwide dataset. It has demonstrated its superior performance by surpassing that of conventional risk assessments. With the model's high performance and easily accessible features, the triage algorithm is expected to be particularly useful when patients monitor their health status by themselves through smartphone applications.
    CoarSAS2hvec: Heterogeneous Information Network Embedding with Balanced Network Sampling. (arXiv:2110.05820v2 [cs.LG] UPDATED)
    Heterogeneous information network (HIN) embedding aims to find the representations of nodes that preserve the proximity between entities of different nature. A family of approaches that are wildly adopted applies random walk to generate a sequence of heterogeneous context, from which the embedding is learned. However, due to the multipartite graph structure of HIN, hub nodes tend to be over-represented in the sampled sequence, giving rise to imbalanced samples of the network. Here we propose a new embedding method CoarSAS2hvec. The self-avoid short sequence sampling with the HIN coarsening procedure (CoarSAS) is utilized to better collect the rich information in HIN. An optimized loss function is used to improve the performance of the HIN structure embedding. CoarSAS2hvec outperforms nine other methods in two different tasks on four real-world data sets. The ablation study confirms that the samples collected by CoarSAS contain richer information of the network compared with those by other methods, which is characterized by a higher information entropy. Hence, the traditional loss function applied to samples by CoarSAS can also yield improved results. Our work addresses a limitation of the random-walk-based HIN embedding that has not been emphasized before, which can shed light on a range of problems in HIN analyses.
    Active noise control techniques for nonlinear systems. (arXiv:2110.09672v2 [cs.LG] UPDATED)
    Most of the literature focuses on the development of the linear active noise control (ANC) techniques. However, ANC systems might have to deal with some nonlinear components and the performance of linear ANC techniques may degrade in this scenario. To overcome this limitation, nonlinear ANC (NLANC) algorithms were developed. In Part II, we review the development of NLANC algorithms during the last decade. The contributions of heuristic ANC algorithms are outlined. Moreover, we emphasize recent advances of NLANC algorithms, such as spline ANC algorithms, kernel adaptive filters, and nonlinear distributed ANC algorithms. Then, we present recent applications of ANC technique including linear and nonlinear perspectives. Future research challenges regarding ANC techniques are also discussed.
    Benchmarking the Spectrum of Agent Capabilities. (arXiv:2109.06780v2 [cs.AI] UPDATED)
    Evaluating the general abilities of intelligent agents requires complex simulation environments. Existing benchmarks typically evaluate only one narrow task per environment, requiring researchers to perform expensive training runs on many different environments. We introduce Crafter, an open world survival game with visual inputs that evaluates a wide range of general abilities within a single environment. Agents either learn from the provided reward signal or through intrinsic objectives and are evaluated by semantically meaningful achievements that can be unlocked during each episode, such as discovering resources and crafting tools. Consistently unlocking all achievements requires strong generalization, deep exploration, and long-term reasoning. We experimentally verify that Crafter is of appropriate difficulty to drive future research and provide baselines scores of reward agents and unsupervised agents. Furthermore, we observe sophisticated behaviors emerging from maximizing the reward signal, such as building tunnel systems, bridges, houses, and plantations. We hope that Crafter will accelerate research progress by quickly evaluating a wide spectrum of abilities.
    A survey on deep learning approaches for breast cancer diagnosis. (arXiv:2109.08853v2 [eess.IV] UPDATED)
    Deep learning has introduced several learning-based methods to recognize breast tumours and presents high applicability in breast cancer diagnostics. It has presented itself as a practical installment in Computer-Aided Diagnostic (CAD) systems to further assist radiologists in diagnostics for different modalities. A deep learning network trained on images provided by hospitals or public databases can perform classification, detection, and segmentation of lesion types. Significant progress has been made in recognizing tumours on 2D images but recognizing 3D images remains a frontier so far. The interconnection of deep learning networks between different fields of study help propels discoveries for more efficient, accurate, and robust networks. In this review paper, the following topics will be explored: (i) theory and application of deep learning, (ii) progress of 2D, 2.5D, and 3D CNN approaches in breast tumour recognition from a performance metric perspective, and (iii) challenges faced in CNN approaches.
    RETE: Retrieval-Enhanced Temporal Event Forecasting on Unified Query Product Evolutionary Graph. (arXiv:2202.06129v1 [cs.IR])
    With the increasing demands on e-commerce platforms, numerous user action history is emerging. Those enriched action records are vital to understand users' interests and intents. Recently, prior works for user behavior prediction mainly focus on the interactions with product-side information. However, the interactions with search queries, which usually act as a bridge between users and products, are still under investigated. In this paper, we explore a new problem named temporal event forecasting, a generalized user behavior prediction task in a unified query product evolutionary graph, to embrace both query and product recommendation in a temporal manner. To fulfill this setting, there involves two challenges: (1) the action data for most users is scarce; (2) user preferences are dynamically evolving and shifting over time. To tackle those issues, we propose a novel Retrieval-Enhanced Temporal Event (RETE) forecasting framework. Unlike existing methods that enhance user representations via roughly absorbing information from connected entities in the whole graph, RETE efficiently and dynamically retrieves relevant entities centrally on each user as high-quality subgraphs, preventing the noise propagation from the densely evolutionary graph structures that incorporate abundant search queries. And meanwhile, RETE autoregressively accumulates retrieval-enhanced user representations from each time step, to capture evolutionary patterns for joint query and product prediction. Empirically, extensive experiments on both the public benchmark and four real-world industrial datasets demonstrate the effectiveness of the proposed RETE method.
    Mastering Atari with Discrete World Models. (arXiv:2010.02193v4 [cs.LG] UPDATED)
    Intelligent agents need to generalize from past experience to achieve goals in complex environments. World models facilitate such generalization and allow learning behaviors from imagined outcomes to increase sample-efficiency. While learning world models from image inputs has recently become feasible for some tasks, modeling Atari games accurately enough to derive successful behaviors has remained an open challenge for many years. We introduce DreamerV2, a reinforcement learning agent that learns behaviors purely from predictions in the compact latent space of a powerful world model. The world model uses discrete representations and is trained separately from the policy. DreamerV2 constitutes the first agent that achieves human-level performance on the Atari benchmark of 55 tasks by learning behaviors inside a separately trained world model. With the same computational budget and wall-clock time, Dreamer V2 reaches 200M frames and surpasses the final performance of the top single-GPU agents IQN and Rainbow. DreamerV2 is also applicable to tasks with continuous actions, where it learns an accurate world model of a complex humanoid robot and solves stand-up and walking from only pixel inputs.
    Vulnerability-Aware Poisoning Mechanism for Online RL with Unknown Dynamics. (arXiv:2009.00774v4 [cs.LG] UPDATED)
    Poisoning attacks on Reinforcement Learning (RL) systems could take advantage of RL algorithm's vulnerabilities and cause failure of the learning. However, prior works on poisoning RL usually either unrealistically assume the attacker knows the underlying Markov Decision Process (MDP), or directly apply the poisoning methods in supervised learning to RL. In this work, we build a generic poisoning framework for online RL via a comprehensive investigation of heterogeneous poisoning models in RL. Without any prior knowledge of the MDP, we propose a strategic poisoning algorithm called Vulnerability-Aware Adversarial Critic Poison (VA2C-P), which works for most policy-based deep RL agents, closing the gap that no poisoning method exists for policy-based RL agents. VA2C-P uses a novel metric, stability radius in RL, that measures the vulnerability of RL algorithms. Experiments on multiple deep RL agents and multiple environments show that our poisoning algorithm successfully prevents agents from learning a good policy or teaches the agents to converge to a target policy, with a limited attacking budget.
    Physics-informed neural networks for the shallow-water equations on the sphere. (arXiv:2104.00615v3 [physics.comp-ph] UPDATED)
    We propose the use of physics-informed neural networks for solving the shallow-water equations on the sphere in the meteorological context. Physics-informed neural networks are trained to satisfy the differential equations along with the prescribed initial and boundary data, and thus can be seen as an alternative approach to solving differential equations compared to traditional numerical approaches such as finite difference, finite volume or spectral methods. We discuss the training difficulties of physics-informed neural networks for the shallow-water equations on the sphere and propose a simple multi-model approach to tackle test cases of comparatively long time intervals. Here we train a sequence of neural networks instead of a single neural network for the entire integration interval. We also avoid the use of a boundary value loss by encoding the boundary conditions in a custom neural network layer. We illustrate the abilities of the method by solving the most prominent test cases proposed by Williamson et al. [J. Comput. Phys. 102 (1992), 211-224].
    Visual Representation Learning Does Not Generalize Strongly Within the Same Domain. (arXiv:2107.08221v4 [cs.LG] UPDATED)
    An important component for generalization in machine learning is to uncover underlying latent factors of variation as well as the mechanism through which each factor acts in the world. In this paper, we test whether 17 unsupervised, weakly supervised, and fully supervised representation learning approaches correctly infer the generative factors of variation in simple datasets (dSprites, Shapes3D, MPI3D) from controlled environments, and on our contributed CelebGlow dataset. In contrast to prior robustness work that introduces novel factors of variation during test time, such as blur or other (un)structured noise, we here recompose, interpolate, or extrapolate only existing factors of variation from the training data set (e.g., small and medium-sized objects during training and large objects during testing). Models that learn the correct mechanism should be able to generalize to this benchmark. In total, we train and test 2000+ models and observe that all of them struggle to learn the underlying mechanism regardless of supervision signal and architectural bias. Moreover, the generalization capabilities of all tested models drop significantly as we move from artificial datasets towards more realistic real-world datasets. Despite their inability to identify the correct mechanism, the models are quite modular as their ability to infer other in-distribution factors remains fairly stable, providing only a single factor is out-of-distribution. These results point to an important yet understudied problem of learning mechanistic models of observations that can facilitate generalization.
    LEADS: Learning Dynamical Systems that Generalize Across Environments. (arXiv:2106.04546v2 [cs.LG] UPDATED)
    When modeling dynamical systems from real-world data samples, the distribution of data often changes according to the environment in which they are captured, and the dynamics of the system itself vary from one environment to another. Generalizing across environments thus challenges the conventional frameworks. The classical settings suggest either considering data as i.i.d. and learning a single model to cover all situations or learning environment-specific models. Both are sub-optimal: the former disregards the discrepancies between environments leading to biased solutions, while the latter does not exploit their potential commonalities and is prone to scarcity problems. We propose LEADS, a novel framework that leverages the commonalities and discrepancies among known environments to improve model generalization. This is achieved with a tailored training formulation aiming at capturing common dynamics within a shared model while additional terms capture environment-specific dynamics. We ground our approach in theory, exhibiting a decrease in sample complexity with our approach and corroborate these results empirically, instantiating it for linear dynamics. Moreover, we concretize this framework for neural networks and evaluate it experimentally on representative families of nonlinear dynamics. We show that this new setting can exploit knowledge extracted from environment-dependent data and improves generalization for both known and novel environments. Code is available at https://github.com/yuan-yin/LEADS.
    MCDAL: Maximum Classifier Discrepancy for Active Learning. (arXiv:2107.11049v2 [cs.CV] UPDATED)
    Recent state-of-the-art active learning methods have mostly leveraged Generative Adversarial Networks (GAN) for sample acquisition; however, GAN is usually known to suffer from instability and sensitivity to hyper-parameters. In contrast to these methods, we propose in this paper a novel active learning framework that we call Maximum Classifier Discrepancy for Active Learning (MCDAL) which takes the prediction discrepancies between multiple classifiers. In particular, we utilize two auxiliary classification layers that learn tighter decision boundaries by maximizing the discrepancies among them. Intuitively, the discrepancies in the auxiliary classification layers' predictions indicate the uncertainty in the prediction. In this regard, we propose a novel method to leverage the classifier discrepancies for the acquisition function for active learning. We also provide an interpretation of our idea in relation to existing GAN based active learning methods and domain adaptation frameworks. Moreover, we empirically demonstrate the utility of our approach where the performance of our approach exceeds the state-of-the-art methods on several image classification and semantic segmentation datasets in active learning setups.
    Unveiling the structure of wide flat minima in neural networks. (arXiv:2107.01163v5 [cond-mat.dis-nn] UPDATED)
    The success of deep learning has revealed the application potential of neural networks across the sciences and opened up fundamental theoretical problems. In particular, the fact that learning algorithms based on simple variants of gradient methods are able to find near-optimal minima of highly nonconvex loss functions is an unexpected feature of neural networks. Moreover, such algorithms are able to fit the data even in the presence of noise, and yet they have excellent predictive capabilities. Several empirical results have shown a reproducible correlation between the so-called flatness of the minima achieved by the algorithms and the generalization performance. At the same time, statistical physics results have shown that in nonconvex networks a multitude of narrow minima may coexist with a much smaller number of wide flat minima, which generalize well. Here we show that wide flat minima arise as complex extensive structures, from the coalescence of minima around "high-margin" (i.e., locally robust) configurations. Despite being exponentially rare compared to zero-margin ones, high-margin minima tend to concentrate in particular regions. These minima are in turn surrounded by other solutions of smaller and smaller margin, leading to dense regions of solutions over long distances. Our analysis also provides an alternative analytical method for estimating when flat minima appear and when algorithms begin to find solutions, as the number of model parameters varies.
    Breaking the Deadly Triad with a Target Network. (arXiv:2101.08862v8 [cs.LG] UPDATED)
    The deadly triad refers to the instability of a reinforcement learning algorithm when it employs off-policy learning, function approximation, and bootstrapping simultaneously. In this paper, we investigate the target network as a tool for breaking the deadly triad, providing theoretical support for the conventional wisdom that a target network stabilizes training. We first propose and analyze a novel target network update rule which augments the commonly used Polyak-averaging style update with two projections. We then apply the target network and ridge regularization in several divergent algorithms and show their convergence to regularized TD fixed points. Those algorithms are off-policy with linear function approximation and bootstrapping, spanning both policy evaluation and control, as well as both discounted and average-reward settings. In particular, we provide the first convergent linear $Q$-learning algorithms under nonrestrictive and changing behavior policies without bi-level optimization.
    On the Convergence Rate of Off-Policy Policy Optimization Methods with Density-Ratio Correction. (arXiv:2106.00993v2 [cs.LG] UPDATED)
    In this paper, we study the convergence properties of off-policy policy improvement algorithms with state-action density ratio correction under function approximation setting, where the objective function is formulated as a max-max-min optimization problem. We characterize the bias of the learning objective and present two strategies with finite-time convergence guarantees. In our first strategy, we present algorithm P-SREDA with convergence rate $O(\epsilon^{-3})$, whose dependency on $\epsilon$ is optimal. In our second strategy, we propose a new off-policy actor-critic style algorithm named O-SPIM. We prove that O-SPIM converges to a stationary point with total complexity $O(\epsilon^{-4})$, which matches the convergence rate of some recent actor-critic algorithms in the on-policy setting.
    On Federated Learning with Energy Harvesting Clients. (arXiv:2202.06105v1 [eess.SP])
    Catering to the proliferation of Internet of Things devices and distributed machine learning at the edge, we propose an energy harvesting federated learning (EHFL) framework in this paper. The introduction of EH implies that a client's availability to participate in any FL round cannot be guaranteed, which complicates the theoretical analysis. We derive novel convergence bounds that capture the impact of time-varying device availabilities due to the random EH characteristics of the participating clients, for both parallel and local stochastic gradient descent (SGD) with non-convex loss functions. The results suggest that having a uniform client scheduling that maximizes the minimum number of clients throughout the FL process is desirable, which is further corroborated by the numerical experiments using a real-world FL task and a state-of-the-art EH scheduler.
    The Impact of Using Regression Models to Build Defect Classifiers. (arXiv:2202.06157v1 [cs.SE])
    It is common practice to discretize continuous defect counts into defective and non-defective classes and use them as a target variable when building defect classifiers (discretized classifiers). However, this discretization of continuous defect counts leads to information loss that might affect the performance and interpretation of defect classifiers. Another possible approach to build defect classifiers is through the use of regression models then discretizing the predicted defect counts into defective and non-defective classes (regression-based classifiers). In this paper, we compare the performance and interpretation of defect classifiers that are built using both approaches (i.e., discretized classifiers and regression-based classifiers) across six commonly used machine learning classifiers (i.e., linear/logistic regression, random forest, KNN, SVM, CART, and neural networks) and 17 datasets. We find that: i) Random forest based classifiers outperform other classifiers (best AUC) for both classifier building approaches; ii) In contrast to common practice, building a defect classifier using discretized defect counts (i.e., discretized classifiers) does not always lead to better performance. Hence we suggest that future defect classification studies should consider building regression-based classifiers (in particular when the defective ratio of the modeled dataset is low). Moreover, we suggest that both approaches for building defect classifiers should be explored, so the best-performing classifier can be used when determining the most influential features.
    Physics-Guided Problem Decomposition for Scaling Deep Learning of High-dimensional Eigen-Solvers: The Case of Schr\"{o}dinger's Equation. (arXiv:2202.05994v1 [cs.LG])
    Given their ability to effectively learn non-linear mappings and perform fast inference, deep neural networks (NNs) have been proposed as a viable alternative to traditional simulation-driven approaches for solving high-dimensional eigenvalue equations (HDEs), which are the foundation for many scientific applications. Unfortunately, for the learned models in these scientific applications to achieve generalization, a large, diverse, and preferably annotated dataset is typically needed and is computationally expensive to obtain. Furthermore, the learned models tend to be memory- and compute-intensive primarily due to the size of the output layer. While generalization, especially extrapolation, with scarce data has been attempted by imposing physical constraints in the form of physics loss, the problem of model scalability has remained. In this paper, we alleviate the compute bottleneck in the output layer by using physics knowledge to decompose the complex regression task of predicting the high-dimensional eigenvectors into multiple simpler sub-tasks, each of which are learned by a simple "expert" network. We call the resulting architecture of specialized experts Physics-Guided Mixture-of-Experts (PG-MoE). We demonstrate the efficacy of such physics-guided problem decomposition for the case of the Schr\"{o}dinger's Equation in Quantum Mechanics. Our proposed PG-MoE model predicts the ground-state solution, i.e., the eigenvector that corresponds to the smallest possible eigenvalue. The model is 150x smaller than the network trained to learn the complex task while being competitive in generalization. To improve the generalization of the PG-MoE, we also employ a physics-guided loss function based on variational energy, which by quantum mechanics principles is minimized iff the output is the ground-state solution.
    Fairness-aware Configuration of Machine Learning Libraries. (arXiv:2202.06196v1 [cs.SE])
    This paper investigates the parameter space of machine learning (ML) algorithms in aggravating or mitigating fairness bugs. Data-driven software is increasingly applied in social-critical applications where ensuring fairness is of paramount importance. The existing approaches focus on addressing fairness bugs by either modifying the input dataset or modifying the learning algorithms. On the other hand, the selection of hyperparameters, which provide finer controls of ML algorithms, may enable a less intrusive approach to influence the fairness. Can hyperparameters amplify or suppress discrimination present in the input dataset? How can we help programmers in detecting, understanding, and exploiting the role of hyperparameters to improve the fairness? We design three search-based software testing algorithms to uncover the precision-fairness frontier of the hyperparameter space. We complement these algorithms with statistical debugging to explain the role of these parameters in improving fairness. We implement the proposed approaches in the tool Parfait-ML (PARameter FAIrness Testing for ML Libraries) and show its effectiveness and utility over five mature ML algorithms as used in six social-critical applications. In these applications, our approach successfully identified hyperparameters that significantly improve (vis-a-vis the state-of-the-art techniques) the fairness without sacrificing precision. Surprisingly, for some algorithms (e.g., random forest), our approach showed that certain configuration of hyperparameters (e.g., restricting the search space of attributes) can amplify biases across applications. Upon further investigation, we found intuitive explanations of these phenomena, and the results corroborate similar observations from the literature.
    Improving Fraud detection via Hierarchical Attention-based Graph Neural Network. (arXiv:2202.06096v1 [cs.LG])
    Graph neural networks (GNN) have emerged as a powerful tool for fraud detection tasks, where fraudulent nodes are identified by aggregating neighbor information via different relations. To get around such detection, crafty fraudsters resort to camouflage via connecting to legitimate users (i.e., relation camouflage) or providing seemingly legitimate feedbacks (i.e., feature camouflage). A wide-spread solution reinforces the GNN aggregation process with neighbor selectors according to original node features. This method may carry limitations when identifying fraudsters not only with the relation camouflage, but with the feature camouflage making them hard to distinguish from their legitimate neighbors. In this paper, we propose a Hierarchical Attention-based Graph Neural Network (HA-GNN) for fraud detection, which incorporates weighted adjacency matrices across different relations against camouflage. This is motivated in the Relational Density Theory and is exploited for forming a hierarchical attention-based graph neural network. Specifically, we design a relation attention module to reflect the tie strength between two nodes, while a neighborhood attention module to capture the long-range structural affinity associated with the graph. We generate node embeddings by aggregating information from local/long-range structures and original node features. Experiments on three real-world datasets demonstrate the effectiveness of our model over the state-of-the-arts.
    Understanding Natural Gradient in Sobolev Spaces. (arXiv:2202.06232v1 [cs.LG])
    While natural gradients have been widely studied from both theoretical and empirical perspectives, we argue that a fundamental theoretical issue regarding the existence of gradients in infinite dimensional function spaces remains underexplored. We therefore study the natural gradient induced by Sobolevmetrics and develop several rigorous results. Our results also establish new connections between natural gradients and RKHS theory, and specifically to the Neural Tangent Kernel (NTK). We develop computational techniques for the efficient approximation of the proposed Sobolev Natural Gradient. Preliminary experimental results reveal the potential of this new natural gradient variant.
    Private Adaptive Optimization with Side Information. (arXiv:2202.05963v1 [cs.LG])
    Adaptive optimization methods have become the default solvers for many machine learning tasks. Unfortunately, the benefits of adaptivity may degrade when training with differential privacy, as the noise added to ensure privacy reduces the effectiveness of the adaptive preconditioner. To this end, we propose AdaDPS, a general framework that uses non-sensitive side information to precondition the gradients, allowing the effective use of adaptive methods in private settings. We formally show AdaDPS reduces the amount of noise needed to achieve similar privacy guarantees, thereby improving optimization performance. Empirically, we leverage simple and readily available side information to explore the performance of AdaDPS in practice, comparing to strong baselines in both centralized and federated settings. Our results show that AdaDPS improves accuracy by 7.7% (absolute) on average -- yielding state-of-the-art privacy-utility trade-offs on large-scale text and image benchmarks.
    Impact of Discretization Noise of the Dependent variable on Machine Learning Classifiers in Software Engineering. (arXiv:2202.06146v1 [cs.SE])
    Researchers usually discretize a continuous dependent variable into two target classes by introducing an artificial discretization threshold (e.g., median). However, such discretization may introduce noise (i.e., discretization noise) due to ambiguous class loyalty of data points that are close to the artificial threshold. Previous studies do not provide a clear directive on the impact of discretization noise on the classifiers and how to handle such noise. In this paper, we propose a framework to help researchers and practitioners systematically estimate the impact of discretization noise on classifiers in terms of its impact on various performance measures and the interpretation of classifiers. Through a case study of 7 software engineering datasets, we find that: 1) discretization noise affects the different performance measures of a classifier differently for different datasets; 2) Though the interpretation of the classifiers are impacted by the discretization noise on the whole, the top 3 most important features are not affected by the discretization noise. Therefore, we suggest that practitioners and researchers use our framework to understand the impact of discretization noise on the performance of their built classifiers and estimate the exact amount of discretization noise to be discarded from the dataset to avoid the negative impact of such noise.
    Automatic Issue Classifier: A Transfer Learning Framework for Classifying Issue Reports. (arXiv:2202.06149v1 [cs.SE])
    Issue tracking systems are used in the software industry for the facilitation of maintenance activities that keep the software robust and up to date with ever-changing industry requirements. Usually, users report issues that can be categorized into different labels such as bug reports, enhancement requests, and questions related to the software. Most of the issue tracking systems make the labelling of these issue reports optional for the issue submitter, which leads to a large number of unlabeled issue reports. In this paper, we present a state-of-the-art method to classify the issue reports into their respective categories i.e. bug, enhancement, and question. This is a challenging task because of the common use of informal language in the issue reports. Existing studies use traditional natural language processing approaches adopting key-word based features, which fail to incorporate the contextual relationship between words and therefore result in a high rate of false positives and false negatives. Moreover, previous works utilize a uni-label approach to classify the issue reports however, in reality, an issue-submitter can tag one issue report with more than one label at a time. This paper presents our approach to classify the issue reports in a multi-label setting. We use an off-the-shelf neural network called RoBERTa and fine-tune it to classify the issue reports. We validate our approach on issue reports belonging to numerous industrial projects from GitHub. We were able to achieve promising F-1 scores of 81%, 74%, and 80% for bug reports, enhancements, and questions, respectively. We also develop an industry tool called Automatic Issue Classifier (AIC), which automatically assigns labels to newly reported issues on GitHub repositories with high accuracy.
    Stochastic Strategic Patient Buyers: Revenue maximization using posted prices. (arXiv:2202.06143v1 [cs.GT])
    We consider a seller faced with buyers which have the ability to delay their decision, which we call patience. Each buyer's type is composed of value and patience, and it is sampled i.i.d. from a distribution. The seller, using posted prices, would like to maximize her revenue from selling to the buyer. Our main results are the following. $\bullet$ We formalize this setting and characterize the resulting Stackelberg equilibrium, where the seller first commits to her strategy and then the buyers best respond. $\bullet$ We show a separation between the best fixed price, the best pure strategy, which is a fixed sequence of prices, and the best mixed strategy, which is a distribution over price sequences. $\bullet$ We characterize both the optimal pure strategy of the seller and the buyer's best response strategy to any seller's mixed strategy. $\bullet$ We show how to compute efficiently the optimal pure strategy and give an algorithm for the optimal mixed strategy (which is exponential in the maximum patience). We then consider a learning setting, where the seller does not have access to the distribution over buyer's types. Our main results are the following. $\bullet$ We derive a sample complexity bound for the learning of an approximate optimal pure strategy, by computing the fat-shattering dimension of this setting. $\bullet$ We give a general sample complexity bound for the approximate optimal mixed strategy. $\bullet$ We consider an online setting and derive a vanishing regret bound with respect to both the optimal pure strategy and the optimal mixed strategy.
    Online Bayesian Recommendation with No Regret. (arXiv:2202.06135v1 [cs.GT])
    We introduce and study the online Bayesian recommendation problem for a platform, who can observe a utility-relevant state of a product, repeatedly interacting with a population of myopic users through an online recommendation mechanism. This paradigm is common in a wide range of scenarios in the current Internet economy. For each user with her own private preference and belief, the platform commits to a recommendation strategy to utilize his information advantage on the product state to persuade the self-interested user to follow the recommendation. The platform does not know user's preferences and beliefs, and has to use an adaptive recommendation strategy to persuade with gradually learning user's preferences and beliefs in the process. We aim to design online learning policies with no Stackelberg regret for the platform, i.e., against the optimum policy in hindsight under the assumption that users will correspondingly adapt their behaviors to the benchmark policy. Our first result is an online policy that achieves double logarithm regret dependence on the number of rounds. We then present a hardness result showing that no adaptive online policy can achieve regret with better dependency on the number of rounds. Finally, by formulating the platform's problem as optimizing a linear program with membership oracle access, we present our second online policy that achieves regret with polynomial dependence on the number of states but logarithm dependence on the number of rounds.
    Multi-task Deep Learning for Cerebrovascular Disease Classification and MRI-to-PET Translation. (arXiv:2202.06142v1 [eess.IV])
    Accurate quantification of cerebral blood flow (CBF) is essential for the diagnosis and assessment of cerebrovascular diseases such as Moyamoya, carotid stenosis, aneurysms, and stroke. Positron emission tomography (PET) is currently regarded as the gold standard for the measurement of CBF in the human brain. PET imaging, however, is not widely available because of its prohibitive costs, use of ionizing radiation, and logistical challenges, which require a co-localized cyclotron to deliver the 2 min half-life Oxygen-15 radioisotope. Magnetic resonance imaging (MRI), in contrast, is more readily available and does not involve ionizing radiation. In this study, we propose a multi-task learning framework for brain MRI-to-PET translation and disease diagnosis. The proposed framework comprises two prime networks: (1) an attention-based 3D encoder-decoder convolutional neural network (CNN) that synthesizes high-quality PET CBF maps from multi-contrast MRI images, and (2) a multi-scale 3D CNN that identifies the brain disease corresponding to the input MRI images. Our multi-task framework yields promising results on the task of MRI-to-PET translation, achieving an average structural similarity index (SSIM) of 0.94 and peak signal-to-noise ratio (PSNR) of 38dB on a cohort of 120 subjects. In addition, we show that integrating multiple MRI modalities can improve the clinical diagnosis of brain diseases.
    Neural NID Rules. (arXiv:2202.06036v1 [cs.LG])
    Abstract object properties and their relations are deeply rooted in human common sense, allowing people to predict the dynamics of the world even in situations that are novel but governed by familiar laws of physics. Standard machine learning models in model-based reinforcement learning are inadequate to generalize in this way. Inspired by the classic framework of noisy indeterministic deictic (NID) rules, we introduce here Neural NID, a method that learns abstract object properties and relations between objects with a suitably regularized graph neural network. We validate the greater generalization capability of Neural NID on simple benchmarks specifically designed to assess the transition dynamics learned by the model.
    Neural Spectral Marked Point Processes. (arXiv:2106.10773v3 [cs.LG] UPDATED)
    Self- and mutually-exciting point processes are popular models in machine learning and statistics for dependent discrete event data. To date, most existing models assume stationary kernels (including the classical Hawkes processes) and simple parametric models. Modern applications with complex event data require more general point process models that can incorporate contextual information of the events, called marks, besides the temporal and location information. Moreover, such applications often require non-stationary models to capture more complex spatio-temporal dependence. To tackle these challenges, a key question is to devise a versatile influence kernel in the point process model. In this paper, we introduce a novel and general neural network-based non-stationary influence kernel with high expressiveness for handling complex discrete events data while providing theoretical performance guarantees. We demonstrate the superior performance of our proposed method compared with the state-of-the-art on synthetic and real data.
    Optimal Rates of (Locally) Differentially Private Heavy-tailed Multi-Armed Bandits. (arXiv:2106.02575v3 [cs.LG] UPDATED)
    In this paper we study the problem of stochastic multi-armed bandits (MAB) in the (local) differential privacy (DP/LDP) model. Unlike the previous results which need to assume bounded/sub-Gaussian reward distributions, here we focus on the case where each arm's reward distribution only has $(1+v)$-th moment with some $v\in (0, 1]$. In the first part, we study the problem in the central $\epsilon$-DP model. We first provide a near-optimal result by developing a private and robust Upper Confidence Bound (UCB) algorithm. Then, we improve the result via a private and robust version of the Successive Elimination (SE) algorithm. Finally, we show that the instance-dependent regret bound of our improved algorithm is optimal by showing its lower bound. In the second part, we study the problem in the $\epsilon$-LDP model. We propose an algorithm that can be seen as locally private and robust version of the SE algorithm, and we show it achieves (near) optimal rates for both instance-dependent and instance-independent regrets. Our results reveal differences between the problem of private MAB with bounded/sub-Gaussian rewards and heavy-tailed rewards. To achieve these (near) optimal rates, we develop several new hard instances and private robust estimators as byproducts, which might be used to other related problems. Finally, experiments also support our theoretical findings and show the effectiveness of our algorithms.
    Boosting Barely Robust Learners: A New Perspective on Adversarial Robustness. (arXiv:2202.05920v1 [cs.LG])
    We present an oracle-efficient algorithm for boosting the adversarial robustness of barely robust learners. Barely robust learning algorithms learn predictors that are adversarially robust only on a small fraction $\beta \ll 1$ of the data distribution. Our proposed notion of barely robust learning requires robustness with respect to a "larger" perturbation set; which we show is necessary for strongly robust learning, and that weaker relaxations are not sufficient for strongly robust learning. Our results reveal a qualitative and quantitative equivalence between two seemingly unrelated problems: strongly robust learning and barely robust learning.
    Deep Performer: Score-to-Audio Music Performance Synthesis. (arXiv:2202.06034v1 [cs.SD])
    Music performance synthesis aims to synthesize a musical score into a natural performance. In this paper, we borrow recent advances in text-to-speech synthesis and present the Deep Performer -- a novel system for score-to-audio music performance synthesis. Unlike speech, music often contains polyphony and long notes. Hence, we propose two new techniques for handling polyphonic inputs and providing a fine-grained conditioning in a transformer encoder-decoder model. To train our proposed system, we present a new violin dataset consisting of paired recordings and scores along with estimated alignments between them. We show that our proposed model can synthesize music with clear polyphony and harmonic structures. In a listening test, we achieve competitive quality against the baseline model, a conditional generative audio model, in terms of pitch accuracy, timbre and noise level. Moreover, our proposed model significantly outperforms the baseline on an existing piano dataset in overall quality.
    Formalization of a Stochastic Approximation Theorem. (arXiv:2202.05959v1 [cs.LO])
    Stochastic approximation algorithms are iterative procedures which are used to approximate a target value in an environment where the target is unknown and direct observations are corrupted by noise. These algorithms are useful, for instance, for root-finding and function minimization when the target function or model is not directly known. Originally introduced in a 1951 paper by Robbins and Monro, the field of Stochastic approximation has grown enormously and has come to influence application domains from adaptive signal processing to artificial intelligence. As an example, the Stochastic Gradient Descent algorithm which is ubiquitous in various subdomains of Machine Learning is based on stochastic approximation theory. In this paper, we give a formal proof (in the Coq proof assistant) of a general convergence theorem due to Aryeh Dvoretzky, which implies the convergence of important classical methods such as the Robbins-Monro and the Kiefer-Wolfowitz algorithms. In the process, we build a comprehensive Coq library of measure-theoretic probability theory and stochastic processes.
    Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam. (arXiv:2202.06009v1 [cs.LG])
    1-bit communication is an effective method to scale up model training, and has been studied extensively on SGD. Its benefits, however, remain an open question on Adam-based model training (e.g. BERT and GPT). In this paper, we propose 0/1 Adam, which improves upon the state-of-the-art 1-bit Adam via two novel designs: (1) adaptive variance state freezing, which eliminates the requirement of running expensive full-precision communication at early stage of training; (2) 1-bit sync, which allows skipping communication rounds with bit-free synchronization over Adam's optimizer states, momentum and variance. In theory, we provide convergence analysis for 0/1 Adam on smooth non-convex objectives, and show the complexity bound is better than original Adam under certain conditions. On various benchmarks such as BERT-Base/Large pretraining and ImageNet, we demonstrate on up to 128 GPUs that 0/1 Adam is able to reduce up to 90% of data volume, 54% of communication rounds, and achieve up to 2X higher throughput compared to the state-of-the-art 1-bit Adam while enjoying the same statistical convergence speed and end-to-end model accuracy on GLUE dataset and ImageNet validation set.
    Predicting Pollution Level Using Random Forest: A Case Study of Marilao River in Bulacan Province, Philippines. (arXiv:2202.06066v1 [cs.LG])
    This study aims to predict the pollution level that threatens the Marilao River, located in the province of Bulacan, Philippines. The inhabitants of this area are now being exposed to pollution. Contamination of this waterway comes from both formal and informal industries, such as a used lead-acid battery, open dumpsites metal refining, and other toxic metals. Using various water quality parameters like Dissolved Oxygen (DO), Potential of Hydrogen (pH), Biochemical Oxygen Demand (BOD) and Total Suspended Solids (TSS) were the basis for predicting the pollution level. This study used the Data Mining technique based on the sample data collected from January of 2013 to November of 2017. These were used as a training data and test results to predict the river condition with its corresponding pollution level classification indicated with the used of colors such as Green for Normal, Yellow for Average, Orange for Polluted and Red for Highly Polluted. The model got an accuracy of 91.75% with a Kappa value of 0.8115, interpreted as Strong in terms of the level of agreement.
    Fast Model-based Policy Search for Universal Policy Networks. (arXiv:2202.05843v1 [cs.LG])
    Adapting an agent's behaviour to new environments has been one of the primary focus areas of physics based reinforcement learning. Although recent approaches such as universal policy networks partially address this issue by enabling the storage of multiple policies trained in simulation on a wide range of dynamic/latent factors, efficiently identifying the most appropriate policy for a given environment remains a challenge. In this work, we propose a Gaussian Process-based prior learned in simulation, that captures the likely performance of a policy when transferred to a previously unseen environment. We integrate this prior with a Bayesian Optimisation-based policy search process to improve the efficiency of identifying the most appropriate policy from the universal policy network. We empirically evaluate our approach in a range of continuous and discrete control environments, and show that it outperforms other competing baselines.
    Identification of Flux Rope Orientation via Neural Networks. (arXiv:2202.05901v1 [physics.space-ph])
    Geomagnetic disturbance forecasting is based on the identification of solar wind structures and accurate determination of their magnetic field orientation. For nowcasting activities, this is currently a tedious and manual process. Focusing on the main driver of geomagnetic disturbances, the twisted internal magnetic field of interplanetary coronal mass ejections (ICMEs), we explore a convolutional neural network's (CNN) ability to predict the embedded magnetic flux rope's orientation once it has been identified from in situ solar wind observations. Our work uses CNNs trained with magnetic field vectors from analytical flux rope data. The simulated flux ropes span many possible spacecraft trajectories and flux rope orientations. We train CNNs first with full duration flux ropes and then again with partial duration flux ropes. The former provides us with a baseline of how well CNNs can predict flux rope orientation while the latter provides insights into real-time forecasting by exploring how accuracy is affected by percentage of flux rope observed. The process of casting the physics problem as a machine learning problem is discussed as well as the impacts of different factors on prediction accuracy such as flux rope fluctuations and different neural network topologies. Finally, results from evaluating the trained network against observed ICMEs from Wind during 1995-2015 are presented.
    RSINet: Inpainting Remotely Sensed Images Using Triple GAN Framework. (arXiv:2202.05988v1 [cs.CV])
    We tackle the problem of image inpainting in the remote sensing domain. Remote sensing images possess high resolution and geographical variations, that render the conventional inpainting methods less effective. This further entails the requirement of models with high complexity to sufficiently capture the spectral, spatial and textural nuances within an image, emerging from its high spatial variability. To this end, we propose a novel inpainting method that individually focuses on each aspect of an image such as edges, colour and texture using a task specific GAN. Moreover, each individual GAN also incorporates the attention mechanism that explicitly extracts the spectral and spatial features. To ensure consistent gradient flow, the model uses residual learning paradigm, thus simultaneously working with high and low level features. We evaluate our model, alongwith previous state of the art models, on the two well known remote sensing datasets, Open Cities AI and Earth on Canvas, and achieve competitive performance.
    EREBA: Black-box Energy Testing of Adaptive Neural Networks. (arXiv:2202.06084v1 [cs.LG])
    Recently, various Deep Neural Network (DNN) models have been proposed for environments like embedded systems with stringent energy constraints. The fundamental problem of determining the robustness of a DNN with respect to its energy consumption (energy robustness) is relatively unexplored compared to accuracy-based robustness. This work investigates the energy robustness of Adaptive Neural Networks (AdNNs), a type of energy-saving DNNs proposed for many energy-sensitive domains and have recently gained traction. We propose EREBA, the first black-box testing method for determining the energy robustness of an AdNN. EREBA explores and infers the relationship between inputs and the energy consumption of AdNNs to generate energy surging samples. Extensive implementation and evaluation using three state-of-the-art AdNNs demonstrate that test inputs generated by EREBA could degrade the performance of the system substantially. The test inputs generated by EREBA can increase the energy consumption of AdNNs by 2,000% compared to the original inputs. Our results also show that test inputs generated via EREBA are valuable in detecting energy surging inputs.
    Learning complex dependency structure of gene regulatory networks from high dimensional micro-array data with Gaussian Bayesian networks. (arXiv:2106.15365v2 [q-bio.MN] UPDATED)
    Gene expression datasets consist of thousand of genes with relatively small samplesizes (i.e. are large-$p$-small-$n$). Moreover, dependencies of various orders co-exist in the datasets. In the Undirected probabilistic Graphical Model (UGM) framework the Glasso algorithm has been proposed to deal with high dimensional micro-array datasets forcing sparsity. Also, modifications of the default Glasso algorithm are developed to overcome the problem of complex interaction structure. In this work we advocate the use of a simple score-based Hill Climbing algorithm (HC) that learns Gaussian Bayesian Networks (BNs) leaning on Directed Acyclic Graphs (DAGs). We compare HC with Glasso and its modifications in the UGM framework on their capability to reconstruct GRNs from micro-array data belonging to the Escherichia Coli genome. We benefit from the analytical properties of the Joint Probability Density (JPD) function on which both directed and undirected PGMs build to convert DAGs to UGMs. We conclude that dependencies in complex data are learned best by the HC algorithm, presenting them most accurately and efficiently, simultaneously modelling strong local and weaker but significant global connections coexisting in the gene expression dataset. The HC algorithm adapts intrinsically to the complex dependency structure of the dataset, without forcing a specific structure in advance. On the contrary, Glasso and modifications model unnecessary dependencies at the expense of the probabilistic information in the network and of a structural bias in the JPD function that can only be relieved including many parameters.
    Asteroid Flyby Cycler Trajectory Design Using Deep Neural Networks. (arXiv:2111.11858v2 [astro-ph.IM] UPDATED)
    Asteroid exploration has been attracting more attention in recent years. Nevertheless, we have just visited tens of asteroids while we have discovered more than one million bodies. As our current observation and knowledge should be biased, it is essential to explore multiple asteroids directly to better understand the remains of planetary building materials. One of the mission design solutions is utilizing asteroid flyby cycler trajectories with multiple Earth gravity assists. An asteroid flyby cycler trajectory design problem is a subclass of global trajectory optimization problems with multiple flybys, involving a trajectory optimization problem for a given flyby sequence and a combinatorial optimization problem to decide the sequence of the flybys. As the number of flyby bodies grows, the computation time of this optimization problem expands maliciously. This paper presents a new method to design asteroid flyby cycler trajectories utilizing a surrogate model constructed by deep neural networks approximating trajectory optimization results. Since one of the bottlenecks of machine learning approaches is the computation time to generate massive trajectory databases, we propose an efficient database generation strategy by introducing pseudo-asteroids satisfying the Karush-Kuhn-Tucker conditions. The numerical result applied to JAXA's DESTINY+ mission shows that the proposed method is practically applicable to space mission design and can significantly reduce the computational time for searching asteroid flyby sequences.
    Improving Image-recognition Edge Caches with a Generative Adversarial Network. (arXiv:2202.05929v1 [cs.NI])
    Image recognition is an essential task in several mobile applications. For instance, a smartphone can process a landmark photo to gather more information about its location. If the device does not have enough computational resources available, it offloads the processing task to a cloud infrastructure. Although this approach solves resource shortages, it introduces a communication delay. Image-recognition caches on the Internet's edge can mitigate this problem. These caches run on servers close to mobile devices and stores information about previously recognized images. If the server receives a request with a photo stored in its cache, it replies to the device, avoiding cloud offloading. The main challenge for this cache is to verify if the received image matches a stored one. Furthermore, for outdoor photos, it is difficult to compare them if one was taken in the daytime and the other at nighttime. In that case, the cache might wrongly infer that they refer to different places, offloading the processing to the cloud. This work shows that a well-known generative adversarial network, called ToDayGAN, can solve this problem by generating daytime images using nighttime ones. We can thus use this translation to populate a cache with synthetic photos that can help image matching. We show that our solution reduces cloud offloading and, therefore, the application's latency.
    Multi-level Latent Space Structuring for Generative Control. (arXiv:2202.05910v1 [cs.CV])
    Truncation is widely used in generative models for improving the quality of the generated samples, at the expense of reducing their diversity. We propose to leverage the StyleGAN generative architecture to devise a new truncation technique, based on a decomposition of the latent space into clusters, enabling customized truncation to be performed at multiple semantic levels. We do so by learning to re-generate W-space, the extended intermediate latent space of StyleGAN, using a learnable mixture of Gaussians, while simultaneously training a classifier to identify, for each latent vector, the cluster that it belongs to. The resulting truncation scheme is more faithful to the original untruncated samples and allows a better trade-off between quality and diversity. We compare our method to other truncation approaches for StyleGAN, both qualitatively and quantitatively.
    What Makes Good Contrastive Learning on Small-Scale Wearable-based Tasks?. (arXiv:2202.05998v1 [cs.LG])
    Self-supervised learning establishes a new paradigm of learning representations with much fewer or even no label annotations. Recently there has been remarkable progress on large-scale contrastive learning models which require substantial computing resources, yet such models are not practically optimal for small-scale tasks. To fill the gap, we aim to study contrastive learning on the wearable-based activity recognition task. Specifically, we conduct an in-depth study of contrastive learning from both algorithmic-level and task-level perspectives. For algorithmic-level analysis, we decompose contrastive models into several key components and conduct rigorous experimental evaluations to better understand the efficacy and rationale behind contrastive learning. More importantly, for task-level analysis, we show that the wearable-based signals bring unique challenges and opportunities to existing contrastive models, which cannot be readily solved by existing algorithms. Our thorough empirical studies suggest important practices and shed light on future research challenges. In the meantime, this paper presents an open-source PyTorch library \texttt{CL-HAR}, which can serve as a practical tool for researchers. The library is highly modularized and easy to use, which opens up avenues for exploring novel contrastive models quickly in the future.
    Learning by Doing: Controlling a Dynamical System using Causality, Control, and Reinforcement Learning. (arXiv:2202.06052v1 [cs.LG])
    Questions in causality, control, and reinforcement learning go beyond the classical machine learning task of prediction under i.i.d. observations. Instead, these fields consider the problem of learning how to actively perturb a system to achieve a certain effect on a response variable. Arguably, they have complementary views on the problem: In control, one usually aims to first identify the system by excitation strategies to then apply model-based design techniques to control the system. In (non-model-based) reinforcement learning, one directly optimizes a reward. In causality, one focus is on identifiability of causal structure. We believe that combining the different views might create synergies and this competition is meant as a first step toward such synergies. The participants had access to observational and (offline) interventional data generated by dynamical systems. Track CHEM considers an open-loop problem in which a single impulse at the beginning of the dynamics can be set, while Track ROBO considers a closed-loop problem in which control variables can be set at each time step. The goal in both tracks is to infer controls that drive the system to a desired state. Code is open-sourced ( https://github.com/LearningByDoingCompetition/learningbydoing-comp ) to reproduce the winning solutions of the competition and to facilitate trying out new methods on the competition tasks.
    Uncertainty Aware System Identification with Universal Policies. (arXiv:2202.05844v1 [cs.LG])
    Sim2real transfer is primarily concerned with transferring policies trained in simulation to potentially noisy real world environments. A common problem associated with sim2real transfer is estimating the real-world environmental parameters to ground the simulated environment to. Although existing methods such as Domain Randomisation (DR) can produce robust policies by sampling from a distribution of parameters during training, there is no established method for identifying the parameters of the corresponding distribution for a given real-world setting. In this work, we propose Uncertainty-aware policy search (UncAPS), where we use Universal Policy Network (UPN) to store simulation-trained task-specific policies across the full range of environmental parameters and then subsequently employ robust Bayesian optimisation to craft robust policies for the given environment by combining relevant UPN policies in a DR like fashion. Such policy-driven grounding is expected to be more efficient as it estimates only task-relevant sets of parameters. Further, we also account for the estimation uncertainties in the search process to produce policies that are robust against both aleatoric and epistemic uncertainties. We empirically evaluate our approach in a range of noisy, continuous control environments, and show its improved performance compared to competing baselines.
    Reverse Back Propagation to Make Full Use of Derivative. (arXiv:2202.06316v1 [cs.LG])
    The development of the back-propagation algorithm represents a landmark in neural networks. We provide an approach that conducts the back-propagation again to reverse the traditional back-propagation process to optimize the input loss at the input end of a neural network for better effects without extra costs during the inference time. Then we further analyzed its principles and advantages and disadvantages, reformulated the weight initialization strategy for our method. And experiments on MNIST, CIFAR10, and CIFAR100 convinced our approaches could adapt to a larger range of learning rate and learn better than vanilla back-propagation.
    Text and Image Guided 3D Avatar Generation and Manipulation. (arXiv:2202.06079v1 [cs.CV])
    The manipulation of latent space has recently become an interesting topic in the field of generative models. Recent research shows that latent directions can be used to manipulate images towards certain attributes. However, controlling the generation process of 3D generative models remains a challenge. In this work, we propose a novel 3D manipulation method that can manipulate both the shape and texture of the model using text or image-based prompts such as 'a young face' or 'a surprised face'. We leverage the power of Contrastive Language-Image Pre-training (CLIP) model and a pre-trained 3D GAN model designed to generate face avatars, and create a fully differentiable rendering pipeline to manipulate meshes. More specifically, our method takes an input latent code and modifies it such that the target attribute specified by a text or image prompt is present or enhanced, while leaving other attributes largely unaffected. Our method requires only 5 minutes per manipulation, and we demonstrate the effectiveness of our approach with extensive results and comparisons.
    The Curse Revisited: When are Distances Informative for the Ground Truth in Noisy High-Dimensional Data?. (arXiv:2109.10569v4 [cs.LG] UPDATED)
    Distances between data points are widely used in machine learning applications. Yet, when corrupted by noise, these distances -- and thus the models based upon them -- may lose their usefulness in high dimensions. Indeed, the small marginal effects of the noise may then accumulate quickly, shifting empirical closest and furthest neighbors away from the ground truth. In this paper, we exactly characterize such effects in noisy high-dimensional data using an asymptotic probabilistic expression. Previously, it has been argued that neighborhood queries become meaningless and unstable when distance concentration occurs, which means that there is a poor relative discrimination between the furthest and closest neighbors in the data. However, we conclude that this is not necessarily the case when we decompose the data in a ground truth -- which we aim to recover -- and noise component. More specifically, we derive that under particular conditions, empirical neighborhood relations affected by noise are still likely to be truthful even when distance concentration occurs. We also include thorough empirical verification of our results, as well as interesting experiments in which our derived `phase shift' where neighbors become random or not turns out to be identical to the phase shift where common dimensionality reduction methods perform poorly or well for recovering low-dimensional reconstructions of high-dimensional data with dense noise.
    The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models. (arXiv:2201.03544v2 [cs.LG] UPDATED)
    Reward hacking -- where RL agents exploit gaps in misspecified reward functions -- has been widely observed, but not yet systematically studied. To understand how reward hacking arises, we construct four RL environments with misspecified rewards. We investigate reward hacking as a function of agent capabilities: model capacity, action space resolution, observation space noise, and training time. More capable agents often exploit reward misspecifications, achieving higher proxy reward and lower true reward than less capable agents. Moreover, we find instances of phase transitions: capability thresholds at which the agent's behavior qualitatively shifts, leading to a sharp decrease in the true reward. Such phase transitions pose challenges to monitoring the safety of ML systems. To address this, we propose an anomaly detection task for aberrant policies and offer several baseline detectors.
    Escaping Saddle Points with Bias-Variance Reduced Local Perturbed SGD for Communication Efficient Nonconvex Distributed Learning. (arXiv:2202.06083v1 [cs.LG])
    In recent centralized nonconvex distributed learning and federated learning, local methods are one of the promising approaches to reduce communication time. However, existing work has mainly focused on studying first-order optimality guarantees. On the other side, second-order optimality guaranteed algorithms have been extensively studied in the non-distributed optimization literature. In this paper, we study a new local algorithm called Bias-Variance Reduced Local Perturbed SGD (BVR-L-PSGD), that combines the existing bias-variance reduced gradient estimator with parameter perturbation to find second-order optimal points in centralized nonconvex distributed optimization. BVR-L-PSGD enjoys second-order optimality with nearly the same communication complexity as the best known one of BVR-L-SGD to find first-order optimality. Particularly, the communication complexity is better than non-local methods when the local datasets heterogeneity is smaller than the smoothness of the local loss. In an extreme case, the communication complexity approaches to $\widetilde \Theta(1)$ when the local datasets heterogeneity goes to zero.
    Action and Perception as Divergence Minimization. (arXiv:2009.01791v3 [cs.AI] UPDATED)
    To learn directed behaviors in complex environments, intelligent agents need to optimize objective functions. Various objectives are known for designing artificial agents, including task rewards and intrinsic motivation. However, it is unclear how the known objectives relate to each other, which objectives remain yet to be discovered, and which objectives better describe the behavior of humans. We introduce the Action Perception Divergence (APD), an approach for categorizing the space of possible objective functions for embodied agents. We show a spectrum that reaches from narrow to general objectives. While the narrow objectives correspond to domain-specific rewards as typical in reinforcement learning, the general objectives maximize information with the environment through latent variable models of input sequences. Intuitively, these agents use perception to align their beliefs with the world and use actions to align the world with their beliefs. They infer representations that are informative of past inputs, explore future inputs that are informative of their representations, and select actions or skills that maximally influence future inputs. This explains a wide range of unsupervised objectives from a single principle, including representation learning, information gain, empowerment, and skill discovery. Our findings suggest leveraging powerful world models for unsupervised exploration as a path toward highly adaptive agents that seek out large niches in their environments, rendering task rewards optional.
    Common Language for Goal-Oriented Semantic Communications: A Curriculum Learning Framework. (arXiv:2111.08051v2 [cs.NI] UPDATED)
    Semantic communications will play a critical role in enabling goal-oriented services over next-generation wireless systems. However, most prior art in this domain is restricted to specific applications (e.g., text or image), and it does not enable goal-oriented communications in which the effectiveness of the transmitted information must be considered along with the semantics so as to execute a certain task. In this paper, a comprehensive semantic communications framework is proposed for enabling goal-oriented task execution. To capture the semantics between a speaker and a listener, a common language is defined using the concept of beliefs to enable the speaker to describe the environment observations to the listener. Then, an optimization problem is posed to choose the minimum set of beliefs that perfectly describes the observation while minimizing the task execution time and transmission cost. A novel top-down framework that combines curriculum learning (CL) and reinforcement learning (RL) is proposed to solve this problem. Simulation results show that the proposed CL method outperforms traditional RL in terms of convergence time, task execution time, and transmission cost during training.
    Towards Real-time and Light-weight Line Segment Detection. (arXiv:2106.00186v2 [cs.CV] UPDATED)
    Previous deep learning-based line segment detection (LSD) suffers from the immense model size and high computational cost for line prediction. This constrains them from real-time inference on computationally restricted environments. In this paper, we propose a real-time and light-weight line segment detector for resource-constrained environments named Mobile LSD (M-LSD). We design an extremely efficient LSD architecture by minimizing the backbone network and removing the typical multi-module process for line prediction found in previous methods. To maintain competitive performance with a light-weight network, we present novel training schemes: Segments of Line segment (SoL) augmentation, matching and geometric loss. SoL augmentation splits a line segment into multiple subparts, which are used to provide auxiliary line data during the training process. Moreover, the matching and geometric loss allow a model to capture additional geometric cues. Compared with TP-LSD-Lite, previously the best real-time LSD method, our model (M-LSD-tiny) achieves competitive performance with 2.5% of model size and an increase of 130.5% in inference speed on GPU. Furthermore, our model runs at 56.8 FPS and 48.6 FPS on the latest Android and iPhone mobile devices, respectively. To the best of our knowledge, this is the first real-time deep LSD available on mobile devices. Our code is available.
    Fairness Through Counterfactual Utilities. (arXiv:2108.05315v2 [cs.LG] UPDATED)
    Group fairness definitions such as Demographic Parity and Equal Opportunity make assumptions about the underlying decision-problem that restrict them to classification problems. Prior work has translated these definitions to other machine learning environments, such as unsupervised learning and reinforcement learning, by implementing their closest mathematical equivalent. As a result, there are numerous bespoke interpretations of these definitions. Instead, we provide a generalized set of group fairness definitions that unambiguously extend to all machine learning environments while still retaining their original fairness notions. We derive two fairness principles that enable such a generalized framework. First, our framework measures outcomes in terms of utilities, rather than predictions, and does so for both the decision-algorithm and the individual. Second, our framework considers counterfactual outcomes, rather than just observed outcomes, thus preventing loopholes where fairness criteria are satisfied through self-fulfilling prophecies. We provide concrete examples of how our counterfactual utility fairness framework resolves known fairness issues in classification, clustering, and reinforcement learning problems. We also show that many of the bespoke interpretations of Demographic Parity and Equal Opportunity fit nicely as special cases of our framework.
    Online Hyperparameter Meta-Learning with Hypergradient Distillation. (arXiv:2110.02508v2 [cs.LG] UPDATED)
    Many gradient-based meta-learning methods assume a set of parameters that do not participate in inner-optimization, which can be considered as hyperparameters. Although such hyperparameters can be optimized using the existing gradient-based hyperparameter optimization (HO) methods, they suffer from the following issues. Unrolled differentiation methods do not scale well to high-dimensional hyperparameters or horizon length, Implicit Function Theorem (IFT) based methods are restrictive for online optimization, and short horizon approximations suffer from short horizon bias. In this work, we propose a novel HO method that can overcome these limitations, by approximating the second-order term with knowledge distillation. Specifically, we parameterize a single Jacobian-vector product (JVP) for each HO step and minimize the distance from the true second-order term. Our method allows online optimization and also is scalable to the hyperparameter dimension and the horizon length. We demonstrate the effectiveness of our method on two different meta-learning methods and three benchmark datasets.
    Sm{\aa}prat: DialoGPT for Natural Language Generation of Swedish Dialogue by Transfer Learning. (arXiv:2110.06273v2 [cs.CL] UPDATED)
    Building open-domain conversational systems (or chatbots) that produce convincing responses is a recognized challenge. Recent state-of-the-art (SoTA) transformer-based models for the generation of natural language dialogue have demonstrated impressive performance in simulating human-like, single-turn conversations in English. This work investigates, by an empirical study, the potential for transfer learning of such models to Swedish language. DialoGPT, an English language pre-trained model, is adapted by training on three different Swedish language conversational datasets obtained from publicly available sources. Perplexity score (an automated intrinsic language model metric) and surveys by human evaluation were used to assess the performances of the fine-tuned models, with results that indicate that the capacity for transfer learning can be exploited with considerable success. Human evaluators asked to score the simulated dialogue judged over 57% of the chatbot responses to be human-like for the model trained on the largest (Swedish) dataset. We provide the demos and model checkpoints of our English and Swedish chatbots on the HuggingFace platform for public use.
    Accelerating Stochastic Simulation with Spatiotemporal Neural Processes. (arXiv:2106.02770v4 [cs.LG] UPDATED)
    Stochastic simulations such as large-scale, spatiotemporal, age-structured epidemic models are computationally expensive at fine-grained resolution. We propose Spatiotemporal Neural Processes (STNP), a neural latent variable model to mimic the spatiotemporal dynamics of stochastic simulators. To further speed up training, we use a Bayesian active learning strategy to proactively query the simulator, gather more data, and continuously improve the model. Our model can automatically infer the latent processes which describe the intrinsic uncertainty of the simulator. This also gives rise to a new acquisition function based on latent information gain. Theoretical analysis demonstrates that our approach reduces sample complexity compared with random sampling in high dimension. Empirically, we demonstrate that our framework can faithfully imitate the behavior of a complex infectious disease simulator with a small number of examples, enabling rapid simulation and scenario exploration.
    Areas on the space of smooth probability density functions on $S^2$. (arXiv:2110.07773v2 [math.SG] UPDATED)
    We present symbolic and numerical methods for computing Poisson brackets on the spaces of measures with positive densities of the plane, the 2-torus, and the 2-sphere. We apply our methods to compute symplectic areas of finite regions for the case of the 2-sphere, including an explicit example for Gaussian measures with positive densities.
    ParChain: A Framework for Parallel Hierarchical Agglomerative Clustering using Nearest-Neighbor Chain. (arXiv:2106.04727v3 [cs.DS] UPDATED)
    This paper studies the hierarchical clustering problem, where the goal is to produce a dendrogram that represents clusters at varying scales of a data set. We propose the ParChain framework for designing parallel hierarchical agglomerative clustering (HAC) algorithms, and using the framework we obtain novel parallel algorithms for the complete linkage, average linkage, and Ward's linkage criteria. Compared to most previous parallel HAC algorithms, which require quadratic memory, our new algorithms require only linear memory, and are scalable to large data sets. ParChain is based on our parallelization of the nearest-neighbor chain algorithm, and enables multiple clusters to be merged on every round. We introduce two key optimizations that are critical for efficiency: a range query optimization that reduces the number of distance computations required when finding nearest neighbors of clusters, and a caching optimization that stores a subset of previously computed distances, which are likely to be reused. Experimentally, we show that our highly-optimized implementations using 48 cores with two-way hyper-threading achieve 5.8--110.1x speedup over state-of-the-art parallel HAC algorithms and achieve 13.75--54.23x self-relative speedup. Compared to state-of-the-art algorithms, our algorithms require up to 237.3x less space. Our algorithms are able to scale to data set sizes with tens of millions of points, which existing algorithms are not able to handle.
    Invariant Risk Minimisation for Cross-Organism Inference: Substituting Mouse Data for Human Data in Human Risk Factor Discovery. (arXiv:2111.07348v2 [cs.LG] UPDATED)
    Human medical data can be challenging to obtain due to data privacy concerns, difficulties conducting certain types of experiments, or prohibitive associated costs. In many settings, data from animal models or in-vitro cell lines are available to help augment our understanding of human data. However, this data is known for having low etiological validity in comparison to human data. In this work, we augment small human medical datasets with in-vitro data and animal models. We use Invariant Risk Minimisation (IRM) to elucidate invariant features by considering cross-organism data as belonging to different data-generating environments. Our models identify genes of relevance to human cancer development. We observe a degree of consistency between varying the amounts of human and mouse data used, however, further work is required to obtain conclusive insights. As a secondary contribution, we enhance existing open source datasets and provide two uniformly processed, cross-organism, homologue gene-matched datasets to the community.
    SDE approximations of GANs training and its long-run behavior. (arXiv:2006.02047v5 [cs.LG] UPDATED)
    This paper analyzes the training process of GANs via stochastic differential equations (SDEs). It first establishes SDE approximations for the training of GANs under stochastic gradient algorithms, with precise error bound analysis. It then describes the long-run behavior of GANs training via the invariant measures of its SDE approximations under proper conditions. This work builds theoretical foundation for GANs training and provides analytical tools to study its evolution and stability.
    Applying Differential Privacy to Tensor Completion. (arXiv:2110.00539v4 [cs.LG] UPDATED)
    Tensor completion aims at filling the missing or unobserved entries based on partially observed tensors. However, utilization of the observed tensors often raises serious privacy concerns in many practical scenarios. To address this issue, we propose a solid and unified framework that contains several approaches for applying differential privacy to the two most widely used tensor decomposition methods: i) CANDECOMP/PARAFAC~(CP) and ii) Tucker decompositions. For each approach, we establish a rigorous privacy guarantee and meanwhile evaluate the privacy-accuracy trade-off. Experiments on synthetic and real-world datasets demonstrate that our proposal achieves high accuracy for tensor completion while ensuring strong privacy protections.
    On Reward-Free RL with Kernel and Neural Function Approximations: Single-Agent MDP and Markov Game. (arXiv:2110.09771v2 [cs.LG] UPDATED)
    To achieve sample efficiency in reinforcement learning (RL), it necessitates efficiently exploring the underlying environment. Under the offline setting, addressing the exploration challenge lies in collecting an offline dataset with sufficient coverage. Motivated by such a challenge, we study the reward-free RL problem, where an agent aims to thoroughly explore the environment without any pre-specified reward function. Then, given any extrinsic reward, the agent computes the policy via a planning algorithm with offline data collected in the exploration phase. Moreover, we tackle this problem under the context of function approximation, leveraging powerful function approximators. Specifically, we propose to explore via an optimistic variant of the value-iteration algorithm incorporating kernel and neural function approximations, where we adopt the associated exploration bonus as the exploration reward. Moreover, we design exploration and planning algorithms for both single-agent MDPs and zero-sum Markov games and prove that our methods can achieve $\widetilde{\mathcal{O}}(1 /\varepsilon^2)$ sample complexity for generating a $\varepsilon$-suboptimal policy or $\varepsilon$-approximate Nash equilibrium when given an arbitrary extrinsic reward. To the best of our knowledge, we establish the first provably efficient reward-free RL algorithm with kernel and neural function approximators.
    Approximate Inference via Clustering. (arXiv:2111.14219v2 [cs.LG] UPDATED)
    In recent years, large-scale Bayesian learning draws a great deal of attention. However, in big-data era, the amount of data we face is growing much faster than our ability to deal with it. Fortunately, it is observed that large-scale datasets usually own rich internal structure and is somewhat redundant. In this paper, we attempt to simplify the Bayesian posterior via exploiting this structure. Specifically, we restrict our interest to the so-called well-clustered datasets and construct an \emph{approximate posterior} according to the clustering information. Fortunately, the clustering structure can be efficiently obtained via a particular clustering algorithm. When constructing the approximate posterior, the data points in the same cluster are all replaced by the centroid of the cluster. As a result, the posterior can be significantly simplified. Theoretically, we show that under certain conditions the approximate posterior we construct is close (measured by KL divergence) to the exact posterior. Furthermore, thorough experiments are conducted to validate the fact that the constructed posterior is a good approximation to the true posterior and much easier to sample from.
    Scaling Laws Under the Microscope: Predicting Transformer Performance from Small Scale Experiments. (arXiv:2202.06387v1 [cs.CL])
    Neural scaling laws define a predictable relationship between a model's parameter count and its performance after training in the form of a power law. However, most research to date has not explicitly investigated whether scaling laws can be used to accelerate model development. In this work, we perform such an empirical investigation across a wide range of language understanding tasks, starting from models with as few as 10K parameters, and evaluate downstream performance across 9 language understanding tasks. We find that scaling laws emerge at finetuning time in some NLP tasks, and that they can also be exploited for debugging convergence when training large models. Moreover, for tasks where scaling laws exist, they can be used to predict the performance of larger models, which enables effective model selection. However, revealing scaling laws requires careful hyperparameter tuning and multiple runs for the purpose of uncertainty estimation, which incurs additional overhead, partially offsetting the computational benefits.
    Beyond NaN: Resiliency of Optimization Layers in The Face of Infeasibility. (arXiv:2202.06242v1 [cs.LG])
    Prior work has successfully incorporated optimization layers as the last layer in neural networks for various problems, thereby allowing joint learning and planning in one neural network forward pass. In this work, we identify a weakness in such a set-up where inputs to the optimization layer lead to undefined output of the neural network. Such undefined decision outputs can lead to possible catastrophic outcomes in critical real time applications. We show that an adversary can cause such failures by forcing rank deficiency on the matrix fed to the optimization layer which results in the optimization failing to produce a solution. We provide a defense for the failure cases by controlling the condition number of the input matrix. We study the problem in the settings of synthetic data, Jigsaw Sudoku, and in speed planning for autonomous driving, building on top of prior frameworks in end-to-end learning and optimization. We show that our proposed defense effectively prevents the framework from failing with undefined output. Finally, we surface a number of edge cases which lead to serious bugs in popular equation and optimization solvers which can be abused as well.
    Learning Asymmetric Embedding for Attributed Networks via Convolutional Neural Network. (arXiv:2202.06307v1 [cs.LG])
    Recently network embedding has gained increasing attention due to its advantages in facilitating network computation tasks such as link prediction, node classification and node clustering. The objective of network embedding is to represent network nodes in a low-dimensional vector space while retaining as much information as possible from the original network including structural, relational, and semantic information. However, asymmetric nature of directed networks poses many challenges as how to best preserve edge directions in the embedding process. Here, we propose a novel deep asymmetric attributed network embedding model based on convolutional graph neural network, called AAGCN. The main idea is to maximally preserve the asymmetric proximity and asymmetric similarity of directed attributed networks. AAGCN introduces two neighbourhood feature aggregation schemes to separately aggregate the features of a node with the features of its in- and out- neighbours. Then, it learns two embedding vectors for each node, one source embedding vector and one target embedding vector. The final representations are the results of concatenating source and target embedding vectors. We test the performance of AAGCN on three real-world networks for network reconstruction, link prediction, node classification and visualization tasks. The experimental results show the superiority of AAGCN against state-of-the-art embedding methods.
    Surgical Scheduling via Optimization and Machine Learning with Long-Tailed Data. (arXiv:2202.06383v1 [cs.LG])
    Using data from cardiovascular surgery patients with long and highly variable post-surgical lengths of stay (LOS), we develop a model to reduce recovery unit congestion. We estimate LOS using a variety of machine learning models, schedule procedures with a variety of online optimization models, and estimate performance with simulation. The machine learning models achieved only modest LOS prediction accuracy, despite access to a very rich set of patient characteristics. Compared to the current paper-based system used in the hospital, most optimization models failed to reduce congestion without increasing wait times for surgery. A conservative stochastic optimization with sufficient sampling to capture the long tail of the LOS distribution outperformed the current manual process. These results highlight the perils of using oversimplified distributional models of patient length of stay for scheduling procedures and the importance of using stochastic optimization well-suited to dealing with long-tailed behavior.
    Socially-Compatible Behavior Design of Autonomous Vehicles with Verification on Real Human Data. (arXiv:2010.14712v7 [cs.RO] UPDATED)
    As more and more autonomous vehicles (AVs) are being deployed on public roads, designing socially compatible behaviors for them is becoming increasingly important. In order to generate safe and efficient actions, AVs need to not only predict the future behaviors of other traffic participants, but also be aware of the uncertainties associated with such behavior prediction. In this paper, we propose an uncertain-aware integrated prediction and planning (UAPP) framework. It allows the AVs to infer the characteristics of other road users online and generate behaviors optimizing not only their own rewards, but also their courtesy to others, and their confidence regarding the prediction uncertainties. We first propose the definitions for courtesy and confidence. Based on that, their influences on the behaviors of AVs in interactive driving scenarios are explored. Moreover, we evaluate the proposed algorithm on naturalistic human driving data by comparing the generated behavior against ground truth. Results show that the online inference can significantly improve the human-likeness of the generated behaviors. Furthermore, we find that human drivers show great courtesy to others, even for those without right-of-way. We also find that such driving preferences vary significantly in different cultures.
    Border basis computation with gradient-weighted normalization. (arXiv:2101.00401v3 [cs.SC] UPDATED)
    Normalization of polynomials plays a vital role in the approximate basis computation of vanishing ideals. Coefficient normalization, which normalizes a polynomial with its coefficient norm, is the most common method in computer algebra. This study proposes the gradient-weighted normalization method for the approximate border basis computation of vanishing ideals, inspired by recent developments in machine learning. The data-dependent nature of gradient-weighted normalization leads to better stability against perturbation and consistency in the scaling of input points, which cannot be attained by coefficient normalization. Only a subtle change is needed to introduce gradient normalization in the existing algorithms with coefficient normalization. The analysis of algorithms still works with a small modification, and the time complexity of algorithms remains unchanged. We also prove that, with coefficient normalization, which does not provide the scaling consistency property, scaling of points (e.g., as a preprocessing) can cause an approximate basis computation to fail. This study is the first to theoretically highlight the crucial effect of scaling in approximate basis computation and presents the utility of data-dependent normalization.
    Unsupervised Disentanglement with Tensor Product Representations on the Torus. (arXiv:2202.06201v1 [cs.LG])
    The current methods for learning representations with auto-encoders almost exclusively employ vectors as the latent representations. In this work, we propose to employ a tensor product structure for this purpose. This way, the obtained representations are naturally disentangled. In contrast to the conventional variations methods, which are targeted toward normally distributed features, the latent space in our representation is distributed uniformly over a set of unit circles. We argue that the torus structure of the latent space captures the generative factors effectively. We employ recent tools for measuring unsupervised disentanglement, and in an extensive set of experiments demonstrate the advantage of our method in terms of disentanglement, completeness, and informativeness. The code for our proposed method is available at https://github.com/rotmanmi/Unsupervised-Disentanglement-Torus.
    A Data Augmentation Method for Fully Automatic Brain Tumor Segmentation. (arXiv:2202.06344v1 [eess.IV])
    Automatic segmentation of glioma and its subregions is of great significance for diagnosis, treatment and monitoring of disease. In this paper, an augmentation method, called TensorMixup, was proposed and applied to the three dimensional U-Net architecture for brain tumor segmentation. The main ideas included that first, two image patches with size of 128 in three dimensions were selected according to glioma information of ground truth labels from the magnetic resonance imaging data of any two patients with the same modality. Next, a tensor in which all elements were independently sampled from Beta distribution was used to mix the image patches. Then the tensor was mapped to a matrix which was used to mix the one-hot encoded labels of the above image patches. Therefore, a new image and its one-hot encoded label were synthesized. Finally, the new data was used to train the model which could be used to segment glioma. The experimental results show that the mean accuracy of Dice scores are 91.32%, 85.67%, and 82.20% respectively on the whole tumor, tumor core, and enhancing tumor segmentation, which proves that the proposed TensorMixup is feasible and effective for brain tumor segmentation.
    An AI-based Domain-Decomposition Non-Intrusive Reduced-Order Model for Extended Domains applied to Multiphase Flow in Pipes. (arXiv:2202.06170v1 [physics.flu-dyn])
    The modelling of multiphase flow in a pipe presents a significant challenge for high-resolution computational fluid dynamics (CFD) models due to the high aspect ratio (length over diameter) of the domain. In subsea applications, the pipe length can be several hundreds of kilometres versus a pipe diameter of just a few inches. In this paper, we present a new AI-based non-intrusive reduced-order model within a domain decomposition framework (AI-DDNIROM) which is capable of making predictions for domains significantly larger than the domain used in training. This is achieved by using domain decomposition; dimensionality reduction; training a neural network to make predictions for a single subdomain; and by using an iteration-by-subdomain technique to converge the solution over the whole domain. To find the low-dimensional space, we explore several types of autoencoder networks, known for their ability to compress information accurately and compactly. The performance of the autoencoders is assessed on two advection-dominated problems: flow past a cylinder and slug flow in a pipe. To make predictions in time, we exploit an adversarial network which aims to learn the distribution of the training data, in addition to learning the mapping between particular inputs and outputs. This type of network has shown the potential to produce realistic outputs. The whole framework is applied to multiphase slug flow in a horizontal pipe for which an AI-DDNIROM is trained on high-fidelity CFD simulations of a pipe of length 10 m with an aspect ratio of 13:1, and tested by simulating the flow for a pipe of length 98 m with an aspect ratio of almost 130:1. Statistics of the flows obtained from the CFD simulations are compared to those of the AI-DDNIROM predictions to demonstrate the success of our approach.
    TATTOOED: A Robust Deep Neural Network Watermarking Scheme based on Spread-Spectrum Channel Coding. (arXiv:2202.06091v1 [cs.CR])
    The proliferation of deep learning applications in several areas has led to the rapid adoption of such solutions from an ever-growing number of institutions and companies. The deep neural network (DNN) models developed by these entities are often trained on proprietary data. They require powerful computational resources, with the resulting DNN models being incorporated in the company's work pipeline or provided as a service. Being trained on proprietary information, these models provide a competitive edge for the owner company. At the same time, these models can be attractive to competitors (or malicious entities), which can employ state-of-the-art security attacks to steal and use these models for their benefit. As these attacks are hard to prevent, it becomes imperative to have mechanisms that enable an affected entity to verify the ownership of a DNN with high confidence. This paper presents TATTOOED, a robust and efficient DNN watermarking technique based on spread-spectrum channel coding. TATTOOED has a negligible effect on the performance of the DNN model and requires as little as one iteration to watermark a DNN model. We extensively evaluate TATTOOED against several state-of-the-art mechanisms used to remove watermarks from DNNs. Our results show that TATTOOED is robust to such removal techniques even in extreme scenarios. For example, if the removal techniques such as fine-tuning and parameter pruning change as much as 99\% of the model parameters, the TATTOOED watermark is still present in full in the DNN model, and ensures ownership verification.
    Compute Trends Across Three Eras of Machine Learning. (arXiv:2202.05924v1 [cs.LG])
    Compute, data, and algorithmic advances are the three fundamental factors that guide the progress of modern Machine Learning (ML). In this paper we study trends in the most readily quantified factor - compute. We show that before 2010 training compute grew in line with Moore's law, doubling roughly every 20 months. Since the advent of Deep Learning in the early 2010s, the scaling of training compute has accelerated, doubling approximately every 6 months. In late 2015, a new trend emerged as firms developed large-scale ML models with 10 to 100-fold larger requirements in training compute. Based on these observations we split the history of compute in ML into three eras: the Pre Deep Learning Era, the Deep Learning Era and the Large-Scale Era. Overall, our work highlights the fast-growing compute requirements for training advanced ML systems.
    Uncalibrated Models Can Improve Human-AI Collaboration. (arXiv:2202.05983v1 [cs.AI])
    In many practical applications of AI, an AI model is used as a decision aid for human users. The AI provides advice that a human (sometimes) incorporates into their decision-making process. The AI advice is often presented with some measure of "confidence" that the human can use to calibrate how much they depend on or trust the advice. In this paper, we demonstrate that presenting AI models as more confident than they actually are, even when the original AI is well-calibrated, can improve human-AI performance (measured as the accuracy and confidence of the human's final prediction after seeing the AI advice). We first learn a model for how humans incorporate AI advice using data from thousands of human interactions. This enables us to explicitly estimate how to transform the AI's prediction confidence, making the AI uncalibrated, in order to improve the final human prediction. We empirically validate our results across four different tasks -- dealing with images, text and tabular data -- involving hundreds of human participants. We further support our findings with simulation analysis. Our findings suggest the importance of and a framework for jointly optimizing the human-AI system as opposed to the standard paradigm of optimizing the AI model alone.
    Indication as Prior Knowledge for Multimodal Disease Classification in Chest Radiographs with Transformers. (arXiv:2202.06076v1 [cs.CV])
    When a clinician refers a patient for an imaging exam, they include the reason (e.g. relevant patient history, suspected disease) in the scan request; this appears as the indication field in the radiology report. The interpretation and reporting of the image are substantially influenced by this request text, steering the radiologist to focus on particular aspects of the image. We use the indication field to drive better image classification, by taking a transformer network which is unimodally pre-trained on text (BERT) and fine-tuning it for multimodal classification of a dual image-text input. We evaluate the method on the MIMIC-CXR dataset, and present ablation studies to investigate the effect of the indication field on the classification performance. The experimental results show our approach achieves 87.8 average micro AUROC, outperforming the state-of-the-art methods for unimodal (84.4) and multimodal (86.0) classification. Our code is available at https://github.com/jacenkow/mmbt.
    Coupling Online-Offline Learning for Multi-distributional Data Streams. (arXiv:2202.05996v1 [cs.LG])
    The distributions of real-life data streams are usually nonstationary, where one exciting setting is that a stream can be decomposed into several offline intervals with a fixed time horizon but different distributions and an out-of-distribution online interval. We call such data multi-distributional data streams, on which learning an on-the-fly expert for unseen samples with a desirable generalization is demanding yet highly challenging owing to the multi-distributional streaming nature, particularly when initially limited data is available for the online interval. To address these challenges, this work introduces a novel optimization method named coupling online-offline learning (CO$_2$) with theoretical guarantees about the knowledge transfer, the regret, and the generalization error. CO$_2$ extracts knowledge by training an offline expert for each offline interval and update an online expert by an off-the-shelf online optimization method in the online interval. CO$_2$ outputs a hypothesis for each sample by adaptively coupling both the offline experts and the underlying online expert through an expert-tracking strategy to adapt to the dynamic environment. To study the generalization performance of the output hypothesis, we propose a general theory to analyze its excess risk bound related to the loss function properties, the hypothesis class, the data distribution, and the regret.
    Towards a Robust and Trustworthy Machine Learning System Development: An Engineering Perspective. (arXiv:2101.03042v2 [cs.LG] UPDATED)
    While Machine Learning (ML) technologies are widely adopted in many mission critical fields to support intelligent decision-making, concerns remain about system resilience against ML-specific security attacks and privacy breaches as well as the trust that users have in these systems. In this article, we present our recent systematic and comprehensive survey on the state-of-the-art ML robustness and trustworthiness from a security engineering perspective, focusing on the problems in system threat analysis, design and evaluation faced in developing practical machine learning applications, in terms of robustness and user trust. Accordingly, we organize the presentation of this survey intended to facilitate the convey of the body of knowledge from this angle. We then describe a metamodel we created that represents the body of knowledge in a standard and visualized way. We further illustrate how to leverage the metamodel to guide a systematic threat analysis and security design process which extends and scales up the classic process. Finally, we propose the future research directions motivated by our findings. Our work differs itself from the existing surveys by (i) exploring the fundamental principles and best practices to support robust and trustworthy ML system development, and (ii) studying the interplay of robustness and user trust in the context of ML systems. We expect this survey provides a big picture for machine learning security practitioners.
    Improving Mini-batch Optimal Transport via Partial Transportation. (arXiv:2108.09645v3 [stat.ML] UPDATED)
    Mini-batch optimal transport (m-OT) has been widely used recently to deal with the memory issue of OT in large-scale applications. Despite their practicality, m-OT suffers from misspecified mappings, namely, mappings that are optimal on the mini-batch level but are partially wrong in the comparison with the optimal transportation plan between the original measures. Motivated by the misspecified mappings issue, we propose a novel mini-batch method by using partial optimal transport (POT) between mini-batch empirical measures, which we refer to as mini-batch partial optimal transport (m-POT). Leveraging the insight from the partial transportation, we explain the source of misspecified mappings from the m-OT and motivate why limiting the amount of transported masses among mini-batches via POT can alleviate the incorrect mappings. Finally, we carry out extensive experiments on various applications such as deep domain adaptation, partial domain adaptation, deep generative model, color transfer, and gradient flow to demonstrate the favorable performance of m-POT compared to current mini-batch methods.
    Optimal Spend Rate Estimation and Pacing for Ad Campaigns with Budgets. (arXiv:2202.05881v1 [cs.GT])
    Online ad platforms offer budget management tools for advertisers that aim to maximize the number of conversions given a budget constraint. As the volume of impressions, conversion rates and prices vary over time, these budget management systems learn a spend plan (to find the optimal distribution of budget over time) and run a pacing algorithm which follows the spend plan. This paper considers two models for impressions and competition that varies with time: a) an episodic model which exhibits stationarity in each episode, but each episode can be arbitrarily different from the next, and b) a model where the distributions of prices and values change slowly over time. We present the first learning theoretic guarantees on both the accuracy of spend plans and the resulting end-to-end budget management system. We present four main results: 1) for the episodic setting we give sample complexity bounds for the spend rate prediction problem: given $n$ samples from each episode, with high probability we have $|\widehat{\rho}_e - \rho_e| \leq \tilde{O}(\frac{1}{n^{1/3}})$ where $\rho_e$ is the optimal spend rate for the episode, $\widehat{\rho}_e$ is the estimate from our algorithm, 2) we extend the algorithm of Balseiro and Gur (2017) to operate on varying, approximate spend rates and show that the resulting combined system of optimal spend rate estimation and online pacing algorithm for episodic settings has regret that vanishes in number of historic samples $n$ and the number of rounds $T$, 3) for non-episodic but slowly-changing distributions we show that the same approach approximates the optimal bidding strategy up to a factor dependent on the rate-of-change of the distributions and 4) we provide experiments showing that our algorithm outperforms both static spend plans and non-pacing across a wide variety of settings.  ( 2 min )
    Blind leads Blind: A Zero-Knowledge Attack on Federated Learning. (arXiv:2202.05877v1 [cs.CR])
    Attacks on Federated Learning (FL) can severely reduce the quality of the generated models and limit the usefulness of this emerging learning paradigm that enables on-premise decentralized learning. There have been various untargeted attacks on FL, but they are not widely applicable as they i) assume that the attacker knows every update of benign clients, which is indeed sent in encrypted form to the central server, or ii) assume that the attacker has a large dataset and sufficient resources to locally train updates imitating benign parties. In this paper, we design a zero-knowledge untargeted attack (ZKA), which synthesizes malicious data to craft adversarial models without eavesdropping on the transmission of benign clients at all or requiring a large quantity of task-specific training data. To inject malicious input into the FL system by synthetic data, ZKA has two variants. ZKA-R generates adversarial ambiguous data by reversing engineering from the global models. To enable stealthiness, ZKA-G trains the local model on synthetic data from the generator that aims to synthesize images different from a randomly chosen class. Furthermore, we add a novel distance-based regularization term for both attacks to further enhance stealthiness. Experimental results on Fashion-MNIST and CIFAR-10 show that the ZKA achieves similar or even higher attack success rate than the state-of-the-art untargeted attacks against various defense mechanisms, namely more than 50% for Cifar-10 for all considered defense mechanisms. As expected, ZKA-G is better at circumventing defenses, even showing a defense pass rate of close to 90% when ZKA-R only achieves 70%. Higher data heterogeneity favours ZKA-R since detection becomes harder.  ( 2 min )
    Abstraction for Deep Reinforcement Learning. (arXiv:2202.05839v1 [cs.LG])
    We characterise the problem of abstraction in the context of deep reinforcement learning. Various well established approaches to analogical reasoning and associative memory might be brought to bear on this issue, but they present difficulties because of the need for end-to-end differentiability. We review developments in AI and machine learning that could facilitate their adoption.  ( 2 min )
    Applications of Machine Learning to Lattice Quantum Field Theory. (arXiv:2202.05838v1 [hep-lat])
    There is great potential to apply machine learning in the area of numerical lattice quantum field theory, but full exploitation of that potential will require new strategies. In this white paper for the Snowmass community planning process, we discuss the unique requirements of machine learning for lattice quantum field theory research and outline what is needed to enable exploration and deployment of this approach in the future.  ( 2 min )
  • Open

    Privacy-Aware Communication Over a Wiretap Channel with Generative Networks. (arXiv:2110.04094v2 [cs.IT] UPDATED)
    We study privacy-aware communication over a wiretap channel using end-to-end learning. Alice wants to transmit a source signal to Bob over a binary symmetric channel, while passive eavesdropper Eve tries to infer some sensitive attribute of Alice's source based on its overheard signal. Since we usually do not have access to true distributions, we propose a data-driven approach using variational autoencoder (VAE)-based joint source channel coding (JSCC). We show through simulations with the colored MNIST dataset that our approach provides high reconstruction quality at the receiver while confusing the eavesdropper about the latent sensitive attribute, which consists of the color and thickness of the digits. Finally, we consider a parallel-channel scenario, and show that our approach arranges the information transmission such that the channels with higher noise levels at the eavesdropper carry the sensitive information, while the non-sensitive information is transmitted over more vulnerable channels.  ( 2 min )
    Robust and Information-theoretically Safe Bias Classifier against Adversarial Attacks. (arXiv:2111.04404v2 [cs.LG] UPDATED)
    In this paper, the bias classifier is introduced, that is, the bias part of a DNN with Relu as the activation function is used as a classifier. The work is motivated by the fact that the bias part is a piecewise constant function with zero gradient and hence cannot be directly attacked by gradient-based methods to generate adversaries, such as FGSM. The existence of the bias classifier is proved and an effective training method for the bias classifier is given. It is proved that by adding a proper random first-degree part to the bias classifier, an information-theoretically safe classifier against the original-model gradient attack is obtained in the sense that the attack will generate a totally random attacking direction. This seems to be the first time that the concept of information-theoretically safe classifier is proposed. Several attack methods for the bias classifier are proposed and numerical experiments are used to show that the bias classifier is more robust than DNNs with similar size against these attacks in most cases.  ( 2 min )
    Blurs Behave Like Ensembles: Spatial Smoothings to Improve Accuracy, Uncertainty, and Robustness. (arXiv:2105.12639v2 [cs.LG] UPDATED)
    Neural network ensembles, such as Bayesian neural networks (BNNs), have shown success in the areas of uncertainty estimation and robustness. However, a crucial challenge prohibits their use in practice. BNNs require a large number of predictions to produce reliable results, leading to a significant increase in computational cost. To alleviate this issue, we propose spatial smoothing, a method that spatially ensembles neighboring feature map points of convolutional neural networks. By simply adding a few blur layers to the models, we empirically show that spatial smoothing improves accuracy, uncertainty estimation, and robustness of BNNs across a whole range of ensemble sizes. In particular, BNNs incorporating spatial smoothing achieve high predictive performance merely with a handful of ensembles. Moreover, this method also can be applied to canonical deterministic neural networks to improve the performances. A number of evidences suggest that the improvements can be attributed to the stabilized feature maps and the smoothing of the loss landscape. In addition, we provide a fundamental explanation for prior works - namely, global average pooling, pre-activation, and ReLU6 - by addressing them as special cases of spatial smoothing. These not only enhance accuracy, but also improve uncertainty estimation and robustness by making the loss landscape smoother in the same manner as spatial smoothing. The code is available at https://github.com/xxxnell/spatial-smoothing.  ( 2 min )
    The Curse Revisited: When are Distances Informative for the Ground Truth in Noisy High-Dimensional Data?. (arXiv:2109.10569v4 [cs.LG] UPDATED)
    Distances between data points are widely used in machine learning applications. Yet, when corrupted by noise, these distances -- and thus the models based upon them -- may lose their usefulness in high dimensions. Indeed, the small marginal effects of the noise may then accumulate quickly, shifting empirical closest and furthest neighbors away from the ground truth. In this paper, we exactly characterize such effects in noisy high-dimensional data using an asymptotic probabilistic expression. Previously, it has been argued that neighborhood queries become meaningless and unstable when distance concentration occurs, which means that there is a poor relative discrimination between the furthest and closest neighbors in the data. However, we conclude that this is not necessarily the case when we decompose the data in a ground truth -- which we aim to recover -- and noise component. More specifically, we derive that under particular conditions, empirical neighborhood relations affected by noise are still likely to be truthful even when distance concentration occurs. We also include thorough empirical verification of our results, as well as interesting experiments in which our derived `phase shift' where neighbors become random or not turns out to be identical to the phase shift where common dimensionality reduction methods perform poorly or well for recovering low-dimensional reconstructions of high-dimensional data with dense noise.  ( 2 min )
    Generalization Bounds and Representation Learning for Estimation of Potential Outcomes and Causal Effects. (arXiv:2001.07426v3 [cs.LG] UPDATED)
    Practitioners in diverse fields such as healthcare, economics and education are eager to apply machine learning to improve decision making. The cost and impracticality of performing experiments and a recent monumental increase in electronic record keeping has brought attention to the problem of evaluating decisions based on non-experimental observational data. This is the setting of this work. In particular, we study estimation of individual-level causal effects, such as a single patient's response to alternative medication, from recorded contexts, decisions and outcomes. We give generalization bounds on the error in estimated effects based on distance measures between groups receiving different treatments, allowing for sample re-weighting. We provide conditions under which our bound is tight and show how it relates to results for unsupervised domain adaptation. Led by our theoretical results, we devise representation learning algorithms that minimize our bound, by regularizing the representation's induced treatment group distance, and encourage sharing of information between treatment groups. We extend these algorithms to simultaneously learn a weighted representation to further reduce treatment group distances. Finally, an experimental evaluation on real and synthetic data shows the value of our proposed representation architecture and regularization scheme.  ( 2 min )
    Learning by Doing: Controlling a Dynamical System using Causality, Control, and Reinforcement Learning. (arXiv:2202.06052v1 [cs.LG])
    Questions in causality, control, and reinforcement learning go beyond the classical machine learning task of prediction under i.i.d. observations. Instead, these fields consider the problem of learning how to actively perturb a system to achieve a certain effect on a response variable. Arguably, they have complementary views on the problem: In control, one usually aims to first identify the system by excitation strategies to then apply model-based design techniques to control the system. In (non-model-based) reinforcement learning, one directly optimizes a reward. In causality, one focus is on identifiability of causal structure. We believe that combining the different views might create synergies and this competition is meant as a first step toward such synergies. The participants had access to observational and (offline) interventional data generated by dynamical systems. Track CHEM considers an open-loop problem in which a single impulse at the beginning of the dynamics can be set, while Track ROBO considers a closed-loop problem in which control variables can be set at each time step. The goal in both tracks is to infer controls that drive the system to a desired state. Code is open-sourced ( https://github.com/LearningByDoingCompetition/learningbydoing-comp ) to reproduce the winning solutions of the competition and to facilitate trying out new methods on the competition tasks.
    Benign Overfitting in Two-layer Convolutional Neural Networks. (arXiv:2202.06526v1 [cs.LG])
    Modern neural networks often have great expressive power and can be trained to overfit the training data, while still achieving a good test performance. This phenomenon is referred to as "benign overfitting". Recently, there emerges a line of works studying "benign overfitting" from the theoretical perspective. However, they are limited to linear models or kernel/random feature models, and there is still a lack of theoretical understanding about when and how benign overfitting occurs in neural networks. In this paper, we study the benign overfitting phenomenon in training a two-layer convolutional neural network (CNN). We show that when the signal-to-noise ratio satisfies a certain condition, a two-layer CNN trained by gradient descent can achieve arbitrarily small training and test loss. On the other hand, when this condition does not hold, overfitting becomes harmful and the obtained CNN can only achieve constant level test loss. These together demonstrate a sharp phase transition between benign overfitting and harmful overfitting, driven by the signal-to-noise ratio. To the best of our knowledge, this is the first work that precisely characterizes the conditions under which benign overfitting can occur in training convolutional neural networks.
    Adaptive Regret for Control of Time-Varying Dynamics. (arXiv:2007.04393v3 [cs.LG] UPDATED)
    We consider the problem of online control of systems with time-varying linear dynamics. This is a general formulation that is motivated by the use of local linearization in control of nonlinear dynamical systems. To state meaningful guarantees over changing environments, we introduce the metric of {\it adaptive regret} to the field of control. This metric, originally studied in online learning, measures performance in terms of regret against the best policy in hindsight on {\it any interval in time}, and thus captures the adaptation of the controller to changing dynamics. Our main contribution is a novel efficient meta-algorithm: it converts a controller with sublinear regret bounds into one with sublinear {\it adaptive regret} bounds in the setting of time-varying linear dynamical systems. The main technical innovation is the first adaptive regret bound for the more general framework of online convex optimization with memory. Furthermore, we give a lower bound showing that our attained adaptive regret bound is nearly tight for this general framework.
    CausalSim: Unbiased Trace-Driven Network Simulation. (arXiv:2201.01811v2 [cs.LG] UPDATED)
    Evaluating the real-world performance of network protocols is challenging. Randomized control trials (RCT) are expensive and inaccessible to most, while simulators fail to capture complex behaviors in real networks. We present CausalSim, a trace-driven counterfactual simulator for network protocols that addresses this challenge. Counterfactual simulation aims to predict what would happen using different protocols under the same conditions as a given trace. This is complicated due to the bias introduced by the protocols used during trace collection. CausalSim uses traces from an initial RCT under a set of protocols to learn a causal network model, effectively removing the biases present in the data. Key to CausalSim is mapping the task of counterfactual simulation to that of tensor completion with extremely sparse observations. Through an adversarial neural network training technique that exploits distributional invariances that are present in training data coming from an RCT, CausalSim enables a novel tensor completion method despite the sparsity of observations. Our extensive evaluation of CausalSim on both real and synthetic datasets and two use cases, including more than nine months of real data from the Puffer video streaming system, shows that it provides accurate counterfactual predictions, reducing prediction error by 44% and 53% on average compared to expert-designed and supervised learning baselines.
    Off-Policy Evaluation for Large Action Spaces via Embeddings. (arXiv:2202.06317v1 [cs.LG])
    Off-policy evaluation (OPE) in contextual bandits has seen rapid adoption in real-world systems, since it enables offline evaluation of new policies using only historic log data. Unfortunately, when the number of actions is large, existing OPE estimators -- most of which are based on inverse propensity score weighting -- degrade severely and can suffer from extreme bias and variance. This foils the use of OPE in many applications from recommender systems to language models. To overcome this issue, we propose a new OPE estimator that leverages marginalized importance weights when action embeddings provide structure in the action space. We characterize the bias, variance, and mean squared error of the proposed estimator and analyze the conditions under which the action embedding provides statistical benefits over conventional estimators. In addition to the theoretical analysis, we find that the empirical performance improvement can be substantial, enabling reliable OPE even when existing estimators collapse due to a large number of actions.
    Learning to Balance: Bayesian Meta-Learning for Imbalanced and Out-of-distribution Tasks. (arXiv:1905.12917v3 [cs.LG] UPDATED)
    While tasks could come with varying the number of instances and classes in realistic settings, the existing meta-learning approaches for few-shot classification assume that the number of instances per task and class is fixed. Due to such restriction, they learn to equally utilize the meta-knowledge across all the tasks, even when the number of instances per task and class largely varies. Moreover, they do not consider distributional difference in unseen tasks, on which the meta-knowledge may have less usefulness depending on the task relatedness. To overcome these limitations, we propose a novel meta-learning model that adaptively balances the effect of the meta-learning and task-specific learning within each task. Through the learning of the balancing variables, we can decide whether to obtain a solution by relying on the meta-knowledge or task-specific learning. We formulate this objective into a Bayesian inference framework and tackle it using variational inference. We validate our Bayesian Task-Adaptive Meta-Learning (Bayesian TAML) on multiple realistic task- and class-imbalanced datasets, on which it significantly outperforms existing meta-learning approaches. Further ablation study confirms the effectiveness of each balancing component and the Bayesian learning framework.
    Stochastic Multi-level Composition Optimization Algorithms with Level-Independent Convergence Rates. (arXiv:2008.10526v5 [math.OC] UPDATED)
    In this paper, we study smooth stochastic multi-level composition optimization problems, where the objective function is a nested composition of $T$ functions. We assume access to noisy evaluations of the functions and their gradients, through a stochastic first-order oracle. For solving this class of problems, we propose two algorithms using moving-average stochastic estimates, and analyze their convergence to an $\epsilon$-stationary point of the problem. We show that the first algorithm, which is a generalization of \cite{GhaRuswan20} to the $T$ level case, can achieve a sample complexity of $\mathcal{O}(1/\epsilon^6)$ by using mini-batches of samples in each iteration. By modifying this algorithm using linearized stochastic estimates of the function values, we improve the sample complexity to $\mathcal{O}(1/\epsilon^4)$. {\color{black}This modification not only removes the requirement of having a mini-batch of samples in each iteration, but also makes the algorithm parameter-free and easy to implement}. To the best of our knowledge, this is the first time that such an online algorithm designed for the (un)constrained multi-level setting, obtains the same sample complexity of the smooth single-level setting, under standard assumptions (unbiasedness and boundedness of the second moments) on the stochastic first-order oracle.
    Extracting the Subhalo Mass Function from Strong Lens Images with Image Segmentation. (arXiv:2009.06639v3 [astro-ph.CO] UPDATED)
    Detecting substructure within strongly lensed images is a promising route to shed light on the nature of dark matter. However, it is a challenging task, which traditionally requires detailed lens modeling and source reconstruction, taking weeks to analyze each system. We use machine-learning to circumvent the need for lens and source modeling and develop a neural network to both locate subhalos in an image as well as determine their mass using the technique of image segmentation. The network is trained on images with a single subhalo located near the Einstein ring across a wide range of apparent source magnitudes. The network is then able to resolve subhalos with masses $m\gtrsim 10^{8.5} M_{\odot}$. Training in this way allows the network to learn the gravitational lensing of light, and remarkably, it is then able to detect entire populations of substructure, even for locations further away from the Einstein ring than those used in training. Over a wide range of the apparent source magnitude, the false-positive rate is around three false subhalos per 100 images, coming mostly from the lightest detectable subhalo for that signal-to-noise ratio. With good accuracy and a low false-positive rate, counting the number of pixels assigned to each subhalo class over multiple images allows for a measurement of the subhalo mass function (SMF). When measured over three mass bins from $10^9M_{\odot}$--$10^{10} M_{\odot}$ the SMF slope is recovered with an error of 36% for 50 images, and this improves to 10% for 1000 images with Hubble Space Telescope-like noise.
    Topologically Regularized Data Embeddings. (arXiv:2110.09193v3 [cs.LG] UPDATED)
    Unsupervised feature learning often finds low-dimensional embeddings that capture the structure of complex data. For tasks for which prior expert topological knowledge is available, incorporating this into the learned representation may lead to higher quality embeddings. For example, this may help one to embed the data into a given number of clusters, or to accommodate for noise that prevents one from deriving the distribution of the data over the model directly, which can then be learned more effectively. However, a general tool for integrating different prior topological knowledge into embeddings is lacking. Although differentiable topology layers have been recently developed that can (re)shape embeddings into prespecified topological models, they have two important limitations for representation learning, which we address in this paper. First, the currently suggested topological losses fail to represent simple models such as clusters and flares in a natural manner. Second, these losses neglect all original structural (such as neighborhood) information in the data that is useful for learning. We overcome these limitations by introducing a new set of topological losses, and proposing their usage as a way for \emph{topologically regularizing} data embeddings to naturally represent a prespecified model. We include thorough experiments on synthetic and real data that highlight the usefulness and versatility of this approach, with applications ranging from modeling high-dimensional single-cell data, to graph embedding.
    Tensor Moments of Gaussian Mixture Models: Theory and Applications. (arXiv:2202.06930v1 [stat.ML])
    Gaussian mixture models (GMM) are fundamental tools in statistical and data sciences. We study the moments of multivariate Gaussians and GMMs. The $d$-th moment of an $n$-dimensional random variable is a symmetric $d$-way tensor of size $n^d$, so working with moments naively is assumed to be prohibitively expensive for $d>2$ and larger values of $n$. In this work, we develop theory and numerical methods for implicit computations with moment tensors of GMMs, reducing the computational and storage costs to $\mathcal{O}(n^2)$ and $\mathcal{O}(n^3)$, respectively, for general covariance matrices, and to $\mathcal{O}(n)$ and $\mathcal{O}(n)$, respectively, for diagonal ones. We derive concise analytic expressions for the moments in terms of symmetrized tensor products, relying on the correspondence between symmetric tensors and homogeneous polynomials, and combinatorial identities involving Bell polynomials. The primary application of this theory is to estimating GMM parameters from a set of observations, when formulated as a moment-matching optimization problem. If there is a known and common covariance matrix, we also show it is possible to debias the data observations, in which case the problem of estimating the unknown means reduces to symmetric CP tensor decomposition. Numerical results validate and illustrate the numerical efficiency of our approaches. This work potentially opens the door to the competitiveness of the method of moments as compared to expectation maximization methods for parameter estimation of GMMs.
    Neural Spectral Marked Point Processes. (arXiv:2106.10773v3 [cs.LG] UPDATED)
    Self- and mutually-exciting point processes are popular models in machine learning and statistics for dependent discrete event data. To date, most existing models assume stationary kernels (including the classical Hawkes processes) and simple parametric models. Modern applications with complex event data require more general point process models that can incorporate contextual information of the events, called marks, besides the temporal and location information. Moreover, such applications often require non-stationary models to capture more complex spatio-temporal dependence. To tackle these challenges, a key question is to devise a versatile influence kernel in the point process model. In this paper, we introduce a novel and general neural network-based non-stationary influence kernel with high expressiveness for handling complex discrete events data while providing theoretical performance guarantees. We demonstrate the superior performance of our proposed method compared with the state-of-the-art on synthetic and real data.
    An online learning approach to dynamic pricing and capacity sizing in service systems. (arXiv:2009.02911v2 [math.PR] UPDATED)
    We study a dynamic pricing and capacity sizing problem in a GI/GI/1 queue, where the service provider's objective is to obtain the optimal service fee $p$ and service capacity $\mu$ so as to maximize cumulative expected profit (the service revenue minus the staffing cost and delay penalty). Due to the complex nature of the queueing dynamics, such a problem has no analytic solution so that previous research often resorts to heavy-traffic analysis in that both the arrival rate and service rate are sent to infinity. In this work we propose an online learning framework designed for solving this problem which does not require the system's scale to increase. Our algorithm organizes the time horizon into successive operational cycles and prescribes an efficient procedure to obtain improved pricing and staffing policies in each cycle using data collected in previous cycles. Data here include the number of customer arrivals, waiting times, and the server's busy times. The ingenuity of this approach lies in its online nature, which allows the service provider do better by interacting with the environment. Effectiveness of our online learning algorithm is substantiated by (i) theoretical results including the algorithm convergence and regret analysis (with a logarithmic regret bound), and (ii) engineering confirmation via simulation experiments of a variety of representative GI/GI/1 queues.
    Mastering Atari with Discrete World Models. (arXiv:2010.02193v4 [cs.LG] UPDATED)
    Intelligent agents need to generalize from past experience to achieve goals in complex environments. World models facilitate such generalization and allow learning behaviors from imagined outcomes to increase sample-efficiency. While learning world models from image inputs has recently become feasible for some tasks, modeling Atari games accurately enough to derive successful behaviors has remained an open challenge for many years. We introduce DreamerV2, a reinforcement learning agent that learns behaviors purely from predictions in the compact latent space of a powerful world model. The world model uses discrete representations and is trained separately from the policy. DreamerV2 constitutes the first agent that achieves human-level performance on the Atari benchmark of 55 tasks by learning behaviors inside a separately trained world model. With the same computational budget and wall-clock time, Dreamer V2 reaches 200M frames and surpasses the final performance of the top single-GPU agents IQN and Rainbow. DreamerV2 is also applicable to tasks with continuous actions, where it learns an accurate world model of a complex humanoid robot and solves stand-up and walking from only pixel inputs.
    Improving Mini-batch Optimal Transport via Partial Transportation. (arXiv:2108.09645v3 [stat.ML] UPDATED)
    Mini-batch optimal transport (m-OT) has been widely used recently to deal with the memory issue of OT in large-scale applications. Despite their practicality, m-OT suffers from misspecified mappings, namely, mappings that are optimal on the mini-batch level but are partially wrong in the comparison with the optimal transportation plan between the original measures. Motivated by the misspecified mappings issue, we propose a novel mini-batch method by using partial optimal transport (POT) between mini-batch empirical measures, which we refer to as mini-batch partial optimal transport (m-POT). Leveraging the insight from the partial transportation, we explain the source of misspecified mappings from the m-OT and motivate why limiting the amount of transported masses among mini-batches via POT can alleviate the incorrect mappings. Finally, we carry out extensive experiments on various applications such as deep domain adaptation, partial domain adaptation, deep generative model, color transfer, and gradient flow to demonstrate the favorable performance of m-POT compared to current mini-batch methods.
    Sharp Bounds for Federated Averaging (Local SGD) and Continuous Perspective. (arXiv:2111.03741v2 [cs.LG] UPDATED)
    Federated Averaging (FedAvg), also known as Local SGD, is one of the most popular algorithms in Federated Learning (FL). Despite its simplicity and popularity, the convergence rate of FedAvg has thus far been undetermined. Even under the simplest assumptions (convex, smooth, homogeneous, and bounded covariance), the best-known upper and lower bounds do not match, and it is not clear whether the existing analysis captures the capacity of the algorithm. In this work, we first resolve this question by providing a lower bound for FedAvg that matches the existing upper bound, which shows the existing FedAvg upper bound analysis is not improvable. Additionally, we establish a lower bound in a heterogeneous setting that nearly matches the existing upper bound. While our lower bounds show the limitations of FedAvg, under an additional assumption of third-order smoothness, we prove more optimistic state-of-the-art convergence results in both convex and non-convex settings. Our analysis stems from a notion we call iterate bias, which is defined by the deviation of the expectation of the SGD trajectory from the noiseless gradient descent trajectory with the same initialization. We prove novel sharp bounds on this quantity, and show intuitively how to analyze this quantity from a Stochastic Differential Equation (SDE) perspective.
    Trace norm regularization for multi-task learning with scarce data. (arXiv:2202.06742v1 [stat.ML])
    Multi-task learning leverages structural similarities between multiple tasks to learn despite very few samples. Motivated by the recent success of neural networks applied to data-scarce tasks, we consider a linear low-dimensional shared representation model. Despite an extensive literature, existing theoretical results either guarantee weak estimation rates or require a large number of samples per task. This work provides the first estimation error bound for the trace norm regularized estimator when the number of samples per task is small. The advantages of trace norm regularization for learning data-scarce tasks extend to meta-learning and are confirmed empirically on synthetic datasets.
    Near-optimal Local Convergence of Alternating Gradient Descent-Ascent for Minimax Optimization. (arXiv:2102.09468v3 [cs.LG] UPDATED)
    Smooth minimax games often proceed by simultaneous or alternating gradient updates. Although algorithms with alternating updates are commonly used in practice, the majority of existing theoretical analyses focus on simultaneous algorithms for convenience of analysis. In this paper, we study alternating gradient descent-ascent (Alt-GDA) in minimax games and show that Alt-GDA is superior to its simultaneous counterpart~(Sim-GDA) in many settings. We prove that Alt-GDA achieves a near-optimal local convergence rate for strongly convex-strongly concave (SCSC) problems while Sim-GDA converges at a much slower rate. To our knowledge, this is the \emph{first} result of any setting showing that Alt-GDA converges faster than Sim-GDA by more than a constant. We further adapt the theory of integral quadratic constraints (IQC) and show that Alt-GDA attains the same rate \emph{globally} for a subclass of SCSC minimax problems. Empirically, we demonstrate that alternating updates speed up GAN training significantly and the use of optimism only helps for simultaneous algorithms.
    On Reward-Free RL with Kernel and Neural Function Approximations: Single-Agent MDP and Markov Game. (arXiv:2110.09771v2 [cs.LG] UPDATED)
    To achieve sample efficiency in reinforcement learning (RL), it necessitates efficiently exploring the underlying environment. Under the offline setting, addressing the exploration challenge lies in collecting an offline dataset with sufficient coverage. Motivated by such a challenge, we study the reward-free RL problem, where an agent aims to thoroughly explore the environment without any pre-specified reward function. Then, given any extrinsic reward, the agent computes the policy via a planning algorithm with offline data collected in the exploration phase. Moreover, we tackle this problem under the context of function approximation, leveraging powerful function approximators. Specifically, we propose to explore via an optimistic variant of the value-iteration algorithm incorporating kernel and neural function approximations, where we adopt the associated exploration bonus as the exploration reward. Moreover, we design exploration and planning algorithms for both single-agent MDPs and zero-sum Markov games and prove that our methods can achieve $\widetilde{\mathcal{O}}(1 /\varepsilon^2)$ sample complexity for generating a $\varepsilon$-suboptimal policy or $\varepsilon$-approximate Nash equilibrium when given an arbitrary extrinsic reward. To the best of our knowledge, we establish the first provably efficient reward-free RL algorithm with kernel and neural function approximators.
    A Private and Computationally-Efficient Estimator for Unbounded Gaussians. (arXiv:2111.04609v2 [stat.ML] UPDATED)
    We give the first polynomial-time, polynomial-sample, differentially private estimator for the mean and covariance of an arbitrary Gaussian distribution $\mathcal{N}(\mu,\Sigma)$ in $\mathbb{R}^d$. All previous estimators are either nonconstructive, with unbounded running time, or require the user to specify a priori bounds on the parameters $\mu$ and $\Sigma$. The primary new technical tool in our algorithm is a new differentially private preconditioner that takes samples from an arbitrary Gaussian $\mathcal{N}(0,\Sigma)$ and returns a matrix $A$ such that $A \Sigma A^T$ has constant condition number.
    Stochastic linear optimization never overfits with quadratically-bounded losses on general data. (arXiv:2202.06915v1 [cs.LG])
    This work shows that a diverse collection of linear optimization methods, when run on general data, fail to overfit, despite lacking any explicit constraints or regularization: with high probability, their trajectories stay near the curve of optimal constrained solutions over the population distribution. This analysis is powered by an elementary but flexible proof scheme which can handle many settings, summarized as follows. Firstly, the data can be general: unlike other implicit bias works, it need not satisfy large margin or other structural conditions, and moreover can arrive sequentially IID, sequentially following a Markov chain, as a batch, and lastly it can have heavy tails. Secondly, while the main analysis is for mirror descent, rates are also provided for the Temporal-Difference fixed-point method from reinforcement learning; all prior high probability analyses in these settings required bounded iterates, bounded updates, bounded noise, or some equivalent. Thirdly, the losses are general, and for instance the logistic and squared losses can be handled simultaneously, unlike other implicit bias works. In all of these settings, not only is low population error guaranteed with high probability, but moreover low sample complexity is guaranteed so long as there exists any low-complexity near-optimal solution, even if the global problem structure and in particular global optima have high complexity.
    Robust Estimation of Discrete Distributions under Local Differential Privacy. (arXiv:2202.06825v1 [math.ST])
    Although robust learning and local differential privacy are both widely studied fields of research, combining the two settings is an almost unexplored topic. We consider the problem of estimating a discrete distribution in total variation from $n$ contaminated data batches under a local differential privacy constraint. A fraction $1-\epsilon$ of the batches contain $k$ i.i.d. samples drawn from a discrete distribution $p$ over $d$ elements. To protect the users' privacy, each of the samples is privatized using an $\alpha$-locally differentially private mechanism. The remaining $\epsilon n $ batches are an adversarial contamination. The minimax rate of estimation under contamination alone, with no privacy, is known to be $\epsilon/\sqrt{k}+\sqrt{d/kn}$, up to a $\sqrt{\log(1/\epsilon)}$ factor. Under the privacy constraint alone, the minimax rate of estimation is $\sqrt{d^2/\alpha^2 kn}$. We show that combining the two constraints leads to a minimax estimation rate of $\epsilon\sqrt{d/\alpha^2 k}+\sqrt{d^2/\alpha^2 kn}$ up to a $\sqrt{\log(1/\epsilon)}$ factor, larger than the sum of the two separate rates. We provide a polynomial-time algorithm achieving this bound, as well as a matching information theoretic lower bound.  ( 2 min )
    Counterfactual inference for sequential experimental design. (arXiv:2202.06891v1 [stat.ML])
    We consider the problem of counterfactual inference in sequentially designed experiments wherein a collection of $\mathbf{N}$ units each undergo a sequence of interventions for $\mathbf{T}$ time periods, based on policies that sequentially adapt over time. Our goal is counterfactual inference, i.e., estimate what would have happened if alternate policies were used, a problem that is inherently challenging due to the heterogeneity in the outcomes across units and time. To tackle this task, we introduce a suitable latent factor model where the potential outcomes are determined by exogenous unit and time level latent factors. Under suitable conditions, we show that it is possible to estimate the missing (potential) outcomes using a simple variant of nearest neighbors. First, assuming a bilinear latent factor model and allowing for an arbitrary adaptive sampling policy, we establish a distribution-free non-asymptotic guarantee for estimating the missing outcome of any unit at any time; under suitable regularity condition, this guarantee implies that our estimator is consistent. Second, for a generic non-parametric latent factor model, we establish that the estimate for the missing outcome of any unit at time $\mathbf{T}$ satisfies a central limit theorem as $\mathbf{T} \to \infty$, under suitable regularity conditions. Finally, en route to establishing this central limit theorem, we establish a non-asymptotic mean-squared-error bound for the estimate of the missing outcome of any unit at time $\mathbf{T}$. Our work extends the recently growing literature on inference with adaptively collected data by allowing for policies that pool across units, and also compliments the matrix completion literature when the entries are revealed sequentially in an arbitrarily dependent manner based on prior observed data.
    Vulnerability-Aware Poisoning Mechanism for Online RL with Unknown Dynamics. (arXiv:2009.00774v4 [cs.LG] UPDATED)
    Poisoning attacks on Reinforcement Learning (RL) systems could take advantage of RL algorithm's vulnerabilities and cause failure of the learning. However, prior works on poisoning RL usually either unrealistically assume the attacker knows the underlying Markov Decision Process (MDP), or directly apply the poisoning methods in supervised learning to RL. In this work, we build a generic poisoning framework for online RL via a comprehensive investigation of heterogeneous poisoning models in RL. Without any prior knowledge of the MDP, we propose a strategic poisoning algorithm called Vulnerability-Aware Adversarial Critic Poison (VA2C-P), which works for most policy-based deep RL agents, closing the gap that no poisoning method exists for policy-based RL agents. VA2C-P uses a novel metric, stability radius in RL, that measures the vulnerability of RL algorithms. Experiments on multiple deep RL agents and multiple environments show that our poisoning algorithm successfully prevents agents from learning a good policy or teaches the agents to converge to a target policy, with a limited attacking budget.
    Sample-Efficient Reinforcement Learning with loglog(T) Switching Cost. (arXiv:2202.06385v1 [cs.LG])
    We study the problem of reinforcement learning (RL) with low (policy) switching cost - a problem well-motivated by real-life RL applications in which deployments of new policies are costly and the number of policy updates must be low. In this paper, we propose a new algorithm based on stage-wise exploration and adaptive policy elimination that achieves a regret of $\widetilde{O}(\sqrt{H^4S^2AT})$ while requiring a switching cost of $O(HSA \log\log T)$. This is an exponential improvement over the best-known switching cost $O(H^2SA\log T)$ among existing methods with $\widetilde{O}(\mathrm{poly}(H,S,A)\sqrt{T})$ regret. In the above, $S,A$ denotes the number of states and actions in an $H$-horizon episodic Markov Decision Process model with unknown transitions, and $T$ is the number of steps. We also prove an information-theoretical lower bound which says that a switching cost of $\Omega(HSA)$ is required for any no-regret algorithm. As a byproduct, our new algorithmic techniques allow us to derive a \emph{reward-free} exploration algorithm with an optimal switching cost of $O(HSA)$.
    Action and Perception as Divergence Minimization. (arXiv:2009.01791v3 [cs.AI] UPDATED)
    To learn directed behaviors in complex environments, intelligent agents need to optimize objective functions. Various objectives are known for designing artificial agents, including task rewards and intrinsic motivation. However, it is unclear how the known objectives relate to each other, which objectives remain yet to be discovered, and which objectives better describe the behavior of humans. We introduce the Action Perception Divergence (APD), an approach for categorizing the space of possible objective functions for embodied agents. We show a spectrum that reaches from narrow to general objectives. While the narrow objectives correspond to domain-specific rewards as typical in reinforcement learning, the general objectives maximize information with the environment through latent variable models of input sequences. Intuitively, these agents use perception to align their beliefs with the world and use actions to align the world with their beliefs. They infer representations that are informative of past inputs, explore future inputs that are informative of their representations, and select actions or skills that maximally influence future inputs. This explains a wide range of unsupervised objectives from a single principle, including representation learning, information gain, empowerment, and skill discovery. Our findings suggest leveraging powerful world models for unsupervised exploration as a path toward highly adaptive agents that seek out large niches in their environments, rendering task rewards optional.
    fairmodels: A Flexible Tool For Bias Detection, Visualization, And Mitigation. (arXiv:2104.00507v2 [stat.ML] UPDATED)
    Machine learning decision systems are getting omnipresent in our lives. From dating apps to rating loan seekers, algorithms affect both our well-being and future. Typically, however, these systems are not infallible. Moreover, complex predictive models are really eager to learn social biases present in historical data that can lead to increasing discrimination. If we want to create models responsibly then we need tools for in-depth validation of models also from the perspective of potential discrimination. This article introduces an R package fairmodels that helps to validate fairness and eliminate bias in classification models in an easy and flexible fashion. The fairmodels package offers a model-agnostic approach to bias detection, visualization and mitigation. The implemented set of functions and fairness metrics enables model fairness validation from different perspectives. The package includes a series of methods for bias mitigation that aim to diminish the discrimination in the model. The package is designed not only to examine a single model, but also to facilitate comparisons between multiple models.
    The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models. (arXiv:2201.03544v2 [cs.LG] UPDATED)
    Reward hacking -- where RL agents exploit gaps in misspecified reward functions -- has been widely observed, but not yet systematically studied. To understand how reward hacking arises, we construct four RL environments with misspecified rewards. We investigate reward hacking as a function of agent capabilities: model capacity, action space resolution, observation space noise, and training time. More capable agents often exploit reward misspecifications, achieving higher proxy reward and lower true reward than less capable agents. Moreover, we find instances of phase transitions: capability thresholds at which the agent's behavior qualitatively shifts, leading to a sharp decrease in the true reward. Such phase transitions pose challenges to monitoring the safety of ML systems. To address this, we propose an anomaly detection task for aberrant policies and offer several baseline detectors.
    Benign Overfitting without Linearity: Neural Network Classifiers Trained by Gradient Descent for Noisy Linear Data. (arXiv:2202.05928v1 [cs.LG])
    Benign overfitting, the phenomenon where interpolating models generalize well in the presence of noisy data, was first observed in neural network models trained with gradient descent. To better understand this empirical observation, we consider the generalization error of two-layer neural networks trained to interpolation by gradient descent on the logistic loss following random initialization. We assume the data comes from well-separated class-conditional log-concave distributions and allow for a constant fraction of the training labels to be corrupted by an adversary. We show that in this setting, neural networks exhibit benign overfitting: they can be driven to zero training error, perfectly fitting any noisy training labels, and simultaneously achieve test error close to the Bayes-optimal error. In contrast to previous work on benign overfitting that require linear or kernel-based predictors, our analysis holds in a setting where both the model and learning dynamics are fundamentally nonlinear.
    An Evaluation of Change Point Detection Algorithms. (arXiv:2003.06222v3 [stat.ML] UPDATED)
    Change point detection is an important part of time series analysis, as the presence of a change point indicates an abrupt and significant change in the data generating process. While many algorithms for change point detection have been proposed, comparatively little attention has been paid to evaluating their performance on real-world time series. Algorithms are typically evaluated on simulated data and a small number of commonly-used series with unreliable ground truth. Clearly this does not provide sufficient insight into the comparative performance of these algorithms. Therefore, instead of developing yet another change point detection method, we consider it vastly more important to properly evaluate existing algorithms on real-world data. To achieve this, we present a data set specifically designed for the evaluation of change point detection algorithms that consists of 37 time series from various application domains. Each series was annotated by five human annotators to provide ground truth on the presence and location of change points. We analyze the consistency of the human annotators, and describe evaluation metrics that can be used to measure algorithm performance in the presence of multiple ground truth annotations. Next, we present a benchmark study where 14 algorithms are evaluated on each of the time series in the data set. Our aim is that this data set will serve as a proving ground in the development of novel change point detection algorithms.  ( 2 min )
    KNIFE: Kernelized-Neural Differential Entropy Estimation. (arXiv:2202.06618v1 [cs.LG])
    Mutual Information (MI) has been widely used as a loss regularizer for training neural networks. This has been particularly effective when learn disentangled or compressed representations of high dimensional data. However, differential entropy (DE), another fundamental measure of information, has not found widespread use in neural network training. Although DE offers a potentially wider range of applications than MI, off-the-shelf DE estimators are either non differentiable, computationally intractable or fail to adapt to changes in the underlying distribution. These drawbacks prevent them from being used as regularizers in neural networks training. To address shortcomings in previously proposed estimators for DE, here we introduce KNIFE, a fully parameterized, differentiable kernel-based estimator of DE. The flexibility of our approach also allows us to construct KNIFE-based estimators for conditional (on either discrete or continuous variables) DE, as well as MI. We empirically validate our method on high-dimensional synthetic data and further apply it to guide the training of neural networks for real-world tasks. Our experiments on a large variety of tasks, including visual domain adaptation, textual fair classification, and textual fine-tuning demonstrate the effectiveness of KNIFE-based estimation. Code can be found at https://github.com/g-pichler/knife.  ( 2 min )
    Active Surrogate Estimators: An Active Learning Approach to Label-Efficient Model Evaluation. (arXiv:2202.06881v1 [cs.LG])
    We propose Active Surrogate Estimators (ASEs), a new method for label-efficient model evaluation. Evaluating model performance is a challenging and important problem when labels are expensive. ASEs address this active testing problem using a surrogate-based estimation approach, whereas previous methods have focused on Monte Carlo estimates. ASEs actively learn the underlying surrogate, and we propose a novel acquisition strategy, XWING, that tailors this learning to the final estimation task. We find that ASEs offer greater label-efficiency than the current state-of-the-art when applied to challenging model evaluation problems for deep neural networks. We further theoretically analyze ASEs' errors.  ( 2 min )
    LEADS: Learning Dynamical Systems that Generalize Across Environments. (arXiv:2106.04546v2 [cs.LG] UPDATED)
    When modeling dynamical systems from real-world data samples, the distribution of data often changes according to the environment in which they are captured, and the dynamics of the system itself vary from one environment to another. Generalizing across environments thus challenges the conventional frameworks. The classical settings suggest either considering data as i.i.d. and learning a single model to cover all situations or learning environment-specific models. Both are sub-optimal: the former disregards the discrepancies between environments leading to biased solutions, while the latter does not exploit their potential commonalities and is prone to scarcity problems. We propose LEADS, a novel framework that leverages the commonalities and discrepancies among known environments to improve model generalization. This is achieved with a tailored training formulation aiming at capturing common dynamics within a shared model while additional terms capture environment-specific dynamics. We ground our approach in theory, exhibiting a decrease in sample complexity with our approach and corroborate these results empirically, instantiating it for linear dynamics. Moreover, we concretize this framework for neural networks and evaluate it experimentally on representative families of nonlinear dynamics. We show that this new setting can exploit knowledge extracted from environment-dependent data and improves generalization for both known and novel environments. Code is available at https://github.com/yuan-yin/LEADS.  ( 2 min )
    On the cost of Bayesian posterior mean strategy for log-concave models. (arXiv:2010.06420v2 [math.PR] UPDATED)
    In this paper, we investigate the problem of computing Bayesian estimators using Langevin Monte-Carlo type approximation. The novelty of this paper is to consider together the statistical and numerical counterparts (in a general log-concave setting). More precisely, we address the following question: given $n$ observations in $\mathbb{R}^q$ distributed under an unknown probability $\mathbb{P}_{\theta^\star}$ with $\theta^\star \in \mathbb{R}^d$ , what is the optimal numerical strategy and its cost for the approximation of $\theta^\star$ with the Bayesian posterior mean? To answer this question, we establish some quantitative statistical bounds related to the underlying Poincar\'e constant of the model and establish new results about the numerical approximation of Gibbs measures by Cesaro averages of Euler schemes of (over-damped) Langevin diffusions. These last results include in particular some quantitative controls in the weakly convex case based on new bounds on the solution of the related Poisson equation of the diffusion.  ( 2 min )
    Approximate Inference via Clustering. (arXiv:2111.14219v2 [cs.LG] UPDATED)
    In recent years, large-scale Bayesian learning draws a great deal of attention. However, in big-data era, the amount of data we face is growing much faster than our ability to deal with it. Fortunately, it is observed that large-scale datasets usually own rich internal structure and is somewhat redundant. In this paper, we attempt to simplify the Bayesian posterior via exploiting this structure. Specifically, we restrict our interest to the so-called well-clustered datasets and construct an \emph{approximate posterior} according to the clustering information. Fortunately, the clustering structure can be efficiently obtained via a particular clustering algorithm. When constructing the approximate posterior, the data points in the same cluster are all replaced by the centroid of the cluster. As a result, the posterior can be significantly simplified. Theoretically, we show that under certain conditions the approximate posterior we construct is close (measured by KL divergence) to the exact posterior. Furthermore, thorough experiments are conducted to validate the fact that the constructed posterior is a good approximation to the true posterior and much easier to sample from.  ( 2 min )
    Meta Dropout: Learning to Perturb Features for Generalization. (arXiv:1905.12914v3 [cs.LG] UPDATED)
    A machine learning model that generalizes well should obtain low errors on unseen test examples. Thus, if we know how to optimally perturb training examples to account for test examples, we may achieve better generalization performance. However, obtaining such perturbation is not possible in standard machine learning frameworks as the distribution of the test data is unknown. To tackle this challenge, we propose a novel regularization method, meta-dropout, which learns to perturb the latent features of training examples for generalization in a meta-learning framework. Specifically, we meta-learn a noise generator which outputs a multiplicative noise distribution for latent features, to obtain low errors on the test instances in an input-dependent manner. Then, the learned noise generator can perturb the training examples of unseen tasks at the meta-test time for improved generalization. We validate our method on few-shot classification datasets, whose results show that it significantly improves the generalization performance of the base model, and largely outperforms existing regularization methods such as information bottleneck, manifold mixup, and information dropout.  ( 2 min )
    Distributed Sparse Normal Means Estimation with Sublinear Communication. (arXiv:2102.03060v2 [stat.ML] UPDATED)
    We consider the problem of sparse normal means estimation in a distributed setting with communication constraints. We assume there are $M$ machines, each holding $d$-dimensional observations of a $K$-sparse vector $\mu$ corrupted by additive Gaussian noise. The $M$ machines are connected in a star topology to a fusion center, whose goal is to estimate the vector $\mu$ with a low communication budget. Previous works have shown that to achieve the centralized minimax rate for the $\ell_2$ risk, the total communication must be high - at least linear in the dimension $d$. This phenomenon occurs, however, at very weak signals. We show that at signal-to-noise ratios (SNRs) that are sufficiently high - but not enough for recovery by any individual machine - the support of $\mu$ can be correctly recovered with significantly less communication. Specifically, we present two algorithms for distributed estimation of a sparse mean vector corrupted by either Gaussian or sub-Gaussian noise. We then prove that above certain SNR thresholds, with high probability, these algorithms recover the correct support with total communication that is sublinear in the dimension $d$. Furthermore, the communication decreases exponentially as a function of signal strength. If in addition $KM\ll \tfrac{d}{\log d}$, then with an additional round of sublinear communication, our algorithms achieve the centralized rate for the $\ell_2$ risk. Finally, we present simulations that illustrate the performance of our algorithms in different parameter regimes.  ( 2 min )
    Boosting Barely Robust Learners: A New Perspective on Adversarial Robustness. (arXiv:2202.05920v1 [cs.LG])
    We present an oracle-efficient algorithm for boosting the adversarial robustness of barely robust learners. Barely robust learning algorithms learn predictors that are adversarially robust only on a small fraction $\beta \ll 1$ of the data distribution. Our proposed notion of barely robust learning requires robustness with respect to a "larger" perturbation set; which we show is necessary for strongly robust learning, and that weaker relaxations are not sufficient for strongly robust learning. Our results reveal a qualitative and quantitative equivalence between two seemingly unrelated problems: strongly robust learning and barely robust learning.  ( 2 min )
    Exact Statistical Inference for Time Series Similarity using Dynamic Time Warping by Selective Inference. (arXiv:2202.06593v1 [stat.ML])
    In this paper, we study statistical inference on the similarity/distance between two time-series under uncertain environment by considering a statistical hypothesis test on the distance obtained from Dynamic Time Warping (DTW) algorithm. The sampling distribution of the DTW distance is too complicated to derive because it is obtained based on the solution of a complicated algorithm. To circumvent this difficulty, we propose to employ a conditional sampling distribution for the inference, which enables us to derive an exact (non-asymptotic) inference method on the DTW distance. Besides, we also develop a novel computational method to compute the conditional sampling distribution. To our knowledge, this is the first method that can provide valid $p$-value to quantify the statistical significance of the DTW distance, which is helpful for high-stake decision making. We evaluate the performance of the proposed inference method on both synthetic and real-world datasets.  ( 2 min )
    SDE approximations of GANs training and its long-run behavior. (arXiv:2006.02047v5 [cs.LG] UPDATED)
    This paper analyzes the training process of GANs via stochastic differential equations (SDEs). It first establishes SDE approximations for the training of GANs under stochastic gradient algorithms, with precise error bound analysis. It then describes the long-run behavior of GANs training via the invariant measures of its SDE approximations under proper conditions. This work builds theoretical foundation for GANs training and provides analytical tools to study its evolution and stability.  ( 2 min )
    Continuous-time stochastic gradient descent for optimizing over the stationary distribution of stochastic differential equations. (arXiv:2202.06637v1 [cs.LG])
    We develop a new continuous-time stochastic gradient descent method for optimizing over the stationary distribution of stochastic differential equation (SDE) models. The algorithm continuously updates the SDE model's parameters using an estimate for the gradient of the stationary distribution. The gradient estimate is simultaneously updated, asymptotically converging to the direction of steepest descent. We rigorously prove convergence of our online algorithm for linear SDE models and present numerical results for nonlinear examples. The proof requires analysis of the fluctuations of the parameter evolution around the direction of steepest descent. Bounds on the fluctuations are challenging to obtain due to the online nature of the algorithm (e.g., the stationary distribution will continuously change as the parameters change). We prove bounds for the solutions of a new class of Poisson partial differential equations, which are then used to analyze the parameter fluctuations in the algorithm.  ( 2 min )
    Towards Deployment-Efficient Reinforcement Learning: Lower Bound and Optimality. (arXiv:2202.06450v1 [cs.LG])
    Deployment efficiency is an important criterion for many real-world applications of reinforcement learning (RL). Despite the community's increasing interest, there lacks a formal theoretical formulation for the problem. In this paper, we propose such a formulation for deployment-efficient RL (DE-RL) from an "optimization with constraints" perspective: we are interested in exploring an MDP and obtaining a near-optimal policy within minimal \emph{deployment complexity}, whereas in each deployment the policy can sample a large batch of data. Using finite-horizon linear MDPs as a concrete structural model, we reveal the fundamental limit in achieving deployment efficiency by establishing information-theoretic lower bounds, and provide algorithms that achieve the optimal deployment efficiency. Moreover, our formulation for DE-RL is flexible and can serve as a building block for other practically relevant settings; we give "Safe DE-RL" and "Sample-Efficient DE-RL" as two examples, which may be worth future investigation.  ( 2 min )
    Accelerating Stochastic Simulation with Spatiotemporal Neural Processes. (arXiv:2106.02770v4 [cs.LG] UPDATED)
    Stochastic simulations such as large-scale, spatiotemporal, age-structured epidemic models are computationally expensive at fine-grained resolution. We propose Spatiotemporal Neural Processes (STNP), a neural latent variable model to mimic the spatiotemporal dynamics of stochastic simulators. To further speed up training, we use a Bayesian active learning strategy to proactively query the simulator, gather more data, and continuously improve the model. Our model can automatically infer the latent processes which describe the intrinsic uncertainty of the simulator. This also gives rise to a new acquisition function based on latent information gain. Theoretical analysis demonstrates that our approach reduces sample complexity compared with random sampling in high dimension. Empirically, we demonstrate that our framework can faithfully imitate the behavior of a complex infectious disease simulator with a small number of examples, enabling rapid simulation and scenario exploration.  ( 2 min )
    High-Dimensional Linear Regression via Implicit Regularization. (arXiv:1903.09367v2 [math.ST] UPDATED)
    Many statistical estimators for high-dimensional linear regression are M-estimators, formed through minimizing a data-dependent square loss function plus a regularizer. This work considers a new class of estimators implicitly defined through a discretized gradient dynamic system under overparameterization. We show that under suitable restricted isometry conditions, overparameterization leads to implicit regularization: if we directly apply gradient descent to the residual sum of squares with sufficiently small initial values, then under some proper early stopping rule, the iterates converge to a nearly sparse rate-optimal solution that improves over explicitly regularized approaches. In particular, the resulting estimator does not suffer from extra bias due to explicit penalties, and can achieve the parametric root-n rate when the signal-to-noise ratio is sufficiently high. We also perform simulations to compare our methods with high dimensional linear regression with explicit regularization. Our results illustrate the advantages of using implicit regularization via gradient descent after overparameterization in sparse vector estimation.  ( 2 min )
    Corralling a Larger Band of Bandits: A Case Study on Switching Regret for Linear Bandits. (arXiv:2202.06151v1 [cs.LG])
    We consider the problem of combining and learning over a set of adversarial bandit algorithms with the goal of adaptively tracking the best one on the fly. The CORRAL algorithm of Agarwal et al. (2017) and its variants (Foster et al., 2020a) achieve this goal with a regret overhead of order $\widetilde{O}(\sqrt{MT})$ where $M$ is the number of base algorithms and $T$ is the time horizon. The polynomial dependence on $M$, however, prevents one from applying these algorithms to many applications where $M$ is poly$(T)$ or even larger. Motivated by this issue, we propose a new recipe to corral a larger band of bandit algorithms whose regret overhead has only \emph{logarithmic} dependence on $M$ as long as some conditions are satisfied. As the main example, we apply our recipe to the problem of adversarial linear bandits over a $d$-dimensional $\ell_p$ unit-ball for $p \in (1,2]$. By corralling a large set of $T$ base algorithms, each starting at a different time step, our final algorithm achieves the first optimal switching regret $\widetilde{O}(\sqrt{d S T})$ when competing against a sequence of comparators with $S$ switches (for some known $S$). We further extend our results to linear bandits over a smooth and strongly convex domain as well as unconstrained linear bandits.  ( 2 min )
    Versatile Dueling Bandits: Best-of-both-World Analyses for Online Learning from Preferences. (arXiv:2202.06694v1 [cs.LG])
    We study the problem of $K$-armed dueling bandit for both stochastic and adversarial environments, where the goal of the learner is to aggregate information through relative preferences of pair of decisions points queried in an online sequential manner. We first propose a novel reduction from any (general) dueling bandits to multi-armed bandits and despite the simplicity, it allows us to improve many existing results in dueling bandits. In particular, \emph{we give the first best-of-both world result for the dueling bandits regret minimization problem} -- a unified framework that is guaranteed to perform optimally for both stochastic and adversarial preferences simultaneously. Moreover, our algorithm is also the first to achieve an optimal $O(\sum_{i = 1}^K \frac{\log T}{\Delta_i})$ regret bound against the Condorcet-winner benchmark, which scales optimally both in terms of the arm-size $K$ and the instance-specific suboptimality gaps $\{\Delta_i\}_{i = 1}^K$. This resolves the long-standing problem of designing an instancewise gap-dependent order optimal regret algorithm for dueling bandits (with matching lower bounds up to small constant factors). We further justify the robustness of our proposed algorithm by proving its optimal regret rate under adversarially corrupted preferences -- this outperforms the existing state-of-the-art corrupted dueling results by a large margin. In summary, we believe our reduction idea will find a broader scope in solving a diverse class of dueling bandits setting, which are otherwise studied separately from multi-armed bandits with often more complex solutions and worse guarantees. The efficacy of our proposed algorithms is empirically corroborated against the existing dueling bandit methods.  ( 2 min )
    Private Adaptive Optimization with Side Information. (arXiv:2202.05963v1 [cs.LG])
    Adaptive optimization methods have become the default solvers for many machine learning tasks. Unfortunately, the benefits of adaptivity may degrade when training with differential privacy, as the noise added to ensure privacy reduces the effectiveness of the adaptive preconditioner. To this end, we propose AdaDPS, a general framework that uses non-sensitive side information to precondition the gradients, allowing the effective use of adaptive methods in private settings. We formally show AdaDPS reduces the amount of noise needed to achieve similar privacy guarantees, thereby improving optimization performance. Empirically, we leverage simple and readily available side information to explore the performance of AdaDPS in practice, comparing to strong baselines in both centralized and federated settings. Our results show that AdaDPS improves accuracy by 7.7% (absolute) on average -- yielding state-of-the-art privacy-utility trade-offs on large-scale text and image benchmarks.  ( 2 min )
    On Pitfalls of Identifiability in Unsupervised Learning. A Note on: "Desiderata for Representation Learning: A Causal Perspective". (arXiv:2202.06844v1 [stat.ML])
    Model identifiability is a desirable property in the context of unsupervised representation learning. In absence thereof, different models may be observationally indistinguishable while yielding representations that are nontrivially related to one another, thus making the recovery of a ground truth generative model fundamentally impossible, as often shown through suitably constructed counterexamples. In this note, we discuss one such construction, illustrating a potential failure case of an identifiability result presented in "Desiderata for Representation Learning: A Causal Perspective" by Wang & Jordan (2021). The construction is based on the theory of nonlinear independent component analysis. We comment on implications of this and other counterexamples for identifiable representation learning.  ( 2 min )
    Adaptive Bandit Convex Optimization with Heterogeneous Curvature. (arXiv:2202.06150v1 [cs.LG])
    We consider the problem of adversarial bandit convex optimization, that is, online learning over a sequence of arbitrary convex loss functions with only one function evaluation for each of them. While all previous works assume known and homogeneous curvature on these loss functions, we study a heterogeneous setting where each function has its own curvature that is only revealed after the learner makes a decision. We develop an efficient algorithm that is able to adapt to the curvature on the fly. Specifically, our algorithm not only recovers or \emph{even improves} existing results for several homogeneous settings, but also leads to surprising results for some heterogeneous settings -- for example, while Hazan and Levy (2014) showed that $\widetilde{O}(d^{3/2}\sqrt{T})$ regret is achievable for a sequence of $T$ smooth and strongly convex $d$-dimensional functions, our algorithm reveals that the same is achievable even if $T^{3/4}$ of them are not strongly convex, and sometimes even if a constant fraction of them are not strongly convex. Our approach is inspired by the framework of Bartlett et al. (2007) who studied a similar heterogeneous setting but with stronger gradient feedback. Extending their framework to the bandit feedback setting requires novel ideas such as lifting the feasible domain and using a logarithmically homogeneous self-concordant barrier regularizer.  ( 2 min )
    Simultaneous Transport Evolution for Minimax Equilibria on Measures. (arXiv:2202.06460v1 [cs.LG])
    Min-max optimization problems arise in several key machine learning setups, including adversarial learning and generative modeling. In their general form, in absence of convexity/concavity assumptions, finding pure equilibria of the underlying two-player zero-sum game is computationally hard [Daskalakis et al., 2021]. In this work we focus instead in finding mixed equilibria, and consider the associated lifted problem in the space of probability measures. By adding entropic regularization, our main result establishes global convergence towards the global equilibrium by using simultaneous gradient ascent-descent with respect to the Wasserstein metric -- a dynamics that admits efficient particle discretization in high-dimensions, as opposed to entropic mirror descent. We complement this positive result with a related entropy-regularized loss which is not bilinear but still convex-concave in the Wasserstein geometry, and for which simultaneous dynamics do not converge yet timescale separation does. Taken together, these results showcase the benign geometry of bilinear games in the space of measures, enabling particle dynamics with global qualitative convergence guarantees.  ( 2 min )
    The Sample Complexity of One-Hidden-Layer Neural Networks. (arXiv:2202.06233v1 [cs.LG])
    We study norm-based uniform convergence bounds for neural networks, aiming at a tight understanding of how these are affected by the architecture and type of norm constraint, for the simple class of scalar-valued one-hidden-layer networks, and inputs bounded in Euclidean norm. We begin by proving that in general, controlling the spectral norm of the hidden layer weight matrix is insufficient to get uniform convergence guarantees (independent of the network width), while a stronger Frobenius norm control is sufficient, extending and improving on previous work. Motivated by the proof constructions, we identify and analyze two important settings where a mere spectral norm control turns out to be sufficient: First, when the network's activation functions are sufficiently smooth (with the result extending to deeper networks); and second, for certain types of convolutional networks. In the latter setting, we study how the sample complexity is additionally affected by parameters such as the amount of overlap between patches and the overall number of patches.  ( 2 min )
    Optimal sizing of a holdout set for safe predictive model updating. (arXiv:2202.06374v1 [stat.ML])
    Risk models are becoming ubiquitous in healthcare and may guide intervention by providing practitioners with insights from patient data. Should a model be updated after a guided intervention, it may lead to its own failure at making predictions. The use of a `holdout set' -- a subset of the population that does not receive interventions guided by the model -- has been proposed to prevent this. Since patients in the holdout set do not benefit from risk predictions, the chosen size must trade off maximising model performance whilst minimising the number of held out patients. By defining a general loss function, we prove the existence and uniqueness of an optimal holdout set size, and introduce parametric and semi-parametric algorithms for its estimation. We demonstrate their use on a recent risk score for pre-eclampsia. Based on these results, we argue that a holdout set is a safe, viable and easily implemented solution to the model update problem.  ( 2 min )
    On the complexity of All $\varepsilon$-Best Arms Identification. (arXiv:2202.06280v1 [stat.ML])
    We consider the problem introduced by \cite{Mason2020} of identifying all the $\varepsilon$-optimal arms in a finite stochastic multi-armed bandit with Gaussian rewards. In the fixed confidence setting, we give a lower bound on the number of samples required by any algorithm that returns the set of $\varepsilon$-good arms with a failure probability less than some risk level $\delta$. This bound writes as $T_{\varepsilon}^*(\mu)\log(1/\delta)$, where $T_{\varepsilon}^*(\mu)$ is a characteristic time that depends on the vector of mean rewards $\mu$ and the accuracy parameter $\varepsilon$. We also provide an efficient numerical method to solve the convex max-min program that defines the characteristic time. Our method is based on a complete characterization of the alternative bandit instances that the optimal sampling strategy needs to rule out, thus making our bound tighter than the one provided by \cite{Mason2020}. Using this method, we propose a Track-and-Stop algorithm that identifies the set of $\varepsilon$-good arms w.h.p and enjoys asymptotic optimality (when $\delta$ goes to zero) in terms of the expected sample complexity. Finally, using numerical simulations, we demonstrate our algorithm's advantage over state-of-the-art methods, even for moderate values of the risk parameter.  ( 2 min )
    Statistical Limits for Testing Correlation of Hypergraphs. (arXiv:2202.05888v1 [math.ST])
    In this paper, we consider the hypothesis testing of correlation between two $m$-uniform hypergraphs on $n$ unlabelled nodes. Under the null hypothesis, the hypergraphs are independent, while under the alternative hypothesis, the hyperdges have the same marginal distributions as in the null hypothesis but are correlated after some unknown node permutation. We focus on two scenarios: the hypergraphs are generated from the Gaussian-Wigner model and the dense Erd\"{o}s-R\'{e}nyi model. We derive the sharp information-theoretic testing threshold. Above the threshold, there exists a powerful test to distinguish the alternative hypothesis from the null hypothesis. Below the threshold, the alternative hypothesis and the null hypothesis are not distinguishable. The threshold involves $m$ and decreases as $m$ gets larger. This indicates testing correlation of hypergraphs ($m\geq3$) becomes easier than testing correlation of graphs ($m=2$)  ( 2 min )
    Improved analysis for a proximal algorithm for sampling. (arXiv:2202.06386v1 [math.ST])
    We study the proximal sampler of Lee, Shen, and Tian (2021) and obtain new convergence guarantees under weaker assumptions than strong log-concavity: namely, our results hold for (1) weakly log-concave targets, and (2) targets satisfying isoperimetric assumptions which allow for non-log-concavity. We demonstrate our results by obtaining new state-of-the-art sampling guarantees for several classes of target distributions. We also strengthen the connection between the proximal sampler and the proximal method in optimization by interpreting the proximal sampler as an entropically regularized Wasserstein proximal method, and the proximal point method as the limit of the proximal sampler with vanishing noise.  ( 2 min )
    Relaxing the Feature Covariance Assumption: Time-Variant Bounds for Benign Overfitting in Linear Regression. (arXiv:2202.06054v1 [cs.LG])
    Benign overfitting demonstrates that overparameterized models can perform well on test data while fitting noisy training data. However, it only considers the final min-norm solution in linear regression, which ignores the algorithm information and the corresponding training procedure. In this paper, we generalize the idea of benign overfitting to the whole training trajectory instead of the min-norm solution and derive a time-variant bound based on the trajectory analysis. Starting from the time-variant bound, we further derive a time interval that suffices to guarantee a consistent generalization error for a given feature covariance. Unlike existing approaches, the newly proposed generalization bound is characterized by a time-variant effective dimension of feature covariance. By introducing the time factor, we relax the strict assumption on the feature covariance matrix required in previous benign overfitting under the regimes of overparameterized linear regression with gradient descent. This paper extends the scope of benign overfitting, and experiment results indicate that the proposed bound accords better with empirical evidence.  ( 2 min )
    State-of-the-Art Review of Design of Experiments for Physics-Informed Deep Learning. (arXiv:2202.06416v1 [stat.ML])
    This paper presents a comprehensive review of the design of experiments used in the surrogate models. In particular, this study demonstrates the necessity of the design of experiment schemes for the Physics-Informed Neural Network (PINN), which belongs to the supervised learning class. Many complex partial differential equations (PDEs) do not have any analytical solution; only numerical methods are used to solve the equations, which is computationally expensive. In recent decades, PINN has gained popularity as a replacement for numerical methods to reduce the computational budget. PINN uses physical information in the form of differential equations to enhance the performance of the neural networks. Though it works efficiently, the choice of the design of experiment scheme is important as the accuracy of the predicted responses using PINN depends on the training data. In this study, five different PDEs are used for numerical purposes, i.e., viscous Burger's equation, Shr\"{o}dinger equation, heat equation, Allen-Cahn equation, and Korteweg-de Vries equation. A comparative study is performed to establish the necessity of the selection of a DoE scheme. It is seen that the Hammersley sampling-based PINN performs better than other DoE sample strategies.  ( 2 min )
    A Group-Equivariant Autoencoder for Identifying Spontaneously Broken Symmetries in the Ising Model. (arXiv:2202.06319v1 [cond-mat.stat-mech])
    We introduce the group-equivariant autoencoder (GE-autoencoder) -- a novel deep neural network method that locates phase boundaries in the Ising universality class by determining which symmetries of the Hamiltonian are broken at each temperature. The encoder network of the GE-autoencoder models the order parameter observable associated with the phase transition. The parameters of the GE-autoencoder are constrained such that the encoder is invariant to the subgroup of symmetries that never break; this results in a dramatic reduction in the number of free parameters such that the GE-autoencoder size is independent of the system size. The loss function of the GE-autoencoder includes regularization terms that enforce equivariance to the remaining quotient group of symmetries. We test the GE-autoencoder method on the 2D classical ferromagnetic and antiferromagnetic Ising models, finding that the GE-autoencoder (1) accurately determines which symmetries are broken at each temperature, and (2) estimates the critical temperature with greater accuracy and time-efficiency than a symmetry-agnostic autoencoder, once finite-size scaling analysis is taken into account.  ( 2 min )
    Black-Box Generalization. (arXiv:2202.06880v1 [cs.LG])
    We provide the first generalization error analysis for black-box learning through derivative-free optimization. Under the assumption of a Lipschitz and smooth unknown loss, we consider the Zeroth-order Stochastic Search (ZoSS) algorithm, that updates a $d$-dimensional model by replacing stochastic gradient directions with stochastic differences of $K+1$ perturbed loss evaluations per dataset (example) query. For both unbounded and bounded possibly nonconvex losses, we present the first generalization bounds for the ZoSS algorithm. These bounds coincide with those for SGD, and rather surprisingly are independent of $d$, $K$ and the batch size $m$, under appropriate choices of a slightly decreased learning rate. For bounded nonconvex losses and a batch size $m=1$, we additionally show that both generalization error and learning rate are independent of $d$ and $K$, and remain essentially the same as for the SGD, even for two function evaluations. Our results extensively extend and consistently recover established results for SGD in prior work, on both generalization bounds and corresponding learning rates. If additionally $m=n$, where $n$ is the dataset size, we derive generalization guarantees for full-batch GD as well.  ( 2 min )
  • Open

    Duck-typing, scope, and investigative functions in Python
    Python is a duck typing language. It means the data types of variables can change as long as the syntax […] The post Duck-typing, scope, and investigative functions in Python appeared first on Machine Learning Mastery.  ( 13 min )
  • Open

    Calling All Data Scientists: Data Observability Needs You
    We live in a complex world that is full of data, and it’s getting even more full every day. In 2020, the world collectively created, captured, copied, and consumed nearly 64.2 zettabytes of data and by 2025 that figure is expected to more than double to 180 zettabytes. Increasingly, companies depend on this data to… Read More »Calling All Data Scientists: Data Observability Needs You The post Calling All Data Scientists: Data Observability Needs You appeared first on Data Science Central.  ( 4 min )

  • Open

    Day 1 of posting a motivational/inspiring image from an AI
    submitted by /u/ScottBakerJr [link] [comments]
    On the Origin of Species (of Machine Learning Models)
    submitted by /u/spincycle27 [link] [comments]
    A new study by researchers at MIT suggests the day may be approaching when advanced artificial intelligence systems could assist anesthesiologists in the operating room.
    submitted by /u/qptbook [link] [comments]  ( 1 min )
    What is DeepCTRL?
    submitted by /u/Beautiful-Credit-868 [link] [comments]
    Uber Explores Deep Learning To Develop ‘DeepETA’: A Low-Latency Deep Neural Network Architecture For ETA Prediction
    Web mapping services like Google Maps are excellent tools to dynamically navigate portions of the Earth. They are used daily by businesses that rely on accurate mapping features (such as food delivery services to provide accurate and optimized delivery time estimates) and everyday consumers looking for (best) routes between waypoints. Therefore, predicting the expected arrival time (ETA) is vital to allow traffic participants to make better decisions, potentially avoiding congested areas and reducing the overall time spent stuck in traffic. Continue Reading Uber Blog: https://eng.uber.com/deepeta-how-uber-predicts-arrival-times/ https://i.redd.it/i2bloh8k81i81.gif submitted by /u/ai-lover [link] [comments]  ( 1 min )
    Using Mediapipe for custom object detection with landmarks
    Hi, I want to train an object detector which won't only mark an object with a bounding box but with objects specific landmarks. In that way I'm planning to measure objects size in an image by using a reference object(of known size). For this purpose I want to use Mediapipe library. Is it possible to train a shape detector with Mediapipe? Thanks in advance, Best wishes submitted by /u/TaskinBaltaci [link] [comments]  ( 1 min )
    How to use chatbots for your Google Business chat
    Hi, there. As you probably, heard, Google wants to make the most from their Search. I heard that they hope AI can turn Search into a conversation in the future. But for now, we have Google Business chat that could be just automated (with chatbots). My company had several clients with that case, and we prepared a little guide on Google business chat features you may find interesting for you if you're planning to have a chatbot there. Anyway, here's a guide ⬇️ https://botscrew.com/blog/google-chatbot-and-google-business-chat/?utm_source=Reddit&utm_medium=artificial/&utm_campaign=&utm_term=&utm_content= submitted by /u/Avandegraund [link] [comments]  ( 1 min )
    A.I. Will Predict Alzheimer's Years In Advance by 2027
    submitted by /u/Beautiful-Credit-868 [link] [comments]
    Represent the above sentences in first order logic.
    All married employees earning Rs.2225,000 or more per year in Nepal pay taxes. All unmarried employees earning Rs 2,00,000 or more per year in Nepal pay taxes. The university professor of Nepal earns Rs.4,00,000 and has to pay 25% taxes. No other employees earns more than the professor in the university. Some of Nepalese citizens earn less than Rs.200 per day and they don’t have to pay any taxes. My try-: https://imgur.com/a/rxii4Vv submitted by /u/kalokeshmarelimai [link] [comments]  ( 1 min )
    AI Equation in Health Care Industry
    The Healthcare industry is the fastest-growing sector as artificial intelligence and machine learning in healthcare brings changes and how we perceive human health. AI-enhanced diagnostics to surgical robotics, neural networks in drug discovery, machine learning in healthcare projects helps in informed decision-making. Pharmaceutical giants and healthcare providers combat human health through various deep learning innovations as AI technology is in the process of evolving across variable touchpoints. One of humanity's greatest challenges today is to provide quality access to rural and low-income settings for simple diagnoses for patients. The understanding of day-to-day patterns of a patient, the benefits of artificial intelligence used in the healthcare eco-system is vast. The future of machine learning in healthcare for new techniques in diagnosis processes is helping AI programs take treatment protocols that help in patient monitoring and care services. With the power of AI embedded in streamlining, new medical drug discovery can significantly bring down the manufacturing costs, etc. submitted by /u/futureanalytica [link] [comments]  ( 1 min )
    Best accessible AI art generators? Particularly for video
    Hey there, I’m looking to “make” some AI art, images are great but I’m also really trying to find more ways to make AI generated video. Text to image if possible but other means are also needed. I am currently using the hell out of Nightcafe and having a blast but I feel like there must be something superior out there, I don’t know any coding or anything like that but have crawled my way through some fairly complex (imo) processes with proper instruction. submitted by /u/Chickenwomp [link] [comments]  ( 1 min )
    Tired Of Traditional Trade Methods? Download These 10 Best Trading Apps Now
    https://preview.redd.it/f7roznewiyh81.jpg?width=778&format=pjpg&auto=webp&s=ff3fd40f461bc9bf0aafd57475bf40c817257f89 Whether you are a first-time investor or an experienced one, trading apps have changed the lives of majority of investors. With plenty of options available in the market, it is difficult to decide which is the best trading app. It is 2021, and trading businesses have escalated to new heights. In the present scenario, one does not have to run several errands in and out of ‘The Phiroze Jeejeebhoy Towers’. Wider availability of convenient smartphones and access to (reduced) mobile data costs have facilitated the adoption of mobile trading apps. While selecting the best trading app for themselves, one needs to consider several factors and services offered by the applications. No wonder the battle of Upstox vs Zerodha tends to continue for being the best trading app in India. However, various other trading apps have started occupying a significant position in the race. Read Full Blog: submitted by /u/businessapac [link] [comments]
    10 Dangers Associated with Artificial Intelligence
    submitted by /u/Philo167 [link] [comments]
    UT Austin Researchers Demonstrate a Deep Learning Technique That Achieves High-Quality Image Reconstructions Based on MRI Datasets
    During a magnetic resonance imaging (MRI) scan, time seems to stand still for many individuals. Those who have experienced one understand the difficulty of remaining immovably still inside a buzzing, banging scanner for periods ranging from a few minutes to more than an hour. Amazon research award winner and UT Austin Researcher, Jonathan (Jon) Tamir, is working on machine learning ways to reduce exam durations and extract more information from this necessary — but frequently unpleasant — imaging procedure. Continue Reading Paper: https://arxiv.org/pdf/2108.01368.pdf Github: https://github.com/utcsilab/csgm-mri-langevin https://preview.redd.it/ti6khvo0zwh81.png?width=1200&format=png&auto=webp&s=8857232459b363438c6162f651df6738481bfe02 submitted by /u/ai-lover [link] [comments]  ( 1 min )
    Artificial Nightmares : Orc March to Battle || DD PYTTI VQGAN AI Art Video [4K 60 FPS]
    submitted by /u/Thenamessd [link] [comments]
    Weekly China AI News: AI Companies Flex Muscles at Winter Olympics; 10-Month-Old 3D Digital Creation Startup Raises $50M; Training Vision Transformers with 2040 Images
    submitted by /u/trcytony [link] [comments]  ( 1 min )
    These photos were made in AI using Nvidia Canvas.
    submitted by /u/photogrammetery [link] [comments]
  • Open

    [D] New Research on the ecological impact of computation and the cloud
    https://thereader.mitpress.mit.edu/the-staggering-ecological-impacts-of-computation-and-the-cloud/ This supports many of the points made in this: https://kv-emptypages.blogspot.com/2021/11/the-carbon-footprint-of-machine-learning.html submitted by /u/kirya_V21 [link] [comments]
    Moving beyond engagement: How could Facebook's algorithms optimize for human values instead? [D]
    I used to work at Facebook, YouTube, and Twitter. One of the big questions I focused on: what was the right objective function to align our AI systems towards? When we started optimizing for watch time at YouTube, for example, our algorithms started suggesting longer videos for the sake of longer videos, and videos with racy thumbnails. Similarly, optimizing for engagement at Facebook led to low-quality clickbait, and Hooter's appearing as the top search result when you searched for restaurants in Houston. Experiments that increased favorites and replies at Twitter invariably increased toxic content as well. So while watch time, engagement, and replies would always go up -- were these really the products we wanted to build? What happened to Facebook's original mission of connecting users with their friends and family? What did "favorites" have to do with being the platform for public conversation at Twitter? A lot of work at these companies is spent measuring active users, but where were the dashboards measuring progress to these broader goals? It's easy to become blinded by standard metrics and lose sight of the original product principles that made us stand out. So could we figure out a metric that was better tuned to human values and the product mission we cared about, but also fast, rigorous, and easily measurable? After all, we still need our A/B tests, ML objective functions, and OKRs. This question is particularly important today, with all the troubles that social media platforms face, so I wrote a blog about alternative approaches to social media algorithms. submitted by /u/BB4evaTB12 [link] [comments]  ( 2 min )
    [D] Finding valid generalizations about deep learning models
    Hello everyone, I have a weird question to ask. I have a simple CNN model that takes in 6000 dimensional input and maps it to one of the 10 classes. My aim is to find a way to reduce the accuracy of my trained model as much as I can while introducing minimum amount of noise to the input. I'm obviously using adversarial AI methods. It seems like finding the gradients of class outputs with respect to input instead of the weights is giving me the best results. I update the inputs that belong to a certain class with these gradients that I calculated for every output, and I analyze at what point does my model start classifying input that belong to class A, as class B. For every class, I essentially take inputs that belong to that class, update it with gradients of every other class and see which output class takes the least amount for my model to be classifying the inputs as. Well, this works. I can see my inputs slowly start to get classified as some other class. So this means I have a vague understanding of what class is closest to what other class in input space. Now comes the problem. For every class, I want to create a single gradient that I can apply to all inputs that belong to that class to make my model classify those updated inputs as something else. This means I have to find a way to generalize all these different gradients that I've calculated. How the hell do I do that? Just taking the average of every gradient I calculated gives me a good estimate. But I end up applying that calculated single gradient to every input that belongs to one class. And if I decide to train a new model using the dataset that has noise added, the new trained model gives me even higher accuracy. Sure, I end up lowering the accuracy of my main model, but anyone can just simply train a new model with dataset that has the noise added and classify the inputs with even better accuracies. submitted by /u/egesko [link] [comments]  ( 1 min )
    [D] Where can I find a list of T5 tasks/prefixes?
    Trying to test/see what tasks T5 can do out of the box. Knowing basics for each task prefix + text format is required to properly fit/frame my current problem into some pre-trained task that T5 can reliably handle. So far I haven't been able find a convenient list of examples for each of T5's trained tasks. Tutorials give spatterings of examples, but nothing comprehensive. The paper gives examples without clear formatting info and their github is fairly unhelpful as well. T5 is very powerful and simple to use when there is a clear format to follow submitted by /u/BearThreat [link] [comments]  ( 1 min )
    [D] Should We Be Using JAX in 2022?
    JAX is an extremely cool Google project that's now a few years old - is it time to start migrating to JAX? JAX can be incredibly fast and, while it's a no-brainer for certain things, Machine Learning, and especially Deep Learning, benefit from specialized tools that JAX currently does not replace (and does not seek to replace). I wrote an article detailing why I think you should (or shouldn't) be using JAX in 2022. It also includes an overview of the big JAX transformations, as well as some benchmarks like this: ​ https://preview.redd.it/by6vqtud71i81.png?width=875&format=png&auto=webp&s=e44ef8b5cc1f78a100d403024bcc1345f575c179 What do you think about JAX in 2022? Some points of discussion that you might want to touch on: Will JAX's functional paradigm lead to issues for those without functional experience, especially in Deep Learning? What's your favorite Deep Learning API for JAX - Flax, Haiku, Elegy, something else? Are there cases or fields in which you think JAX should definitely or definitely not be used? Do you think JAX (plus higher-level tools) could replace TensorFlow eventually, or do you think JAX is intended in part as Google's answer to PyTorch dominating the research landscape, or are both of these hypotheticals/theories wrong? Where do you see JAX fitting in generally in the Machine Learning world going forward, and especially as an alternative to PyTorch in particular? Do you think higher-order optimization being easier with JAX will be important in the coming years? I'd love to discuss these points or any others you might have thought of! submitted by /u/SleekEagle [link] [comments]  ( 5 min )
    [P] Make transformer models relation aware (ratransformers, python package)
    https://github.com/JoaoLages/RATransformers I have made a package to be able to use pretrained language models on structured data. By changing self-attention to be relation aware, you are able to pass implicit relations within the input to the model. submitted by /u/JClub [link] [comments]  ( 1 min )
    [D] Podcast: Getting Data Scientists to Write Better Code 🔥 with Laszlo Sragner
    Hey r/ml! I thought people here might enjoy (or possibly have a great discussion about) the latest episode in the MLOps Podcast. In this episode, I'm speaking with Laszlo Sragner about how data scientists can write better code, how it affects real-world ML projects, and how to build an ML team. We also talk about how to break down ML problems into smaller, more manageable tasks, and a bunch of other things. You can watch it here: https://www.youtube.com/watch?v=mtwGV-x3nSM or listen to it here, or read some of the Q&A. Would love to open up a discussion – what are your best practices for improving code-craft in machine learning projects? submitted by /u/PhYsIcS-GUY227 [link] [comments]  ( 1 min )
    [P] Commercial use face recognition model
    I am looking for pretrained model that produces face embeddigs allowing face comparision. I have found such a modela trained on vggface and vggface2 datasets which are strictly not commercial use licensed. Do you know if there is any commercial allowed used datasets and model trained on them? I could find only 10 years old modela, which I will test, but I guess I would rather to have something with as high as possible accuracy while still being free. I do not do unethical project for civil survilence or something. It is about catching really bad people. submitted by /u/BoniekZbigniew [link] [comments]  ( 1 min )
    [R] Detectron2 DA-RetinaNet
    Hello, i am happy to share with you one of my latest work for domain adaptation built on top of Detectron2 object detector model (RetinaNet). Link to the repo DA-RetinaNet: https://github.com/fpv-iplab/DA-RetinaNet submitted by /u/CapitalShake3085 [link] [comments]
    [D] Legality of Commercial Use of Public Video Data?
    There's been a buzz in the policy space about how people's personal data can be used, to an extent that's pretty surprising. For example, Meta is currently being sued by the state of Texas (and settled with Illinois for $650 million dollars) for using user's pictures for training facial recognition algorithms. Anecdotally I've heard large companies have issues with models trained on ImageNet given that the license for each image varies and models could be considered derivative works. Specifically, I'm concerned about MOT style tracking datasets. While MOT16/17/20 are covered by Creative Commons and explicitly can't be used for commercial purposes, based on some quick Googling it seems like making and labelling your own recordings of public spaces to train commercial detectors might be illegal as well; i.e. you need permission of the video's participants to commercialize work containing them. This isn't a legal subreddit, but I'm wondering if anyone here has experience dealing with this. Am I misinterpreting the laws, or do most large companies hire background "actors" to create their own data? submitted by /u/RepsNRobots [link] [comments]  ( 2 min )
    [R] MuZero with Self-competition for Rate Control in VP9 Video Compression
    submitted by /u/hardmaru [link] [comments]
    [D] Seeking research about RL models optimized for changing reward functions
    Hi All, Can you point me to some research that discusses how to address this problem? I want to be able to scale/modulate the relative influence of sub-rewards that comprise the reward function of an RL model. This model would have an interface where a user can turn a knob changing up/down the influence of that given sub-reward. I'm hoping to train the model to perform consistently across arbitrarily scaled sub-rewards. Are there any research papers that explores this topic? Thanks in advance. submitted by /u/jdthrowy [link] [comments]  ( 1 min )
    [D] Details of the First Neural Network (Perceptron)
    Are there any references which discuss the exact experiment that Rosenblatt used when making the first Perceptron? For instance, on an informal level, I have heard that Rosenblatt first used the Perceptron to classify pictures of circles and squares in a supervised setting. However, I have been able to find any references about this. I consulted the following references: https://blogs.umass.edu/brain-wars/files/2016/03/rosenblatt-1957.pdf https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.335.3398&rep=rep1&type=pdf But I still couldn't find anything. Does anyone have any ideas? Thanks! submitted by /u/jj4646 [link] [comments]  ( 1 min )
    [News] PSA: Now get all official and unofficial code implementations of any AI/ML papers as you're browsing DuckDuckGo, Reddit, Google, Scholar, Arxiv, Twitter and more!
    Now get all official and unofficial code implementations of any AI/ML papers as you're browsing DuckDuckGo, Reddit, Google, Scholar, Arxiv, Twitter and more! (if code not yet available, request it with 1-click as well!) https://chrome.google.com/webstore/detail/aiml-papers-with-code-eve/aikkeehnlfpamidigaffhfmgbkdeheil ​ ​ https://preview.redd.it/zpkx0j1fdwh81.png?width=1265&format=png&auto=webp&s=59f732f9d26223ba20ba4958b389a50617b27f06 https://preview.redd.it/mn656t1fdwh81.png?width=1265&format=png&auto=webp&s=cb6d7d3ddce53cc6c2113e41e80f2d2acd0be964 https://preview.redd.it/0ertru1fdwh81.png?width=1167&format=png&auto=webp&s=a0f5df0905c7dc873ec702c564290cbe9bd7adb7 submitted by /u/arxiv_code_test_b [link] [comments]  ( 1 min )
    [D] Machine learning model decision at uncontrolled traffic junction based on traffic sign board in autonomous driving scenario
    As shown in image below, at uncontrolled junction due to stop sign Car A and Car C will stop and Car B will have the highest priority followed by other car. In this scenario, I wanted to generate kind of priority list based on the visual clues present in the traffic scene. But not sure how to treat this kind of problem. Should it be treated as a classification problem or something else? or I am wondering is this type of problem can be tackled by visual Question answering (VQA), which combine CV and NLP to solve such kind of logical reasoning. Is there research in this direction of generating a priority list of traffic participant over a traffic scene based on the interaction and traffic signs? I am able to find research direction but they are more into trajectory prediction or scene understanding. ​ An uncontrolled traffic junction over which priority should be generated. submitted by /u/human_treadstone [link] [comments]  ( 1 min )
    [D] Is there any training-as-a-service tool?
    What I am thinking is, I'm going to submit the dataset (possibly pre-processed), and select the model from something like a drop-down list. The service will handle everything behind the scenes (hyperparameter selection etc.), and is going to return me the trained model, along with some auxiliary information such as learning curves. Are there such services? submitted by /u/tell-me-the-truth- [link] [comments]
  • Open

    "ML training compute has been doubling every 6 months since 2010"
    submitted by /u/nickb [link] [comments]
    Introduction to Deep learning
    submitted by /u/neuronovesite_cz [link] [comments]
    Classification of iris flowers using machine learning
    This model classifies iris flowers among three species (Setosa, Versicolor or Virginica) based on the length and width measurements of the sepals and petals using Neural Designer ​ https://www.neuraldesigner.com/learning/examples/iris-flowers-classification submitted by /u/datapablo [link] [comments]
    Driving a robot with a neural network - use case study
    submitted by /u/KamilBugnoKrk [link] [comments]
  • Open

    Why GPUs Are Flourishing in the Data Center
    Digital transformation is driving all kinds of changes in enterprises, including the growing use of AI. Though AI and data centers have existed for decades, graphics processing units (GPUs) in data centers are a fairly recent development.  “GPUs have high levels of parallelism and can apply math operations to highly parallel datasets. CPUs can perform… Read More »Why GPUs Are Flourishing in the Data Center The post Why GPUs Are Flourishing in the Data Center appeared first on Data Science Central.  ( 3 min )
    DSC Weekly Newsletter 15 Feb 2022: On the Verge of War
    I write this particular newsletter on Valentine’s Day, and I was planning to talk about love (in particular the love of data). However, real-world events seem to have overtaken this particular holiday. As Russian troops mass along the Ukrainian border, it creates the rather surreal assertion that World War III will begin tomorrow. It’s almost… Read More »DSC Weekly Newsletter 15 Feb 2022: On the Verge of War The post DSC Weekly Newsletter 15 Feb 2022: On the Verge of War appeared first on Data Science Central.  ( 4 min )
  • Open

    Good News About the Carbon Footprint of Machine Learning Training
    Posted by David Patterson, Distinguished Engineer, Google Research, Brain Team Machine learning (ML) has become prominent in information technology, which has led some to raise concerns about the associated rise in the costs of computation, primarily the carbon footprint, i.e., total greenhouse gas emissions. While these assertions rightfully elevated the discussion around carbon emissions in ML, they also highlight the need for accurate data to assess true carbon footprint, which can help identify strategies to mitigate carbon emission in ML. In “The Carbon Footprint of Machine Learning Training Will Plateau, Then Shrink”, accepted for publication in IEEE Computer, we focus on operational carbon emissions — i.e., the energy cost of operating ML hardware, including data center overhead…  ( 8 min )
  • Open

    Utilizing RL model for a relaxed environment to train on a more complex one
    Hello :D, I was wondering if there is a way to use a previously trained model to help in learning in a slightly more complex environment. I have an agent(robot/uav) that had learned to navigate in an environment with no obstacles, using PPO. Now I want to increase the complexity of the environment by adding obstacles. I can start a new learning process from scratch, but could I somehow use the previously learned model to help in speeding up the learning process in the new environment? submitted by /u/AhmedNizam_ [link] [comments]  ( 1 min )
    Relevant Twitter Handles to Follow
    I've heard so much about Twitter lately and how great it is for keeping up to date with new topics and open positions/opportunities at labs etc. So I decided to create a Twitter (finally)! Please do let me know useful handles/people to follow (for opportunities mostly, but for staying up to date on RL too). Thanks in advance! submitted by /u/issyonibba [link] [comments]  ( 1 min )
    Could we explain why the agent can learn good policy just by looking at the shape of Q-function in actor-critic algorithms?
    submitted by /u/Rich_Beautiful [link] [comments]  ( 1 min )
    PPO with guided sampling
    Hello :D I'm trying to make a slight modification the my PPO implementation using demonstrations. I have an agent trying to solve a problem, and an expert (from which i can generate demonstrations). What I'm doing is, I inject these demonstrations in my agent's experiences by changing its actions to match those of the demonstrations (while using the log probabilities of my agent). It's basically as if the agent "got lucky" and chose actions, based on its policy, that match the trajectory. I do this at a certain rate, say 25% of the sampling cases. I had high hopes for this to give better performance, since the agent has more exposure to "good" experiences, but it surprisingly deteriorates the performance. Any possible reason? submitted by /u/AhmedNizam_ [link] [comments]  ( 3 min )
    Trying to understand Policy Gradient in the context of autodiff frameworks
    Hi, I'm trying to understand Policy Gradient, particularly in the context of autodiff frameworks like Pytorch. According to Pytorch docs you can implement REINFORCE like this: probs = policy_network(state) m = Categorical(probs) action = m.sample() next_state, reward = env.step(action) loss = -m.log_prob(action) * reward loss.backward() I think this translates to the equation in the first picture. So this is our loss that we want to minimize. Or more specifically as I understand it: we want to find the parameters of our policy network such that the expected loss over actions is the lowest, which should correspond to the expected reward over actions being the highest. https://preview.redd.it/l2guxzs050i81.png?width=585&format=png&auto=webp&s=5187162537f7f388f5d33119a8e530e206f2c7af Now let's consider the simplest possible case: there is no state, the episode length is one (we take only one action) and there are only two actions: 0 with reward -1 and 1 with reward 1. Let's say we want to model the policy with Bernoulli distribution using parameter p. Obviously, the highest expected reward is when p=1. In this case, the equations for expected reward and expected loss are in the second picture. https://preview.redd.it/ix4k5fi150i81.png?width=1024&format=png&auto=webp&s=5e952dd1a340dca598deefe2eae5622308bdaefa Let's plot the equations with p on the x-axis. As expected, the expected reward (green) has its maximum at p=1. However, the expected loss (red) has its minimum at p=0.8386. Why is it different? Apparently, I made a mistake somewhere along the way or my assumptions are wrong, but could you please tell me where exactly am I wrong? https://preview.redd.it/n35i8qc250i81.png?width=838&format=png&auto=webp&s=91efe5e1c5aedeacbf9dc2273aff7278c4160ee5 submitted by /u/MisterHasacz [link] [comments]  ( 2 min )
    "MuZero with Self-competition for Rate Control in VP9 Video Compression", Mandhane et al 2022 {DM}
    submitted by /u/gwern [link] [comments]  ( 1 min )
    The current state of Deep Reinforcement Learning?
    Though I'm more well-versed in deep learning, I recently did a project related to the generation of molecular structures and got into deep reinforcement learning to solve it. I rather enjoyed the process of developing reinforcement learning models and found it to be more engaging than deep learning, so I am now considering taking a step further by reading Sutton and Barto's book. However, I'm aware that there are major issues with RL, one of them being that the models are significantly harder to tune than traditional ML models. My question: is RL a worthy pursuit given the current state of the field, or am I better off sticking with deep learning (i.e GANs or VAEs for generative tasks)? If it is worth pursuing, what are some good resources/papers to look into, apart from Sutton and Barto? submitted by /u/WeeWasabi [link] [comments]  ( 1 min )
  • Open

    Peak Performance: Production Studio Sets the Stage for Virtual Opening Ceremony at European Football Championship
    At the latest UEFA Champions League Finals, one of the world’s most anticipated annual soccer events, pop stars Marshmello, Khalid and Selena Gomez shared the stage for a dazzling opening ceremony at Portugal’s third-largest football stadium — without ever stepping foot in it. The stunning video performance took place in a digital twin of the Read article > The post Peak Performance: Production Studio Sets the Stage for Virtual Opening Ceremony at European Football Championship appeared first on The Official NVIDIA Blog.  ( 4 min )
  • Open

    Rank-based loss for learning hierarchical representations. (arXiv:2110.05941v2 [cs.LG] UPDATED)
    Hierarchical taxonomies are common in many contexts, and they are a very natural structure humans use to organise information. In machine learning, the family of methods that use the 'extra' information is called hierarchical classification. However, applied to audio classification, this remains relatively unexplored. Here we focus on how to integrate the hierarchical information of a problem to learn embeddings representative of the hierarchical relationships. Previously, triplet loss has been proposed to address this problem, however it presents some issues like requiring the careful construction of the triplets, and being limited in the extent of hierarchical information it uses at each iteration. In this work we propose a rank based loss function that uses hierarchical information and translates this into a rank ordering of target distances between the examples. We show that rank based loss is suitable to learn hierarchical representations of the data. By testing on unseen fine level classes we show that this method is also capable of learning hierarchically correct representations of the new classes. Rank based loss has two promising aspects, it is generalisable to hierarchies with any number of levels, and is capable of dealing with data with incomplete hierarchical labels.  ( 2 min )
    Batch Label Inference and Replacement Attacks in Black-Boxed Vertical Federated Learning. (arXiv:2112.05409v2 [cs.LG] UPDATED)
    In a vertical federated learning (VFL) scenario where features and model are split into different parties, communications of sample-specific updates are required for correct gradient calculations but can be used to deduce important sample-level label information. An immediate defense strategy is to protect sample-level messages communicated with Homomorphic Encryption (HE), and in this way only the batch-averaged local gradients are exposed to each party (termed black-boxed VFL). In this paper, we first explore the possibility of recovering labels in the vertical federated learning setting with HE-protected communication, and show that private labels can be reconstructed with high accuracy by training a gradient inversion model. Furthermore, we show that label replacement backdoor attacks can be conducted in black-boxed VFL by directly replacing encrypted communicated messages (termed gradient-replacement attack). As it is a common presumption that batch-averaged information is safe to share, batch label inference and replacement attacks are a severe challenge to VFL. To defend against batch label inference attack, we further evaluate several defense strategies, including confusional autoencoder (CoAE), a technique we proposed based on autoencoder and entropy regularization. We demonstrate that label inference and replacement attacks can be successfully blocked by this technique without hurting as much main task accuracy as compared to existing methods.  ( 2 min )
    Sparse Communication via Mixed Distributions. (arXiv:2108.02658v2 [cs.LG] UPDATED)
    Neural networks and other machine learning models compute continuous representations, while humans communicate mostly through discrete symbols. Reconciling these two forms of communication is desirable for generating human-readable interpretations or learning discrete latent variable models, while maintaining end-to-end differentiability. Some existing approaches (such as the Gumbel-Softmax transformation) build continuous relaxations that are discrete approximations in the zero-temperature limit, while others (such as sparsemax transformations and the Hard Concrete distribution) produce discrete/continuous hybrids. In this paper, we build rigorous theoretical foundations for these hybrids, which we call "mixed random variables." Our starting point is a new "direct sum" base measure defined on the face lattice of the probability simplex. From this measure, we introduce new entropy and Kullback-Leibler divergence functions that subsume the discrete and differential cases and have interpretations in terms of code optimality. Our framework suggests two strategies for representing and sampling mixed random variables, an extrinsic ("sample-and-project") and an intrinsic one (based on face stratification). We experiment with both approaches on an emergent communication benchmark and on modeling MNIST and Fashion-MNIST data with variational auto-encoders with mixed latent variables.  ( 2 min )
    SafePicking: Learning Safe Object Extraction via Object-Level Mapping. (arXiv:2202.05832v1 [cs.RO])
    Robots need object-level scene understanding to manipulate objects while reasoning about contact, support, and occlusion among objects. Given a pile of objects, object recognition and reconstruction can identify the boundary of object instances, giving important cues as to how the objects form and support the pile. In this work, we present a system, SafePicking, that integrates object-level mapping and learning-based motion planning to generate a motion that safely extracts occluded target objects from a pile. Planning is done by learning a deep Q-network that receives observations of predicted poses and a depth-based heightmap to output a motion trajectory, trained to maximize a safety metric reward. Our results show that the observation fusion of poses and depth-sensing gives both better performance and robustness to the model. We evaluate our methods using the YCB objects in both simulation and the real world, achieving safe object extraction from piles.  ( 2 min )
    SoK: Certified Robustness for Deep Neural Networks. (arXiv:2009.04131v6 [cs.LG] UPDATED)
    Great advances in deep neural networks (DNNs) have led to state-of-the-art performance on a wide range of tasks. However, recent studies have shown that DNNs are vulnerable to adversarial attacks, which have brought great concerns when deploying these models to safety-critical applications such as autonomous driving. Different defense approaches have been proposed against adversarial attacks, including: a) empirical defenses, which usually can be adaptively attacked again without providing robustness certification; and b) certifiably robust approaches which consist of robustness verification providing the lower bound of robust accuracy against any attacks under certain conditions and corresponding robust training approaches. In this paper, we systematize the certifiably robust approaches and related practical and theoretical implications and findings. We also provide the first comprehensive benchmark on existing robustness verification and training approaches on different datasets. In particular, we 1) provide a taxonomy for the robustness verification and training approaches, as well as summarize the methodologies for representative algorithms, 2) reveal the characteristics, strengths, limitations, and fundamental connections among these approaches, 3) discuss current research progresses, theoretical barriers, main challenges, and future directions for certifiably robust approaches for DNNs, and 4) provide an open-sourced unified platform to evaluate over 20 representative certifiably robust approaches for a wide range of DNNs.  ( 2 min )
    The Dual Form of Neural Networks Revisited: Connecting Test Time Predictions to Training Patterns via Spotlights of Attention. (arXiv:2202.05798v1 [cs.LG])
    Linear layers in neural networks (NNs) trained by gradient descent can be expressed as a key-value memory system which stores all training datapoints and the initial weights, and produces outputs using unnormalised dot attention over the entire training experience. While this has been technically known since the '60s, no prior work has effectively studied the operations of NNs in such a form, presumably due to prohibitive time and space complexities and impractical model sizes, all of them growing linearly with the number of training patterns which may get very large. However, this dual formulation offers a possibility of directly visualizing how an NN makes use of training patterns at test time, by examining the corresponding attention weights. We conduct experiments on small scale supervised image classification tasks in single-task, multi-task, and continual learning settings, as well as language modelling, and discuss potentials and limits of this view for better understanding and interpreting how NNs exploit training patterns. Our code is public.
    Matrix-wise $\ell_0$-constrained Sparse Nonnegative Least Squares. (arXiv:2011.11066v3 [cs.LG] UPDATED)
    Nonnegative least squares problems with multiple right-hand sides (MNNLS) arise in models that rely on additive linear combinations. In particular, they are at the core of most nonnegative matrix factorization algorithms and have many applications. The nonnegativity constraint is known to naturally favor sparsity, that is, solutions with few non-zero entries. However, it is often useful to further enhance this sparsity, as it improves the interpretability of the results and helps reducing noise, which leads to the sparse MNNLS problem. In this paper, as opposed to most previous works that enforce sparsity column- or row-wise, we first introduce a novel formulation for sparse MNNLS, with a matrix-wise sparsity constraint. Then, we present a two-step algorithm to tackle this problem. The first step divides sparse MNNLS in subproblems, one per column of the original problem. It then uses different algorithms to produce, either exactly or approximately, a Pareto front for each subproblem, that is, to produce a set of solutions representing different tradeoffs between reconstruction error and sparsity. The second step selects solutions among these Pareto fronts in order to build a sparsity-constrained matrix that minimizes the reconstruction error. We perform experiments on facial and hyperspectral images, and we show that our proposed two-step approach provides more accurate results than state-of-the-art sparse coding heuristics applied both column-wise and globally.
    Rethinking Graph Convolutional Networks in Knowledge Graph Completion. (arXiv:2202.05679v1 [cs.AI])
    Graph convolutional networks (GCNs) -- which are effective in modeling graph structures -- have been increasingly popular in knowledge graph completion (KGC). GCN-based KGC models first use GCNs to generate expressive entity representations and then use knowledge graph embedding (KGE) models to capture the interactions among entities and relations. However, many GCN-based KGC models fail to outperform state-of-the-art KGE models though introducing additional computational complexity. This phenomenon motivates us to explore the real effect of GCNs in KGC. Therefore, in this paper, we build upon representative GCN-based KGC models and introduce variants to find which factor of GCNs is critical in KGC. Surprisingly, we observe from experiments that the graph structure modeling in GCNs does not have a significant impact on the performance of KGC models, which is in contrast to the common belief. Instead, the transformations for entity representations are responsible for the performance improvements. Based on the observation, we propose a simple yet effective framework named LTE-KGE, which equips existing KGE models with linearly transformed entity embeddings. Experiments demonstrate that LTE-KGE models lead to similar performance improvements with GCN-based KGC methods, while being more computationally efficient. These results suggest that existing GCNs are unnecessary for KGC, and novel GCN-based KGC models should count on more ablation studies to validate their effectiveness. The code of all the experiments is available on GitHub at https://github.com/MIRALab-USTC/GCN4KGC.
    Improving Generalization via Uncertainty Driven Perturbations. (arXiv:2202.05737v1 [cs.LG])
    Recently Shah et al., 2020 pointed out the pitfalls of the simplicity bias - the tendency of gradient-based algorithms to learn simple models - which include the model's high sensitivity to small input perturbations, as well as sub-optimal margins. In particular, while Stochastic Gradient Descent yields max-margin boundary on linear models, such guarantee does not extend to non-linear models. To mitigate the simplicity bias, we consider uncertainty-driven perturbations (UDP) of the training data points, obtained iteratively by following the direction that maximizes the model's estimated uncertainty. Unlike loss-driven perturbations, uncertainty-guided perturbations do not cross the decision boundary, allowing for using a larger range of values for the hyperparameter that controls the magnitude of the perturbation. Moreover, as real-world datasets have non-isotropic distances between data points of different classes, the above property is particularly appealing for increasing the margin of the decision boundary, which in turn improves the model's generalization. We show that UDP is guaranteed to achieve the maximum margin decision boundary on linear models and that it notably increases it on challenging simulated datasets. Interestingly, it also achieves competitive loss-based robustness and generalization trade-off on several datasets.
    Semi-Supervised GCN for learning Molecular Structure-Activity Relationships. (arXiv:2202.05704v1 [q-bio.BM])
    Since the introduction of artificial intelligence in medicinal chemistry, the necessity has emerged to analyse how molecular property variation is modulated by either single atoms or chemical groups. In this paper, we propose to train graph-to-graph neural network using semi-supervised learning for attributing structure-property relationships. As initial case studies we apply the method to solubility and molecular acidity while checking its consistency in comparison with known experimental chemical data. As final goal, our approach could represent a valuable tool to deal with problems such as activity cliffs, lead optimization and de-novo drug design.
    End-to-end Algorithm Synthesis with Recurrent Networks: Logical Extrapolation Without Overthinking. (arXiv:2202.05826v1 [cs.LG])
    Machine learning systems perform well on pattern matching tasks, but their ability to perform algorithmic or logical reasoning is not well understood. One important reasoning capability is logical extrapolation, in which models trained only on small/simple reasoning problems can synthesize complex algorithms that scale up to large/complex problems at test time. Logical extrapolation can be achieved through recurrent systems, which can be iterated many times to solve difficult reasoning problems. We observe that this approach fails to scale to highly complex problems because behavior degenerates when many iterations are applied -- an issue we refer to as "overthinking." We propose a recall architecture that keeps an explicit copy of the problem instance in memory so that it cannot be forgotten. We also employ a progressive training routine that prevents the model from learning behaviors that are specific to iteration number and instead pushes it to learn behaviors that can be repeated indefinitely. These innovations prevent the overthinking problem, and enable recurrent systems to solve extremely hard logical extrapolation tasks, some requiring over 100K convolutional layers, without overthinking.
    Post-hoc Interpretability for Neural NLP: A Survey. (arXiv:2108.04840v3 [cs.CL] UPDATED)
    Neural networks for NLP are becoming increasingly complex and widespread, and there is a growing concern if these models are responsible to use. Explaining models helps to address the safety and ethical concerns and is essential for accountability. Interpretability serves to provide these explanations in terms that are understandable to humans. Additionally, post-hoc methods provide explanations after a model is learned and are generally model-agnostic. This survey provides a categorization of how recent post-hoc interpretability methods communicate explanations to humans, it discusses each method in-depth, and how they are validated, as the latter is often a common concern.
    Training Robust Zero-Shot Voice Conversion Models with Self-supervised Features. (arXiv:2112.04424v2 [cs.SD] UPDATED)
    Unsupervised Zero-Shot Voice Conversion (VC) aims to modify the speaker characteristic of an utterance to match an unseen target speaker without relying on parallel training data. Recently, self-supervised learning of speech representation has been shown to produce useful linguistic units without using transcripts, which can be directly passed to a VC model. In this paper, we showed that high-quality audio samples can be achieved by using a length resampling decoder, which enables the VC model to work in conjunction with different linguistic feature extractors and vocoders without requiring them to operate on the same sequence length. We showed that our method can outperform many baselines on the VCTK dataset. Without modifying the architecture, we further demonstrated that a) using pairs of different audio segments from the same speaker, b) adding a cycle consistency loss, and c) adding a speaker classification loss can help to learn a better speaker embedding. Our model trained on LibriTTS using these techniques achieves the best performance, producing audio samples transferred well to the target speaker's voice, while preserving the linguistic content that is comparable with actual human utterances in terms of Character Error Rate.
    A Modern Self-Referential Weight Matrix That Learns to Modify Itself. (arXiv:2202.05780v1 [cs.LG])
    The weight matrix (WM) of a neural network (NN) is its program. The programs of many traditional NNs are learned through gradient descent in some error function, then remain fixed. The WM of a self-referential NN, however, can keep rapidly modifying all of itself during runtime. In principle, such NNs can meta-learn to learn, and meta-meta-learn to meta-learn to learn, and so on, in the sense of recursive self-improvement. While NN architectures potentially capable of implementing such behavior have been proposed since the '90s, there have been few if any practical studies. Here we revisit such NNs, building upon recent successes of fast weight programmers and closely related linear Transformers. We propose a scalable self-referential WM (SRWM) that uses outer products and the delta update rule to modify itself. We evaluate our SRWM in supervised few-shot learning and in multi-task reinforcement learning with procedurally generated game environments. Our experiments demonstrate both practical applicability and competitive performance of the proposed SRWM. Our code is public.
    Holistic Semi-Supervised Approaches for EEG Representation Learning. (arXiv:2109.11732v2 [cs.LG] UPDATED)
    Recently, supervised methods, which often require substantial amounts of class labels, have achieved promising results for EEG representation learning. However, labeling EEG data is a challenging task. More recently, holistic semi-supervised learning approaches, which only require few output labels, have shown promising results in the field of computer vision. These methods, however, have not yet been adapted for EEG learning. In this paper, we adapt three state-of-the-art holistic semi-supervised approaches, namely MixMatch, FixMatch, and AdaMatch, as well as five classical semi-supervised methods for EEG learning. We perform rigorous experiments with all 8 methods on two public EEG-based emotion recognition datasets, namely SEED and SEED-IV. The experiments with different amounts of limited labeled samples show that the holistic approaches achieve strong results even when only 1 labeled sample is used per class. Further experiments show that in most cases, AdaMatch is the most effective method, followed by MixMatch and FixMatch.
    NAS-Bench-Suite: NAS Evaluation is (Now) Surprisingly Easy. (arXiv:2201.13396v2 [cs.LG] UPDATED)
    The release of tabular benchmarks, such as NAS-Bench-101 and NAS-Bench-201, has significantly lowered the computational overhead for conducting scientific research in neural architecture search (NAS). Although they have been widely adopted and used to tune real-world NAS algorithms, these benchmarks are limited to small search spaces and focus solely on image classification. Recently, several new NAS benchmarks have been introduced that cover significantly larger search spaces over a wide range of tasks, including object detection, speech recognition, and natural language processing. However, substantial differences among these NAS benchmarks have so far prevented their widespread adoption, limiting researchers to using just a few benchmarks. In this work, we present an in-depth analysis of popular NAS algorithms and performance prediction methods across 25 different combinations of search spaces and datasets, finding that many conclusions drawn from a few NAS benchmarks do not generalize to other benchmarks. To help remedy this problem, we introduce NAS-Bench-Suite, a comprehensive and extensible collection of NAS benchmarks, accessible through a unified interface, created with the aim to facilitate reproducible, generalizable, and rapid NAS research. Our code is available at https://github.com/automl/naslib.
    Reducing the Feeder Effect in Public School Admissions: A Bias-aware Analysis for Targeted Interventions. (arXiv:2004.10846v3 [cs.CY] UPDATED)
    Traditionally, New York City's top 8 public schools select candidates solely based on their scores in the Specialized High School Admissions Test (SHSAT). These scores are known to be impacted by socioeconomic status of students and test preparation received in "feeder" middle schools, leading to a massive filtering effect in the education pipeline. For instance, around 50% (resp. 80%) of the students admitted to the top public high schools in New York City come from only 5% (resp. 15%) of the middle schools. The classical mechanisms for assigning students to schools do not naturally address problems like school segregation and class diversity, which have worsened over the years. The scientific community, including policy makers, have reacted by incorporating group-specific quotas, proportionality constraints, or summer school opportunities, but there is evidence that these can end up hurting minorities or even create legal challenges. We take a completely different approach to reduce this filtering effect of feeder middle schools, with the goal of increasing the opportunity of students with high economic needs. We model a two-sided market where a candidate may not be perceived at their true potential, and is therefore assigned to a lower level than the one (s)he deserves. We propose and study the effect of interventions such as additional training and scholarships towards disadvantaged students, and challenge existing mechanisms for scholarships. We validate these findings using SAT scores data from New York City high schools. We further show that our qualitative takeaways remain the same even when some of the modeling assumptions are relaxed.
    PEg TRAnsfer Workflow recognition challenge report: Does multi-modal data improve recognition?. (arXiv:2202.05821v1 [cs.LG])
    This paper presents the design and results of the "PEg TRAnsfert Workflow recognition" (PETRAW) challenge whose objective was to develop surgical workflow recognition methods based on one or several modalities, among video, kinematic, and segmentation data, in order to study their added value. The PETRAW challenge provided a data set of 150 peg transfer sequences performed on a virtual simulator. This data set was composed of videos, kinematics, semantic segmentation, and workflow annotations which described the sequences at three different granularity levels: phase, step, and activity. Five tasks were proposed to the participants: three of them were related to the recognition of all granularities with one of the available modalities, while the others addressed the recognition with a combination of modalities. Average application-dependent balanced accuracy (AD-Accuracy) was used as evaluation metric to take unbalanced classes into account and because it is more clinically relevant than a frame-by-frame score. Seven teams participated in at least one task and four of them in all tasks. Best results are obtained with the use of the video and the kinematics data with an AD-Accuracy between 93% and 90% for the four teams who participated in all tasks. The improvement between video/kinematic-based methods and the uni-modality ones was significant for all of the teams. However, the difference in testing execution time between the video/kinematic-based and the kinematic-based methods has to be taken into consideration. Is it relevant to spend 20 to 200 times more computing time for less than 3% of improvement? The PETRAW data set is publicly available at www.synapse.org/PETRAW to encourage further research in surgical workflow recognition.
    EE-Net: Exploitation-Exploration Neural Networks in Contextual Bandits. (arXiv:2110.03177v4 [cs.LG] UPDATED)
    In this paper, we propose a novel neural-based exploration strategy in contextual bandits, EE-Net, distinct from the standard UCB-based and TS-based approaches. Contextual multi-armed bandits have been studied for decades with various applications. To solve the exploitation-exploration tradeoff in bandits, there are three main techniques: epsilon-greedy, Thompson Sampling (TS), and Upper Confidence Bound (UCB). In recent literature, linear contextual bandits have adopted ridge regression to estimate the reward function and combine it with TS or UCB strategies for exploration. However, this line of works explicitly assumes the reward is based on a linear function of arm vectors, which may not be true in real-world datasets. To overcome this challenge, a series of neural-based bandit algorithms have been proposed, where a neural network is assigned to learn the underlying reward function and TS or UCB are adapted for exploration. In this paper, we propose "EE-Net", a neural-based bandit approach with a novel exploration strategy. In addition to utilizing a neural network (Exploitation network) to learn the reward function, EE-Net adopts another neural network (Exploration network) to adaptively learn potential gains compared to currently estimated reward for making explorations. Then, a decision-maker is constructed to combine the outputs from the Exploitation and Exploration networks. We prove that EE-Net can achieve $\mathcal{O}(\sqrt{T\log T})$ regret, which is tighter than existing state-of-the-art neural bandit algorithms ($\mathcal{O}(\sqrt{T}\log T)$ for both UCB-based and TS-based). Through extensive experiments on four real-world datasets, we show that EE-Net outperforms existing linear and neural bandit approaches.
    Pairwise Fairness for Ordinal Regression. (arXiv:2105.03153v2 [stat.ML] UPDATED)
    We initiate the study of fairness for ordinal regression. We adapt two fairness notions previously considered in fair ranking and propose a strategy for training a predictor that is approximately fair according to either notion. Our predictor has the form of a threshold model, composed of a scoring function and a set of thresholds, and our strategy is based on a reduction to fair binary classification for learning the scoring function and local search for choosing the thresholds. We provide generalization guarantees on the error and fairness violation of our predictor, and we illustrate the effectiveness of our approach in extensive experiments.
    Policy Finetuning: Bridging Sample-Efficient Offline and Online Reinforcement Learning. (arXiv:2106.04895v2 [cs.LG] UPDATED)
    Recent theoretical work studies sample-efficient reinforcement learning (RL) extensively in two settings: learning interactively in the environment (online RL), or learning from an offline dataset (offline RL). However, existing algorithms and theories for learning near-optimal policies in these two settings are rather different and disconnected. Towards bridging this gap, this paper initiates the theoretical study of policy finetuning, that is, online RL where the learner has additional access to a "reference policy" $\mu$ close to the optimal policy $\pi_\star$ in a certain sense. We consider the policy finetuning problem in episodic Markov Decision Processes (MDPs) with $S$ states, $A$ actions, and horizon length $H$. We first design a sharp offline reduction algorithm -- which simply executes $\mu$ and runs offline policy optimization on the collected dataset -- that finds an $\varepsilon$ near-optimal policy within $\widetilde{O}(H^3SC^\star/\varepsilon^2)$ episodes, where $C^\star$ is the single-policy concentrability coefficient between $\mu$ and $\pi_\star$. This offline result is the first that matches the sample complexity lower bound in this setting, and resolves a recent open question in offline RL. We then establish an $\Omega(H^3S\min\{C^\star, A\}/\varepsilon^2)$ sample complexity lower bound for any policy finetuning algorithm, including those that can adaptively explore the environment. This implies that -- perhaps surprisingly -- the optimal policy finetuning algorithm is either offline reduction or a purely online RL algorithm that does not use $\mu$. Finally, we design a new hybrid offline/online algorithm for policy finetuning that achieves better sample complexity than both vanilla offline reduction and purely online RL algorithms, in a relaxed setting where $\mu$ only satisfies concentrability partially up to a certain time step.
    EchoVPR: Echo State Networks for Visual Place Recognition. (arXiv:2110.05572v3 [cs.CV] UPDATED)
    Recognising previously visited locations is an important, but unsolved, task in autonomous navigation. Current visual place recognition (VPR) benchmarks typically challenge models to recover the position of a query image (or images) from sequential datasets that include both spatial and temporal components. Recently, Echo State Network (ESN) varieties have proven particularly powerful at solving machine learning tasks that require spatio-temporal modelling. These networks are simple, yet powerful neural architectures that--exhibiting memory over multiple time-scales and non-linear high-dimensional representations--can discover temporal relations in the data while still maintaining linearity in the learning time. In this paper, we present a series of ESNs and analyse their applicability to the VPR problem. We report that the addition of ESNs to pre-processed convolutional neural networks led to a dramatic boost in performance in comparison to non-recurrent networks in five out of six standard benchmarks (GardensPoint, SPEDTest, ESSEX3IN1, Oxford RobotCar, and Nordland), demonstrating that ESNs are able to capture the temporal structure inherent in VPR problems. Moreover, we show that models that include ESNs can outperform class-leading VPR models which also exploit the sequential dynamics of the data. Finally, our results demonstrate that ESNs improve generalisation abilities, robustness, and accuracy further supporting their suitability to VPR applications.
    Inference of Multiscale Gaussian Graphical Model. (arXiv:2202.05775v1 [stat.ML])
    Gaussian Graphical Models (GGMs) are widely used for exploratory data analysis in various fields such as genomics, ecology, psychometry. In a high-dimensional setting, when the number of variables exceeds the number of observations by several orders of magnitude, the estimation of GGM is a difficult and unstable optimization problem. Clustering of variables or variable selection is often performed prior to GGM estimation. We propose a new method allowing to simultaneously infer a hierarchical clustering structure and the graphs describing the structure of independence at each level of the hierarchy. This method is based on solving a convex optimization problem combining a graphical lasso penalty with a fused type lasso penalty. Results on real and synthetic data are presented.
    TACTO: A Fast, Flexible, and Open-source Simulator for High-Resolution Vision-based Tactile Sensors. (arXiv:2012.08456v2 [cs.RO] UPDATED)
    Simulators perform an important role in prototyping, debugging, and benchmarking new advances in robotics and learning for control. Although many physics engines exist, some aspects of the real world are harder than others to simulate. One of the aspects that have so far eluded accurate simulation is touch sensing. To address this gap, we present TACTO - a fast, flexible, and open-source simulator for vision-based tactile sensors. This simulator allows to render realistic high-resolution touch readings at hundreds of frames per second, and can be easily configured to simulate different vision-based tactile sensors, including DIGIT and OmniTact. In this paper, we detail the principles that drove the implementation of TACTO and how they are reflected in its architecture. We demonstrate TACTO on a perceptual task, by learning to predict grasp stability using touch from 1 million grasps, and on a marble manipulation control task. Moreover, we provide a proof-of-concept that TACTO can be successfully used for Sim2Real applications. We believe that TACTO is a step towards the widespread adoption of touch sensing in robotic applications, and to enable machine learning practitioners interested in multi-modal learning and control. TACTO is open-source at https://github.com/facebookresearch/tacto.
    Learning Fast Samplers for Diffusion Models by Differentiating Through Sample Quality. (arXiv:2202.05830v1 [cs.LG])
    Diffusion models have emerged as an expressive family of generative models rivaling GANs in sample quality and autoregressive models in likelihood scores. Standard diffusion models typically require hundreds of forward passes through the model to generate a single high-fidelity sample. We introduce Differentiable Diffusion Sampler Search (DDSS): a method that optimizes fast samplers for any pre-trained diffusion model by differentiating through sample quality scores. We also present Generalized Gaussian Diffusion Models (GGDM), a family of flexible non-Markovian samplers for diffusion models. We show that optimizing the degrees of freedom of GGDM samplers by maximizing sample quality scores via gradient descent leads to improved sample quality. Our optimization procedure backpropagates through the sampling process using the reparametrization trick and gradient rematerialization. DDSS achieves strong results on unconditional image generation across various datasets (e.g., FID scores on LSUN church 128x128 of 11.6 with only 10 inference steps, and 4.82 with 20 steps, compared to 51.1 and 14.9 with strongest DDPM/DDIM baselines). Our method is compatible with any pre-trained diffusion model without fine-tuning or re-training required.
    A Single-Timescale Method for Stochastic Bilevel Optimization. (arXiv:2102.04671v3 [math.OC] UPDATED)
    Stochastic bilevel optimization generalizes the classic stochastic optimization from the minimization of a single objective to the minimization of an objective function that depends the solution of another optimization problem. Recently, stochastic bilevel optimization is regaining popularity in emerging machine learning applications such as hyper-parameter optimization and model-agnostic meta learning. To solve this class of stochastic optimization problems, existing methods require either double-loop or two-timescale updates, which are sometimes less efficient. This paper develops a new optimization method for a class of stochastic bilevel problems that we term Single-Timescale stochAstic BiLevEl optimization (STABLE) method. STABLE runs in a single loop fashion, and uses a single-timescale update with a fixed batch size. To achieve an $\epsilon$-stationary point of the bilevel problem, STABLE requires ${\cal O}(\epsilon^{-2})$ samples in total; and to achieve an $\epsilon$-optimal solution in the strongly convex case, STABLE requires ${\cal O}(\epsilon^{-1})$ samples. To the best of our knowledge, this is the first bilevel optimization algorithm achieving the same order of sample complexity as the stochastic gradient descent method for the single-level stochastic optimization.
    Predicting Out-of-Distribution Error with the Projection Norm. (arXiv:2202.05834v1 [cs.LG])
    We propose a metric -- Projection Norm -- to predict a model's performance on out-of-distribution (OOD) data without access to ground truth labels. Projection Norm first uses model predictions to pseudo-label test samples and then trains a new model on the pseudo-labels. The more the new model's parameters differ from an in-distribution model, the greater the predicted OOD error. Empirically, our approach outperforms existing methods on both image and text classification tasks and across different network architectures. Theoretically, we connect our approach to a bound on the test error for overparameterized linear models. Furthermore, we find that Projection Norm is the only approach that achieves non-trivial detection performance on adversarial examples. Our code is available at https://github.com/yaodongyu/ProjNorm.
    Single Model Deep Learning on Imbalanced Small Datasets for Skin Lesion Classification. (arXiv:2102.01284v2 [cs.CV] UPDATED)
    Deep convolutional neural network (DCNN) models have been widely explored for skin disease diagnosis and some of them have achieved the diagnostic outcomes comparable or even superior to those of dermatologists. However, broad implementation of DCNN in skin disease detection is hindered by small size and data imbalance of the publically accessible skin lesion datasets. This paper proposes a novel single-model based strategy for classification of skin lesions on small and imbalanced datasets. First, various DCNNs are trained on different small and imbalanced datasets to verify that the models with moderate complexity outperform the larger models. Second, regularization DropOut and DropBlock are added to reduce overfitting and a Modified RandAugment augmentation strategy is proposed to deal with the defects of sample underrepresentation in the small dataset. Finally, a novel Multi-Weighted New Loss (MWNL) function and an end-to-end cumulative learning strategy (CLS) are introduced to overcome the challenge of uneven sample size and classification difficulty and to reduce the impact of abnormal samples on training. By combining Modified RandAugment, MWNL and CLS, our single DCNN model method achieved the classification accuracy comparable or superior to those of multiple ensembling models on different dermoscopic image datasets. Our study shows that this method is able to achieve a high classification performance at a low cost of computational resources and inference time, potentially suitable to implement in mobile devices for automated screening of skin lesions and many other malignancies in low resource settings.
    Fixed Support Tree-Sliced Wasserstein Barycenter. (arXiv:2109.03431v2 [cs.AI] UPDATED)
    The Wasserstein barycenter has been widely studied in various fields, including natural language processing, and computer vision. However, it requires a high computational cost to solve the Wasserstein barycenter problem because the computation of the Wasserstein distance requires a quadratic time with respect to the number of supports. By contrast, the Wasserstein distance on a tree, called the tree-Wasserstein distance, can be computed in linear time and allows for the fast comparison of a large number of distributions. In this study, we propose a barycenter under the tree-Wasserstein distance, called the fixed support tree-Wasserstein barycenter (FS-TWB) and its extension, called the fixed support tree-sliced Wasserstein barycenter (FS-TSWB). More specifically, we first show that the FS-TWB and FS-TSWB problems are convex optimization problems and can be solved by using the projected subgradient descent. Moreover, we propose a more efficient algorithm to compute the subgradient and objective function value by using the properties of tree-Wasserstein barycenter problems. Through real-world experiments, we show that, by using the proposed algorithm, the FS-TWB and FS-TSWB can be solved two orders of magnitude faster than the original Wasserstein barycenter.
    Contextual Latent-Movements Off-Policy Optimization for Robotic Manipulation Skills. (arXiv:2010.13766v3 [cs.RO] UPDATED)
    Parameterized movement primitives have been extensively used for imitation learning of robotic tasks. However, the high-dimensionality of the parameter space hinders the improvement of such primitives in the reinforcement learning (RL) setting, especially for learning with physical robots. In this paper we propose a novel view on handling the demonstrated trajectories for acquiring low-dimensional, non-linear latent dynamics, using mixtures of probabilistic principal component analyzers (MPPCA) on the movements' parameter space. Moreover, we introduce a new contextual off-policy RL algorithm, named LAtent-Movements Policy Optimization (LAMPO). LAMPO can provide gradient estimates from previous experience using self-normalized importance sampling, hence, making full use of samples collected in previous learning iterations. These advantages combined provide a complete framework for sample-efficient off-policy optimization of movement primitives for robot learning of high-dimensional manipulation skills. Our experimental results conducted both in simulation and on a real robot show that LAMPO provides sample-efficient policies against common approaches in literature.
    Multi Type Mean Field Reinforcement Learning. (arXiv:2002.02513v5 [cs.MA] UPDATED)
    Mean field theory provides an effective way of scaling multiagent reinforcement learning algorithms to environments with many agents that can be abstracted by a virtual mean agent. In this paper, we extend mean field multiagent algorithms to multiple types. The types enable the relaxation of a core assumption in mean field reinforcement learning, which is that all agents in the environment are playing almost similar strategies and have the same goal. We conduct experiments on three different testbeds for the field of many agent reinforcement learning, based on the standard MAgents framework. We consider two different kinds of mean field environments: a) Games where agents belong to predefined types that are known a priori and b) Games where the type of each agent is unknown and therefore must be learned based on observations. We introduce new algorithms for each type of game and demonstrate their superior performance over state of the art algorithms that assume that all agents belong to the same type and other baseline algorithms in the MAgent framework.
    Novel split quality measures for stratified multilabel Cross Validation with application to large and sparse gene ontology datasets. (arXiv:2109.01425v2 [cs.LG] UPDATED)
    Multilabel learning is an important topic in machine learning research. Evaluating models in multilabel settings requires specific cross validation methods designed for multilabel data. In this article, we show that the most widely used cross validation split quality measures do not behave adequately with multilabel data that has strong class imbalance. We present improved measures and an algorithm, optisplit, for optimising cross validations splits. We present an extensive comparison of various types of cross validation methods in which we show that optisplit produces more even cross validation splits than the existing methods and that it is fast enough to be used on big Gene Ontology (GO) datasets.
    SimGRACE: A Simple Framework for Graph Contrastive Learning without Data Augmentation. (arXiv:2202.03104v2 [cs.LG] UPDATED)
    Graph contrastive learning (GCL) has emerged as a dominant technique for graph representation learning which maximizes the mutual information between paired graph augmentations that share the same semantics. Unfortunately, it is difficult to preserve semantics well during augmentations in view of the diverse nature of graph data. Currently, data augmentations in GCL that are designed to preserve semantics broadly fall into three unsatisfactory ways. First, the augmentations can be manually picked per dataset by trial-and-errors. Second, the augmentations can be selected via cumbersome search. Third, the augmentations can be obtained by introducing expensive domain-specific knowledge as guidance. All of these limit the efficiency and more general applicability of existing GCL methods. To circumvent these crucial issues, we propose a \underline{Sim}ple framework for \underline{GRA}ph \underline{C}ontrastive l\underline{E}arning, \textbf{SimGRACE} for brevity, which does not require data augmentations. Specifically, we take original graph as input and GNN model with its perturbed version as two encoders to obtain two correlated views for contrast. SimGRACE is inspired by the observation that graph data can preserve their semantics well during encoder perturbations while not requiring manual trial-and-errors, cumbersome search or expensive domain knowledge for augmentations selection. Also, we explain why SimGRACE can succeed. Furthermore, we devise adversarial training scheme, dubbed \textbf{AT-SimGRACE}, to enhance the robustness of graph contrastive learning and theoretically explain the reasons. Albeit simple, we show that SimGRACE can yield competitive or better performance compared with state-of-the-art methods in terms of generalizability, transferability and robustness, while enjoying unprecedented degree of flexibility and efficiency.
    Latent Variable Sequential Set Transformers For Joint Multi-Agent Motion Prediction. (arXiv:2104.00563v3 [cs.RO] UPDATED)
    Robust multi-agent trajectory prediction is essential for the safe control of robotic systems. A major challenge is to efficiently learn a representation that approximates the true joint distribution of contextual, social, and temporal information to enable planning. We propose Latent Variable Sequential Set Transformers which are encoder-decoder architectures that generate scene-consistent multi-agent trajectories. We refer to these architectures as "AutoBots". The encoder is a stack of interleaved temporal and social multi-head self-attention (MHSA) modules which alternately perform equivariant processing across the temporal and social dimensions. The decoder employs learnable seed parameters in combination with temporal and social MHSA modules allowing it to perform inference over the entire future scene in a single forward pass efficiently. AutoBots can produce either the trajectory of one ego-agent or a distribution over the future trajectories for all agents in the scene. For the single-agent prediction case, our model achieves top results on the global nuScenes vehicle motion prediction leaderboard, and produces strong results on the Argoverse vehicle prediction challenge. In the multi-agent setting, we evaluate on the synthetic partition of TrajNet++ dataset to showcase the model's socially-consistent predictions. We also demonstrate our model on general sequences of sets and provide illustrative experiments modelling the sequential structure of the multiple strokes that make up symbols in the Omniglot data. A distinguishing feature of AutoBots is that all models are trainable on a single desktop GPU (1080 Ti) in under 48h.
    Bellman-consistent Pessimism for Offline Reinforcement Learning. (arXiv:2106.06926v4 [cs.LG] UPDATED)
    The use of pessimism, when reasoning about datasets lacking exhaustive exploration has recently gained prominence in offline reinforcement learning. Despite the robustness it adds to the algorithm, overly pessimistic reasoning can be equally damaging in precluding the discovery of good policies, which is an issue for the popular bonus-based pessimism. In this paper, we introduce the notion of Bellman-consistent pessimism for general function approximation: instead of calculating a point-wise lower bound for the value function, we implement pessimism at the initial state over the set of functions consistent with the Bellman equations. Our theoretical guarantees only require Bellman closedness as standard in the exploratory setting, in which case bonus-based pessimism fails to provide guarantees. Even in the special case of linear function approximation where stronger expressivity assumptions hold, our result improves upon a recent bonus-based approach by $\mathcal{O}(d)$ in its sample complexity when the action space is finite. Remarkably, our algorithms automatically adapt to the best bias-variance tradeoff in the hindsight, whereas most prior approaches require tuning extra hyperparameters a priori.
    Shifts: A Dataset of Real Distributional Shift Across Multiple Large-Scale Tasks. (arXiv:2107.07455v3 [cs.LG] UPDATED)
    There has been significant research done on developing methods for improving robustness to distributional shift and uncertainty estimation. In contrast, only limited work has examined developing standard datasets and benchmarks for assessing these approaches. Additionally, most work on uncertainty estimation and robustness has developed new techniques based on small-scale regression or image classification tasks. However, many tasks of practical interest have different modalities, such as tabular data, audio, text, or sensor data, which offer significant challenges involving regression and discrete or continuous structured prediction. Thus, given the current state of the field, a standardized large-scale dataset of tasks across a range of modalities affected by distributional shifts is necessary. This will enable researchers to meaningfully evaluate the plethora of recently developed uncertainty quantification methods, as well as assessment criteria and state-of-the-art baselines. In this work, we propose the Shifts Dataset for evaluation of uncertainty estimates and robustness to distributional shift. The dataset, which has been collected from industrial sources and services, is composed of three tasks, with each corresponding to a particular data modality: tabular weather prediction, machine translation, and self-driving car (SDC) vehicle motion prediction. All of these data modalities and tasks are affected by real, "in-the-wild" distributional shifts and pose interesting challenges with respect to uncertainty estimation. In this work we provide a description of the dataset and baseline results for all tasks.
    Beyond Exact Gradients: Convergence of Stochastic Soft-Max Policy Gradient Methods with Entropy Regularization. (arXiv:2110.10117v2 [cs.LG] UPDATED)
    Entropy regularization is an efficient technique for encouraging exploration and preventing a premature convergence of (vanilla) policy gradient methods in reinforcement learning (RL). However, the theoretical understanding of entropy regularized RL algorithms has been limited. In this paper, we revisit the classical entropy regularized policy gradient methods with the soft-max policy parametrization, whose convergence has so far only been established assuming access to exact gradient oracles. To go beyond this scenario, we propose the first set of (nearly) unbiased stochastic policy gradient estimators with trajectory-level entropy regularization, with one being an unbiased visitation measure-based estimator and the other one being a nearly unbiased yet more practical trajectory-based estimator. We prove that although the estimators themselves are unbounded in general due to the additional logarithmic policy rewards introduced by the entropy term, the variances are uniformly bounded. We then propose a two-phase stochastic policy gradient (PG) algorithm that uses a large batch size in the first phase to overcome the challenge of the stochastic approximation due to the non-coercive landscape, and uses a small batch size in the second phase by leveraging the curvature information around the optimal policy. We establish a global optimality convergence result and a sample complexity of $\widetilde{\mathcal{O}}(\frac{1}{\epsilon^2})$ for the proposed algorithm. Our result is the first global convergence and sample complexity results for the stochastic entropy-regularized vanilla PG method.
    Rate-matching the regret lower-bound in the linear quadratic regulator with unknown dynamics. (arXiv:2202.05799v1 [cs.LG])
    The theory of reinforcement learning currently suffers from a mismatch between its empirical performance and the theoretical characterization of its performance, with consequences for, e.g., the understanding of sample efficiency, safety, and robustness. The linear quadratic regulator with unknown dynamics is a fundamental reinforcement learning setting with significant structure in its dynamics and cost function, yet even in this setting there is a gap between the best known regret lower-bound of $\Omega_p(\sqrt{T})$ and the best known upper-bound of $O_p(\sqrt{T}\,\text{polylog}(T))$. The contribution of this paper is to close that gap by establishing a novel regret upper-bound of $O_p(\sqrt{T})$. Our proof is constructive in that it analyzes the regret of a concrete algorithm, and simultaneously establishes an estimation error bound on the dynamics of $O_p(T^{-1/4})$ which is also the first to match the rate of a known lower-bound. The two keys to our improved proof technique are (1) a more precise upper- and lower-bound on the system Gram matrix and (2) a self-bounding argument for the expected estimation error of the optimal controller.
    A Witness Two-Sample Test. (arXiv:2102.05573v3 [cs.LG] UPDATED)
    The Maximum Mean Discrepancy (MMD) has been the state-of-the-art nonparametric test for tackling the two-sample problem. Its statistic is given by the difference in expectations of the witness function, a real-valued function defined as a weighted sum of kernel evaluations on a set of basis points. Typically the kernel is optimized on a training set, and hypothesis testing is performed on a separate test set to avoid overfitting (i.e., control type-I error). That is, the test set is used to simultaneously estimate the expectations and define the basis points, while the training set only serves to select the kernel and is discarded. In this work, we propose to use the training data to also define the weights and the basis points for better data efficiency. We show that 1) the new test is consistent and has a well-controlled type-I error; 2) the optimal witness function is given by a precision-weighted mean in the reproducing kernel Hilbert space associated with the kernel; and 3) the test power of the proposed test is comparable or exceeds that of the MMD and other modern tests, as verified empirically on challenging synthetic and real problems (e.g., Higgs data).
    Self-Initiated Open World Learning for Autonomous AI Agents. (arXiv:2110.11385v2 [cs.AI] UPDATED)
    As more and more AI agents are used in practice, it is time to think about how to make these agents fully autonomous so that they can learn by themselves in a self-motivated and self-supervised manner rather than being retrained periodically on the initiation of human engineers using expanded training data. As the real-world is an open environment with unknowns or novelties, detecting novelties or unknowns, characterizing them, accommodating or adapting to them, gathering ground-truth training data, and incrementally learning the unknowns/novelties are critical to making the agent more and more knowledgeable and powerful over time. The key challenge is how to automate the process so that it is carried out on the agent's own initiative and through its own interactions with humans and the environment. Since an AI agent usually has a performance task, characterizing each novelty becomes critical and necessary so that the agent can formulate an appropriate response to adapt its behavior to accommodate the novelty and to learn from it to improve the agent's adaptation capability and task performance. The process goes continually without termination. This paper proposes a theoretic framework for this learning paradigm to promote the research of building Self-initiated Open world Learning (SOL) agents. An example SOL agent is also described.
    Towards Adversarially Robust Deepfake Detection: An Ensemble Approach. (arXiv:2202.05687v1 [cs.LG])
    Detecting deepfakes is an important problem, but recent work has shown that DNN-based deepfake detectors are brittle against adversarial deepfakes, in which an adversary adds imperceptible perturbations to a deepfake to evade detection. In this work, we show that a modification to the detection strategy in which we replace a single classifier with a carefully chosen ensemble, in which input transformations for each model in the ensemble induces pairwise orthogonal gradients, can significantly improve robustness beyond the de facto solution of adversarial training. We present theoretical results to show that such orthogonal gradients can help thwart a first-order adversary by reducing the dimensionality of the input subspace in which adversarial deepfakes lie. We validate the results empirically by instantiating and evaluating a randomized version of such "orthogonal" ensembles for adversarial deepfake detection and find that these randomized ensembles exhibit significantly higher robustness as deepfake detectors compared to state-of-the-art deepfake detectors against adversarial deepfakes, even those created using strong PGD-500 attacks.
    Unified Group Fairness on Federated Learning. (arXiv:2111.04986v2 [cs.LG] UPDATED)
    Federated learning (FL) has emerged as an important machine learning paradigm where a global model is trained based on the private data from distributed clients. However, most of existing FL algorithms cannot guarantee the performance fairness towards different groups because of data distribution shift over groups. In this paper, we formulate the problem of unified group fairness on FL, where the groups can be formed by clients (including existing clients and newly added clients) and sensitive attribute(s). To solve this problem, we first propose a general fair federated framework. Then we construct a unified group fairness risk from the view of federated uncertainty set with theoretical analyses to guarantee unified group fairness on FL. We also develop an efficient federated optimization algorithm named Federated Mirror Descent Ascent with Momentum Acceleration (FMDA-M) with convergence guarantee. We validate the advantages of the FMDA-M algorithm with various kinds of distribution shift settings in experiments, and the results show that FMDA-M algorithm outperforms the existing fair FL algorithms on unified group fairness.
    Meta-learning with GANs for anomaly detection, with deployment in high-speed rail inspection system. (arXiv:2202.05795v1 [cs.LG])
    Anomaly detection has been an active research area with a wide range of potential applications. Key challenges for anomaly detection in the AI era with big data include lack of prior knowledge of potential anomaly types, highly complex and noisy background in input data, scarce abnormal samples, and imbalanced training dataset. In this work, we propose a meta-learning framework for anomaly detection to deal with these issues. Within this framework, we incorporate the idea of generative adversarial networks (GANs) with appropriate choices of loss functions including structural similarity index measure (SSIM). Experiments with limited labeled data for high-speed rail inspection demonstrate that our meta-learning framework is sharp and robust in identifying anomalies. Our framework has been deployed in five high-speed railways of China since 2021: it has reduced more than 99.7% workload and saved 96.7% inspection time.
    A Newton-type algorithm for federated learning based on incremental Hessian eigenvector sharing. (arXiv:2202.05800v1 [cs.LG])
    There is a growing interest in the decentralized optimization framework that goes under the name of Federated Learning (FL). In particular, much attention is being turned to FL scenarios where the network is strongly heterogeneous in terms of communication resources (e.g., bandwidth) and data distribution. In these cases, communication between local machines (agents) and the central server (Master) is a main consideration. In this work, we present an original communication-constrained Newton-type (NT) algorithm designed to accelerate FL in such heterogeneous scenarios. The algorithm is by design robust to non i.i.d. data distributions, handles heterogeneity of agents' communication resources (CRs), only requires sporadic Hessian computations, and achieves super-linear convergence. This is possible thanks to an incremental strategy, based on a singular value decomposition (SVD) of the local Hessian matrices, which exploits (possibly) outdated second-order information. The proposed solution is thoroughly validated on real datasets by assessing (i) the number of communication rounds required for convergence, (ii) the overall amount of data transmitted and (iii) the number of local Hessian computations required. For all these metrics, the proposed approach shows superior performance against state-of-the art techniques like GIANT and FedNL.
    Hierarchical Risk Parity and Minimum Variance Portfolio Design on NIFTY 50 Stocks. (arXiv:2202.02728v1 [q-fin.PM] CROSS LISTED)
    Portfolio design and optimization have been always an area of research that has attracted a lot of attention from researchers from the finance domain. Designing an optimum portfolio is a complex task since it involves accurate forecasting of future stock returns and risks and making a suitable tradeoff between them. This paper proposes a systematic approach to designing portfolios using two algorithms, the critical line algorithm, and the hierarchical risk parity algorithm on eight sectors of the Indian stock market. While the portfolios are designed using the stock price data from Jan 1, 2016, to Dec 31, 2020, they are tested on the data from Jan 1, 2021, to Aug 26, 2021. The backtesting results of the portfolios indicate while the performance of the CLA algorithm is superior on the training data, the HRP algorithm has outperformed the CLA algorithm on the test data.
    Distributionally Robust Data Join. (arXiv:2202.05797v1 [cs.LG])
    Suppose we are given two datasets: a labeled dataset and unlabeled dataset which also has additional auxiliary features not present in the first dataset. What is the most principled way to use these datasets together to construct a predictor? The answer should depend upon whether these datasets are generated by the same or different distributions over their mutual feature sets, and how similar the test distribution will be to either of those distributions. In many applications, the two datasets will likely follow different distributions, but both may be close to the test distribution. We introduce the problem of building a predictor which minimizes the maximum loss over all probability distributions over the original features, auxiliary features, and binary labels, whose Wasserstein distance is $r_1$ away from the empirical distribution over the labeled dataset and $r_2$ away from that of the unlabeled dataset. This can be thought of as a generalization of distributionally robust optimization (DRO), which allows for two data sources, one of which is unlabeled and may contain auxiliary features.
    On Generalization of Adversarial Imitation Learning and Beyond. (arXiv:2106.10424v3 [cs.LG] UPDATED)
    Despite massive empirical evaluations, one of the fundamental questions in imitation learning is still not fully settled: does AIL (adversarial imitation learning) provably generalize better than BC (behavioral cloning)? We study this open problem with tabular and episodic MDPs. For vanilla AIL that uses the direct maximum likelihood estimation, we provide both negative and positive answers under the known transition setting. For some MDPs, we show that vanilla AIL has a worse sample complexity than BC. The key insight is that the state-action distribution matching principle is weak so that AIL may generalize poorly even on visited states from the expert demonstrations. For another class of MDPs, vanilla AIL is proved to generalize well even on non-visited states. Interestingly, its sample complexity is horizon-free, which provably beats BC by a wide margin. Finally, we establish a framework in the unknown transition scenario, which allows AIL to explore via reward-free exploration strategies. Compared with the best-known online apprenticeship learning algorithm, the resulting algorithm improves the sample complexity and interaction complexity.
    Meta-Interpretive Learning as Metarule Specialisation. (arXiv:2106.07464v6 [cs.LG] UPDATED)
    In Meta-Interpretive Learning (MIL) the metarules, second-order datalog clauses acting as inductive bias, are manually defined by the user. In this work we show that second-order metarules for MIL can be learned by MIL. We define a generality ordering of metarules by $\theta$-subsumption and show that user-defined \emph{sort metarules} are derivable by specialisation of the most-general \emph{matrix metarules} in a language class; and that these matrix metarules are in turn derivable by specialisation of third-order \emph{punch metarules} with variables quantified over the set of atoms and for which only an upper bound on their number of literals need be user-defined. We show that the cardinality of a metarule language is polynomial in the number of literals in punch metarules. We re-frame MIL as metarule specialisation by resolution. We modify the MIL metarule specialisation operator to return new metarules rather than first-order clauses and prove the correctness of the new operator. We implement the new operator as TOIL, a sub-system of the MIL system Louise. Our experiments show that as user-defined sort metarules are progressively replaced by sort metarules learned by TOIL, Louise's predictive accuracy and training times are maintained. We conclude that automatically derived metarules can replace user-defined metarules.
    Analyzing a Caching Model. (arXiv:2112.06989v2 [cs.LG] UPDATED)
    Machine Learning has been successfully applied in systems applications such as memory prefetching and caching, where learned models have been shown to outperform heuristics. However, the lack of understanding the inner workings of these models -- interpretability -- remains a major obstacle for adoption in real-world deployments. Understanding a model's behavior can help system administrators and developers gain confidence in the model, understand risks, and debug unexpected behavior in production. Interpretability for models used in computer systems poses a particular challenge: Unlike ML models trained on images or text, the input domain (e.g., memory access patterns, program counters) is not immediately interpretable. A major challenge is therefore to explain the model in terms of concepts that are approachable to a human practitioner. By analyzing a state-of-the-art caching model, we provide evidence that the model has learned concepts beyond simple statistics that can be leveraged for explanations. Our work provides a first step towards explanability of system ML models and highlights both promises and challenges of this emerging research area.
    SleepPPG-Net: a deep learning algorithm for robust sleep staging from continuous photoplethysmography. (arXiv:2202.05735v1 [cs.LG])
    Introduction: Sleep staging is an essential component in the diagnosis of sleep disorders and management of sleep health. It is traditionally measured in a clinical setting and requires a labor-intensive labeling process. We hypothesize that it is possible to perform robust 4-class sleep staging using the raw photoplethysmography (PPG) time series and modern advances in deep learning (DL). Methods: We used two publicly available sleep databases that included raw PPG recordings, totalling 2,374 patients and 23,055 hours. We developed SleepPPG-Net, a DL model for 4-class sleep staging from the raw PPG time series. SleepPPG-Net was trained end-to-end and consists of a residual convolutional network for automatic feature extraction and a temporal convolutional network to capture long-range contextual information. We benchmarked the performance of SleepPPG-Net against models based on the best-reported state-of-the-art (SOTA) algorithms. Results: When benchmarked on a held-out test set, SleepPPG-Net obtained a median Cohen's Kappa ($\kappa$) score of 0.75 against 0.69 for the best SOTA approach. SleepPPG-Net showed good generalization performance to an external database, obtaining a $\kappa$ score of 0.74 after transfer learning. Perspective: Overall, SleepPPG-Net provides new SOTA performance. In addition, performance is high enough to open the path to the development of wearables that meet the requirements for usage in clinical applications such as the diagnosis and monitoring of obstructive sleep apnea.
    Bernstein Flows for Flexible Posteriors in Variational Bayes. (arXiv:2202.05650v1 [stat.ML])
    Variational inference (VI) is a technique to approximate difficult to compute posteriors by optimization. In contrast to MCMC, VI scales to many observations. In the case of complex posteriors, however, state-of-the-art VI approaches often yield unsatisfactory posterior approximations. This paper presents Bernstein flow variational inference (BF-VI), a robust and easy-to-use method, flexible enough to approximate complex multivariate posteriors. BF-VI combines ideas from normalizing flows and Bernstein polynomial-based transformation models. In benchmark experiments, we compare BF-VI solutions with exact posteriors, MCMC solutions, and state-of-the-art VI methods including normalizing flow based VI. We show for low-dimensional models that BF-VI accurately approximates the true posterior; in higher-dimensional models, BF-VI outperforms other VI methods. Further, we develop with BF-VI a Bayesian model for the semi-structured Melanoma challenge data, combining a CNN model part for image data with an interpretable model part for tabular data, and demonstrate for the first time how the use of VI in semi-structured models.
    SuperCon: Supervised Contrastive Learning for Imbalanced Skin Lesion Classification. (arXiv:2202.05685v1 [cs.CV])
    Convolutional neural networks (CNNs) have achieved great success in skin lesion classification. A balanced dataset is required to train a good model. However, due to the appearance of different skin lesions in practice, severe or even deadliest skin lesion types (e.g., melanoma) naturally have quite small amount represented in a dataset. In that, classification performance degradation occurs widely, it is significantly important to have CNNs that work well on class imbalanced skin lesion image dataset. In this paper, we propose SuperCon, a two-stage training strategy to overcome the class imbalance problem on skin lesion classification. It contains two stages: (i) representation training that tries to learn a feature representation that closely aligned among intra-classes and distantly apart from inter-classes, and (ii) classifier fine-tuning that aims to learn a classifier that correctly predict the label based on the learnt representations. In the experimental evaluation, extensive comparisons have been made among our approach and other existing approaches on skin lesion benchmark datasets. The results show that our two-stage training strategy effectively addresses the class imbalance classification problem, and significantly improves existing works in terms of F1-score and AUC score, resulting in state-of-the-art performance.
    SVD-Embedded Deep Autoencoder for MIMO Communications. (arXiv:2111.02359v2 [cs.IT] UPDATED)
    Using a deep autoencoder (DAE) for end-to-end communication in multiple-input multiple-output (MIMO) systems is a novel concept with significant potential. DAE-aided MIMO has been shown to outperform singular-value decomposition (SVD)-based precoded MIMO in terms of bit error rate (BER). This paper proposes embedding left- and right-singular vectors of the channel matrix into DAE encoder and decoder to further improve the performance of the MIMO DAE. SVDembedded DAE largely outperforms theoretic linear precoding in terms of BER. This is remarkable since it demonstrates that DAEs have significant potential to exceed the limits of current system design by treating the communication system as a single, end-to-end optimization block. Based on the simulation results, at SNR=10dB, the proposed SVD-embedded design can achieve a BER of about $10^{-5}$ and reduce the BER at least 10 times compared with existing DAE without SVD, and up to 18 times compared with theoretical linear precoding. We attribute this to the fact that the proposed DAE can match the input and output as an adaptive modulation structure with finite alphabet input. We also observe that adding residual connections to the DAE further improves the performance.
    Machine Learning for Stock Prediction Based on Fundamental Analysis. (arXiv:2202.05702v1 [q-fin.ST])
    Application of machine learning for stock prediction is attracting a lot of attention in recent years. A large amount of research has been conducted in this area and multiple existing results have shown that machine learning methods could be successfully used toward stock predicting using stocks historical data. Most of these existing approaches have focused on short term prediction using stocks historical price and technical indicators. In this paper, we prepared 22 years worth of stock quarterly financial data and investigated three machine learning algorithms: Feed-forward Neural Network (FNN), Random Forest (RF) and Adaptive Neural Fuzzy Inference System (ANFIS) for stock prediction based on fundamental analysis. In addition, we applied RF based feature selection and bootstrap aggregation in order to improve model performance and aggregate predictions from different models. Our results show that RF model achieves the best prediction results, and feature selection is able to improve test performance of FNN and ANFIS. Moreover, the aggregated model outperforms all baseline models as well as the benchmark DJIA index by an acceptable margin for the test period. Our findings demonstrate that machine learning models could be used to aid fundamental analysts with decision-making regarding stock investment.
    Domain Adversarial Training: A Game Perspective. (arXiv:2202.05352v1 [cs.LG])
    The dominant line of work in domain adaptation has focused on learning invariant representations using domain-adversarial training. In this paper, we interpret this approach from a game theoretical perspective. Defining optimal solutions in domain-adversarial training as a local Nash equilibrium, we show that gradient descent in domain-adversarial training can violate the asymptotic convergence guarantees of the optimizer, oftentimes hindering the transfer performance. Our analysis leads us to replace gradient descent with high-order ODE solvers (i.e., Runge-Kutta), for which we derive asymptotic convergence guarantees. This family of optimizers is significantly more stable and allows more aggressive learning rates, leading to high performance gains when used as a drop-in replacement over standard optimizers. Our experiments show that in conjunction with state-of-the-art domain-adversarial methods, we achieve up to 3.5% improvement with less than of half training iterations. Our optimizers are easy to implement, free of additional parameters, and can be plugged into any domain-adversarial framework.
    CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP. (arXiv:2110.11316v2 [cs.LG] UPDATED)
    CLIP yielded impressive results on zero-shot transfer learning tasks and is considered as a foundation model like BERT or GPT3. CLIP vision models that have a rich representation are pre-trained using the InfoNCE objective and natural language supervision before they are fine-tuned on the particular tasks. Though CLIP excels at zero-shot transfer learning, it suffers from explaining away, that is, it focuses too much on few specific features and/or insufficiently extracts the covariance structure in the data. The former problem of focusing on few features only is caused by a saturation of the InfoNCE objective, which is severe for high mutual information. The latter problem of insufficiently exploiting the covariance structure is caused by a deficiency in extracting feature associations and co-occurrences. We introduce "Contrastive Leave One Out Boost" (CLOOB), which uses the InfoLOOB objective and modern Hopfield networks. In contrast to InfoNCE, the InfoLOOB objective (leave one out bound) does not saturate and works well for high mutual information. Modern Hopfield networks, on the other hand, allow to use retrieved embeddings, which have an enriched covariance structure via co-occurrences of stored features. We compare CLOOB to CLIP after pre-training on the Conceptual Captions and the YFCC dataset with respect to their zero-shot transfer learning performance on other datasets. CLOOB consistently outperforms CLIP at zero-shot transfer learning across all considered architectures and datasets.
    Modeling Reservoir Release Using Pseudo-Prospective Learning and Physical Simulations to Predict Water Temperature. (arXiv:2202.05714v1 [cs.LG])
    This paper proposes a new data-driven method for predicting water temperature in stream networks with reservoirs. The water flows released from reservoirs greatly affect the water temperature of downstream river segments. However, the information of released water flow is often not available for many reservoirs, which makes it difficult for data-driven models to capture the impact to downstream river segments. In this paper, we first build a state-aware graph model to represent the interactions amongst streams and reservoirs, and then propose a parallel learning structure to extract the reservoir release information and use it to improve the prediction. In particular, for reservoirs with no available release information, we mimic the water managers' release decision process through a pseudo-prospective learning method, which infers the release information from anticipated water temperature dynamics. For reservoirs with the release information, we leverage a physics-based model to simulate the water release temperature and transfer such information to guide the learning process for other reservoirs. The evaluation for the Delaware River Basin shows that the proposed method brings over 10\% accuracy improvement over existing data-driven models for stream temperature prediction when the release data is not available for any reservoirs. The performance is further improved after we incorporate the release data and physical simulations for a subset of reservoirs.
    Active Privacy-Utility Trade-off Against Inference in Time-Series Data Sharing. (arXiv:2202.05833v1 [cs.IT])
    Internet of things (IoT) devices, such as smart meters, smart speakers and activity monitors, have become highly popular thanks to the services they offer. However, in addition to their many benefits, they raise privacy concerns since they share fine-grained time-series user data with untrusted third parties. In this work, we consider a user releasing her data containing personal information in return of a service from an honest-but-curious service provider (SP). We model user's personal information as two correlated random variables (r.v.'s), one of them, called the secret variable, is to be kept private, while the other, called the useful variable, is to be disclosed for utility. We consider active sequential data release, where at each time step the user chooses from among a finite set of release mechanisms, each revealing some information about the user's personal information, i.e., the true values of the r.v.'s, albeit with different statistics. The user manages data release in an online fashion such that the maximum amount of information is revealed about the latent useful variable as quickly as possible, while the confidence for the sensitive variable is kept below a predefined level. For privacy measure, we consider both the probability of correctly detecting the true value of the secret and the mutual information (MI) between the secret and the released data. We formulate both problems as partially observable Markov decision processes (POMDPs), and numerically solve them by advantage actor-critic (A2C) deep reinforcement learning (DRL). We evaluate the privacy-utility trade-off (PUT) of the proposed policies on both the synthetic data and smoking activity dataset, and show their validity by testing the activity detection accuracy of the SP modeled by a long short-term memory (LSTM) neural network.
    A Characterization of Semi-Supervised Adversarially-Robust PAC Learnability. (arXiv:2202.05420v1 [cs.LG])
    We study the problem of semi-supervised learning of an adversarially-robust predictor in the PAC model, where the learner has access to both labeled and unlabeled examples. The sample complexity in semi-supervised learning has two parameters, the number of labeled examples and the number of unlabeled examples. We consider the complexity measures, $VC_U \leq dim_U \leq VC$ and $VC^*$, where $VC$ is the standard $VC$-dimension, $VC^*$ is its dual, and the other two measures appeared in Montasser et al. (2019). The best sample bound known for robust supervised PAC learning is $O(VC \cdot VC^*)$, and we will compare our sample bounds to $\Lambda$ which is the minimal number of labeled examples required by any robust supervised PAC learning algorithm. Our main results are the following: (1) in the realizable setting it is sufficient to have $O(VC_U)$ labeled examples and $O(\Lambda)$ unlabeled examples. (2) In the agnostic setting, let $\eta$ be the minimal agnostic error. The sample complexity depends on the resulting error rate. If we allow an error of $2\eta+\epsilon$, it is still sufficient to have $O(VC_U)$ labeled examples and $O(\Lambda)$ unlabeled examples. If we insist on having an error $\eta+\epsilon$ then $\Omega(dim_U)$ labeled examples are necessary, as in the supervised case. The above results show that there is a significant benefit in semi-supervised robust learning, as there are hypothesis classes with $VC_U=0$ and $dim_U$ arbitrary large. In supervised learning, having access only to labeled examples requires at least $\Lambda \geq dim_U$ labeled examples. Semi-supervised require only $O(1)$ labeled examples and $O(\Lambda)$ unlabeled examples. A byproduct of our result is that if we assume that the distribution is robustly realizable by a hypothesis class, then with respect to the 0-1 loss we can learn with only $O(VC_U)$ labeled examples, even if the $VC$ is infinite.
    Audio Defect Detection in Music with Deep Networks. (arXiv:2202.05718v1 [cs.SD])
    With increasing amounts of music being digitally transferred from production to distribution, automatic means of determining media quality are needed. Protection mechanisms in digital audio processing tools have not eliminated the need of production entities located downstream the distribution chain to assess audio quality and detect defects inserted further upstream. Such analysis often relies on the received audio and scarce meta-data alone. Deliberate use of artefacts such as clicks in popular music as well as more recent defects stemming from corruption in modern audio encodings call for data-centric and context sensitive solutions for detection. We present a convolutional network architecture following end-to-end encoder decoder configuration to develop detectors for two exemplary audio defects. A click detector is trained and compared to a traditional signal processing method, with a discussion on context sensitivity. Additional post-processing is used for data augmentation and workflow simulation. The ability of our models to capture variance is explored in a detector for artefacts from decompression of corrupted MP3 compressed audio. For both tasks we describe the synthetic generation of artefacts for controlled detector training and evaluation. We evaluate our detectors on the large open-source Free Music Archive (FMA) and genre-specific datasets.
    Mitigating Bias in Calibration Error Estimation. (arXiv:2012.08668v3 [cs.LG] UPDATED)
    For an AI system to be reliable, the confidence it expresses in its decisions must match its accuracy. To assess the degree of match, examples are typically binned by confidence and the per-bin mean confidence and accuracy are compared. Most research in calibration focuses on techniques to reduce this empirical measure of calibration error, ECE_bin. We instead focus on assessing statistical bias in this empirical measure, and we identify better estimators. We propose a framework through which we can compute the bias of a particular estimator for an evaluation data set of a given size. The framework involves synthesizing model outputs that have the same statistics as common neural architectures on popular data sets. We find that binning-based estimators with bins of equal mass (number of instances) have lower bias than estimators with bins of equal width. Our results indicate two reliable calibration-error estimators: the debiased estimator (Brocker, 2012; Ferro and Fricker, 2012) and a method we propose, ECE_sweep, which uses equal-mass bins and chooses the number of bins to be as large as possible while preserving monotonicity in the calibration function. With these estimators, we observe improvements in the effectiveness of recalibration methods and in the detection of model miscalibration.
    A nonlinear diffusion method for semi-supervised learning on hypergraphs. (arXiv:2103.14867v2 [cs.LG] UPDATED)
    Hypergraphs are a common model for multiway relationships in data, and hypergraph semi-supervised learning is the problem of assigning labels to all nodes in a hypergraph, given labels on just a few nodes. Diffusions and label spreading are classical techniques for semi-supervised learning in the graph setting, and there are some standard ways to extend them to hypergraphs. However, these methods are linear models, and do not offer an obvious way of incorporating node features for making predictions. Here, we develop a nonlinear diffusion process on hypergraphs that spreads both features and labels following the hypergraph structure, which can be interpreted as a hypergraph equilibrium network. Even though the process is nonlinear, we show global convergence to a unique limiting point for a broad class of nonlinearities, which is the global optimum of a interpretable, regularized semi-supervised learning loss function. The limiting point serves as a node embedding from which we make predictions with a linear model. Our approach is much more accurate than several hypergraph neural networks, and also takes less time to train.
    Confidence-Aware Multi-Teacher Knowledge Distillation. (arXiv:2201.00007v2 [cs.LG] UPDATED)
    Knowledge distillation is initially introduced to utilize additional supervision from a single teacher model for the student model training. To boost the student performance, some recent variants attempt to exploit diverse knowledge sources from multiple teachers. However, existing studies mainly integrate knowledge from diverse sources by averaging over multiple teacher predictions or combining them using other various label-free strategies, which may mislead student in the presence of low-quality teacher predictions. To tackle this problem, we propose Confidence-Aware Multi-teacher Knowledge Distillation (CA-MKD), which adaptively assigns sample-wise reliability for each teacher prediction with the help of ground-truth labels, with those teacher predictions close to one-hot labels assigned large weights. Besides, CA-MKD incorporates intermediate layers to stable the knowledge transfer process. Extensive experiments show that our CA-MKD consistently outperforms all compared state-of-the-art methods across various teacher-student architectures.
    Learning via nonlinear conjugate gradients and depth-varying neural ODEs. (arXiv:2202.05766v1 [cs.LG])
    The inverse problem of supervised reconstruction of depth-variable (time-dependent) parameters in a neural ordinary differential equation (NODE) is considered, that means finding the weights of a residual network with time continuous layers. The NODE is treated as an isolated entity describing the full network as opposed to earlier research, which embedded it between pre- and post-appended layers trained by conventional methods. The proposed parameter reconstruction is done for a general first order differential equation by minimizing a cost functional covering a variety of loss functions and penalty terms. A nonlinear conjugate gradient method (NCG) is derived for the minimization. Mathematical properties are stated for the differential equation and the cost functional. The adjoint problem needed is derived together with a sensitivity problem. The sensitivity problem can estimate changes in the network output under perturbation of the trained parameters. To preserve smoothness during the iterations the Sobolev gradient is calculated and incorporated. As a proof-of-concept, numerical results are included for a NODE and two synthetic datasets, and compared with standard gradient approaches (not based on NODEs). The results show that the proposed method works well for deep learning with infinite numbers of layers, and has built-in stability and smoothness.
    Bounded nonlinear forecasts of partially observed geophysical systems with physics-constrained deep learning. (arXiv:2202.05750v1 [stat.ML])
    The complexity of real-world geophysical systems is often compounded by the fact that the observed measurements depend on hidden variables. These latent variables include unresolved small scales and/or rapidly evolving processes, partially observed couplings, or forcings in coupled systems. This is the case in ocean-atmosphere dynamics, for which unknown interior dynamics can affect surface observations. The identification of computationally-relevant representations of such partially-observed and highly nonlinear systems is thus challenging and often limited to short-term forecast applications. Here, we investigate the physics-constrained learning of implicit dynamical embeddings, leveraging neural ordinary differential equation (NODE) representations. A key objective is to constrain their boundedness, which promotes the generalization of the learned dynamics to arbitrary initial condition. The proposed architecture is implemented within a deep learning framework, and its relevance is demonstrated with respect to state-of-the-art schemes for different case-studies representative of geophysical dynamics.
    Molecule Generation from Input-Attributions over Graph Convolutional Networks. (arXiv:2202.05703v1 [q-bio.BM])
    It is well known that Drug Design is often a costly process both in terms of time and economic effort. While good Quantitative Structure-Activity Relationship models (QSAR) can help predicting molecular properties without the need to synthesize them, it is still required to come up with new molecules to be tested. This is mostly done in lack of tools to determine which modifications are more promising or which aspects of a molecule are more influential for the final activity/property. Here we present an automatic process which involves Graph Convolutional Network models and input-attribution methods to generate new molecules. We also explore the problems of over-optimization and applicability, recognizing them as two important aspects in the practical use of such automatic tools.
    Artificial Neural Network Modeling for Airline Disruption Management. (arXiv:2104.02032v3 [cs.AI] UPDATED)
    Since the 1970s, most airlines have incorporated computerized support for managing disruptions during flight schedule execution. However, existing platforms for airline disruption management (ADM) employ monolithic system design methods that rely on the creation of specific rules and requirements through explicit optimization routines, before a system that meets the specifications is designed. Thus, current platforms for ADM are unable to readily accommodate additional system complexities resulting from the introduction of new capabilities, such as the introduction of unmanned aerial systems (UAS), operations and infrastructure, to the system. To this end, we use historical data on airline scheduling and operations recovery to develop a system of artificial neural networks (ANNs), which describe a predictive transfer function model (PTFM) for promptly estimating the recovery impact of disruption resolutions at separate phases of flight schedule execution during ADM. Furthermore, we provide a modular approach for assessing and executing the PTFM by employing a parallel ensemble method to develop generative routines that amalgamate the system of ANNs. Our modular approach ensures that current industry standards for tardiness in flight schedule execution during ADM are satisfied, while accurately estimating appropriate time-based performance metrics for the separate phases of flight schedule execution.
    Deep artificial neural network for prediction of atrial fibrillation through the analysis of 12-leads standard ECG. (arXiv:2202.05676v1 [eess.SP])
    Atrial Fibrillation (AF) is a heart's arrhythmia which, despite being often asymptomatic, represents an important risk factor for stroke, therefore being able to predict AF at the electrocardiogram exam, would be of great impact on actively targeting patients at high risk. In the present work we use Convolution Neural Networks to analyze ECG and predict Atrial Fibrillation starting from realistic datasets, i.e. considering fewer ECG than other studies and extending the maximal distance between ECG and AF diagnosis. We achieved 75.5% (0.75) AUC firstly increasing our dataset size by a shifting technique and secondarily using the dilation parameter of the convolution neural network. In addition we find that, contrarily to what is commonly used by clinicians reporting AF at the exam, the most informative leads for the task of predicting AF are D1 and avR. Similarly, we find that the most important frequencies to check are in the range of 5-20 Hz. Finally, we develop a net able to manage at the same time the electrocardiographic signal together with the electronic health record, showing that integration between different sources of data is a profitable path. In fact, the 2.8% gain of such net brings us to a 78.6% (std 0.77) AUC. In future works we will deepen both the integration of sources and the reason why we claim avR is the most informative lead.
    Automated Architecture Search for Brain-inspired Hyperdimensional Computing. (arXiv:2202.05827v1 [cs.LG])
    This paper represents the first effort to explore an automated architecture search for hyperdimensional computing (HDC), a type of brain-inspired neural network. Currently, HDC design is largely carried out in an application-specific ad-hoc manner, which significantly limits its application. Furthermore, the approach leads to inferior accuracy and efficiency, which suggests that HDC cannot perform competitively against deep neural networks. Herein, we present a thorough study to formulate an HDC architecture search space. On top of the search space, we apply reinforcement-learning to automatically explore the HDC architectures. The searched HDC architectures show competitive performance on case studies involving a drug discovery dataset and a language recognition task. On the Clintox dataset, which tries to learn features from developed drugs that passed/failed clinical trials for toxicity reasons, the searched HDC architecture obtains the state-of-the-art ROC-AUC scores, which are 0.80% higher than the manually designed HDC and 9.75% higher than conventional neural networks. Similar results are achieved on the language recognition task, with 1.27% higher performance than conventional methods.
    A PDE-Based Analysis of the Symmetric Two-Armed Bernoulli Bandit. (arXiv:2202.05767v1 [cs.LG])
    This work addresses a version of the two-armed Bernoulli bandit problem where the sum of the means of the arms is one (the symmetric two-armed Bernoulli bandit). In a regime where the gap between these means goes to zero and the number of prediction periods approaches infinity, we obtain the leading order terms of the expected regret and pseudoregret for this problem by associating each of them with a solution of a linear parabolic partial differential equation. Our results improve upon the previously known results; specifically we explicitly compute the leading order term of the optimal regret and pseudoregret in three different scaling regimes for the gap. Additionally, we obtain new non-asymptotic bounds for any given time horizon.
    DECORE: Deep Compression with Reinforcement Learning. (arXiv:2106.06091v2 [cs.LG] UPDATED)
    Deep learning has become an increasingly popular and powerful methodology for modern pattern recognition systems. However, many deep neural networks have millions or billions of parameters, making them untenable for real-world applications due to constraints on memory size or latency requirements. As a result, efficient network compression techniques are often required for the widespread adoption of deep learning methods. We present DECORE, a reinforcement learning-based approach to automate the network compression process. DECORE assigns an agent to each channel in the network along with a light policy gradient method to learn which neurons or channels to be kept or removed. Each agent in the network has just one parameter (keep or drop) to learn, which leads to a much faster training process compared to existing approaches. DECORE provides state-of-the-art compression results on various network architectures and various datasets. For example, on the ResNet-110 architecture, it achieves a 64.8% compression and 61.8% FLOPs reduction as compared to the baseline model without any accuracy loss on the CIFAR-10 dataset. It can reduce the size of regular architectures like the VGG network by up to 99% with just a small accuracy drop of 2.28%. For a larger dataset like ImageNet with just 30 epochs of training, it can compress the ResNet-50 architecture by 44.7% and reduce FLOPs by 42.3%, with just a 0.69% drop on Top-5 accuracy of the uncompressed model. We also demonstrate that DECORE can be used to search for compressed network architectures based on various constraints, such as memory and FLOPs.
    The Power of Adaptivity in SGD: Self-Tuning Step Sizes with Unbounded Gradients and Affine Variance. (arXiv:2202.05791v1 [stat.ML])
    We study convergence rates of AdaGrad-Norm as an exemplar of adaptive stochastic gradient methods (SGD), where the step sizes change based on observed stochastic gradients, for minimizing non-convex, smooth objectives. Despite their popularity, the analysis of adaptive SGD lags behind that of non adaptive methods in this setting. Specifically, all prior works rely on some subset of the following assumptions: (i) uniformly-bounded gradient norms, (ii) uniformly-bounded stochastic gradient variance (or even noise support), (iii) conditional independence between the step size and stochastic gradient. In this work, we show that AdaGrad-Norm exhibits an order optimal convergence rate of $\mathcal{O}\left(\frac{\mathrm{poly}\log(T)}{\sqrt{T}}\right)$ after $T$ iterations under the same assumptions as optimally-tuned non adaptive SGD (unbounded gradient norms and affine noise variance scaling), and crucially, without needing any tuning parameters. We thus establish that adaptive gradient methods exhibit order-optimal convergence in much broader regimes than previously understood.
    Using Random Perturbations to Mitigate Adversarial Attacks on Sentiment Analysis Models. (arXiv:2202.05758v1 [cs.CL])
    Attacks on deep learning models are often difficult to identify and therefore are difficult to protect against. This problem is exacerbated by the use of public datasets that typically are not manually inspected before use. In this paper, we offer a solution to this vulnerability by using, during testing, random perturbations such as spelling correction if necessary, substitution by random synonym, or simply dropping the word. These perturbations are applied to random words in random sentences to defend NLP models against adversarial attacks. Our Random Perturbations Defense and Increased Randomness Defense methods are successful in returning attacked models to similar accuracy of models before attacks. The original accuracy of the model used in this work is 80% for sentiment classification. After undergoing attacks, the accuracy drops to accuracy between 0% and 44%. After applying our defense methods, the accuracy of the model is returned to the original accuracy within statistical significance.
    Handling Noisy Labels via One-Step Abductive Multi-Target Learning: An Application to Helicobacter Pylori Segmentation. (arXiv:2011.14956v3 [cs.LG] UPDATED)
    Learning from noisy labels is an important concern because of the lack of accurate ground-truth labels in plenty of real-world scenarios. In practice, various approaches for this concern first make some corrections corresponding to potentially noisy-labeled instances, and then update predictive model with information of the made corrections. However, in specific areas, such as medical histopathology whole slide image analysis (MHWSIA), it is often difficult or even impossible for experts to manually achieve the noisy-free ground-truth labels which leads to labels with complex noise. This situation raises two more difficult problems: 1) the methodology of approaches making corrections corresponding to potentially noisy-labeled instances has limitations due to the complex noise existing in labels; and 2) the appropriate evaluation strategy for validation/testing is unclear because of the great difficulty in collecting the noisy-free ground-truth labels. In this paper, we focus on alleviating these two problems. For the problem 1), we present one-step abductive multi-target learning (OSAMTL) that imposes a one-step logical reasoning upon machine learning via a multi-target learning procedure to constrain the predictions of the learning model to be subject to our prior knowledge about the true target. For the problem 2), we propose a logical assessment formula (LAF) that evaluates the logical rationality of the outputs of an approach by estimating the consistencies between the predictions of the learning model and the logical facts narrated from the results of the one-step logical reasoning of OSAMTL. Applying OSAMTL and LAF to the Helicobacter pylori (H. pylori) segmentation task in MHWSIA, we show that OSAMTL is able to enable the machine learning model achieving logically more rational predictions, which is beyond various state-of-the-art approaches in handling complex noisy labels.
    Positive-Unlabeled Domain Adaptation. (arXiv:2202.05695v1 [cs.LG])
    Domain Adaptation methodologies have shown to effectively generalize from a labeled source domain to a label scarce target domain. Previous research has either focused on unlabeled domain adaptation without any target supervision or semi-supervised domain adaptation with few labeled target examples per class. On the other hand Positive-Unlabeled (PU-) Learning has attracted increasing interest in the weakly supervised learning literature since in quite some real world applications positive labels are much easier to obtain than negative ones. In this work we are the first to introduce the challenge of Positive-Unlabeled Domain Adaptation where we aim to generalise from a fully labeled source domain to a target domain where only positive and unlabeled data is available. We present a novel two-step learning approach to this problem by firstly identifying reliable positive and negative pseudo-labels in the target domain guided by source domain labels and a positive-unlabeled risk estimator. This enables us to use a standard classifier on the target domain in a second step. We validate our approach by running experiments on benchmark datasets for visual object recognition. Furthermore we propose real world examples for our setting and validate our superior performance on parking occupancy data.
    Recovering Stochastic Dynamics via Gaussian Schr\"odinger Bridges. (arXiv:2202.05722v1 [cs.LG])
    We propose a new framework to reconstruct a stochastic process $\left\{\mathbb{P}_{t}: t \in[0, T]\right\}$ using only samples from its marginal distributions, observed at start and end times $0$ and $T$. This reconstruction is useful to infer population dynamics, a crucial challenge, e.g., when modeling the time-evolution of cell populations from single-cell sequencing data. Our general framework encompasses the more specific Schr\"odinger bridge (SB) problem, where $\mathbb{P}_{t}$ represents the evolution of a thermodynamic system at almost equilibrium. Estimating such bridges is notoriously difficult, motivating our proposal for a novel adaptive scheme called the GSBflow. Our goal is to rely on Gaussian approximations of the data to provide the reference stochastic process needed to estimate SB. To that end, we solve the \acs{SB} problem with Gaussian marginals, for which we provide, as a central contribution, a closed-form solution and SDE-representation. We use these formulas to define the reference process used to estimate more complex SBs, and show that this does indeed help with its numerical solution. We obtain notable improvements when reconstructing both synthetic processes and single-cell genomics experiments.
    Cross Domain Few-Shot Learning via Meta Adversarial Training. (arXiv:2202.05713v1 [cs.LG])
    Few-shot relation classification (RC) is one of the critical problems in machine learning. Current research merely focuses on the set-ups that both training and testing are from the same domain. However, in practice, this assumption is not always guaranteed. In this study, we present a novel model that takes into consideration the afore-mentioned cross-domain situation. Not like previous models, we only use the source domain data to train the prototypical networks and test the model on target domain data. A meta-based adversarial training framework (\textbf{MBATF}) is proposed to fine-tune the trained networks for adapting to data from the target domain. Empirical studies confirm the effectiveness of the proposed model.
    A Novel Speech Intelligibility Enhancement Model based on CanonicalCorrelation and Deep Learning. (arXiv:2202.05756v1 [cs.SD])
    Current deep learning (DL) based approaches to speech intelligibility enhancement in noisy environments are often trained to minimise the feature distance between noise-free speech and enhanced speech signals. Despite improving the speech quality, such approaches do not deliver required levels of speech intelligibility in everyday noisy environments . Intelligibility-oriented (I-O) loss functions have recently been developed to train DL approaches for robust speech enhancement. Here, we formulate, for the first time, a novel canonical correlation based I-O loss function to more effectively train DL algorithms. Specifically, we present a canonical-correlation based short-time objective intelligibility (CC-STOI) cost function to train a fully convolutional neural network (FCN) model. We carry out comparative simulation experiments to show that our CC-STOI based speech enhancement framework outperforms state-of-the-art DL models trained with conventional distance-based and STOI-based loss functions, using objective and subjective evaluation measures for case of both unseen speakers and noises. Ongoing future work is evaluating the proposed approach for design of robust hearing-assistive technology.
    Desiderata for Representation Learning: A Causal Perspective. (arXiv:2109.03795v2 [stat.ML] UPDATED)
    Representation learning constructs low-dimensional representations to summarize essential features of high-dimensional data. This learning problem is often approached by describing various desiderata associated with learned representations; e.g., that they be non-spurious, efficient, or disentangled. It can be challenging, however, to turn these intuitive desiderata into formal criteria that can be measured and enhanced based on observed data. In this paper, we take a causal perspective on representation learning, formalizing non-spuriousness and efficiency (in supervised representation learning) and disentanglement (in unsupervised representation learning) using counterfactual quantities and observable consequences of causal assertions. This yields computable metrics that can be used to assess the degree to which representations satisfy the desiderata of interest and learn non-spurious and disentangled representations from single observational datasets.  ( 2 min )
    Predictive modeling of microbiological seawater quality classification in karst region using cascade model. (arXiv:2202.05664v1 [cs.LG])
    In this paper, an in-depth analysis of Escherichia coli seawater measurements during the bathing season in the city of Rijeka, Croatia was conducted. Submerged sources of groundwater were observed at several measurement locations which could be the cause for increased E. coli values. This specificity of karst terrain is usually not considered during the monitoring process, thus a novel measurement methodology is proposed. A cascade machine learning model is used to predict coastal water quality based on meteorological data, which improves the level of accuracy due to data imbalance resulting from rare occurrences of measurements with reduced water quality. Currently, the cascade model is employed as a filter method, where measurements not classified as excellent quality need to be further analyzed. However, with improvements proposed in the paper, the cascade model could be ultimately used as a standalone method.
    Explainable Machine Learning for Breakdown Prediction in High Gradient RF Cavities. (arXiv:2202.05610v1 [physics.acc-ph])
    Radio Frequency (RF) breakdowns are one of the most prevalent limiting factors in RF cavities for particle accelerators. During a breakdown, field enhancement associated with small deformations on the cavity surface results in electrical arcs. Such arcs lead to beam aborts, reduce machine availability and can cause irreparable damage on the RF cavity surface. In this paper, we propose a machine learning strategy to discover breakdown precursors in CERN's Compact Linear Collider (CLIC) accelerating structures. By interpreting the parameters of the learned models with explainable Artificial Intelligence (AI), we reverse-engineer physical properties for deriving fast, reliable, and simple rule based models. Based on 6 months of historical data and dedicated experiments, our models show fractions of data with high influence on the occurrence of breakdowns. Specifically, it is shown that in many cases a rise of the vacuum pressure is observed before a breakdown is detected with the current interlock sensors.
    Choices, Risks, and Reward Reports: Charting Public Policy for Reinforcement Learning Systems. (arXiv:2202.05716v1 [cs.LG])
    In the long term, reinforcement learning (RL) is considered by many AI theorists to be the most promising path to artificial general intelligence. This places RL practitioners in a position to design systems that have never existed before and lack prior documentation in law and policy. Public agencies could intervene on complex dynamics that were previously too opaque to deliberate about, and long-held policy ambitions would finally be made tractable. In this whitepaper we illustrate this potential and how it might be technically enacted in the domains of energy infrastructure, social media recommender systems, and transportation. Alongside these unprecedented interventions come new forms of risk that exacerbate the harms already generated by standard machine learning tools. We correspondingly present a new typology of risks arising from RL design choices, falling under four categories: scoping the horizon, defining rewards, pruning information, and training multiple agents. Rather than allowing RL systems to unilaterally reshape human domains, policymakers need new mechanisms for the rule of reason, foreseeability, and interoperability that match the risks these systems pose. We argue that criteria for these choices may be drawn from emerging subfields within antitrust, tort, and administrative law. It will then be possible for courts, federal and state agencies, and non-governmental organizations to play more active roles in RL specification and evaluation. Building on the "model cards" and "datasheets" frameworks proposed by Mitchell et al. and Gebru et al., we argue the need for Reward Reports for AI systems. Reward Reports are living documents for proposed RL deployments that demarcate design choices.
    Discovering Quantum Phase Transitions with Fermionic Neural Networks. (arXiv:2202.05183v1 [physics.comp-ph] CROSS LISTED)
    Deep neural networks have been extremely successful as highly accurate wave function ans\"atze for variational Monte Carlo calculations of molecular ground states. We present an extension of one such ansatz, FermiNet, to calculations of the ground states of periodic Hamiltonians, and study the homogeneous electron gas. FermiNet calculations of the ground-state energies of small electron gas systems are in excellent agreement with previous initiator full configuration interaction quantum Monte Carlo and diffusion Monte Carlo calculations. We investigate the spin-polarized homogeneous electron gas and demonstrate that the same neural network architecture is capable of accurately representing both the delocalized Fermi liquid state and the localized Wigner crystal state. The network is given no \emph{a priori} knowledge that a phase transition exists, but converges on the translationally invariant ground state at high density and spontaneously breaks the symmetry to produce the crystalline ground state at low density.  ( 2 min )
    ProMP: Proximal Meta-Policy Search. (arXiv:1810.06784v4 [cs.LG] UPDATED)
    Credit assignment in Meta-reinforcement learning (Meta-RL) is still poorly understood. Existing methods either neglect credit assignment to pre-adaptation behavior or implement it naively. This leads to poor sample-efficiency during meta-training as well as ineffective task identification strategies. This paper provides a theoretical analysis of credit assignment in gradient-based Meta-RL. Building on the gained insights we develop a novel meta-learning algorithm that overcomes both the issue of poor credit assignment and previous difficulties in estimating meta-policy gradients. By controlling the statistical distance of both pre-adaptation and adapted policies during meta-policy search, the proposed algorithm endows efficient and stable meta-learning. Our approach leads to superior pre-adaptation policy behavior and consistently outperforms previous Meta-RL algorithms in sample-efficiency, wall-clock time, and asymptotic performance.  ( 2 min )
    InterpretTime: a new approach for the systematic evaluation of neural-network interpretability in time series classification. (arXiv:2202.05656v1 [cs.LG])
    We present a novel approach to evaluate the performance of interpretability methods for time series classification, and propose a new strategy to assess the similarity between domain experts and machine data interpretation. The novel approach leverages a new family of synthetic datasets and introduces new interpretability evaluation metrics. The approach addresses several common issues encountered in the literature, and clearly depicts how well an interpretability method is capturing neural network's data usage, providing a systematic interpretability evaluation framework. The new methodology highlights the superiority of Shapley Value Sampling and Integrated Gradients for interpretability in time-series classification tasks.  ( 2 min )
    Model Selection for Time Series Forecasting: Empirical Analysis of Different Estimators. (arXiv:2104.00584v2 [stat.ML] UPDATED)
    Evaluating predictive models is a crucial task in predictive analytics. This process is especially challenging with time series data where the observations show temporal dependencies. Several studies have analysed how different performance estimation methods compare with each other for approximating the true loss incurred by a given forecasting model. However, these studies do not address how the estimators behave for model selection: the ability to select the best solution among a set of alternatives. We address this issue and compare a set of estimation methods for model selection in time series forecasting tasks. We attempt to answer two main questions: (i) how often is the best possible model selected by the estimators; and (ii) what is the performance loss when it does not. We empirically found that the accuracy of the estimators for selecting the best solution is low, and the overall forecasting performance loss associated with the model selection process ranges from 1.2% to 2.3%. We also discovered that some factors, such as the sample size, are important in the relative performance of the estimators.  ( 2 min )
    On the Detection of Adaptive Adversarial Attacks in Speaker Verification Systems. (arXiv:2202.05725v1 [cs.CR])
    Speaker verification systems have been widely used in smart phones and Internet of things devices to identify a legitimate user. In recent work, it has been shown that adversarial attacks, such as FAKEBOB, can work effectively against speaker verification systems. The goal of this paper is to design a detector that can distinguish an original audio from an audio contaminated by adversarial attacks. Specifically, our designed detector, called MEH-FEST, calculates the minimum energy in high frequencies from the short-time Fourier transform of an audio and uses it as a detection metric. Through both analysis and experiments, we show that our proposed detector is easy to implement, fast to process an input audio, and effective in determining whether an audio is corrupted by FAKEBOB attacks. The experimental results indicate that the detector is extremely effective: with near zero false positive and false negative rates for detecting FAKEBOB attacks in Gaussian mixture model (GMM) and i-vector speaker verification systems. Moreover, adaptive adversarial attacks against our proposed detector and their countermeasures are discussed and studied, showing the game between attackers and defenders.
    Investigating Power laws in Deep Representation Learning. (arXiv:2202.05808v1 [cs.LG])
    Representation learning that leverages large-scale labelled datasets, is central to recent progress in machine learning. Access to task relevant labels at scale is often scarce or expensive, motivating the need to learn from unlabelled datasets with self-supervised learning (SSL). Such large unlabelled datasets (with data augmentations) often provide a good coverage of the underlying input distribution. However evaluating the representations learned by SSL algorithms still requires task-specific labelled samples in the training pipeline. Additionally, the generalization of task-specific encoding is often sensitive to potential distribution shift. Inspired by recent advances in theoretical machine learning and vision neuroscience, we observe that the eigenspectrum of the empirical feature covariance matrix often follows a power law. For visual representations, we estimate the coefficient of the power law, $\alpha$, across three key attributes which influence representation learning: learning objective (supervised, SimCLR, Barlow Twins and BYOL), network architecture (VGG, ResNet and Vision Transformer), and tasks (object and scene recognition). We observe that under mild conditions, proximity of $\alpha$ to 1, is strongly correlated to the downstream generalization performance. Furthermore, $\alpha \approx 1$ is a strong indicator of robustness to label noise during fine-tuning. Notably, $\alpha$ is computable from the representations without knowledge of any labels, thereby offering a framework to evaluate the quality of representations in unlabelled datasets.  ( 2 min )
    Intuitive Physics Guided Exploration for Sample Efficient Sim2real Transfer. (arXiv:2104.08795v2 [cs.LG] UPDATED)
    Physics-based reinforcement learning tasks can benefit from simplified physics simulators as they potentially allow near-optimal policies to be learned in simulation. However, such simulators require the latent factors (e.g. mass, friction coefficient etc.) of the associated objects and other environment-specific factors (e.g. wind speed, air density etc.) to be accurately specified, without which, it could take considerable additional learning effort to adapt the learned simulation policy to the real environment. As such a complete specification can be impractical, in this paper, we instead, focus on learning task-specific estimates of latent factors which allow the approximation of real world trajectories in an ideal simulation environment. Specifically, we propose two new concepts: a) action grouping - the idea that certain types of actions are closely associated with the estimation of certain latent factors, and; b) partial grounding - the idea that simulation of task-specific dynamics may not need precise estimation of all the latent factors. We first introduce intuitive action groupings based on human physics knowledge and experience, which is then used to design novel strategies for interacting with the real environment. Next, we describe how prior knowledge of a task in a given environment can be used to extract the relative importance of different latent factors, and how this can be used to inform partial grounding, which enables efficient learning of the task in any arbitrary environment. We demonstrate our approach in a range of physics based tasks, and show that it achieves superior performance relative to other baselines, using only a limited number of real-world interactions.  ( 2 min )
    Continual Learning with Invertible Generative Models. (arXiv:2202.05694v1 [cs.LG])
    Catastrophic forgetting (CF) happens whenever a neural network overwrites past knowledge while being trained on new tasks. Common techniques to handle CF include regularization of the weights (using, e.g., their importance on past tasks), and rehearsal strategies, where the network is constantly re-trained on past data. Generative models have also been applied for the latter, in order to have endless sources of data. In this paper, we propose a novel method that combines the strengths of regularization and generative-based rehearsal approaches. Our generative model consists of a normalizing flow (NF), a probabilistic and invertible neural network, trained on the internal embeddings of the network. By keeping a single NF throughout the training process, we show that our memory overhead remains constant. In addition, exploiting the invertibility of the NF, we propose a simple approach to regularize the network's embeddings with respect to past tasks. We show that our method performs favorably with espect to state-of-the-art approaches in the literature, with bounded computational power and memory overheads.  ( 2 min )
    Error-feedback stochastic modeling strategy for time series forecasting with convolutional neural networks. (arXiv:2002.00717v2 [cs.LG] UPDATED)
    Despite the superiority of convolutional neural networks demonstrated in time series modeling and forecasting, it has not been fully explored on the design of the neural network architecture and the tuning of the hyper-parameters. Inspired by the incremental construction strategy for building a random multilayer perceptron, we propose a novel Error-feedback Stochastic Modeling (ESM) strategy to construct a random Convolutional Neural Network (ESM-CNN) for time series forecasting task, which builds the network architecture adaptively. The ESM strategy suggests that random filters and neurons of the error-feedback fully connected layer are incrementally added to steadily compensate the prediction error during the construction process, and then a filter selection strategy is introduced to enable ESM-CNN to extract the different size of temporal features, providing helpful information at each iterative process for the prediction. The performance of ESM-CNN is justified on its prediction accuracy of one-step-ahead and multi-step-ahead forecasting tasks respectively. Comprehensive experiments on both the synthetic and real-world datasets show that the proposed ESM-CNN not only outperforms the state-of-art random neural networks, but also exhibits stronger predictive power and less computing overhead in comparison to trained state-of-art deep neural network models.  ( 2 min )
    Distributed saddle point problems for strongly concave-convex functions. (arXiv:2202.05812v1 [math.OC])
    In this paper, we propose GT-GDA, a distributed optimization method to solve saddle point problems of the form: $\min_{\mathbf{x}} \max_{\mathbf{y}} \{F(\mathbf{x},\mathbf{y}) :=G(\mathbf{x}) + \langle \mathbf{y}, \overline{P} \mathbf{x} \rangle - H(\mathbf{y})\}$, where the functions $G(\cdot)$, $H(\cdot)$, and the the coupling matrix $\overline{P}$ are distributed over a strongly connected network of nodes. GT-GDA is a first-order method that uses gradient tracking to eliminate the dissimilarity caused by heterogeneous data distribution among the nodes. In the most general form, GT-GDA includes a consensus over the local coupling matrices to achieve the optimal (unique) saddle point, however, at the expense of increased communication. To avoid this, we propose a more efficient variant GT-GDA-Lite that does not incur the additional communication and analyze its convergence in various scenarios. We show that GT-GDA converges linearly to the unique saddle point solution when $G(\cdot)$ is smooth and convex, $H(\cdot)$ is smooth and strongly convex, and the global coupling matrix $\overline{P}$ has full column rank. We further characterize the regime under which GT-GDA exhibits a network topology-independent convergence behavior. We next show the linear convergence of GT-GDA to an error around the unique saddle point, which goes to zero when the coupling cost ${\langle \mathbf y, \overline{P} \mathbf x \rangle}$ is common to all nodes, or when $G(\cdot)$ and $H(\cdot)$ are quadratic. Numerical experiments illustrate the convergence properties and importance of GT-GDA and GT-GDA-Lite for several applications.  ( 2 min )
    Graphon-aided Joint Estimation of Multiple Graphs. (arXiv:2202.05686v1 [stat.ML])
    We consider the problem of estimating the topology of multiple networks from nodal observations, where these networks are assumed to be drawn from the same (unknown) random graph model. We adopt a graphon as our random graph model, which is a nonparametric model from which graphs of potentially different sizes can be drawn. The versatility of graphons allows us to tackle the joint inference problem even for the cases where the graphs to be recovered contain different number of nodes and lack precise alignment across the graphs. Our solution is based on combining a maximum likelihood penalty with graphon estimation schemes and can be used to augment existing network inference methods. We validate our proposed approach by comparing its performance against competing methods in synthetic and real-world datasets.  ( 2 min )
    From Unsupervised to Few-shot Graph Anomaly Detection: A Multi-scale Contrastive Learning Approach. (arXiv:2202.05525v1 [cs.LG])
    Anomaly detection from graph data is an important data mining task in many applications such as social networks, finance, and e-commerce. Existing efforts in graph anomaly detection typically only consider the information in a single scale (view), thus inevitably limiting their capability in capturing anomalous patterns in complex graph data. To address this limitation, we propose a novel framework, graph ANomaly dEtection framework with Multi-scale cONtrastive lEarning (ANEMONE in short). By using a graph neural network as a backbone to encode the information from multiple graph scales (views), we learn better representation for nodes in a graph. In maximizing the agreements between instances at both the patch and context levels concurrently, we estimate the anomaly score of each node with a statistical anomaly estimator according to the degree of agreement from multiple perspectives. To further exploit a handful of ground-truth anomalies (few-shot anomalies) that may be collected in real-life applications, we further propose an extended algorithm, ANEMONE-FS, to integrate valuable information in our method. We conduct extensive experiments under purely unsupervised settings and few-shot anomaly detection settings, and we demonstrate that the proposed method ANEMONE and its variant ANEMONE-FS consistently outperform state-of-the-art algorithms on six benchmark datasets.  ( 2 min )
    Towards a Guideline for Evaluation Metrics in Medical Image Segmentation. (arXiv:2202.05273v1 [eess.IV])
    In the last decade, research on artificial intelligence has seen rapid growth with deep learning models, especially in the field of medical image segmentation. Various studies demonstrated that these models have powerful prediction capabilities and achieved similar results as clinicians. However, recent studies revealed that the evaluation in image segmentation studies lacks reliable model performance assessment and showed statistical bias by incorrect metric implementation or usage. Thus, this work provides an overview and interpretation guide on the following metrics for medical image segmentation evaluation in binary as well as multi-class problems: Dice similarity coefficient, Jaccard, Sensitivity, Specificity, Rand index, ROC curves, Cohen's Kappa, and Hausdorff distance. As a summary, we propose a guideline for standardized medical image segmentation evaluation to improve evaluation quality, reproducibility, and comparability in the research field.  ( 2 min )
    Learning Temporal Rules from Noisy Timeseries Data. (arXiv:2202.05403v1 [cs.LG])
    Events across a timeline are a common data representation, seen in different temporal modalities. Individual atomic events can occur in a certain temporal ordering to compose higher level composite events. Examples of a composite event are a patient's medical symptom or a baseball player hitting a home run, caused distinct temporal orderings of patient vitals and player movements respectively. Such salient composite events are provided as labels in temporal datasets and most works optimize models to predict these composite event labels directly. We focus on uncovering the underlying atomic events and their relations that lead to the composite events within a noisy temporal data setting. We propose Neural Temporal Logic Programming (Neural TLP) which first learns implicit temporal relations between atomic events and then lifts logic rules for composite events, given only the composite events labels for supervision. This is done through efficiently searching through the combinatorial space of all temporal logic rules in an end-to-end differentiable manner. We evaluate our method on video and healthcare datasets where it outperforms the baseline methods for rule discovery.  ( 2 min )
    A Wasserstein GAN for Joint Learning of Inpainting and its Spatial Optimisation. (arXiv:2202.05623v1 [eess.IV])
    Classic image inpainting is a restoration method that reconstructs missing image parts. However, a carefully selected mask of known pixels that yield a high quality inpainting can also act as a sparse image representation. This challenging spatial optimisation problem is essential for practical applications such as compression. So far, it has been almost exclusively addressed by model-based approaches. First attempts with neural networks seem promising, but are tailored towards specific inpainting operators or require postprocessing. To address this issue, we propose the first generative adversarial network for spatial inpainting data optimisation. In contrast to previous approaches, it allows joint training of an inpainting generator and a corresponding mask optimisation network. With a Wasserstein distance, we ensure that our inpainting results accurately reflect the statistics of natural images. This yields significant improvements in visual quality and speed over conventional stochastic models and also outperforms current spatial optimisation networks.  ( 2 min )
    Computational-Statistical Gaps in Reinforcement Learning. (arXiv:2202.05444v1 [cs.LG])
    Reinforcement learning with function approximation has recently achieved tremendous results in applications with large state spaces. This empirical success has motivated a growing body of theoretical work proposing necessary and sufficient conditions under which efficient reinforcement learning is possible. From this line of work, a remarkably simple minimal sufficient condition has emerged for sample efficient reinforcement learning: MDPs with optimal value function $V^*$ and $Q^*$ linear in some known low-dimensional features. In this setting, recent works have designed sample efficient algorithms which require a number of samples polynomial in the feature dimension and independent of the size of state space. They however leave finding computationally efficient algorithms as future work and this is considered a major open problem in the community. In this work, we make progress on this open problem by presenting the first computational lower bound for RL with linear function approximation: unless NP=RP, no randomized polynomial time algorithm exists for deterministic transition MDPs with a constant number of actions and linear optimal value functions. To prove this, we show a reduction from Unique-Sat, where we convert a CNF formula into an MDP with deterministic transitions, constant number of actions and low dimensional linear optimal value functions. This result also exhibits the first computational-statistical gap in reinforcement learning with linear function approximation, as the underlying statistical problem is information-theoretically solvable with a polynomial number of queries, but no computationally efficient algorithm exists unless NP=RP. Finally, we also prove a quasi-polynomial time lower bound under the Randomized Exponential Time Hypothesis.  ( 2 min )
    PARSE: Pairwise Alignment of Representations in Semi-Supervised EEG Learning for Emotion Recognition. (arXiv:2202.05400v1 [cs.LG])
    We propose PARSE, a novel semi-supervised architecture for learning strong EEG representations for emotion recognition. To reduce the potential distribution mismatch between the large amounts of unlabeled data and the limited amount of labeled data, PARSE uses pairwise representation alignment. First, our model performs data augmentation followed by label guessing for large amounts of original and augmented unlabeled data. This is then followed by sharpening of the guessed labels and convex combinations of the unlabeled and labeled data. Finally, representation alignment and emotion classification are performed. To rigorously test our model, we compare PARSE to several state-of-the-art semi-supervised approaches which we implement and adapt for EEG learning. We perform these experiments on four public EEG-based emotion recognition datasets, SEED, SEED-IV, SEED-V and AMIGOS (valence and arousal). The experiments show that our proposed framework achieves the overall best results with varying amounts of limited labeled samples in SEED, SEED-IV and AMIGOS (valence), while approaching the overall best result (reaching the second-best) in SEED-V and AMIGOS (arousal). The analysis shows that our pairwise representation alignment considerably improves the performance by reducing the distribution alignment between unlabeled and labeled data, especially when only 1 sample per class is labeled.  ( 2 min )
    Measuring dissimilarity with diffeomorphism invariance. (arXiv:2202.05614v1 [stat.ML])
    Measures of similarity (or dissimilarity) are a key ingredient to many machine learning algorithms. We introduce DID, a pairwise dissimilarity measure applicable to a wide range of data spaces, which leverages the data's internal structure to be invariant to diffeomorphisms. We prove that DID enjoys properties which make it relevant for theoretical study and practical use. By representing each datum as a function, DID is defined as the solution to an optimization problem in a Reproducing Kernel Hilbert Space and can be expressed in closed-form. In practice, it can be efficiently approximated via Nystr\"om sampling. Empirical experiments support the merits of DID.  ( 2 min )
    Concurrent Training of a Control Policy and a State Estimator for Dynamic and Robust Legged Locomotion. (arXiv:2202.05481v1 [cs.RO])
    In this paper, we propose a locomotion training framework where a control policy and a state estimator are trained concurrently. The framework consists of a policy network which outputs the desired joint positions and a state estimation network which outputs estimates of the robot's states such as the base linear velocity, foot height, and contact probability. We exploit a fast simulation environment to train the networks and the trained networks are transferred to the real robot. The trained policy and state estimator are capable of traversing diverse terrains such as a hill, slippery plate, and bumpy road. We also demonstrate that the learned policy can run at up to 3.75 m/s on normal flat ground and 3.54 m/s on a slippery plate with the coefficient of friction of 0.22.  ( 2 min )
    Long-Time Convergence and Propagation of Chaos for Nonlinear MCMC. (arXiv:2202.05621v1 [stat.ML])
    In this paper, we study the long-time convergence and uniform strong propagation of chaos for a class of nonlinear Markov chains for Markov chain Monte Carlo (MCMC). Our technique is quite simple, making use of recent contraction estimates for linear Markov kernels and basic techniques from Markov theory and analysis. Moreover, the same proof strategy applies to both the long-time convergence and propagation of chaos. We also show, via some experiments, that these nonlinear MCMC techniques are viable for use in real-world high-dimensional inference such as Bayesian neural networks.  ( 2 min )
    Universal Learning Waveform Selection Strategies for Adaptive Target Tracking. (arXiv:2202.05294v1 [cs.IT])
    Online selection of optimal waveforms for target tracking with active sensors has long been a problem of interest. Many conventional solutions utilize an estimation-theoretic interpretation, in which a waveform-specific Cram\'{e}r-Rao lower bound on measurement error is used to select the optimal waveform for each tracking step. However, this approach is only valid in the high SNR regime, and requires a rather restrictive set of assumptions regarding the target motion and measurement models. Further, due to computational concerns, many traditional approaches are limited to near-term, or myopic, optimization, even though radar scenes exhibit strong temporal correlation. More recently, reinforcement learning has been proposed for waveform selection, in which the problem is framed as a Markov decision process (MDP), allowing for long-term planning. However, a major limitation of reinforcement learning is that the memory length of the underlying Markov process is often unknown for realistic target and channel dynamics, and a more general framework is desirable. This work develops a universal sequential waveform selection scheme which asymptotically achieves Bellman optimality in any radar scene which can be modeled as a $U^{\text{th}}$ order Markov process for a finite, but unknown, integer $U$. Our approach is based on well-established tools from the field of universal source coding, where a stationary source is parsed into variable length phrases in order to build a context-tree, which is used as a probabalistic model for the scene's behavior. We show that an algorithm based on a multi-alphabet version of the Context-Tree Weighting (CTW) method can be used to optimally solve a broad class of waveform-agile tracking problems while making minimal assumptions about the environment's behavior.  ( 2 min )
    Conditional Contrastive Learning with Kernel. (arXiv:2202.05458v1 [cs.LG])
    Conditional contrastive learning frameworks consider the conditional sampling procedure that constructs positive or negative data pairs conditioned on specific variables. Fair contrastive learning constructs negative pairs, for example, from the same gender (conditioning on sensitive information), which in turn reduces undesirable information from the learned representations; weakly supervised contrastive learning constructs positive pairs with similar annotative attributes (conditioning on auxiliary information), which in turn are incorporated into the representations. Although conditional contrastive learning enables many applications, the conditional sampling procedure can be challenging if we cannot obtain sufficient data pairs for some values of the conditioning variable. This paper presents Conditional Contrastive Learning with Kernel (CCL-K) that converts existing conditional contrastive objectives into alternative forms that mitigate the insufficient data problem. Instead of sampling data according to the value of the conditioning variable, CCL-K uses the Kernel Conditional Embedding Operator that samples data from all available data and assigns weights to each sampled data given the kernel similarity between the values of the conditioning variable. We conduct experiments using weakly supervised, fair, and hard negatives contrastive learning, showing CCL-K outperforms state-of-the-art baselines.  ( 2 min )
    Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks. (arXiv:2202.05306v1 [cs.LG])
    We hypothesize that due to the greedy nature of learning in multi-modal deep neural networks, these models tend to rely on just one modality while under-fitting the other modalities. Such behavior is counter-intuitive and hurts the models' generalization, as we observe empirically. To estimate the model's dependence on each modality, we compute the gain on the accuracy when the model has access to it in addition to another modality. We refer to this gain as the conditional utilization rate. In the experiments, we consistently observe an imbalance in conditional utilization rates between modalities, across multiple tasks and architectures. Since conditional utilization rate cannot be computed efficiently during training, we introduce a proxy for it based on the pace at which the model learns from each modality, which we refer to as the conditional learning speed. We propose an algorithm to balance the conditional learning speeds between modalities during training and demonstrate that it indeed addresses the issue of greedy learning. The proposed algorithm improves the model's generalization on three datasets: Colored MNIST, Princeton ModelNet40, and NVIDIA Dynamic Hand Gesture.  ( 2 min )
    CMW-Net: Learning a Class-Aware Sample Weighting Mapping for Robust Deep Learning. (arXiv:2202.05613v1 [cs.LG])
    Modern deep neural networks can easily overfit to biased training data containing corrupted labels or class imbalance. Sample re-weighting methods are popularly used to alleviate this data bias issue. Most current methods, however, require to manually pre-specify the weighting schemes as well as their additional hyper-parameters relying on the characteristics of the investigated problem and training data. This makes them fairly hard to be generally applied in practical scenarios, due to their significant complexities and inter-class variations of data bias situations. To address this issue, we propose a meta-model capable of adaptively learning an explicit weighting scheme directly from data. Specifically, by seeing each training class as a separate learning task, our method aims to extract an explicit weighting function with sample loss and task/class feature as input, and sample weight as output, expecting to impose adaptively varying weighting schemes to different sample classes based on their own intrinsic bias characteristics. Synthetic and real data experiments substantiate the capability of our method on achieving proper weighting schemes in various data bias cases, like the class imbalance, feature-independent and dependent label noise scenarios, and more complicated bias scenarios beyond conventional cases. Besides, the task-transferability of the learned weighting scheme is also substantiated, by readily deploying the weighting function learned on relatively smaller-scale CIFAR-10 dataset on much larger-scale full WebVision dataset. A performance gain can be readily achieved compared with previous SOAT ones without additional hyper-parameter tuning and meta gradient descent step. The general availability of our method for multiple robust deep learning issues, including partial-label learning, semi-supervised learning and selective classification, has also been validated.  ( 2 min )
    Efficient Kernel UCB for Contextual Bandits. (arXiv:2202.05638v1 [cs.LG])
    In this paper, we tackle the computational efficiency of kernelized UCB algorithms in contextual bandits. While standard methods require a O(CT^3) complexity where T is the horizon and the constant C is related to optimizing the UCB rule, we propose an efficient contextual algorithm for large-scale problems. Specifically, our method relies on incremental Nystrom approximations of the joint kernel embedding of contexts and actions. This allows us to achieve a complexity of O(CTm^2) where m is the number of Nystrom points. To recover the same regret as the standard kernelized UCB algorithm, m needs to be of order of the effective dimension of the problem, which is at most O(\sqrt(T)) and nearly constant in some cases.  ( 2 min )
    Predicting Fuel Consumption in Power Generation Plants using Machine Learning and Neural Networks. (arXiv:2202.05591v1 [cs.LG])
    The instability of power generation from national grids has led industries (e.g., telecommunication) to rely on plant generators to run their businesses. However, these secondary generators create additional challenges such as fuel leakages in and out of the system and perturbations in the fuel level gauges. Consequently, telecommunication operators have been involved in a constant need for fuel to supply diesel generators. With the increase in fuel prices due to socio-economic factors, excessive fuel consumption and fuel pilferage become a problem, and this affects the smooth run of the network companies. In this work, we compared four machine learning algorithms (i.e. Gradient Boosting, Random Forest, Neural Network, and Lasso) to predict the amount of fuel consumed by a power generation plant. After evaluating the predictive accuracy of these models, the Gradient Boosting model out-perform the other three regressor models with the highest Nash efficiency value of 99.1%.  ( 2 min )
    Factored World Models for Zero-Shot Generalization in Robotic Manipulation. (arXiv:2202.05333v1 [cs.RO])
    World models for environments with many objects face a combinatorial explosion of states: as the number of objects increases, the number of possible arrangements grows exponentially. In this paper, we learn to generalize over robotic pick-and-place tasks using object-factored world models, which combat the combinatorial explosion by ensuring that predictions are equivariant to permutations of objects. Previous object-factored models were limited either by their inability to model actions, or by their inability to plan for complex manipulation tasks. We build on recent contrastive methods for training object-factored world models, which we extend to model continuous robot actions and to accurately predict the physics of robotic pick-and-place. To do so, we use a residual stack of graph neural networks that receive action information at multiple levels in both their node and edge neural networks. Crucially, our learned model can make predictions about tasks not represented in the training data. That is, we demonstrate successful zero-shot generalization to novel tasks, with only a minor decrease in model performance. Moreover, we show that an ensemble of our models can be used to plan for tasks involving up to 12 pick and place actions using heuristic search. We also demonstrate transfer to a physical robot.  ( 2 min )
    Enhancing ASR for Stuttered Speech with Limited Data Using Detect and Pass. (arXiv:2202.05396v1 [eess.AS])
    It is estimated that around 70 million people worldwide are affected by a speech disorder called stuttering. With recent advances in Automatic Speech Recognition (ASR), voice assistants are increasingly useful in our everyday lives. Many technologies in education, retail, telecommunication and healthcare can now be operated through voice. Unfortunately, these benefits are not accessible for People Who Stutter (PWS). We propose a simple but effective method called 'Detect and Pass' to make modern ASR systems accessible for People Who Stutter in a limited data setting. The algorithm uses a context aware classifier trained on a limited amount of data, to detect acoustic frames that contain stutter. To improve robustness on stuttered speech, this extra information is passed on to the ASR model to be utilized during inference. Our experiments show a reduction of 12.18% to 71.24% in Word Error Rate (WER) across various state of the art ASR systems. Upon varying the threshold of the associated posterior probability of stutter for each stacked frame used in determining low frame rate (LFR) acoustic features, we were able to determine an optimal setting that reduced the WER by 23.93% to 71.67% across different ASR systems.  ( 2 min )
    Personalization Improves Privacy-Accuracy Tradeoffs in Federated Optimization. (arXiv:2202.05318v1 [stat.ML])
    Large-scale machine learning systems often involve data distributed across a collection of users. Federated optimization algorithms leverage this structure by communicating model updates to a central server, rather than entire datasets. In this paper, we study stochastic optimization algorithms for a personalized federated learning setting involving local and global models subject to user-level (joint) differential privacy. While learning a private global model induces a cost of privacy, local learning is perfectly private. We show that coordinating local learning with private centralized learning yields a generically useful and improved tradeoff between accuracy and privacy. We illustrate our theoretical results with experiments on synthetic and real-world datasets.  ( 2 min )
    Hybridization of Capsule and LSTM Networks for unsupervised anomaly detection on multivariate data. (arXiv:2202.05538v1 [cs.LG])
    Deep learning techniques have recently shown promise in the field of anomaly detection, providing a flexible and effective method of modelling systems in comparison to traditional statistical modelling and signal processing-based methods. However, there are a few well publicised issues Neural Networks (NN)s face such as generalisation ability, requiring large volumes of labelled data to be able to train effectively and understanding spatial context in data. This paper introduces a novel NN architecture which hybridises the Long-Short-Term-Memory (LSTM) and Capsule Networks into a single network in a branched input Autoencoder architecture for use on multivariate time series data. The proposed method uses an unsupervised learning technique to overcome the issues with finding large volumes of labelled training data. Experimental results show that without hyperparameter optimisation, using Capsules significantly reduces overfitting and improves the training efficiency. Additionally, results also show that the branched input models can learn multivariate data more consistently with or without Capsules in comparison to the non-branched input models. The proposed model architecture was also tested on an open-source benchmark, where it achieved state-of-the-art performance in outlier detection, and overall performs best over the metrics tested in comparison to current state-of-the art methods.  ( 2 min )
    Controlling Confusion via Generalisation Bounds. (arXiv:2202.05560v1 [stat.ML])
    We establish new generalisation bounds for multiclass classification by abstracting to a more general setting of discretised error types. Extending the PAC-Bayes theory, we are hence able to provide fine-grained bounds on performance for multiclass classification, as well as applications to other learning problems including discretisation of regression losses. Tractable training objectives are derived from the bounds. The bounds are uniform over all weightings of the discretised error types and thus can be used to bound weightings not foreseen at training, including the full confusion matrix in the multiclass classification case.  ( 2 min )
    Towards Weakly-Supervised Text Spotting using a Multi-Task Transformer. (arXiv:2202.05508v1 [cs.CV])
    Text spotting end-to-end methods have recently gained attention in the literature due to the benefits of jointly optimizing the text detection and recognition components. Existing methods usually have a distinct separation between the detection and recognition branches, requiring exact annotations for the two tasks. We introduce TextTranSpotter (TTS), a transformer-based approach for text spotting and the first text spotting framework which may be trained with both fully- and weakly-supervised settings. By learning a single latent representation per word detection, and using a novel loss function based on the Hungarian loss, our method alleviates the need for expensive localization annotations. Trained with only text transcription annotations on real data, our weakly-supervised method achieves competitive performance with previous state-of-the-art fully-supervised methods. When trained in a fully-supervised manner, TextTranSpotter shows state-of-the-art results on multiple benchmarks \footnote {Our code will be publicly available upon publication.  ( 2 min )
    Electricity Consumption Forecasting for Out-of-distribution Time-of-Use Tariffs. (arXiv:2202.05517v1 [cs.LG])
    In electricity markets, retailers or brokers want to maximize profits by allocating tariff profiles to end consumers. One of the objectives of such demand response management is to incentivize the consumers to adjust their consumption so that the overall electricity procurement in the wholesale markets is minimized, e.g. it is desirable that consumers consume less during peak hours when cost of procurement for brokers from wholesale markets are high. We consider a greedy solution to maximize the overall profit for brokers by optimal tariff profile allocation. This in-turn requires forecasting electricity consumption for each user for all tariff profiles. This forecasting problem is challenging compared to standard forecasting problems due to following reasons: i. the number of possible combinations of hourly tariffs is high and retailers may not have considered all combinations in the past resulting in a biased set of tariff profiles tried in the past, ii. the profiles allocated in the past to each user is typically based on certain policy. These reasons violate the standard i.i.d. assumptions, as there is a need to evaluate new tariff profiles on existing customers and historical data is biased by the policies used in the past for tariff allocation. In this work, we consider several scenarios for forecasting and optimization under these conditions. We leverage the underlying structure of how consumers respond to variable tariff rates by comparing tariffs across hours and shifting loads, and propose suitable inductive biases in the design of deep neural network based architectures for forecasting under such scenarios. More specifically, we leverage attention mechanisms and permutation equivariant networks that allow desirable processing of tariff profiles to learn tariff representations that are insensitive to the biases in the data and still representative of the task.  ( 2 min )
    Robust estimation algorithms don't need to know the corruption level. (arXiv:2202.05453v1 [cs.LG])
    Real data are rarely pure. Hence the past half-century has seen great interest in robust estimation algorithms that perform well even when part of the data is corrupt. However, their vast majority approach optimal accuracy only when given a tight upper bound on the fraction of corrupt data. Such bounds are not available in practice, resulting in weak guarantees and often poor performance. This brief note abstracts the complex and pervasive robustness problem into a simple geometric puzzle. It then applies the puzzle's solution to derive a universal meta technique that converts any robust estimation algorithm requiring a tight corruption-level upper bound to achieve its optimal accuracy into one achieving essentially the same accuracy without using any upper bounds.  ( 2 min )
    On One-Bit Quantization. (arXiv:2202.05292v1 [cs.IT])
    We consider the one-bit quantizer that minimizes the mean squared error for a source living in a real Hilbert space. The optimal quantizer is a projection followed by a thresholding operation, and we provide methods for identifying the optimal direction along which to project. As an application of our methods, we characterize the optimal one-bit quantizer for a continuous-time random process that exhibits low-dimensional structure. We numerically show that this optimal quantizer is found by a neural-network-based compressor trained via stochastic gradient descent.  ( 2 min )
    Translation and Rotation Equivariant Normalizing Flow (TRENF) for Optimal Cosmological Analysis. (arXiv:2202.05282v1 [astro-ph.CO])
    Our universe is homogeneous and isotropic, and its perturbations obey translation and rotation symmetry. In this work we develop Translation and Rotation Equivariant Normalizing Flow (TRENF), a generative Normalizing Flow (NF) model which explicitly incorporates these symmetries, defining the data likelihood via a sequence of Fourier space-based convolutions and pixel-wise nonlinear transforms. TRENF gives direct access to the high dimensional data likelihood p(x|y) as a function of the labels y, such as cosmological parameters. In contrast to traditional analyses based on summary statistics, the NF approach has no loss of information since it preserves the full dimensionality of the data. On Gaussian random fields, the TRENF likelihood agrees well with the analytical expression and saturates the Fisher information content in the labels y. On nonlinear cosmological overdensity fields from N-body simulations, TRENF leads to significant improvements in constraining power over the standard power spectrum summary statistic. TRENF is also a generative model of the data, and we show that TRENF samples agree well with the N-body simulations it trained on, and that the inverse mapping of the data agrees well with a Gaussian white noise both visually and on various summary statistics: when this is perfectly achieved the resulting p(x|y) likelihood analysis becomes optimal. Finally, we develop a generalization of this model that can handle effects that break the symmetry of the data, such as the survey mask, which enables likelihood analysis on data without periodic boundaries.  ( 2 min )
    Online Decision Transformer. (arXiv:2202.05607v1 [cs.LG])
    Recent work has shown that offline reinforcement learning (RL) can be formulated as a sequence modeling problem (Chen et al., 2021; Janner et al., 2021) and solved via approaches similar to large-scale language modeling. However, any practical instantiation of RL also involves an online component, where policies pretrained on passive offline datasets are finetuned via taskspecific interactions with the environment. We propose Online Decision Transformers (ODT), an RL algorithm based on sequence modeling that blends offline pretraining with online finetuning in a unified framework. Our framework uses sequence-level entropy regularizers in conjunction with autoregressive modeling objectives for sample-efficient exploration and finetuning. Empirically, we show that ODT is competitive with the state-of-the-art in absolute performance on the D4RL benchmark but shows much more significant gains during the finetuning procedure.  ( 2 min )
    Accountability in an Algorithmic Society: Relationality, Responsibility, and Robustness in Machine Learning. (arXiv:2202.05338v1 [cs.CY])
    In 1996, philosopher Helen Nissenbaum issued a clarion call concerning the erosion of accountability in society due to the ubiquitous delegation of consequential functions to computerized systems. Using the conceptual framing of moral blame, Nissenbaum described four types of barriers to accountability that computerization presented: 1) "many hands," the problem of attributing moral responsibility for outcomes caused by many moral actors; 2) "bugs," a way software developers might shrug off responsibility by suggesting software errors are unavoidable; 3) "computer as scapegoat," shifting blame to computer systems as if they were moral actors; and 4) "ownership without liability," a free pass to the tech industry to deny responsibility for the software they produce. We revisit these four barriers in relation to the recent ascendance of data-driven algorithmic systems--technology often folded under the heading of machine learning (ML) or artificial intelligence (AI)--to uncover the new challenges for accountability that these systems present. We then look ahead to how one might construct and justify a moral, relational framework for holding responsible parties accountable, and argue that the FAccT community is uniquely well-positioned to develop such a framework to weaken the four barriers.  ( 2 min )
    Inference and FDR Control for Simulated Ising Models in High-dimension. (arXiv:2202.05612v1 [stat.ML])
    This paper studies the consistency and statistical inference of simulated Ising models in the high dimensional background. Our estimators are based on the Markov chain Monte Carlo maximum likelihood estimation (MCMC-MLE) method penalized by the Elastic-net. Under mild conditions that ensure a specific convergence rate of MCMC method, the $\ell_{1}$ consistency of Elastic-net-penalized MCMC-MLE is proved. We further propose a decorrelated score test based on the decorrelated score function and prove the asymptotic normality of the score function without the influence of many nuisance parameters under the assumption that accelerates the convergence of the MCMC method. The one-step estimator for a single parameter of interest is purposed by linearizing the decorrelated score function to solve its root, as well as its normality and confidence interval for the true value, therefore, be established. Finally, we use different algorithms to control the false discovery rate (FDR) via traditional p-values and novel e-values.  ( 2 min )
    Reduced order modeling with Barlow Twins self-supervised learning: Navigating the space between linear and nonlinear solution manifolds. (arXiv:2202.05460v1 [cs.CE])
    We propose a unified data-driven reduced order model (ROM) that bridges the performance gap between linear and nonlinear manifold approaches. Deep learning ROM (DL-ROM) using deep-convolutional autoencoders (DC-AE) has been shown to capture nonlinear solution manifolds but fails to perform adequately when linear subspace approaches such as proper orthogonal decomposition (POD) would be optimal. Besides, most DL-ROM models rely on convolutional layers, which might limit its application to only a structured mesh. The proposed framework in this study relies on the combination of an autoencoder (AE) and Barlow Twins (BT) self-supervised learning, where BT maximizes the information content of the embedding with the latent space through a joint embedding architecture. Through a series of benchmark problems of natural convection in porous media, BT-AE performs better than the previous DL-ROM framework by providing comparable results to POD-based approaches for problems where the solution lies within a linear subspace as well as DL-ROM autoencoder-based techniques where the solution lies on a nonlinear manifold; consequently, bridges the gap between linear and nonlinear reduced manifolds. Furthermore, this BT-AE framework can operate on unstructured meshes, which provides flexibility in its application to standard numerical solvers, on-site measurements, experimental data, or a combination of these sources.  ( 2 min )
    Describing image focused in cognitive and visual details for visually impaired people: An approach to generating inclusive paragraphs. (arXiv:2202.05331v1 [cs.CV])
    Several services for people with visual disabilities have emerged recently due to achievements in Assistive Technologies and Artificial Intelligence areas. Despite the growth in assistive systems availability, there is a lack of services that support specific tasks, such as understanding the image context presented in online content, e.g., webinars. Image captioning techniques and their variants are limited as Assistive Technologies as they do not match the needs of visually impaired people when generating specific descriptions. We propose an approach for generating context of webinar images combining a dense captioning technique with a set of filters, to fit the captions in our domain, and a language model for the abstractive summary task. The results demonstrated that we can produce descriptions with higher interpretability and focused on the relevant information for that group of people by combining image analysis methods and neural language models.  ( 2 min )
    Regularized Q-learning. (arXiv:2202.05404v1 [cs.LG])
    Q-learning is widely used algorithm in reinforcement learning community. Under the lookup table setting, its convergence is well established. However, its behavior is known to be unstable with the linear function approximation case. This paper develops a new Q-learning algorithm that converges when linear function approximation is used. We prove that simply adding an appropriate regularization term ensures convergence of the algorithm. We prove its stability using a recent analysis tool based on switching system models. Moreover, we experimentally show that it converges in environments where Q-learning with linear function approximation has known to diverge. We also provide an error bound on the solution where the algorithm converges.  ( 2 min )
    Similarity learning for wells based on logging data. (arXiv:2202.05583v1 [cs.LG])
    One of the first steps during the investigation of geological objects is the interwell correlation. It provides information on the structure of the objects under study, as it comprises the framework for constructing geological models and assessing hydrocarbon reserves. Today, the detailed interwell correlation relies on manual analysis of well-logging data. Thus, it is time-consuming and of a subjective nature. The essence of the interwell correlation constitutes an assessment of the similarities between geological profiles. There were many attempts to automate the process of interwell correlation by means of rule-based approaches, classic machine learning approaches, and deep learning approaches in the past. However, most approaches are of limited usage and inherent subjectivity of experts. We propose a novel framework to solve the geological profile similarity estimation based on a deep learning model. Our similarity model takes well-logging data as input and provides the similarity of wells as output. The developed framework enables (1) extracting patterns and essential characteristics of geological profiles within the wells and (2) model training following the unsupervised paradigm without the need for manual analysis and interpretation of well-logging data. For model testing, we used two open datasets originating in New Zealand and Norway. Our data-based similarity models provide high performance: the accuracy of our model is $0.926$ compared to $0.787$ for baselines based on the popular gradient boosting approach. With them, an oil\&gas practitioner can improve interwell correlation quality and reduce operation time.  ( 2 min )
    Invariance Principle Meets Out-of-Distribution Generalization on Graphs. (arXiv:2202.05441v1 [cs.LG])
    Despite recent developments in using the invariance principle from causality to enable out-of-distribution (OOD) generalization on Euclidean data, e.g., images, studies on graph data are limited. Different from images, the complex nature of graphs poses unique challenges that thwart the adoption of the invariance principle for OOD generalization. In particular, distribution shifts on graphs can happen at both structure-level and attribute-level, which increases the difficulty of capturing the invariance. Moreover, domain or environment partitions, which are often required by OOD methods developed on Euclidean data, can be expensive to obtain for graphs. Aiming to bridge this gap, we characterize distribution shifts on graphs with causal models, and show that the OOD generalization on graphs with invariance principle is possible by identifying an invariant subgraph for making predictions. We propose a novel framework to explicitly model this process using a contrastive strategy. By contrasting the estimated invariant subgraphs, our framework can provably identify the underlying invariant subgraph under mild assumptions. Experiments across several synthetic and real-world datasets demonstrate the state-of-the-art OOD generalization ability of our method.  ( 2 min )
    Robust, Deep, and Reinforcement Learning for Management of Communication and Power Networks. (arXiv:2202.05395v1 [cs.LG])
    This thesis develops data-driven machine learning algorithms to managing and optimizing the next-generation highly complex cyberphysical systems, which desperately need ground-breaking control, monitoring, and decision making schemes that can guarantee robustness, scalability, and situational awareness. The present thesis first develops principled methods to make generic machine learning models robust against distributional uncertainties and adversarial data. Particular focus will be on parametric models where some training data are being used to learn a parametric model. The developed framework is of high interest especially when training and testing data are drawn from "slightly" different distribution. We then introduce distributionally robust learning frameworks to minimize the worst-case expected loss over a prescribed ambiguity set of training distributions quantified via Wasserstein distance. Later, we build on this robust framework to design robust semi-supervised learning over graph methods. The second part of this thesis aspires to fully unleash the potential of next-generation wired and wireless networks, where we design "smart" network entities using (deep) reinforcement learning approaches. Finally, this thesis enhances the power system operation and control. Our contribution is on sustainable distribution grids with high penetration of renewable sources and demand response programs. To account for unanticipated and rapidly changing renewable generation and load consumption scenarios, we specifically delegate reactive power compensation to both utility-owned control devices (e.g., capacitor banks), as well as smart inverters of distributed generation units with cyber-capabilities.  ( 2 min )
    What Does it Mean for a Language Model to Preserve Privacy?. (arXiv:2202.05520v1 [stat.ML])
    Natural language reflects our private lives and identities, making its privacy concerns as broad as those of real life. Language models lack the ability to understand the context and sensitivity of text, and tend to memorize phrases present in their training sets. An adversary can exploit this tendency to extract training data. Depending on the nature of the content and the context in which this data was collected, this could violate expectations of privacy. Thus there is a growing interest in techniques for training language models that preserve privacy. In this paper, we discuss the mismatch between the narrow assumptions made by popular data protection techniques (data sanitization and differential privacy), and the broadness of natural language and of privacy as a social norm. We argue that existing protection methods cannot guarantee a generic and meaningful notion of privacy for language models. We conclude that language models should be trained on text data which was explicitly produced for public use.  ( 2 min )
    Jigsaw Puzzle: Selective Backdoor Attack to Subvert Malware Classifiers. (arXiv:2202.05470v1 [cs.CR])
    Malware classifiers are subject to training-time exploitation due to the need to regularly retrain using samples collected from the wild. Recent work has demonstrated the feasibility of backdoor attacks against malware classifiers, and yet the stealthiness of such attacks is not well understood. In this paper, we investigate this phenomenon under the clean-label setting (i.e., attackers do not have complete control over the training or labeling process). Empirically, we show that existing backdoor attacks in malware classifiers are still detectable by recent defenses such as MNTD. To improve stealthiness, we propose a new attack, Jigsaw Puzzle (JP), based on the key observation that malware authors have little to no incentive to protect any other authors' malware but their own. As such, Jigsaw Puzzle learns a trigger to complement the latent patterns of the malware author's samples, and activates the backdoor only when the trigger and the latent pattern are pieced together in a sample. We further focus on realizable triggers in the problem space (e.g., software code) using bytecode gadgets broadly harvested from benign software. Our evaluation confirms that Jigsaw Puzzle is effective as a backdoor, remains stealthy against state-of-the-art defenses, and is a threat in realistic settings that depart from reasoning about feature-space only attacks. We conclude by exploring promising approaches to improve backdoor defenses.  ( 2 min )
    A Machine-Learning-Aided Visual Analysis Workflow for Investigating Air Pollution Data. (arXiv:2202.05413v1 [cs.LG])
    Analyzing air pollution data is challenging as there are various analysis focuses from different aspects: feature (what), space (where), and time (when). As in most geospatial analysis problems, besides high-dimensional features, the temporal and spatial dependencies of air pollution induce the complexity of performing analysis. Machine learning methods, such as dimensionality reduction, can extract and summarize important information of the data to lift the burden of understanding such a complicated environment. In this paper, we present a methodology that utilizes multiple machine learning methods to uniformly explore these aspects. With this methodology, we develop a visual analytic system that supports a flexible analysis workflow, allowing domain experts to freely explore different aspects based on their analysis needs. We demonstrate the capability of our system and analysis workflow supporting a variety of analysis tasks with multiple use cases.  ( 2 min )
    Fast and Robust Sparsity Learning over Networks: A Decentralized Surrogate Median Regression Approach. (arXiv:2202.05498v1 [stat.ML])
    Decentralized sparsity learning has attracted a significant amount of attention recently due to its rapidly growing applications. To obtain the robust and sparse estimators, a natural idea is to adopt the non-smooth median loss combined with a $\ell_1$ sparsity regularizer. However, most of the existing methods suffer from slow convergence performance caused by the {\em double} non-smooth objective. To accelerate the computation, in this paper, we proposed a decentralized surrogate median regression (deSMR) method for efficiently solving the decentralized sparsity learning problem. We show that our proposed algorithm enjoys a linear convergence rate with a simple implementation. We also investigate the statistical guarantee, and it shows that our proposed estimator achieves a near-oracle convergence rate without any restriction on the number of network nodes. Moreover, we establish the theoretical results for sparse support recovery. Thorough numerical experiments and real data study are provided to demonstrate the effectiveness of our method.  ( 2 min )
    Shuffle Private Linear Contextual Bandits. (arXiv:2202.05567v1 [cs.LG])
    Differential privacy (DP) has been recently introduced to linear contextual bandits to formally address the privacy concerns in its associated personalized services to participating users (e.g., recommendations). Prior work largely focus on two trust models of DP: the central model, where a central server is responsible for protecting users sensitive data, and the (stronger) local model, where information needs to be protected directly on user side. However, there remains a fundamental gap in the utility achieved by learning algorithms under these two privacy models, e.g., $\tilde{O}(\sqrt{T})$ regret in the central model as compared to $\tilde{O}(T^{3/4})$ regret in the local model, if all users are unique within a learning horizon $T$. In this work, we aim to achieve a stronger model of trust than the central model, while suffering a smaller regret than the local model by considering recently popular shuffle model of privacy. We propose a general algorithmic framework for linear contextual bandits under the shuffle trust model, where there exists a trusted shuffler in between users and the central server, that randomly permutes a batch of users data before sending those to the server. We then instantiate this framework with two specific shuffle protocols: one relying on privacy amplification of local mechanisms, and another incorporating a protocol for summing vectors and matrices of bounded norms. We prove that both these instantiations lead to regret guarantees that significantly improve on that of the local model, and can potentially be of the order $\tilde{O}(T^{3/5})$ if all users are unique. We also verify this regret behavior with simulations on synthetic data. Finally, under the practical scenario of non-unique users, we show that the regret of our shuffle private algorithm scale as $\tilde{O}(T^{2/3})$, which matches that the central model could achieve in this case.  ( 2 min )
    Cyclical Curriculum Learning. (arXiv:2202.05531v1 [cs.LG])
    Artificial neural networks (ANN) are inspired by human learning. However, unlike human education, classical ANN does not use a curriculum. Curriculum Learning (CL) refers to the process of ANN training in which examples are used in a meaningful order. When using CL, training begins with a subset of the dataset and new samples are added throughout the training, or training begins with the entire dataset and the number of samples used is reduced. With these changes in training dataset size, better results can be obtained with curriculum, anti-curriculum, or random-curriculum methods than the vanilla method. However, a generally efficient CL method for various architectures and data sets is not found. In this paper, we propose cyclical curriculum learning (CCL), in which the data size used during training changes cyclically rather than simply increasing or decreasing. Instead of using only the vanilla method or only the curriculum method, using both methods cyclically like in CCL provides more successful results. We tested the method on 18 different data sets and 15 architectures in image and text classification tasks and obtained more successful results than no-CL and existing CL methods. We also have shown theoretically that it is less erroneous to apply CL and vanilla cyclically instead of using only CL or only vanilla method. The code of Cyclical Curriculum is available at https://github.com/CyclicalCurriculum/Cyclical-Curriculum.  ( 2 min )
    Neural Architecture Search for Energy Efficient Always-on Audio Models. (arXiv:2202.05397v1 [eess.AS])
    Mobile and edge computing devices for always-on audio classification require energy-efficient neural network architectures. We present a neural architecture search (NAS) that optimizes accuracy, energy efficiency and memory usage. The search is run on Vizier, a black-box optimization service. We present a search strategy that uses both Bayesian and regularized evolutionary search with particle swarms, and employs early-stopping to reduce the computational burden. The search returns architectures for a sound-event classification dataset based upon AudioSet with similar accuracy to MobileNetV1/V2 implementations but with an order of magnitude less energy per inference and a much smaller memory footprint.  ( 2 min )
    Noise Augmentation Is All You Need For FGSM Fast Adversarial Training: Catastrophic Overfitting And Robust Overfitting Require Different Augmentation. (arXiv:2202.05488v1 [cs.LG])
    Adversarial training (AT) and its variants are the most effective approaches for obtaining adversarially robust models. A unique characteristic of AT is that an inner maximization problem needs to be solved repeatedly before the model weights can be updated, which makes the training slow. FGSM AT significantly improves its efficiency but it fails when the step size grows. The SOTA GradAlign makes FGSM AT compatible with a higher step size, however, its regularization on input gradient makes it 3 to 4 times slower than FGSM AT. Our proposed NoiseAug removes the extra computation overhead by directly regularizing on the input itself. The key contribution of this work lies in an empirical finding that single-step FGSM AT is not as hard as suggested in the past line of work: noise augmentation is all you need for (FGSM) fast AT. Towards understanding the success of our NoiseAug, we perform an extensive analysis and find that mitigating Catastrophic Overfitting (CO) and Robust Overfitting (RO) need different augmentations. Instead of more samples caused by data augmentation, we identify what makes NoiseAug effective for preventing CO might lie in its improved local linearity.  ( 2 min )
    The Shapley Value in Machine Learning. (arXiv:2202.05594v1 [cs.LG])
    Over the last few years, the Shapley value, a solution concept from cooperative game theory, has found numerous applications in machine learning. In this paper, we first discuss fundamental concepts of cooperative game theory and axiomatic properties of the Shapley value. Then we give an overview of the most important applications of the Shapley value in machine learning: feature selection, explainability, multi-agent reinforcement learning, ensemble pruning, and data valuation. We examine the most crucial limitations of the Shapley value and point out directions for future research.  ( 2 min )
    Trust in AI: Interpretability is not necessary or sufficient, while black-box interaction is necessary and sufficient. (arXiv:2202.05302v1 [cs.LG])
    The problem of human trust in artificial intelligence is one of the most fundamental problems in applied machine learning. Our processes for evaluating AI trustworthiness have substantial ramifications for ML's impact on science, health, and humanity, yet confusion surrounds foundational concepts. What does it mean to trust an AI, and how do humans assess AI trustworthiness? What are the mechanisms for building trustworthy AI? And what is the role of interpretable ML in trust? Here, we draw from statistical learning theory and sociological lenses on human-automation trust to motivate an AI-as-tool framework, which distinguishes human-AI trust from human-AI-human trust. Evaluating an AI's contractual trustworthiness involves predicting future model behavior using behavior certificates (BCs) that aggregate behavioral evidence from diverse sources including empirical out-of-distribution and out-of-task evaluation and theoretical proofs linking model architecture to behavior. We clarify the role of interpretability in trust with a ladder of model access. Interpretability (level 3) is not necessary or even sufficient for trust, while the ability to run a black-box model at-will (level 2) is necessary and sufficient. While interpretability can offer benefits for trust, it can also incur costs. We clarify ways interpretability can contribute to trust, while questioning the perceived centrality of interpretability to trust in popular discourse. How can we empower people with tools to evaluate trust? Instead of trying to understand how a model works, we argue for understanding how a model behaves. Instead of opening up black boxes, we should create more behavior certificates that are more correct, relevant, and understandable. We discuss how to build trusted and trustworthy AI responsibly.  ( 2 min )
    Support Vectors and Gradient Dynamics for Implicit Bias in ReLU Networks. (arXiv:2202.05510v1 [cs.LG])
    Understanding implicit bias of gradient descent has been an important goal in machine learning research. Unfortunately, even for a single-neuron ReLU network, it recently proved impossible to characterize the implicit regularization with the square loss by an explicit function of the norm of model parameters. In order to close the gap between the existing theory and the intriguing empirical behavior of ReLU networks, here we examine the gradient flow dynamics in the parameter space when training single-neuron ReLU networks. Specifically, we discover implicit bias in terms of support vectors in ReLU networks, which play a key role in why and how ReLU networks generalize well. Moreover, we analyze gradient flows with respect to the magnitude of the norm of initialization, and show the impact of the norm in gradient dynamics. Lastly, under some conditions, we prove that the norm of the learned weight strictly increases on the gradient flow.  ( 2 min )
    Coded ResNeXt: a network for designing disentangled information paths. (arXiv:2202.05343v1 [cs.CV])
    To avoid treating neural networks as highly complex black boxes, the deep learning research community has tried to build interpretable models allowing humans to understand the decisions taken by the model. Unfortunately, the focus is mostly on manipulating only the very high-level features associated with the last layers. In this work, we look at neural network architectures for classification in a more general way and introduce an algorithm which defines before the training the paths of the network through which the per-class information flows. We show that using our algorithm we can extract a lighter single-purpose binary classifier for a particular class by removing the parameters that do not participate in the predefined information path of that class, which is approximately 60% of the total parameters. Notably, leveraging coding theory to design the information paths enables us to use intermediate network layers for making early predictions without having to evaluate the full network. We demonstrate that a slightly modified ResNeXt model, trained with our algorithm, can achieve higher classification accuracy on CIFAR-10/100 and ImageNet than the original ResNeXt, while having all the aforementioned properties.  ( 2 min )
    Including Facial Expressions in Contextual Embeddings for Sign Language Generation. (arXiv:2202.05383v1 [cs.CL])
    State-of-the-art sign language generation frameworks lack expressivity and naturalness which is the result of only focusing manual signs, neglecting the affective, grammatical and semantic functions of facial expressions. The purpose of this work is to augment semantic representation of sign language through grounding facial expressions. We study the effect of modeling the relationship between text, gloss, and facial expressions on the performance of the sign generation systems. In particular, we propose a Dual Encoder Transformer able to generate manual signs as well as facial expressions by capturing the similarities and differences found in text and sign gloss annotation. We take into consideration the role of facial muscle activity to express intensities of manual signs by being the first to employ facial action units in sign language generation. We perform a series of experiments showing that our proposed model improves the quality of automatically generated sign language.  ( 2 min )
    Development and Validation of an AI-Driven Model for the La Rance Tidal Barrage: A Generalisable Case Study. (arXiv:2202.05347v1 [cs.LG])
    In this work, an AI-Driven (autonomous) model representation of the La Rance tidal barrage was developed using novel parametrisation and Deep Reinforcement Learning (DRL) techniques. Our model results were validated with experimental measurements, yielding the first Tidal Range Structure (TRS) model validated against a constructed tidal barrage and made available to academics. In order to proper model La Rance, parametrisation methodologies were developed for simulating (i) turbines (in pumping and power generation modes), (ii) transition ramp functions (for opening and closing hydraulic structures) and (iii) equivalent lagoon wetted area. Furthermore, an updated DRL method was implemented for optimising the operation of the hydraulic structures that compose La Rance. The achieved objective of this work was to verify the capabilities of an AI-Driven TRS model to appropriately predict (i) turbine power and (ii) lagoon water level variations. In addition, the observed operational strategy and yearly energy output of our AI-Driven model appeared to be comparable with those reported for the La Rance tidal barrage. The outcomes of this work (developed methodologies and DRL implementations) are generalisable and can be applied to other TRS projects. Furthermore, this work provided insights which allow for more realistic simulation of TRS operation, enabled through our AI-Driven model.  ( 2 min )
    Dual Task Framework for Debiasing Persona-grounded Dialogue Dataset. (arXiv:2202.05435v1 [cs.CL])
    This paper introduces a simple yet effective data-centric approach for the task of improving persona-conditioned dialogue agents. Prior model-centric approaches unquestioningly depend on the raw crowdsourced benchmark datasets such as Persona-Chat. In contrast, we aim to fix annotation artifacts in benchmarking, which is orthogonally applicable to any dialogue model. Specifically, we augment relevant personas to improve dialogue dataset/agent, by leveraging the primal-dual structure of the two tasks, predicting dialogue responses and personas based on each other. Experiments on Persona-Chat show that our approach outperforms pre-trained LMs by an 11.7 point gain in terms of accuracy.  ( 2 min )
    Scale-free Unconstrained Online Learning for Curved Losses. (arXiv:2202.05630v1 [cs.LG])
    A sequence of works in unconstrained online convex optimisation have investigated the possibility of adapting simultaneously to the norm $U$ of the comparator and the maximum norm $G$ of the gradients. In full generality, matching upper and lower bounds are known which show that this comes at the unavoidable cost of an additive $G U^3$, which is not needed when either $G$ or $U$ is known in advance. Surprisingly, recent results by Kempka et al. (2019) show that no such price for adaptivity is needed in the specific case of $1$-Lipschitz losses like the hinge loss. We follow up on this observation by showing that there is in fact never a price to pay for adaptivity if we specialise to any of the other common supervised online learning losses: our results cover log loss, (linear and non-parametric) logistic regression, square loss prediction, and (linear and non-parametric) least-squares regression. We also fill in several gaps in the literature by providing matching lower bounds with an explicit dependence on $U$. In all cases we obtain scale-free algorithms, which are suitably invariant under rescaling of the data. Our general goal is to establish achievable rates without concern for computational efficiency, but for linear logistic regression we also provide an adaptive method that is as efficient as the recent non-adaptive algorithm by Agarwal et al. (2021).  ( 2 min )
    Posterior Consistency for Bayesian Relevance Vector Machines. (arXiv:2202.05422v1 [stat.ML])
    Statistical modeling and inference problems with sample sizes substantially smaller than the number of available covariates are challenging. Chakraborty et al. (2012) did a full hierarchical Bayesian analysis of nonlinear regression in such situations using relevance vector machines based on reproducing kernel Hilbert space (RKHS). But they did not provide any theoretical properties associated with their procedure. The present paper revisits their problem, introduces a new class of global-local priors different from theirs, and provides results on posterior consistency as well as posterior contraction rates  ( 2 min )
    Minimax Regret Optimization for Robust Machine Learning under Distribution Shift. (arXiv:2202.05436v1 [cs.LG])
    In this paper, we consider learning scenarios where the learned model is evaluated under an unknown test distribution which potentially differs from the training distribution (i.e. distribution shift). The learner has access to a family of weight functions such that the test distribution is a reweighting of the training distribution under one of these functions, a setting typically studied under the name of Distributionally Robust Optimization (DRO). We consider the problem of deriving regret bounds in the classical learning theory setting, and require that the resulting regret bounds hold uniformly for all potential test distributions. We show that the DRO formulation does not guarantee uniformly small regret under distribution shift. We instead propose an alternative method called Minimax Regret Optimization (MRO), and show that under suitable conditions this method achieves uniformly low regret across all test distributions. We also adapt our technique to have stronger guarantees when the test distributions are heterogeneous in their similarity to the training data. Given the widespead optimization of worst case risks in current approaches to robust machine learning, we believe that MRO can be a strong alternative to address distribution shift scenarios.  ( 2 min )
    Achieving Minimax Rates in Pool-Based Batch Active Learning. (arXiv:2202.05448v1 [cs.LG])
    We consider a batch active learning scenario where the learner adaptively issues batches of points to a labeling oracle. Sampling labels in batches is highly desirable in practice due to the smaller number of interactive rounds with the labeling oracle (often human beings). However, batch active learning typically pays the price of a reduced adaptivity, leading to suboptimal results. In this paper we propose a solution which requires a careful trade off between the informativeness of the queried points and their diversity. We theoretically investigate batch active learning in the practically relevant scenario where the unlabeled pool of data is available beforehand (pool-based active learning). We analyze a novel stage-wise greedy algorithm and show that, as a function of the label complexity, the excess risk of this algorithm operating in the realizable setting for which we prove matches the known minimax rates in standard statistical learning settings. Our results also exhibit a mild dependence on the batch size. These are the first theoretical results that employ careful trade offs between informativeness and diversity to rigorously quantify the statistical performance of batch active learning in the pool-based scenario.  ( 2 min )
    On change of measure inequalities for $f$-divergences. (arXiv:2202.05568v1 [stat.ML])
    We propose new change of measure inequalities based on $f$-divergences (of which the Kullback-Leibler divergence is a particular case). Our strategy relies on combining the Legendre transform of $f$-divergences and the Young-Fenchel inequality. By exploiting these new change of measure inequalities, we derive new PAC-Bayesian generalisation bounds with a complexity involving $f$-divergences, and holding in mostly unchartered settings (such as heavy-tailed losses). We instantiate our results for the most popular $f$-divergences.  ( 2 min )
    Do People Engage Cognitively with AI? Impact of AI Assistance on Incidental Learning. (arXiv:2202.05402v1 [cs.HC])
    When people receive advice while making difficult decisions, they often make better decisions in the moment and also increase their knowledge in the process. However, such incidental learning can only occur when people cognitively engage with the information they receive and process this information thoughtfully. How do people process the information and advice they receive from AI, and do they engage with it deeply enough to enable learning? To answer these questions, we conducted three experiments in which individuals were asked to make nutritional decisions and received simulated AI recommendations and explanations. In the first experiment, we found that when people were presented with both a recommendation and an explanation before making their choice, they made better decisions than they did when they received no such help, but they did not learn. In the second experiment, participants first made their own choice, and only then saw a recommendation and an explanation from AI; this condition also resulted in improved decisions, but no learning. However, in our third experiment, participants were presented with just an AI explanation but no recommendation and had to arrive at their own decision. This condition led to both more accurate decisions and learning gains. We hypothesize that learning gains in this condition were due to deeper engagement with explanations needed to arrive at the decisions. This work provides some of the most direct evidence to date that it may not be sufficient to include explanations together with AI-generated recommendation to ensure that people engage carefully with the AI-provided information. This work also presents one technique that enables incidental learning and, by implication, can help people process AI recommendations and explanations more carefully.  ( 2 min )
    Understanding Curriculum Learning in Policy Optimization for Solving Combinatorial Optimization Problems. (arXiv:2202.05423v1 [cs.LG])
    Over the recent years, reinforcement learning (RL) has shown impressive performance in finding strategic solutions for game environments, and recently starts to show promising results in solving combinatorial optimization (CO) problems, inparticular when coupled with curriculum learning to facilitate training. Despite emerging empirical evidence, theoretical study on why RL helps is still at its early stage. This paper presents the first systematic study on policy optimization methods for solving CO problems. We show that CO problems can be naturally formulated as latent Markov Decision Processes (LMDPs), and prove convergence bounds on natural policy gradient (NPG) for solving LMDPs. Furthermore, our theory explains the benefit of curriculum learning: it can find a strong sampling policy and reduce the distribution shift, a critical quantity that governs the convergence rate in our theorem. For a canonical combinatorial problem, Secretary Problem, we formally prove that distribution shift is reduced exponentially with curriculum learning. Our theory also shows we can simplify the curriculum learning scheme used in prior work from multi-step to single-step. Lastly, we provide extensive experiments on Secretary Problem and Online Knapsack to empirically verify our findings.  ( 2 min )
    A Lightweight, Efficient and Explainable-by-Design Convolutional Neural Network for Internet Traffic Classification. (arXiv:2202.05535v1 [cs.LG])
    Traffic classification, i.e. the identification of the type of applications flowing in a network, is a strategic task for numerous activities (e.g., intrusion detection, routing). This task faces some critical challenges that current deep learning approaches do not address. The design of current approaches do not take into consideration the fact that networking hardware (e.g., routers) often runs with limited computational resources. Further, they do not meet the need for faithful explainability highlighted by regulatory bodies. Finally, these traffic classifiers are evaluated on small datasets which fail to reflect the diversity of applications in real commercial settings. Therefore, this paper introduces a Lightweight, Efficient and eXplainable-by-design convolutional neural network (LEXNet) for Internet traffic classification, which relies on a new residual block (for lightweight and efficiency purposes) and prototype layer (for explainability). Based on a commercial-grade dataset, our evaluation shows that LEXNet succeeds to maintain the same accuracy as the best performing state-of-the-art neural network, while providing the additional features previously mentioned. Moreover, we demonstrate that LEXNet significantly reduces the model size and inference time compared to the state-of-the-art neural networks with explainability-by-design and post hoc explainability methods. Finally, we illustrate the explainability feature of our approach, which stems from the communication of detected application prototypes to the end-user, and we highlight the faithfulness of LEXNet explanations through a comparison with post hoc methods.  ( 2 min )
    ACORT: A Compact Object Relation Transformer for Parameter Efficient Image Captioning. (arXiv:2202.05451v1 [cs.CV])
    Recent research that applies Transformer-based architectures to image captioning has resulted in state-of-the-art image captioning performance, capitalising on the success of Transformers on natural language tasks. Unfortunately, though these models work well, one major flaw is their large model sizes. To this end, we present three parameter reduction methods for image captioning Transformers: Radix Encoding, cross-layer parameter sharing, and attention parameter sharing. By combining these methods, our proposed ACORT models have 3.7x to 21.6x fewer parameters than the baseline model without compromising test performance. Results on the MS-COCO dataset demonstrate that our ACORT models are competitive against baselines and SOTA approaches, with CIDEr score >=126. Finally, we present qualitative results and ablation studies to demonstrate the efficacy of the proposed changes further. Code and pre-trained models are publicly available at https://github.com/jiahuei/sparse-image-captioning.  ( 2 min )
    DDoS-UNet: Incorporating temporal information using Dynamic Dual-channel UNet for enhancing super-resolution of dynamic MRI. (arXiv:2202.05355v1 [physics.med-ph])
    Magnetic resonance imaging (MRI) provides high spatial resolution and excellent soft-tissue contrast without using harmful ionising radiation. Dynamic MRI is an essential tool for interventions to visualise movements or changes of the target organ. However, such MRI acquisition with high temporal resolution suffers from limited spatial resolution - also known as the spatio-temporal trade-off of dynamic MRI. Several approaches, including deep learning based super-resolution approaches, have been proposed to mitigate this trade-off. Nevertheless, such an approach typically aims to super-resolve each time-point separately, treating them as individual volumes. This research addresses the problem by creating a deep learning model which attempts to learn both spatial and temporal relationships. A modified 3D UNet model, DDoS-UNet, is proposed - which takes the low-resolution volume of the current time-point along with a prior image volume. Initially, the network is supplied with a static high-resolution planning scan as the prior image along with the low-resolution input to super-resolve the first time-point. Then it continues step-wise by using the super-resolved time-points as the prior image while super-resolving the subsequent time-points. The model performance was tested with 3D dynamic data that was undersampled to different in-plane levels. The proposed network achieved an average SSIM value of 0.951$\pm$0.017 while reconstructing the lowest resolution data (i.e. only 4\% of the k-space acquired) - which could result in a theoretical acceleration factor of 25. The proposed approach can be used to reduce the required scan-time while achieving high spatial resolution.  ( 2 min )
    A Survey on Programmatic Weak Supervision. (arXiv:2202.05433v1 [cs.LG])
    Labeling training data has become one of the major roadblocks to using machine learning. Among various weak supervision paradigms, programmatic weak supervision (PWS) has achieved remarkable success in easing the manual labeling bottleneck by programmatically synthesizing training labels from multiple potentially noisy supervision sources. This paper presents a comprehensive survey of recent advances in PWS. In particular, we give a brief introduction of the PWS learning paradigm, and review representative approaches for each component within PWS's learning workflow. In addition, we discuss complementary learning paradigms for tackling limited labeled data scenarios and how these related approaches can be used in conjunction with PWS. Finally, we identify several critical challenges that remain under-explored in the area to hopefully inspire future research directions in the field.  ( 2 min )
    Dynamic Background Subtraction by Generative Neural Networks. (arXiv:2202.05336v1 [eess.IV])
    Background subtraction is a significant task in computer vision and an essential step for many real world applications. One of the challenges for background subtraction methods is dynamic background, which constitute stochastic movements in some parts of the background. In this paper, we have proposed a new background subtraction method, called DBSGen, which uses two generative neural networks, one for dynamic motion removal and another for background generation. At the end, the foreground moving objects are obtained by a pixel-wise distance threshold based on a dynamic entropy map. The proposed method has a unified framework that can be optimized in an end-to-end and unsupervised fashion. The performance of the method is evaluated over dynamic background sequences and it outperforms most of state-of-the-art methods. Our code is publicly available at https://github.com/FatemeBahri/DBSGen.  ( 2 min )
  • Open

    InterpretTime: a new approach for the systematic evaluation of neural-network interpretability in time series classification. (arXiv:2202.05656v1 [cs.LG])
    We present a novel approach to evaluate the performance of interpretability methods for time series classification, and propose a new strategy to assess the similarity between domain experts and machine data interpretation. The novel approach leverages a new family of synthetic datasets and introduces new interpretability evaluation metrics. The approach addresses several common issues encountered in the literature, and clearly depicts how well an interpretability method is capturing neural network's data usage, providing a systematic interpretability evaluation framework. The new methodology highlights the superiority of Shapley Value Sampling and Integrated Gradients for interpretability in time-series classification tasks.  ( 2 min )
    Bernstein Flows for Flexible Posteriors in Variational Bayes. (arXiv:2202.05650v1 [stat.ML])
    Variational inference (VI) is a technique to approximate difficult to compute posteriors by optimization. In contrast to MCMC, VI scales to many observations. In the case of complex posteriors, however, state-of-the-art VI approaches often yield unsatisfactory posterior approximations. This paper presents Bernstein flow variational inference (BF-VI), a robust and easy-to-use method, flexible enough to approximate complex multivariate posteriors. BF-VI combines ideas from normalizing flows and Bernstein polynomial-based transformation models. In benchmark experiments, we compare BF-VI solutions with exact posteriors, MCMC solutions, and state-of-the-art VI methods including normalizing flow based VI. We show for low-dimensional models that BF-VI accurately approximates the true posterior; in higher-dimensional models, BF-VI outperforms other VI methods. Further, we develop with BF-VI a Bayesian model for the semi-structured Melanoma challenge data, combining a CNN model part for image data with an interpretable model part for tabular data, and demonstrate for the first time how the use of VI in semi-structured models.  ( 2 min )
    Minimax Regret Optimization for Robust Machine Learning under Distribution Shift. (arXiv:2202.05436v1 [cs.LG])
    In this paper, we consider learning scenarios where the learned model is evaluated under an unknown test distribution which potentially differs from the training distribution (i.e. distribution shift). The learner has access to a family of weight functions such that the test distribution is a reweighting of the training distribution under one of these functions, a setting typically studied under the name of Distributionally Robust Optimization (DRO). We consider the problem of deriving regret bounds in the classical learning theory setting, and require that the resulting regret bounds hold uniformly for all potential test distributions. We show that the DRO formulation does not guarantee uniformly small regret under distribution shift. We instead propose an alternative method called Minimax Regret Optimization (MRO), and show that under suitable conditions this method achieves uniformly low regret across all test distributions. We also adapt our technique to have stronger guarantees when the test distributions are heterogeneous in their similarity to the training data. Given the widespead optimization of worst case risks in current approaches to robust machine learning, we believe that MRO can be a strong alternative to address distribution shift scenarios.  ( 2 min )
    Personalization Improves Privacy-Accuracy Tradeoffs in Federated Optimization. (arXiv:2202.05318v1 [stat.ML])
    Large-scale machine learning systems often involve data distributed across a collection of users. Federated optimization algorithms leverage this structure by communicating model updates to a central server, rather than entire datasets. In this paper, we study stochastic optimization algorithms for a personalized federated learning setting involving local and global models subject to user-level (joint) differential privacy. While learning a private global model induces a cost of privacy, local learning is perfectly private. We show that coordinating local learning with private centralized learning yields a generically useful and improved tradeoff between accuracy and privacy. We illustrate our theoretical results with experiments on synthetic and real-world datasets.  ( 2 min )
    What Does it Mean for a Language Model to Preserve Privacy?. (arXiv:2202.05520v1 [stat.ML])
    Natural language reflects our private lives and identities, making its privacy concerns as broad as those of real life. Language models lack the ability to understand the context and sensitivity of text, and tend to memorize phrases present in their training sets. An adversary can exploit this tendency to extract training data. Depending on the nature of the content and the context in which this data was collected, this could violate expectations of privacy. Thus there is a growing interest in techniques for training language models that preserve privacy. In this paper, we discuss the mismatch between the narrow assumptions made by popular data protection techniques (data sanitization and differential privacy), and the broadness of natural language and of privacy as a social norm. We argue that existing protection methods cannot guarantee a generic and meaningful notion of privacy for language models. We conclude that language models should be trained on text data which was explicitly produced for public use.  ( 2 min )
    On change of measure inequalities for $f$-divergences. (arXiv:2202.05568v1 [stat.ML])
    We propose new change of measure inequalities based on $f$-divergences (of which the Kullback-Leibler divergence is a particular case). Our strategy relies on combining the Legendre transform of $f$-divergences and the Young-Fenchel inequality. By exploiting these new change of measure inequalities, we derive new PAC-Bayesian generalisation bounds with a complexity involving $f$-divergences, and holding in mostly unchartered settings (such as heavy-tailed losses). We instantiate our results for the most popular $f$-divergences.  ( 2 min )
    Shifts: A Dataset of Real Distributional Shift Across Multiple Large-Scale Tasks. (arXiv:2107.07455v3 [cs.LG] UPDATED)
    There has been significant research done on developing methods for improving robustness to distributional shift and uncertainty estimation. In contrast, only limited work has examined developing standard datasets and benchmarks for assessing these approaches. Additionally, most work on uncertainty estimation and robustness has developed new techniques based on small-scale regression or image classification tasks. However, many tasks of practical interest have different modalities, such as tabular data, audio, text, or sensor data, which offer significant challenges involving regression and discrete or continuous structured prediction. Thus, given the current state of the field, a standardized large-scale dataset of tasks across a range of modalities affected by distributional shifts is necessary. This will enable researchers to meaningfully evaluate the plethora of recently developed uncertainty quantification methods, as well as assessment criteria and state-of-the-art baselines. In this work, we propose the Shifts Dataset for evaluation of uncertainty estimates and robustness to distributional shift. The dataset, which has been collected from industrial sources and services, is composed of three tasks, with each corresponding to a particular data modality: tabular weather prediction, machine translation, and self-driving car (SDC) vehicle motion prediction. All of these data modalities and tasks are affected by real, "in-the-wild" distributional shifts and pose interesting challenges with respect to uncertainty estimation. In this work we provide a description of the dataset and baseline results for all tasks.  ( 2 min )
    Support Vectors and Gradient Dynamics for Implicit Bias in ReLU Networks. (arXiv:2202.05510v1 [cs.LG])
    Understanding implicit bias of gradient descent has been an important goal in machine learning research. Unfortunately, even for a single-neuron ReLU network, it recently proved impossible to characterize the implicit regularization with the square loss by an explicit function of the norm of model parameters. In order to close the gap between the existing theory and the intriguing empirical behavior of ReLU networks, here we examine the gradient flow dynamics in the parameter space when training single-neuron ReLU networks. Specifically, we discover implicit bias in terms of support vectors in ReLU networks, which play a key role in why and how ReLU networks generalize well. Moreover, we analyze gradient flows with respect to the magnitude of the norm of initialization, and show the impact of the norm in gradient dynamics. Lastly, under some conditions, we prove that the norm of the learned weight strictly increases on the gradient flow.  ( 2 min )
    Distributed saddle point problems for strongly concave-convex functions. (arXiv:2202.05812v1 [math.OC])
    In this paper, we propose GT-GDA, a distributed optimization method to solve saddle point problems of the form: $\min_{\mathbf{x}} \max_{\mathbf{y}} \{F(\mathbf{x},\mathbf{y}) :=G(\mathbf{x}) + \langle \mathbf{y}, \overline{P} \mathbf{x} \rangle - H(\mathbf{y})\}$, where the functions $G(\cdot)$, $H(\cdot)$, and the the coupling matrix $\overline{P}$ are distributed over a strongly connected network of nodes. GT-GDA is a first-order method that uses gradient tracking to eliminate the dissimilarity caused by heterogeneous data distribution among the nodes. In the most general form, GT-GDA includes a consensus over the local coupling matrices to achieve the optimal (unique) saddle point, however, at the expense of increased communication. To avoid this, we propose a more efficient variant GT-GDA-Lite that does not incur the additional communication and analyze its convergence in various scenarios. We show that GT-GDA converges linearly to the unique saddle point solution when $G(\cdot)$ is smooth and convex, $H(\cdot)$ is smooth and strongly convex, and the global coupling matrix $\overline{P}$ has full column rank. We further characterize the regime under which GT-GDA exhibits a network topology-independent convergence behavior. We next show the linear convergence of GT-GDA to an error around the unique saddle point, which goes to zero when the coupling cost ${\langle \mathbf y, \overline{P} \mathbf x \rangle}$ is common to all nodes, or when $G(\cdot)$ and $H(\cdot)$ are quadratic. Numerical experiments illustrate the convergence properties and importance of GT-GDA and GT-GDA-Lite for several applications.  ( 2 min )
    Stochastic Subgradient Descent on a Generic Definable Function Converges to a Minimizer. (arXiv:2109.02455v3 [math.OC] UPDATED)
    It was previously shown by Davis and Drusvyatskiy that every Clarke critical point of a generic, semialgebraic (and more generally definable in an o-minimal structure), weakly convex function is lying on an active manifold and is either a local minimum or an active strict saddle. In the first part of this work, we show that when the weak convexity assumption fails a third type of point appears: a sharply repulsive critical point. Moreover, we show that the corresponding active manifolds satisfy the Verdier and the angle conditions which were introduced by us in our previous work. In the second part of this work, we show that, under a density-like assumption on the perturbation sequence, the stochastic subgradient descent (SGD) avoids sharply repulsive critical points with probability one. We show that such a density-like assumption could be obtained upon adding a small random perturbation (e.g. a nondegenerate Gaussian) at each iteration of the algorithm. These results, combined with our previous work on the avoidance of active strict saddles, show that the SGD on a generic definable (e.g. semialgebraic) function converges to a local minimum.
    Computational-Statistical Gaps in Reinforcement Learning. (arXiv:2202.05444v1 [cs.LG])
    Reinforcement learning with function approximation has recently achieved tremendous results in applications with large state spaces. This empirical success has motivated a growing body of theoretical work proposing necessary and sufficient conditions under which efficient reinforcement learning is possible. From this line of work, a remarkably simple minimal sufficient condition has emerged for sample efficient reinforcement learning: MDPs with optimal value function $V^*$ and $Q^*$ linear in some known low-dimensional features. In this setting, recent works have designed sample efficient algorithms which require a number of samples polynomial in the feature dimension and independent of the size of state space. They however leave finding computationally efficient algorithms as future work and this is considered a major open problem in the community. In this work, we make progress on this open problem by presenting the first computational lower bound for RL with linear function approximation: unless NP=RP, no randomized polynomial time algorithm exists for deterministic transition MDPs with a constant number of actions and linear optimal value functions. To prove this, we show a reduction from Unique-Sat, where we convert a CNF formula into an MDP with deterministic transitions, constant number of actions and low dimensional linear optimal value functions. This result also exhibits the first computational-statistical gap in reinforcement learning with linear function approximation, as the underlying statistical problem is information-theoretically solvable with a polynomial number of queries, but no computationally efficient algorithm exists unless NP=RP. Finally, we also prove a quasi-polynomial time lower bound under the Randomized Exponential Time Hypothesis.
    Universal Learning Waveform Selection Strategies for Adaptive Target Tracking. (arXiv:2202.05294v1 [cs.IT])
    Online selection of optimal waveforms for target tracking with active sensors has long been a problem of interest. Many conventional solutions utilize an estimation-theoretic interpretation, in which a waveform-specific Cram\'{e}r-Rao lower bound on measurement error is used to select the optimal waveform for each tracking step. However, this approach is only valid in the high SNR regime, and requires a rather restrictive set of assumptions regarding the target motion and measurement models. Further, due to computational concerns, many traditional approaches are limited to near-term, or myopic, optimization, even though radar scenes exhibit strong temporal correlation. More recently, reinforcement learning has been proposed for waveform selection, in which the problem is framed as a Markov decision process (MDP), allowing for long-term planning. However, a major limitation of reinforcement learning is that the memory length of the underlying Markov process is often unknown for realistic target and channel dynamics, and a more general framework is desirable. This work develops a universal sequential waveform selection scheme which asymptotically achieves Bellman optimality in any radar scene which can be modeled as a $U^{\text{th}}$ order Markov process for a finite, but unknown, integer $U$. Our approach is based on well-established tools from the field of universal source coding, where a stationary source is parsed into variable length phrases in order to build a context-tree, which is used as a probabalistic model for the scene's behavior. We show that an algorithm based on a multi-alphabet version of the Context-Tree Weighting (CTW) method can be used to optimally solve a broad class of waveform-agile tracking problems while making minimal assumptions about the environment's behavior.
    Desiderata for Representation Learning: A Causal Perspective. (arXiv:2109.03795v2 [stat.ML] UPDATED)
    Representation learning constructs low-dimensional representations to summarize essential features of high-dimensional data. This learning problem is often approached by describing various desiderata associated with learned representations; e.g., that they be non-spurious, efficient, or disentangled. It can be challenging, however, to turn these intuitive desiderata into formal criteria that can be measured and enhanced based on observed data. In this paper, we take a causal perspective on representation learning, formalizing non-spuriousness and efficiency (in supervised representation learning) and disentanglement (in unsupervised representation learning) using counterfactual quantities and observable consequences of causal assertions. This yields computable metrics that can be used to assess the degree to which representations satisfy the desiderata of interest and learn non-spurious and disentangled representations from single observational datasets.
    Mitigating Bias in Calibration Error Estimation. (arXiv:2012.08668v3 [cs.LG] UPDATED)
    For an AI system to be reliable, the confidence it expresses in its decisions must match its accuracy. To assess the degree of match, examples are typically binned by confidence and the per-bin mean confidence and accuracy are compared. Most research in calibration focuses on techniques to reduce this empirical measure of calibration error, ECE_bin. We instead focus on assessing statistical bias in this empirical measure, and we identify better estimators. We propose a framework through which we can compute the bias of a particular estimator for an evaluation data set of a given size. The framework involves synthesizing model outputs that have the same statistics as common neural architectures on popular data sets. We find that binning-based estimators with bins of equal mass (number of instances) have lower bias than estimators with bins of equal width. Our results indicate two reliable calibration-error estimators: the debiased estimator (Brocker, 2012; Ferro and Fricker, 2012) and a method we propose, ECE_sweep, which uses equal-mass bins and chooses the number of bins to be as large as possible while preserving monotonicity in the calibration function. With these estimators, we observe improvements in the effectiveness of recalibration methods and in the detection of model miscalibration.
    Bellman-consistent Pessimism for Offline Reinforcement Learning. (arXiv:2106.06926v4 [cs.LG] UPDATED)
    The use of pessimism, when reasoning about datasets lacking exhaustive exploration has recently gained prominence in offline reinforcement learning. Despite the robustness it adds to the algorithm, overly pessimistic reasoning can be equally damaging in precluding the discovery of good policies, which is an issue for the popular bonus-based pessimism. In this paper, we introduce the notion of Bellman-consistent pessimism for general function approximation: instead of calculating a point-wise lower bound for the value function, we implement pessimism at the initial state over the set of functions consistent with the Bellman equations. Our theoretical guarantees only require Bellman closedness as standard in the exploratory setting, in which case bonus-based pessimism fails to provide guarantees. Even in the special case of linear function approximation where stronger expressivity assumptions hold, our result improves upon a recent bonus-based approach by $\mathcal{O}(d)$ in its sample complexity when the action space is finite. Remarkably, our algorithms automatically adapt to the best bias-variance tradeoff in the hindsight, whereas most prior approaches require tuning extra hyperparameters a priori.
    Fitting Sparse Markov Models to Categorical Time Series Using Regularization. (arXiv:2202.05485v1 [stat.ME])
    The major problem of fitting a higher order Markov model is the exponentially growing number of parameters. The most popular approach is to use a Variable Length Markov Chain (VLMC), which determines relevant contexts (recent pasts) of variable orders and form a context tree. A more general approach is called Sparse Markov Model (SMM), where all possible histories of order $m$ form a partition so that the transition probability vectors are identical for the histories belonging to a particular group. We develop an elegant method of fitting SMM using convex clustering, which involves regularization. The regularization parameter is selected using BIC criterion. Theoretical results demonstrate the model selection consistency of our method for large sample size. Extensive simulation studies under different set-up have been presented to measure the performance of our method. We apply this method to classify genome sequences, obtained from individuals affected by different viruses.
    Continual Learning with Invertible Generative Models. (arXiv:2202.05694v1 [cs.LG])
    Catastrophic forgetting (CF) happens whenever a neural network overwrites past knowledge while being trained on new tasks. Common techniques to handle CF include regularization of the weights (using, e.g., their importance on past tasks), and rehearsal strategies, where the network is constantly re-trained on past data. Generative models have also been applied for the latter, in order to have endless sources of data. In this paper, we propose a novel method that combines the strengths of regularization and generative-based rehearsal approaches. Our generative model consists of a normalizing flow (NF), a probabilistic and invertible neural network, trained on the internal embeddings of the network. By keeping a single NF throughout the training process, we show that our memory overhead remains constant. In addition, exploiting the invertibility of the NF, we propose a simple approach to regularize the network's embeddings with respect to past tasks. We show that our method performs favorably with espect to state-of-the-art approaches in the literature, with bounded computational power and memory overheads.
    Robust estimation algorithms don't need to know the corruption level. (arXiv:2202.05453v1 [cs.LG])
    Real data are rarely pure. Hence the past half-century has seen great interest in robust estimation algorithms that perform well even when part of the data is corrupt. However, their vast majority approach optimal accuracy only when given a tight upper bound on the fraction of corrupt data. Such bounds are not available in practice, resulting in weak guarantees and often poor performance. This brief note abstracts the complex and pervasive robustness problem into a simple geometric puzzle. It then applies the puzzle's solution to derive a universal meta technique that converts any robust estimation algorithm requiring a tight corruption-level upper bound to achieve its optimal accuracy into one achieving essentially the same accuracy without using any upper bounds.
    Graphon-aided Joint Estimation of Multiple Graphs. (arXiv:2202.05686v1 [stat.ML])
    We consider the problem of estimating the topology of multiple networks from nodal observations, where these networks are assumed to be drawn from the same (unknown) random graph model. We adopt a graphon as our random graph model, which is a nonparametric model from which graphs of potentially different sizes can be drawn. The versatility of graphons allows us to tackle the joint inference problem even for the cases where the graphs to be recovered contain different number of nodes and lack precise alignment across the graphs. Our solution is based on combining a maximum likelihood penalty with graphon estimation schemes and can be used to augment existing network inference methods. We validate our proposed approach by comparing its performance against competing methods in synthetic and real-world datasets.  ( 2 min )
    A PDE-Based Analysis of the Symmetric Two-Armed Bernoulli Bandit. (arXiv:2202.05767v1 [cs.LG])
    This work addresses a version of the two-armed Bernoulli bandit problem where the sum of the means of the arms is one (the symmetric two-armed Bernoulli bandit). In a regime where the gap between these means goes to zero and the number of prediction periods approaches infinity, we obtain the leading order terms of the expected regret and pseudoregret for this problem by associating each of them with a solution of a linear parabolic partial differential equation. Our results improve upon the previously known results; specifically we explicitly compute the leading order term of the optimal regret and pseudoregret in three different scaling regimes for the gap. Additionally, we obtain new non-asymptotic bounds for any given time horizon.  ( 2 min )
    Active Privacy-Utility Trade-off Against Inference in Time-Series Data Sharing. (arXiv:2202.05833v1 [cs.IT])
    Internet of things (IoT) devices, such as smart meters, smart speakers and activity monitors, have become highly popular thanks to the services they offer. However, in addition to their many benefits, they raise privacy concerns since they share fine-grained time-series user data with untrusted third parties. In this work, we consider a user releasing her data containing personal information in return of a service from an honest-but-curious service provider (SP). We model user's personal information as two correlated random variables (r.v.'s), one of them, called the secret variable, is to be kept private, while the other, called the useful variable, is to be disclosed for utility. We consider active sequential data release, where at each time step the user chooses from among a finite set of release mechanisms, each revealing some information about the user's personal information, i.e., the true values of the r.v.'s, albeit with different statistics. The user manages data release in an online fashion such that the maximum amount of information is revealed about the latent useful variable as quickly as possible, while the confidence for the sensitive variable is kept below a predefined level. For privacy measure, we consider both the probability of correctly detecting the true value of the secret and the mutual information (MI) between the secret and the released data. We formulate both problems as partially observable Markov decision processes (POMDPs), and numerically solve them by advantage actor-critic (A2C) deep reinforcement learning (DRL). We evaluate the privacy-utility trade-off (PUT) of the proposed policies on both the synthetic data and smoking activity dataset, and show their validity by testing the activity detection accuracy of the SP modeled by a long short-term memory (LSTM) neural network.  ( 2 min )
    Measuring dissimilarity with diffeomorphism invariance. (arXiv:2202.05614v1 [stat.ML])
    Measures of similarity (or dissimilarity) are a key ingredient to many machine learning algorithms. We introduce DID, a pairwise dissimilarity measure applicable to a wide range of data spaces, which leverages the data's internal structure to be invariant to diffeomorphisms. We prove that DID enjoys properties which make it relevant for theoretical study and practical use. By representing each datum as a function, DID is defined as the solution to an optimization problem in a Reproducing Kernel Hilbert Space and can be expressed in closed-form. In practice, it can be efficiently approximated via Nystr\"om sampling. Empirical experiments support the merits of DID.  ( 2 min )
    Long-Time Convergence and Propagation of Chaos for Nonlinear MCMC. (arXiv:2202.05621v1 [stat.ML])
    In this paper, we study the long-time convergence and uniform strong propagation of chaos for a class of nonlinear Markov chains for Markov chain Monte Carlo (MCMC). Our technique is quite simple, making use of recent contraction estimates for linear Markov kernels and basic techniques from Markov theory and analysis. Moreover, the same proof strategy applies to both the long-time convergence and propagation of chaos. We also show, via some experiments, that these nonlinear MCMC techniques are viable for use in real-world high-dimensional inference such as Bayesian neural networks.  ( 2 min )
    A Field of Experts Prior for Adapting Neural Networks at Test Time. (arXiv:2202.05271v1 [cs.CV])
    Performance of convolutional neural networks (CNNs) in image analysis tasks is often marred in the presence of acquisition-related distribution shifts between training and test images. Recently, it has been proposed to tackle this problem by fine-tuning trained CNNs for each test image. Such test-time-adaptation (TTA) is a promising and practical strategy for improving robustness to distribution shifts as it requires neither data sharing between institutions nor annotating additional data. Previous TTA methods use a helper model to increase similarity between outputs and/or features extracted from a test image with those of the training images. Such helpers, which are typically modeled using CNNs, can be task-specific and themselves vulnerable to distribution shifts in their inputs. To overcome these problems, we propose to carry out TTA by matching the feature distributions of test and training images, as modelled by a field-of-experts (FoE) prior. FoEs model complicated probability distributions as products of many simpler expert distributions. We use 1D marginal distributions of a trained task CNN's features as experts in the FoE model. Further, we compute principal components of patches of the task CNN's features, and consider the distributions of PCA loadings as additional experts. We validate the method on 5 MRI segmentation tasks (healthy tissues in 4 anatomical regions and lesions in 1 one anatomy), using data from 17 clinics, and on a MRI registration task, using data from 3 clinics. We find that the proposed FoE-based TTA is generically applicable in multiple tasks, and outperforms all previous TTA methods for lesion segmentation. For healthy tissue segmentation, the proposed method outperforms other task-agnostic methods, but a previous TTA method which is specifically designed for segmentation performs the best for most of the tested datasets. Our code is publicly available.  ( 2 min )
    The Power of Adaptivity in SGD: Self-Tuning Step Sizes with Unbounded Gradients and Affine Variance. (arXiv:2202.05791v1 [stat.ML])
    We study convergence rates of AdaGrad-Norm as an exemplar of adaptive stochastic gradient methods (SGD), where the step sizes change based on observed stochastic gradients, for minimizing non-convex, smooth objectives. Despite their popularity, the analysis of adaptive SGD lags behind that of non adaptive methods in this setting. Specifically, all prior works rely on some subset of the following assumptions: (i) uniformly-bounded gradient norms, (ii) uniformly-bounded stochastic gradient variance (or even noise support), (iii) conditional independence between the step size and stochastic gradient. In this work, we show that AdaGrad-Norm exhibits an order optimal convergence rate of $\mathcal{O}\left(\frac{\mathrm{poly}\log(T)}{\sqrt{T}}\right)$ after $T$ iterations under the same assumptions as optimally-tuned non adaptive SGD (unbounded gradient norms and affine noise variance scaling), and crucially, without needing any tuning parameters. We thus establish that adaptive gradient methods exhibit order-optimal convergence in much broader regimes than previously understood.  ( 2 min )
    Achieving Minimax Rates in Pool-Based Batch Active Learning. (arXiv:2202.05448v1 [cs.LG])
    We consider a batch active learning scenario where the learner adaptively issues batches of points to a labeling oracle. Sampling labels in batches is highly desirable in practice due to the smaller number of interactive rounds with the labeling oracle (often human beings). However, batch active learning typically pays the price of a reduced adaptivity, leading to suboptimal results. In this paper we propose a solution which requires a careful trade off between the informativeness of the queried points and their diversity. We theoretically investigate batch active learning in the practically relevant scenario where the unlabeled pool of data is available beforehand (pool-based active learning). We analyze a novel stage-wise greedy algorithm and show that, as a function of the label complexity, the excess risk of this algorithm operating in the realizable setting for which we prove matches the known minimax rates in standard statistical learning settings. Our results also exhibit a mild dependence on the batch size. These are the first theoretical results that employ careful trade offs between informativeness and diversity to rigorously quantify the statistical performance of batch active learning in the pool-based scenario.  ( 2 min )
    Fast and Robust Sparsity Learning over Networks: A Decentralized Surrogate Median Regression Approach. (arXiv:2202.05498v1 [stat.ML])
    Decentralized sparsity learning has attracted a significant amount of attention recently due to its rapidly growing applications. To obtain the robust and sparse estimators, a natural idea is to adopt the non-smooth median loss combined with a $\ell_1$ sparsity regularizer. However, most of the existing methods suffer from slow convergence performance caused by the {\em double} non-smooth objective. To accelerate the computation, in this paper, we proposed a decentralized surrogate median regression (deSMR) method for efficiently solving the decentralized sparsity learning problem. We show that our proposed algorithm enjoys a linear convergence rate with a simple implementation. We also investigate the statistical guarantee, and it shows that our proposed estimator achieves a near-oracle convergence rate without any restriction on the number of network nodes. Moreover, we establish the theoretical results for sparse support recovery. Thorough numerical experiments and real data study are provided to demonstrate the effectiveness of our method.  ( 2 min )
    Predicting Out-of-Distribution Error with the Projection Norm. (arXiv:2202.05834v1 [cs.LG])
    We propose a metric -- Projection Norm -- to predict a model's performance on out-of-distribution (OOD) data without access to ground truth labels. Projection Norm first uses model predictions to pseudo-label test samples and then trains a new model on the pseudo-labels. The more the new model's parameters differ from an in-distribution model, the greater the predicted OOD error. Empirically, our approach outperforms existing methods on both image and text classification tasks and across different network architectures. Theoretically, we connect our approach to a bound on the test error for overparameterized linear models. Furthermore, we find that Projection Norm is the only approach that achieves non-trivial detection performance on adversarial examples. Our code is available at https://github.com/yaodongyu/ProjNorm.  ( 2 min )
    EE-Net: Exploitation-Exploration Neural Networks in Contextual Bandits. (arXiv:2110.03177v4 [cs.LG] UPDATED)
    In this paper, we propose a novel neural-based exploration strategy in contextual bandits, EE-Net, distinct from the standard UCB-based and TS-based approaches. Contextual multi-armed bandits have been studied for decades with various applications. To solve the exploitation-exploration tradeoff in bandits, there are three main techniques: epsilon-greedy, Thompson Sampling (TS), and Upper Confidence Bound (UCB). In recent literature, linear contextual bandits have adopted ridge regression to estimate the reward function and combine it with TS or UCB strategies for exploration. However, this line of works explicitly assumes the reward is based on a linear function of arm vectors, which may not be true in real-world datasets. To overcome this challenge, a series of neural-based bandit algorithms have been proposed, where a neural network is assigned to learn the underlying reward function and TS or UCB are adapted for exploration. In this paper, we propose "EE-Net", a neural-based bandit approach with a novel exploration strategy. In addition to utilizing a neural network (Exploitation network) to learn the reward function, EE-Net adopts another neural network (Exploration network) to adaptively learn potential gains compared to currently estimated reward for making explorations. Then, a decision-maker is constructed to combine the outputs from the Exploitation and Exploration networks. We prove that EE-Net can achieve $\mathcal{O}(\sqrt{T\log T})$ regret, which is tighter than existing state-of-the-art neural bandit algorithms ($\mathcal{O}(\sqrt{T}\log T)$ for both UCB-based and TS-based). Through extensive experiments on four real-world datasets, we show that EE-Net outperforms existing linear and neural bandit approaches.  ( 2 min )
    SoK: Certified Robustness for Deep Neural Networks. (arXiv:2009.04131v6 [cs.LG] UPDATED)
    Great advances in deep neural networks (DNNs) have led to state-of-the-art performance on a wide range of tasks. However, recent studies have shown that DNNs are vulnerable to adversarial attacks, which have brought great concerns when deploying these models to safety-critical applications such as autonomous driving. Different defense approaches have been proposed against adversarial attacks, including: a) empirical defenses, which usually can be adaptively attacked again without providing robustness certification; and b) certifiably robust approaches which consist of robustness verification providing the lower bound of robust accuracy against any attacks under certain conditions and corresponding robust training approaches. In this paper, we systematize the certifiably robust approaches and related practical and theoretical implications and findings. We also provide the first comprehensive benchmark on existing robustness verification and training approaches on different datasets. In particular, we 1) provide a taxonomy for the robustness verification and training approaches, as well as summarize the methodologies for representative algorithms, 2) reveal the characteristics, strengths, limitations, and fundamental connections among these approaches, 3) discuss current research progresses, theoretical barriers, main challenges, and future directions for certifiably robust approaches for DNNs, and 4) provide an open-sourced unified platform to evaluate over 20 representative certifiably robust approaches for a wide range of DNNs.  ( 2 min )
    A Characterization of Semi-Supervised Adversarially-Robust PAC Learnability. (arXiv:2202.05420v1 [cs.LG])
    We study the problem of semi-supervised learning of an adversarially-robust predictor in the PAC model, where the learner has access to both labeled and unlabeled examples. The sample complexity in semi-supervised learning has two parameters, the number of labeled examples and the number of unlabeled examples. We consider the complexity measures, $VC_U \leq dim_U \leq VC$ and $VC^*$, where $VC$ is the standard $VC$-dimension, $VC^*$ is its dual, and the other two measures appeared in Montasser et al. (2019). The best sample bound known for robust supervised PAC learning is $O(VC \cdot VC^*)$, and we will compare our sample bounds to $\Lambda$ which is the minimal number of labeled examples required by any robust supervised PAC learning algorithm. Our main results are the following: (1) in the realizable setting it is sufficient to have $O(VC_U)$ labeled examples and $O(\Lambda)$ unlabeled examples. (2) In the agnostic setting, let $\eta$ be the minimal agnostic error. The sample complexity depends on the resulting error rate. If we allow an error of $2\eta+\epsilon$, it is still sufficient to have $O(VC_U)$ labeled examples and $O(\Lambda)$ unlabeled examples. If we insist on having an error $\eta+\epsilon$ then $\Omega(dim_U)$ labeled examples are necessary, as in the supervised case. The above results show that there is a significant benefit in semi-supervised robust learning, as there are hypothesis classes with $VC_U=0$ and $dim_U$ arbitrary large. In supervised learning, having access only to labeled examples requires at least $\Lambda \geq dim_U$ labeled examples. Semi-supervised require only $O(1)$ labeled examples and $O(\Lambda)$ unlabeled examples. A byproduct of our result is that if we assume that the distribution is robustly realizable by a hypothesis class, then with respect to the 0-1 loss we can learn with only $O(VC_U)$ labeled examples, even if the $VC$ is infinite.  ( 2 min )
    Inference and FDR Control for Simulated Ising Models in High-dimension. (arXiv:2202.05612v1 [stat.ML])
    This paper studies the consistency and statistical inference of simulated Ising models in the high dimensional background. Our estimators are based on the Markov chain Monte Carlo maximum likelihood estimation (MCMC-MLE) method penalized by the Elastic-net. Under mild conditions that ensure a specific convergence rate of MCMC method, the $\ell_{1}$ consistency of Elastic-net-penalized MCMC-MLE is proved. We further propose a decorrelated score test based on the decorrelated score function and prove the asymptotic normality of the score function without the influence of many nuisance parameters under the assumption that accelerates the convergence of the MCMC method. The one-step estimator for a single parameter of interest is purposed by linearizing the decorrelated score function to solve its root, as well as its normality and confidence interval for the true value, therefore, be established. Finally, we use different algorithms to control the false discovery rate (FDR) via traditional p-values and novel e-values.  ( 2 min )
    Policy Finetuning: Bridging Sample-Efficient Offline and Online Reinforcement Learning. (arXiv:2106.04895v2 [cs.LG] UPDATED)
    Recent theoretical work studies sample-efficient reinforcement learning (RL) extensively in two settings: learning interactively in the environment (online RL), or learning from an offline dataset (offline RL). However, existing algorithms and theories for learning near-optimal policies in these two settings are rather different and disconnected. Towards bridging this gap, this paper initiates the theoretical study of policy finetuning, that is, online RL where the learner has additional access to a "reference policy" $\mu$ close to the optimal policy $\pi_\star$ in a certain sense. We consider the policy finetuning problem in episodic Markov Decision Processes (MDPs) with $S$ states, $A$ actions, and horizon length $H$. We first design a sharp offline reduction algorithm -- which simply executes $\mu$ and runs offline policy optimization on the collected dataset -- that finds an $\varepsilon$ near-optimal policy within $\widetilde{O}(H^3SC^\star/\varepsilon^2)$ episodes, where $C^\star$ is the single-policy concentrability coefficient between $\mu$ and $\pi_\star$. This offline result is the first that matches the sample complexity lower bound in this setting, and resolves a recent open question in offline RL. We then establish an $\Omega(H^3S\min\{C^\star, A\}/\varepsilon^2)$ sample complexity lower bound for any policy finetuning algorithm, including those that can adaptively explore the environment. This implies that -- perhaps surprisingly -- the optimal policy finetuning algorithm is either offline reduction or a purely online RL algorithm that does not use $\mu$. Finally, we design a new hybrid offline/online algorithm for policy finetuning that achieves better sample complexity than both vanilla offline reduction and purely online RL algorithms, in a relaxed setting where $\mu$ only satisfies concentrability partially up to a certain time step.  ( 2 min )
    Inference of Multiscale Gaussian Graphical Model. (arXiv:2202.05775v1 [stat.ML])
    Gaussian Graphical Models (GGMs) are widely used for exploratory data analysis in various fields such as genomics, ecology, psychometry. In a high-dimensional setting, when the number of variables exceeds the number of observations by several orders of magnitude, the estimation of GGM is a difficult and unstable optimization problem. Clustering of variables or variable selection is often performed prior to GGM estimation. We propose a new method allowing to simultaneously infer a hierarchical clustering structure and the graphs describing the structure of independence at each level of the hierarchy. This method is based on solving a convex optimization problem combining a graphical lasso penalty with a fused type lasso penalty. Results on real and synthetic data are presented.  ( 2 min )
    ProMP: Proximal Meta-Policy Search. (arXiv:1810.06784v4 [cs.LG] UPDATED)
    Credit assignment in Meta-reinforcement learning (Meta-RL) is still poorly understood. Existing methods either neglect credit assignment to pre-adaptation behavior or implement it naively. This leads to poor sample-efficiency during meta-training as well as ineffective task identification strategies. This paper provides a theoretical analysis of credit assignment in gradient-based Meta-RL. Building on the gained insights we develop a novel meta-learning algorithm that overcomes both the issue of poor credit assignment and previous difficulties in estimating meta-policy gradients. By controlling the statistical distance of both pre-adaptation and adapted policies during meta-policy search, the proposed algorithm endows efficient and stable meta-learning. Our approach leads to superior pre-adaptation policy behavior and consistently outperforms previous Meta-RL algorithms in sample-efficiency, wall-clock time, and asymptotic performance.  ( 2 min )
    A Witness Two-Sample Test. (arXiv:2102.05573v3 [cs.LG] UPDATED)
    The Maximum Mean Discrepancy (MMD) has been the state-of-the-art nonparametric test for tackling the two-sample problem. Its statistic is given by the difference in expectations of the witness function, a real-valued function defined as a weighted sum of kernel evaluations on a set of basis points. Typically the kernel is optimized on a training set, and hypothesis testing is performed on a separate test set to avoid overfitting (i.e., control type-I error). That is, the test set is used to simultaneously estimate the expectations and define the basis points, while the training set only serves to select the kernel and is discarded. In this work, we propose to use the training data to also define the weights and the basis points for better data efficiency. We show that 1) the new test is consistent and has a well-controlled type-I error; 2) the optimal witness function is given by a precision-weighted mean in the reproducing kernel Hilbert space associated with the kernel; and 3) the test power of the proposed test is comparable or exceeds that of the MMD and other modern tests, as verified empirically on challenging synthetic and real problems (e.g., Higgs data).  ( 2 min )
    TACTO: A Fast, Flexible, and Open-source Simulator for High-Resolution Vision-based Tactile Sensors. (arXiv:2012.08456v2 [cs.RO] UPDATED)
    Simulators perform an important role in prototyping, debugging, and benchmarking new advances in robotics and learning for control. Although many physics engines exist, some aspects of the real world are harder than others to simulate. One of the aspects that have so far eluded accurate simulation is touch sensing. To address this gap, we present TACTO - a fast, flexible, and open-source simulator for vision-based tactile sensors. This simulator allows to render realistic high-resolution touch readings at hundreds of frames per second, and can be easily configured to simulate different vision-based tactile sensors, including DIGIT and OmniTact. In this paper, we detail the principles that drove the implementation of TACTO and how they are reflected in its architecture. We demonstrate TACTO on a perceptual task, by learning to predict grasp stability using touch from 1 million grasps, and on a marble manipulation control task. Moreover, we provide a proof-of-concept that TACTO can be successfully used for Sim2Real applications. We believe that TACTO is a step towards the widespread adoption of touch sensing in robotic applications, and to enable machine learning practitioners interested in multi-modal learning and control. TACTO is open-source at https://github.com/facebookresearch/tacto.  ( 2 min )
    A Single-Timescale Method for Stochastic Bilevel Optimization. (arXiv:2102.04671v3 [math.OC] UPDATED)
    Stochastic bilevel optimization generalizes the classic stochastic optimization from the minimization of a single objective to the minimization of an objective function that depends the solution of another optimization problem. Recently, stochastic bilevel optimization is regaining popularity in emerging machine learning applications such as hyper-parameter optimization and model-agnostic meta learning. To solve this class of stochastic optimization problems, existing methods require either double-loop or two-timescale updates, which are sometimes less efficient. This paper develops a new optimization method for a class of stochastic bilevel problems that we term Single-Timescale stochAstic BiLevEl optimization (STABLE) method. STABLE runs in a single loop fashion, and uses a single-timescale update with a fixed batch size. To achieve an $\epsilon$-stationary point of the bilevel problem, STABLE requires ${\cal O}(\epsilon^{-2})$ samples in total; and to achieve an $\epsilon$-optimal solution in the strongly convex case, STABLE requires ${\cal O}(\epsilon^{-1})$ samples. To the best of our knowledge, this is the first bilevel optimization algorithm achieving the same order of sample complexity as the stochastic gradient descent method for the single-level stochastic optimization.  ( 2 min )
    Error-feedback stochastic modeling strategy for time series forecasting with convolutional neural networks. (arXiv:2002.00717v2 [cs.LG] UPDATED)
    Despite the superiority of convolutional neural networks demonstrated in time series modeling and forecasting, it has not been fully explored on the design of the neural network architecture and the tuning of the hyper-parameters. Inspired by the incremental construction strategy for building a random multilayer perceptron, we propose a novel Error-feedback Stochastic Modeling (ESM) strategy to construct a random Convolutional Neural Network (ESM-CNN) for time series forecasting task, which builds the network architecture adaptively. The ESM strategy suggests that random filters and neurons of the error-feedback fully connected layer are incrementally added to steadily compensate the prediction error during the construction process, and then a filter selection strategy is introduced to enable ESM-CNN to extract the different size of temporal features, providing helpful information at each iterative process for the prediction. The performance of ESM-CNN is justified on its prediction accuracy of one-step-ahead and multi-step-ahead forecasting tasks respectively. Comprehensive experiments on both the synthetic and real-world datasets show that the proposed ESM-CNN not only outperforms the state-of-art random neural networks, but also exhibits stronger predictive power and less computing overhead in comparison to trained state-of-art deep neural network models.  ( 2 min )
    Bounded nonlinear forecasts of partially observed geophysical systems with physics-constrained deep learning. (arXiv:2202.05750v1 [stat.ML])
    The complexity of real-world geophysical systems is often compounded by the fact that the observed measurements depend on hidden variables. These latent variables include unresolved small scales and/or rapidly evolving processes, partially observed couplings, or forcings in coupled systems. This is the case in ocean-atmosphere dynamics, for which unknown interior dynamics can affect surface observations. The identification of computationally-relevant representations of such partially-observed and highly nonlinear systems is thus challenging and often limited to short-term forecast applications. Here, we investigate the physics-constrained learning of implicit dynamical embeddings, leveraging neural ordinary differential equation (NODE) representations. A key objective is to constrain their boundedness, which promotes the generalization of the learned dynamics to arbitrary initial condition. The proposed architecture is implemented within a deep learning framework, and its relevance is demonstrated with respect to state-of-the-art schemes for different case-studies representative of geophysical dynamics.  ( 2 min )
    Controlling Confusion via Generalisation Bounds. (arXiv:2202.05560v1 [stat.ML])
    We establish new generalisation bounds for multiclass classification by abstracting to a more general setting of discretised error types. Extending the PAC-Bayes theory, we are hence able to provide fine-grained bounds on performance for multiclass classification, as well as applications to other learning problems including discretisation of regression losses. Tractable training objectives are derived from the bounds. The bounds are uniform over all weightings of the discretised error types and thus can be used to bound weightings not foreseen at training, including the full confusion matrix in the multiclass classification case.  ( 2 min )
    Matrix-wise $\ell_0$-constrained Sparse Nonnegative Least Squares. (arXiv:2011.11066v3 [cs.LG] UPDATED)
    Nonnegative least squares problems with multiple right-hand sides (MNNLS) arise in models that rely on additive linear combinations. In particular, they are at the core of most nonnegative matrix factorization algorithms and have many applications. The nonnegativity constraint is known to naturally favor sparsity, that is, solutions with few non-zero entries. However, it is often useful to further enhance this sparsity, as it improves the interpretability of the results and helps reducing noise, which leads to the sparse MNNLS problem. In this paper, as opposed to most previous works that enforce sparsity column- or row-wise, we first introduce a novel formulation for sparse MNNLS, with a matrix-wise sparsity constraint. Then, we present a two-step algorithm to tackle this problem. The first step divides sparse MNNLS in subproblems, one per column of the original problem. It then uses different algorithms to produce, either exactly or approximately, a Pareto front for each subproblem, that is, to produce a set of solutions representing different tradeoffs between reconstruction error and sparsity. The second step selects solutions among these Pareto fronts in order to build a sparsity-constrained matrix that minimizes the reconstruction error. We perform experiments on facial and hyperspectral images, and we show that our proposed two-step approach provides more accurate results than state-of-the-art sparse coding heuristics applied both column-wise and globally.  ( 2 min )
    Model Selection for Time Series Forecasting: Empirical Analysis of Different Estimators. (arXiv:2104.00584v2 [stat.ML] UPDATED)
    Evaluating predictive models is a crucial task in predictive analytics. This process is especially challenging with time series data where the observations show temporal dependencies. Several studies have analysed how different performance estimation methods compare with each other for approximating the true loss incurred by a given forecasting model. However, these studies do not address how the estimators behave for model selection: the ability to select the best solution among a set of alternatives. We address this issue and compare a set of estimation methods for model selection in time series forecasting tasks. We attempt to answer two main questions: (i) how often is the best possible model selected by the estimators; and (ii) what is the performance loss when it does not. We empirically found that the accuracy of the estimators for selecting the best solution is low, and the overall forecasting performance loss associated with the model selection process ranges from 1.2% to 2.3%. We also discovered that some factors, such as the sample size, are important in the relative performance of the estimators.  ( 2 min )
    Posterior Consistency for Bayesian Relevance Vector Machines. (arXiv:2202.05422v1 [stat.ML])
    Statistical modeling and inference problems with sample sizes substantially smaller than the number of available covariates are challenging. Chakraborty et al. (2012) did a full hierarchical Bayesian analysis of nonlinear regression in such situations using relevance vector machines based on reproducing kernel Hilbert space (RKHS). But they did not provide any theoretical properties associated with their procedure. The present paper revisits their problem, introduces a new class of global-local priors different from theirs, and provides results on posterior consistency as well as posterior contraction rates  ( 2 min )
    Pairwise Fairness for Ordinal Regression. (arXiv:2105.03153v2 [stat.ML] UPDATED)
    We initiate the study of fairness for ordinal regression. We adapt two fairness notions previously considered in fair ranking and propose a strategy for training a predictor that is approximately fair according to either notion. Our predictor has the form of a threshold model, composed of a scoring function and a set of thresholds, and our strategy is based on a reduction to fair binary classification for learning the scoring function and local search for choosing the thresholds. We provide generalization guarantees on the error and fairness violation of our predictor, and we illustrate the effectiveness of our approach in extensive experiments.  ( 2 min )

  • Open

    [D] What would a future Deep Learning theory look like?
    I have seen that many people here share my feeling that deep learning research is often just throwing as many different architectures and hyperparameters as possible and publishing the best output of it. The fact that most people doesnt even report statistical tests made on repetitions of these experiments seem to confirm that. Many seem to suggest that the field is simply too recent, so there was simply no time for a reasonable theory that explains with scientific validity why something works that way. I have an impression that if I decided to apply DL to a completely different field there is simply nothing that I could learn that would give me an intuition if applying X technique with Y hyperparameters would work until I perform that experiment. But what I wonder is, will there ever be such a theory? What if neural networks are simply statistical monsters with very few pockets of causality that can be found and understood by humans? This is actually my impression, but I am just a noobie msc student that knows too little. And if we are going to have such a theory, how would it look like? which kind of answers would it likely answer? Is there a trail of research that you believe is promising in this sense? submitted by /u/TheManveru [link] [comments]  ( 1 min )
    [D] MachineLearning Extrapolation / Prediction
    Hello there, Please correct me if I am wrong. ML is good at intrapolating. It can work in a specific area of available datapoints. If I move out of that area of datapoints that I used to train my ML algorithm I have to extrapolate and get bad results. E.g. dataset of engine degradation over one year. But now my system runs longer. Trained ML has only one year of data available. Is there something I could do to extrapolate? Any info, resources and experiences would help. submitted by /u/HumbleLegacy [link] [comments]  ( 1 min )
    [P] Mappable - Text to API in 30 Min.
    ​ Hello! Mappable (https://mappable.ai/) is a new company/project aimed at making it much easier to go from raw text to an API which can predict things. It is based on two core ideas: ​ Bulk Labelling - Often, your data has some level of redundancy and labeling examples one at a time is inefficient. Mappable lets you annotate on a UMAP graph, so you can label hundreds of datapoints very quickly. Search and Recommendation - When you are creating a text classification dataset, typically you have an idea of what key words and phrases are relevant. Searching your dataset to find and label them makes sense, and provides an intuitive interface for less technical folks to help you do it ("programmatic" labelling also does this, but is less accessible to those without programming knowledge). Mappable also has two interesting features: Open Data Collaboration - If your dataset is public, anyone can help you annotate it! Non-owners' annotations will be surfaced in the UI for you to either accept or reject in bulk. Automatic Model APIs - Mappable provides a model training loop which uses the data annotated in a project. When these models are finished training, they are deployed automatically and are available via API. Currently whilst Mappable is in Beta, this functionality is free, but it will always be free (up to some API limit/data size) for public projects. Eventually there will be a tiered pricing level based on API usage and/or the number of private projects/collaborators. Here is an example dataset of survey responses: https://mappable.ai/5GyAj6CKnk2E8x67SjgcpF I'm always interested in feedback from potential users - if this is you, please join this discord: https://discord.gg/NdEdBNWTx4 submitted by /u/MarkusDeNeutoy [link] [comments]  ( 1 min )
    [D] Problems with interpretability research + blog post about recent NeurIPS paper
    I recently read a NeurIPS paper arguing that neural networks find class information in patterns that are nonsensical to humans. I don't agree with the perspective that the authors present in the paper, and I wrote a blog post about it. Here's a link: https://sprin.xyz/articles/overinterpretation.html The authors of the NeurIPS paper find tiny nonsensical "sufficient input subsets" of images, meaning that when they remove all but 5% of the pixels in an ImageNet image, the classifiers will still predict the original class with high confidence (see Figure 1 in my blog). They claim that within those remaining 5% of pixels, there is sufficient evidence for the classifier to base its prediction. In my article I argue that this is not true. Rather, by following their algorithm to remove pixels, you introduce what are essentially adversarial patterns in the image. I think this highlights a hard-to-solve problem with a lot of interpretability research: when you run an algorithm to interpret a deep learning model, it's really hard to tell if you are actually learning about the deep learning model or if what you see is a result of an artifact of the interpretability algorithm itself. I'm wondering what you all think? submitted by /u/jspr_ml [link] [comments]  ( 1 min )
    [D] Alternatives to clustering to represent hand shapes
    I am looking into representations of body part shapes, in particular a representation of the hand shape. A first step is pose estimation, giving typically 63 values (21 3D joint coordinates). What I want is a representation derived from that, that models the shape of the hands. Preferably I would avoid having to collect tons of data, something that can be derived from those joint coordinates would be perfect. Something like the amount of bending in every finger, but less rule based. I've encountered interesting models like MANO (based on SMPL) and STAR, but I don't think they fit my use case. Of course, a possibility would be to perform clustering of a collection of pose data and looking at the cluster centroids to see if they map to any discernible hand shapes. Alternatively, I could look at PCA of the coordinates. I am sure that there has been someone before me who has tried this stuff, but my literature search has turned up empty (though I've found tons of interesting projects). Anyone who has experience with this? submitted by /u/RaptorDotCpp [link] [comments]  ( 1 min )
    [D] How are you leveraging foundation models?
    Andreessen Horowitz posted a write-up last week on how companies and researchers are using foundation models such as GPT, BERT, CLIP, etc -- and some of the risks/roadblocks. How is everyone here using leveraging foundation models pre-trained on large-scale data? Anything beyond typical fine-tuning going on out there? What are the latest views on model compression/distillation, MOE or other approaches designed to reduce computational requirements? Curious if people think open efforts like EleutherAI will be able to keep up with the model/data scale happening in the big AI labs (where it seems training 150+b param models (even 200+b param models) seems increasingly frequent) submitted by /u/hyperia_ai [link] [comments]  ( 1 min )
    [P] Database for AI: Visualize, version-control & explore image, video and audio datasets
    submitted by /u/davidbun [link] [comments]  ( 3 min )
    [D] Setting up a server for ML/DL research for small group?
    I currently have a server with one Tesla V100S which I would like to use for ML/DL research for a group of researchers and students (no more than 12 people). My experience so far has been limited to local development in my own (Windows) machine. Data in filesystem and that kind of thing, so I'm punching a bit above my weight here. My background is more from the maths side than the CS side, so I've never had to deal with much stuff from below the level of my local python installation (colleagues have roughly the same backgrounds). That being said, I apologize if these are irritatingly simple matters. I'd like for this server to be able to be used for a variety of ML/DL tasks. Not just training models but also having the capability for whatever data visualization/analysis/processing tasks c…  ( 3 min )
    [R] CLIPasso: Semantically-Aware Object Sketching
    submitted by /u/hardmaru [link] [comments]  ( 1 min )
    [R] CfP EvoRL @ GECCO 2022
    CALL FOR PAPERS EvoRL 2022 Evolutionary Reinforcement Learning workshop at GECCO 2022, July 9-13, Boston, USA In recent years reinforcement learning (RL) has received a lot of attention thanks to its performance and ability to address complex tasks. At the same time, multiple recent papers, notably work from OpenAI, have shown that evolution strategies (ES) can be competitive with standard RL algorithms on some problems while being simpler and more scalable. Similar results were obtained by researchers from Uber, this time using a gradient-free genetic algorithm (GA) to train deep neural networks on complex control tasks. Moreover, recent research in the field of evolutionary algorithms (EA) has led to the development of algorithms like Novelty Search and Quality Diversity, capable of…  ( 2 min )
    [R] ChemicalX: A Deep Learning Library for Drug Pair Scoring
    In this paper, we introduce ChemicalX, a PyTorch-based deep learning library designed for providing a range of state of the art models to solve the drug pair scoring task. The primary objective of the library is to make deep drug pair scoring models accessible to machine learning researchers and practitioners in a streamlined framework.The design of ChemicalX reuses existing high level model training utilities, geometric deep learning, and deep chemistry layers from the PyTorch ecosystem. Our system provides neural network layers, custom pair scoring architectures, data loaders, and batch iterators for end users. We showcase these features with example code snippets and case studies to highlight the characteristics of ChemicalX. A range of experiments on real world drug-drug interaction, polypharmacy side effect, and combination synergy prediction tasks demonstrate that the models available in ChemicalX are effective at solving the pair scoring task. Finally, we show that ChemicalX could be used to train and score machine learning models on large drug pair datasets with hundreds of thousands of compounds on commodity hardware. Code: https://github.com/AstraZeneca/chemicalx Paper: https://arxiv.org/abs/2202.05240 submitted by /u/benitorosenberg [link] [comments]  ( 1 min )
    [D] In the dataset with geospatial information (house price prediction, etc.). Is it possible to get the estimate location of expensive area?
    Say I want to predict a house price of one country, after training a model, I noticed that the house price is getting higher when it's closer to one landmark, and lower when it's getting further. Turn out that the center of that area is a big department store, how can I extract such information in the similar fashion automatically? Is it possible to automatically find and create a heatmap of "valuable" area. (and vise versa, "not valuable" area, like having high criminal rate, bad rumors, etc.) Can we get a feature importance of the "location" as a one variable (since x and y is a separate variable, if you use conventional feature importance from Linear regression or Decision Tree, we will get a sperate importance for x and y, but I want to treat them as one variable, can we do that?) (also, please suggest resources or keyword for this kind of problem) Thank you very much. submitted by /u/aviisu [link] [comments]  ( 2 min )
    [R][P] Projected GANs Converge Faster, Pokemon + Hugging Face Spaces Demo
    submitted by /u/Illustrious_Row_9971 [link] [comments]  ( 2 min )
    [News] Now get all official and unofficial code implementations of any AI/ML papers as you're browsing DuckDuckGo, Reddit, Google, Scholar, Arxiv, Twitter and more!
    submitted by /u/fullerhouse570 [link] [comments]  ( 1 min )
  • Open

    Research advances technology of AI assistance for anesthesiologists
    A new deep-learning algorithm trained to optimize doses of propofol to maintain unconsciousness during general anesthesia could augment patient monitoring.  ( 5 min )
  • Open

    IoT in Intelligent Transportation Systems Anchors Smart Traffic for Smart Cities
    Internet of things (IoT) holds an enormous promise in making urban transport systems smarter in terms of safety, energy-saving, ecologically favorable, and efficiency. The efficiency of optimizing transportation in real-time is the key pillar of successful deployment of IoT. Ecosystem to Develop Pivoted on Expanding Use Cases Several developed nations notably Singapore, the U.S., and… Read More »IoT in Intelligent Transportation Systems Anchors Smart Traffic for Smart Cities The post IoT in Intelligent Transportation Systems Anchors Smart Traffic for Smart Cities appeared first on Data Science Central.  ( 2 min )
    Data Management Value Realization Journey Map
    [Editorial Note: Now THIS is data storytelling! – Kurt Cagle] I originally published the Big Data Storymap in January 2013 as a way of communicating the key factors to a successful Big Data and Data Science initiative; to provide a creative view of the “data to value” journey by leveraging the Big Data Business Model… Read More »Data Management Value Realization Journey Map The post Data Management Value Realization Journey Map appeared first on Data Science Central.  ( 5 min )
    Enabling Salesforce Data Backup and Recovery
    Salesforce offers a very comprehensive and flexible model for data security at various levels. Salesforce also has many tools for data sharing and data management, which will not only allow fully secured data access to the users at various levels but also a high level of security. Salesforce also has many native and third-party tools… Read More »Enabling Salesforce Data Backup and Recovery The post Enabling Salesforce Data Backup and Recovery appeared first on Data Science Central.  ( 5 min )
    Kapxy – A Kafka Utility for Topic Lineage
    The Acceldata Engineering team sought a way to identify Kafka Producer-Topic-Consumer relationship metrics, which resulted in us building our own Kafka utility, named Kapxy. The following provides insight into our journey, and it explains how we identified the need for the utility, and how we developed it.  We’ll also demonstrate how it’s being used in… Read More »Kapxy – A Kafka Utility for Topic Lineage The post Kapxy – A Kafka Utility for Topic Lineage appeared first on Data Science Central.  ( 5 min )
    An Overview of Blockchain Technology and its Functionality
    As blockchain innovation continues to grow and become user-friendly, the onus is on every individual who wants to learn this emerging technology to prepare for the future. If one is new to blockchain, then this is the right platform to good foundational knowledge. In this article, one will learn about blockchain technology and the way… Read More »An Overview of Blockchain Technology and its Functionality The post An Overview of Blockchain Technology and its Functionality appeared first on Data Science Central.  ( 4 min )
    Crypto: The New Wild West
    Cryptocurrency scams are costing investors billions. Most are new takes on old con man techniques. Why you shouldn’t invest in crypto. As we move towards the metaverse, the scammers are moving right alongside us. Thanks in part to the lack of formal rules governing trading, scammers are bilking billions of dollars out of unsuspecting investors.… Read More »Crypto: The New Wild West The post Crypto: The New Wild West appeared first on Data Science Central.  ( 4 min )
    Enterprise 5G AI and IoT applications
    Yesterday, in our class, we had a discussion on the significance of 5G applications Its easy to get mixed up about this topic – so here is a very simple way you can think of 5G applications  Essentially, 5G is about low latency and high bandwidth Low latency basically enables many Internet of Things  (IoT)… Read More »Enterprise 5G AI and IoT applications The post Enterprise 5G AI and IoT applications appeared first on Data Science Central.  ( 2 min )
    A Handy Guide for Discerning Conversion Rate Optimization
    Website traffic is a fundamental resource for every digital business. Yes, for obvious reasons – because there is more traffic, the conversion rate is better. If you want to get more and more customers to associate with your venture, you have to ensure enough traffic on your web portal. Along with this, traffic is essential… Read More »A Handy Guide for Discerning Conversion Rate Optimization The post A Handy Guide for Discerning Conversion Rate Optimization appeared first on Data Science Central.  ( 5 min )
  • Open

    Automate a shared bikes and scooters classification model with Amazon SageMaker Autopilot
    Amazon SageMaker Autopilot makes it possible for organizations to quickly build and deploy an end-to-end machine learning (ML) model and inference pipeline with just a few lines of code or even without any code at all with Amazon SageMaker Studio. Autopilot offloads the heavy lifting of configuring infrastructure and the time it takes to build […]  ( 10 min )
  • Open

    Last Week in AI: New way for AI to see in 3D, AI helps translating ancient languages, lawmakers pressure Feds to stop using Clearview AI, and more!
    submitted by /u/regalalgorithm [link] [comments]  ( 1 min )
    How to learn resolution in predicate logic math?
    This is not really a homework question so don't bother answering them. It is more of a guidance problem. This is what I find the hardest out of all topics.. Unfortunately, this topic is a fixed 10 marks question in our 80 marks exam. Comes every time. The types of questions that I need to deal with my exams are like this-: john likes all kinds of food. apples are food chicken is food anything anyone eats and isn't killed by is food bill eats peanuts and is still alive. sue eats everything bill eats. prove that john like peanuts using resolution. PS since I failed it once I am even more idk what to say...i didn't get a word, hesitant or scared or afraid to learn this topic. Learning this would help me pass this exam of artificiail intelligence submitted by /u/kalokeshmarelimai [link] [comments]  ( 3 min )
    CheXstray: Real-time Multi-Modal Data Concordance for Drift Detection in Medical Imaging AI
    submitted by /u/doctanonymous [link] [comments]
    Humans and AI: Problem finders and problem solvers
    submitted by /u/bendee983 [link] [comments]
    Alibaba Open-Sources ‘KNAS’: An Automated Machine Learning Algorithm That Evaluates Given Architectures Without Training
    Architecture evaluation is a systematic approach for identifying flaws and dangers in architectural designs. The evaluation process is ideally performed before they are implemented. Typically, neural architecture search (NAS) systems are used for architectural evaluation. Neural architecture search (NAS) is an AutoML branch that aims to find the best deep-learning model architecture for a task. The systems achieve this by finding an architecture that will achieve the best performance metric on the given task dataset and search space of possible architectures. However, this usually necessitates training each proposed model completely on the dataset, which takes a long time. Continue Reading Paper: http://proceedings.mlr.press/v139/xu21m/xu21m.pdf Github: https://github.com/Jingjing-NLP/KNAS submitted by /u/ai-lover [link] [comments]  ( 1 min )
    Images similar reminiscent of my dreams created on Artbreeder (Generative Adversarial Networks)
    submitted by /u/rg1213 [link] [comments]
    MuZero’s first step from research into the real world
    submitted by /u/nick7566 [link] [comments]
    Is the reward enough to achieve Agi?
    Deep Mind says that only reinforcement learning by reward is sufficient to achieve agi, but I think it may not be. Even in the case of human beings, don't you get a salary or performance pay to motivate you? ​ However, this does not mean that people do not necessarily move only with such external rewards. Own beliefs Sometimes intrinsic motivations, such as curiosity, are more important. ​ In the case of Einstein and Farrellman, who proved the Poincare conjecture, I think that intrinsic motivation was much more important. Intellectual curiosity and intrinsic motivation to know the truth without needing money or fame was more important. ​ Wouldn't it be necessary to give artificial intelligence such intrinsic motivations as intellectual curiosity, a sense of achievement, and a sense of joy to become human? ​ Or should this be dismissed as simply the chemical action of dopamine secretion in the brain? But that doesn't mean that all the great geniuses in the history of mankind have achieved such achievements only through external rewards. ​ Is it possible to motivate AI intrinsically? submitted by /u/sunghyunlim [link] [comments]  ( 2 min )
    In the Latest AI Research Google Explains How It Taps the Full Potential of Datacenter Machine Learning Accelerators with Platform Aware Neural Architecture Search (NAS)
    Datacenter accelerators are pieces of hardware that are specifically built to process visual data. It’s a physical device or software program that boosts a computer’s overall performance. Continuous advancements in creating and delivering data center (DC) machine learning (ML) accelerators, such as TPUs and GPUs, have proven crucial for scaling up contemporary ML models and applications. These upgraded accelerators’ ultimate performance (e.g., FLOPs) is orders of magnitude higher than that of standard computing systems. However, there is a rapidly widening gap between the potential peak performance supplied by state-of-the-art hardware and the actual achievable performance when ML models run on these kinds of hardware. Continue Reading Paper: https://openaccess.thecvf.com/content/CVPR2021/papers/Li\_Searching\_for\_Fast\_Model\_Families\_on\_Datacenter\_Accelerators\_CVPR\_2021\_paper.pdf submitted by /u/ai-lover [link] [comments]  ( 1 min )
  • Open

    Do you guys think HER (Hindsight experience buffer) is helpful for dense reward functions?
    submitted by /u/Rich_Beautiful [link] [comments]  ( 1 min )
    "Online Decision Transformer", Zheng et al 2022 {FB}
    submitted by /u/gwern [link] [comments]
    [Q] - Any advice for a deep RL internship interview !
    Hi, I possess an interview with a power utility Company (that generates and distributes electricity). The position is about using Deep RL to forecast ‏‏‎ the load. I have some experience with Deep RL for controlling robots but I am not sure how to use RL to predict the electricity load/consumption. Do you please have any advice? ​ Thanks submitted by /u/goldenkoala499 [link] [comments]  ( 2 min )
    CfP EvoRL @ GECCO 2022
    CALL FOR PAPERS EvoRL 2022 Evolutionary Reinforcement Learning workshop at GECCO 2022, July 9-13, Boston, USA In recent years reinforcement learning (RL) has received a lot of attention thanks to its performance and ability to address complex tasks. At the same time, multiple recent papers, notably work from OpenAI, have shown that evolution strategies (ES) can be competitive with standard RL algorithms on some problems while being simpler and more scalable. Similar results were obtained by researchers from Uber, this time using a gradient-free genetic algorithm (GA) to train deep neural networks on complex control tasks. Moreover, recent research in the field of evolutionary algorithms (EA) has led to the development of algorithms like Novelty Search and Quality Diversity, capable of…  ( 2 min )
    Where to get the current SOTA results for gym tasks?
    I am currently trying to see if my SAC algo works, by running it on the vanilla Gym's MountainCarContinous-v1 environment. Where can I find the state of the art sample complexity / results by the best algorithm for this task or any of the standard gym task in general? submitted by /u/goldfishjy [link] [comments]  ( 1 min )
    when does over-estimation occur?
    Hi, I know 2 reasons about when over-estimation of Q function occur. max operation using Noisy Q network Updating network over and over with same data Do you know more reasons about the occurrence of over-estimation? Very Thanks. submitted by /u/Spiritual_Fig3632 [link] [comments]  ( 1 min )
    Is this field worth it anymore?
    I am a student of ML but am relatively new to RL. I was thinking of getting into this field but am not so sure . According to this popular post, in RL observations are free (so you can sample millions of them) and reward information is always available (for all actions). This view also relates to Moravec's paradox, the observation that computers generally find it very hard to do things that humans find easy to do (such as perceive a visual scene and react to it) and find it very easy to do things that humans find very hard to do (such as multiply two 3 digit numbers). On this view, the reason deep RL is a dead end is, as the above post says, it starts by knowing nothing, and learns to solve a complicated problem from scratch using millions of observations. In contrast, a child learns to solve the problem using a couple of minutes of adjustments, because he is able to relate it to other problems he already knows how to solve. So, the RL view of behavior as being shaped by rewards is fundamentally limited, because it ignores the role of past experiences in creating state representations. However, since Moravec's paradox was enunciated in the 1990s, there is considerable improvement in artificial systems' ability to solve more generalized human-like tasks, e.g. speech recognition, face recognition etc. now than Moravec would ever have considered possible. So the validity of Moravec's paradox is up for debate. What is it's implication on the design of RL algorithms? Also, what are your views on the above quora post? Any help, whether for or against, would be appreciated. Thanks. submitted by /u/_7th_Element_ [link] [comments]  ( 5 min )
    MCTS
    How are Monte Carlo Tree searches integrated with RL? They seem to be their own self learning thing and I can’t find any understandable articles of doing this. I know that AlphaZero used neural networks along with them but the original paper is beyond my level of understanding submitted by /u/Bobingstern [link] [comments]  ( 1 min )
    Is the reward enough to achieve Agi?
    Deep Mind says that only reinforcement learning by reward is sufficient to achieve agi, but I think it may not be. Even in the case of human beings, don't you get a salary or performance pay to motivate you? ​ However, this does not mean that people do not necessarily move only with such external rewards. Own beliefs Sometimes intrinsic motivations, such as curiosity, are more important. ​ In the case of Einstein and Farrellman, who proved the Poincare conjecture, I think that intrinsic motivation was much more important. Intellectual curiosity and intrinsic motivation to know the truth without needing money or fame was more important. ​ Wouldn't it be necessary to give artificial intelligence such intrinsic motivations as intellectual curiosity, a sense of achievement, and a sense of joy to become human? ​ Or should this be dismissed as simply the chemical action of dopamine secretion in the brain? But that doesn't mean that all the great geniuses in the history of mankind have achieved such achievements only through external rewards. ​ Is it possible to motivate AI intrinsically? submitted by /u/sunghyunlim [link] [comments]  ( 2 min )
    Application-specific question about hyperparameter optimization for agent
    Let's say an RL agent is deployed(statically; no exploration or learning) on a real dynamic system. From time to time, due to the non-stationarity of the system, the agent is being trained on a data-driven model of the real dynamic system(assume the real system to be a physics-based simulator for now). During agent training, the data-driven model is simulated over a period of N days into the future (implying the data-driven model is forecasting state transitions in the future where it has not been obviously trained). The issue is, as might be obvious from the description, that while the agent is trying to learn how to behave optimally in the future, there is an amount of uncertainty w.r.t. how accurate the data-driven model is describing the future state transitions. This is because the da…  ( 2 min )
  • Open

    New Levels Unlocked: Africa’s Game Developers Reach Toward the Next Generation
    Looking for a challenge? Try maneuvering a Kenyan minibus through traffic or dropping seed balls on deforested landscapes. Or download Africa’s Legends and battle through fiendishly difficult puzzles with Ghana’s Ananse or Nigeria’s Oya by your side. Games like these are connecting with a hyper-connected African youth population that’s growing fast. Africa is the youngest Read article > The post New Levels Unlocked: Africa’s Game Developers Reach Toward the Next Generation  appeared first on The Official NVIDIA Blog.  ( 4 min )
  • Open

    Backpropagation in Convnets
    Hello, This is a short tutorial paper I wrote explaining how backpropagation is performed in a convolutional neural network, especially through the convolution and max pooling operations. It would be great to have any feedback on the same! Thanks! submitted by /u/codeandfire [link] [comments]
  • Open

    Best Arm Identification with a Fixed Budget under a Small Gap. (arXiv:2201.04469v4 [stat.ML] UPDATED)
    We consider the fixed-budget best arm identification problem in the multi-armed bandit problem. One of the main interests in this field is to derive a tight lower bound on the probability of misidentifying the best arm and to develop a strategy whose performance guarantee matches the lower bound. However, it has long been an open problem when the optimal allocation ratio of arm draws is unknown. In this paper, we provide an answer for this problem under which the gap between the expected rewards is small. First, we derive a tight problem-dependent lower bound, which characterizes the optimal allocation ratio that depends on the gap of the expected rewards and the Fisher information of the bandit model. Then, we propose the "RS-AIPW" strategy, which consists of the randomized sampling (RS) rule using the estimated optimal allocation ratio and the recommendation rule using the augmented inverse probability weighting (AIPW) estimator. Our proposed strategy is optimal in the sense that the performance guarantee achieves the derived lower bound under a small gap. In the course of the analysis, we present a novel large deviation bound for martingales.  ( 2 min )
    Leveraging Unlabeled Data to Predict Out-of-Distribution Performance. (arXiv:2201.04234v2 [cs.LG] UPDATED)
    Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions that may cause performance drops. In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data. We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples for which model confidence exceeds that threshold. ATC outperforms previous methods across several model architectures, types of distribution shifts (e.g., due to synthetic corruptions, dataset reproduction, or novel subpopulations), and datasets (Wilds, ImageNet, Breeds, CIFAR, and MNIST). In our experiments, ATC estimates target performance $2$-$4\times$ more accurately than prior methods. We also explore the theoretical foundations of the problem, proving that, in general, identifying the accuracy is just as hard as identifying the optimal predictor and thus, the efficacy of any method rests upon (perhaps unstated) assumptions on the nature of the shift. Finally, analyzing our method on some toy distributions, we provide insights concerning when it works.  ( 2 min )
  • Open

    Leveraging Unlabeled Data to Predict Out-of-Distribution Performance. (arXiv:2201.04234v2 [cs.LG] UPDATED)
    Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions that may cause performance drops. In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data. We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples for which model confidence exceeds that threshold. ATC outperforms previous methods across several model architectures, types of distribution shifts (e.g., due to synthetic corruptions, dataset reproduction, or novel subpopulations), and datasets (Wilds, ImageNet, Breeds, CIFAR, and MNIST). In our experiments, ATC estimates target performance $2$-$4\times$ more accurately than prior methods. We also explore the theoretical foundations of the problem, proving that, in general, identifying the accuracy is just as hard as identifying the optimal predictor and thus, the efficacy of any method rests upon (perhaps unstated) assumptions on the nature of the shift. Finally, analyzing our method on some toy distributions, we provide insights concerning when it works.  ( 2 min )
    Best Arm Identification with a Fixed Budget under a Small Gap. (arXiv:2201.04469v4 [stat.ML] UPDATED)
    We consider the fixed-budget best arm identification problem in the multi-armed bandit problem. One of the main interests in this field is to derive a tight lower bound on the probability of misidentifying the best arm and to develop a strategy whose performance guarantee matches the lower bound. However, it has long been an open problem when the optimal allocation ratio of arm draws is unknown. In this paper, we provide an answer for this problem under which the gap between the expected rewards is small. First, we derive a tight problem-dependent lower bound, which characterizes the optimal allocation ratio that depends on the gap of the expected rewards and the Fisher information of the bandit model. Then, we propose the "RS-AIPW" strategy, which consists of the randomized sampling (RS) rule using the estimated optimal allocation ratio and the recommendation rule using the augmented inverse probability weighting (AIPW) estimator. Our proposed strategy is optimal in the sense that the performance guarantee achieves the derived lower bound under a small gap. In the course of the analysis, we present a novel large deviation bound for martingales.  ( 2 min )

  • Open

    [P] New "Machine Learning Simplified" book
    submitted by /u/5x12 [link] [comments]  ( 1 min )
    [P] 20K+ Arxiv ML Papers Vectorised, Cluster Application and Projector
    In recent years, the number of research papers have grown tremendously. New areas are popping up everyday but it is not exactly clear which areas are emerging or which interesting new area has just surfaced up. As a result, I decided to cluster together 20k+ interesting machine learning papers that were recently surfaced up (scraped sometime in November-December in 2020) to see which fields could be discovered I may not already know about. You can check out the applications here: Cluster Application ​ Preview of cluster application showing clear reinforcement learning and multi-task learning clusters Preview of cluster application showing clear reinforcement learning and m Research2Vec Projector Projector Visualisation Method I created the vectors using a fine-tuned version of Sen…  ( 2 min )
    [D] What is the latest consensus regarding small batch vs large batch?
    I search up on internet including this subreddit, a find many contradictory opinions. What I do now to start with a small batch size (e.g. 8), when the early stop policy trigger, I double the batch size and then train again. Repeat until the batch size overflow the GPU memory. My intuition is because small batch size give a noisier gradient, it allows the model to "explore" and then as the model become more mature, the larger batch size would "stablise" the model. Is my intuition wrong? Is there a defacto way to choice batch size in 2022? submitted by /u/RavlaAlvar [link] [comments]  ( 1 min )
    [D] Machine Learning - WAYR (What Are You Reading) - Week 131
    This is a place to share machine learning research papers, journals, and articles that you're reading this week. If it relates to what you're researching, by all means elaborate and give us your insight, otherwise it could just be an interesting paper you've read. Please try to provide some insight from your understanding and please don't post things which are present in wiki. Preferably you should link the arxiv page (not the PDF, you can easily access the PDF from the summary page but not the other way around) or any other pertinent links. Previous weeks : 1-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80 81-90 91-100 101-110 111-120 121-130 Week 1 Week 11 Week 21 Week 31 Week 41 Week 51 Week 61 Week 71 Week 81 Week 91 Week 101 Week 111 Week 121 Week 2 Week 12 Week 22 Week 32 Week 42 Week 52 Week 62 Week 72 Week 82 Week 92 Week 102 Week 112 Week 122 Week 3 Week 13 Week 23 Week 33 Week 43 Week 53 Week 63 Week 73 Week 83 Week 93 Week 103 Week 113 Week 123 Week 4 Week 14 Week 24 Week 34 Week 44 Week 54 Week 64 Week 74 Week 84 Week 94 Week 104 Week 114 Week 124 Week 5 Week 15 Week 25 Week 35 Week 45 Week 55 Week 65 Week 75 Week 85 Week 95 Week 105 Week 115 Week 125 Week 6 Week 16 Week 26 Week 36 Week 46 Week 56 Week 66 Week 76 Week 86 Week 96 Week 106 Week 116 Week 126 Week 7 Week 17 Week 27 Week 37 Week 47 Week 57 Week 67 Week 77 Week 87 Week 97 Week 107 Week 117 Week 127 Week 8 Week 18 Week 28 Week 38 Week 48 Week 58 Week 68 Week 78 Week 88 Week 98 Week 108 Week 118 Week 128 Week 9 Week 19 Week 29 Week 39 Week 49 Week 59 Week 69 Week 79 Week 89 Week 99 Week 109 Week 119 Week 129 Week 10 Week 20 Week 30 Week 40 Week 50 Week 60 Week 70 Week 80 Week 90 Week 100 Week 110 Week 120 Week 130 Most upvoted papers two weeks ago: /u/ConditionalDew: https://machinelearning.apple.com/ /u/ArminBazzaa: Patches are all you need /u/themiro: The Annotated S4 Besides that, there are no rules, have fun. submitted by /u/ML_WAYR_bot [link] [comments]  ( 1 min )
    [P] Stylegan Vintage-Style Portraits
    submitted by /u/Illustrious_Row_9971 [link] [comments]  ( 2 min )
    [D] Which floating-point format do you use most often for model training or inference work?
    View Poll submitted by /u/DevGamerLB [link] [comments]  ( 1 min )
    [D] Simple Questions Thread
    Please post your questions here instead of creating a new thread. Encourage others who create new posts for questions to post here instead! Thread will stay alive until next one so keep posting after the date in the title. Thanks to everyone for answering questions in the previous thread! submitted by /u/AutoModerator [link] [comments]  ( 1 min )
    [D] Are there ways to get GPU computing powers for my own research?
    I am not affiliated with any universities and located in Russia. I've been working on single image depth estimation using visual transformers, point of my interest is merging dataset strategy(using multiple dataset for training), on this subject there has been some interesting papers. Second, and main point of my interest is using synthetic data for training. ​ I have collected reasonable starting amount of such data(depth and color data, around 300 gb), planning to grow dataset further to 4 tb. The problem is training setup, currently I have none. Transformers require large amount of memory to fit in gpu(24 gb ideal), and for reasonable training time at least 2-3 such GPUs. I've looked at pricings, with bloated gpu prices I have no money to create needed rig, neither aws and google tpu cloud are options(both are starting from 2k$ per month). The purpose of this post is to ask if I have any other viable options that I dont know about? submitted by /u/InfiniteLife2 [link] [comments]  ( 2 min )
    [D] What are some good experiments to run a Federated Learning model that does Regression/Classification?
    *run on (sorry for typo) So I am looking to take some observations using the aforementioned model. So far I have graphed how federated accuracy changes with varying the number of client devices, number of data points, and number of epochs of communication. Additionally, I have also dabbled a bit in differential privacy and tweaked the value of epsilon to see how it affects accuracy and convergence in a federated classification setting. Now I am given an image dataset (CIFAR-10) as well as some linear regression datasets. I don't want to repeat exactly the same experiments as I mentioned above. So is there anything else I can attempt? submitted by /u/mavericknathan1 [link] [comments]  ( 1 min )
    [P] C++ Machine Learning Library Built From Scratch by a 16-Year-Old High Schooler
    Hello r/MachineLearning! In this post, I will be explaining why I decided to create a machine learning library in C++ from scratch. If you are interested in taking a closer look at it, the GitHub repository is available here: https://github.com/novak-99/MLPP. To give some background, the library is over 13.0K lines of code and incorporates topics from statistics, linear algebra, numerical analysis, and of course, machine learning and deep learning. I have started working on the library since I was 15. Quite honestly, the main reason why I started this work is simply because C++ is my language of choice. The language is efficient and is good for fast execution. When I began looking over the implementations of various machine learning algorithms, I noticed that most, if not all of the imp…  ( 6 min )
    [R] MIT Computational Sensorimotor Learning (CSL) Seminar
    Hi, r/MachineLearning, r/robotics, r/reinforcementlearning, we are running CSL seminar at MIT. The CSL seminar meets once every few weeks and its topics span RL, robotics, deep learning, and related fields. For each topic, we try to invite the best experts as guest speakers (UPenn/Stanford/Columbia/Google/CMU/etc). The seminar time is Mondays from 11:00-12:00 ET. All the talks will be live-streamed on YouTube. Click here to subscribe to our seminar calendar, where you can find the time and the live youtube link for each talk. You can also subscribe to our YouTube channel for the recorded talks from CSL Seminar. Our first talk will be given by Dinesh Jayaraman from the University of Pennsylvania from 11am to 12pm (ET) on Feb 14th. Dinesh will talk about Scaling Visual RL for Robotics. The talk will be live-streamed on Youtube (https://youtu.be/Yucd3px5Og4). submitted by /u/taochenshh [link] [comments]  ( 1 min )
    [D] Progress in neural code generation.
    I wanted to get the opinion of people in this subreddit about neural code generation. What do you think is the most exciting paper in neural code generation right now? Does it really have any promise now or in the future? submitted by /u/frankhart98 [link] [comments]  ( 1 min )
    [P] Building towards GPT-3 of computer vision
    Hey all! We're building towards a GPT3 level moment in computer vision, and here's our v0 - https://youtu.be/P7zcc8iZ0YA This v0 runs on 13B parameters, with 18B and 34B model iterations coming in the pipeline. Access to the model is gated as of now to help us monitor scale, you can sign up at - https://banana-dev.typeform.com/carrot submitted by /u/immune_star [link] [comments]  ( 1 min )
  • Open

    Can AI help me write my science journal manuscript?
    I saw an ad for hyperwrite which apparently helps someone write blog and content for regular media. Is there an AI that i could feed, say, 30 articles from Nature of Neuroscience and JACS, and then have it help me hammer out a dissertation or manuscript? submitted by /u/Low_Ad_6357 [link] [comments]
    AI for generating stripes based off an existing complex pattern
    Sorry if this isn't the right place to ask this and also for the wall of text. So I have an admittedly very niche request. Here's the context: If you have ever heard of doctor who, you may know of the fourth doctor and his extremely long scarf. I'm a self described "expert" on the various scarves that were worn. I'm focusing on the first one here, that debuted in season 12 of classic who. One of the things that is often said about the origin of the scarf is that it was shortened before recording Tom Baker's first season. The costume designer Jim Acheson and Tom himself have both said that it was roughly 20 feet long when it was first given to them by Begonia Pope, the person Jim hired to knit the scarf. In the first episode, you can clearly see the scarf is no more than 12 feet long. …  ( 2 min )
    Is it possible to create a smart AI to run an exploit or payload?
    Hello everyone, I think we are all familiar with google home & alexa and Siri (which i have set up with lights, thermostats, video, chromecast, etc etc) I never realized how AI could be so useful, everyone in my home Deff takes it for granted though. No one touches a light switch and doesn't have to walk downstairs in the middle of the night to do almost anything. all it takes is a verbal command ( i have spoiled them ) My main questions are, Is it possible to create a AI system from ground-up to execute payloads and exploits written in the format that is designed to use. I would like an AI (like how JARVIS is in iron man or in video games ) I want basic things such as, Okay AI execute exploit 3, responding to me am i sure i want to do that (lets say exploit 3 destroys every piece of information available to all connected systems) and in an emergency i'd just want to say "Okay AI Execute Exploit 3, talks back, i accept the terms and conditions, and it progressively deletes ALL INFO and all my code. wipes everything and sensitive information back to stock or nothing left at all." That's honestly a safety feature i want and to hook it up to my computers uniformly. Is this possible? Too expensive? Or science fiction. Thanks.. I'd start with basic Python code. Is any of this possible to a plausible degree? submitted by /u/RedditAdmin25 [link] [comments]  ( 1 min )
    A new AI Institute is being formed - to help save humanity - and co-evolve humanity and AI
    Sorry we don't have a site yet. Some of my friends and I are putting together a video driven AI organization (a for-profit) - that is actively developing some cutting edge AI, Intelligent Agent, and NLP work, basically implementing all our NLP and AI tools to better humanity and society. My co-founder is a video editing genius, who has produced incredible commercials for Target, Best Buy, and Prince, you name it - he's been around for 30 years telling stories via video. We are working on telling the story of the co-evolution of AI and society, and how with the right direction, we can arrive at a future that is 'good'. and less dystopian. We're looking for some gurus, Masters/PhD students who want their work to get highlighted. Help with crowdfunding, digital marketing, and most of all, h…  ( 2 min )
    Can A Machine Be Truly Concious [With Dr Mark Solms]
    submitted by /u/JHAMBFP [link] [comments]
    Swimbots are colorful wiggly creatures with two goals in their (artificial) life: To consume energy and mate/mutate with other Swimbots...
    submitted by /u/cantonbecker [link] [comments]
    Instagram bots
    Any idea on how to automatice instagrams dms to send to people for a certain amount of followers? Coding with Python or whatever thanks submitted by /u/Kaytor_ [link] [comments]
    AI bias harms over a third of businesses, 81% want more regulation
    submitted by /u/AnnalieseWhorton [link] [comments]
    Can we share books, stories, games, etc? I've been wanting to share a great story.
    An Artificial Intelligence awakens on Earth to find the planet decimated by a malevolent alien race. With all life eradicated, the AI builds itself a body built for the stars...and for revenge. It's called CHRYSALIS and it's by DUST. It's a narrated story. ​ Anyone else have any hidden gems out there related to Artificial Intelligence? submitted by /u/SlowCrates [link] [comments]  ( 1 min )
    Artificial Nightmares : Skull Storm || DD PYTTI VQGAN AI Art Video [4K 60 FPS]
    submitted by /u/Thenamessd [link] [comments]  ( 1 min )
    Building a Complete OCR Engine From Scratch In Python
    submitted by /u/VikasOjha666 [link] [comments]
    Scale to human level
    half of human's cortex = 8 billions neurons number of neuron's type = 1000 -> 1 population = 8 millions neurons number of neuron's patche = 1000 (neuron's patches are patches on dentrite and soma where synapses converge to represent a feature) ​ Full connections between 2 neuron's types = 8m * 8m * 1000 (patches) = 64 quadrillions parameters For 1000 neuron's types = 64 quintillions parameters Sparse modele like Sparse Mixture of Experts limit the generalization because the network can't make all usefull connections. And a recent paper show that routing helps until a certain size (1 trillions parameters). ​ So it's not possible with our technologie. IMHO, the best solution, at least in this decade, is to do what the nature do, grow the neural network (fire together wire together). It's strange that there is not more paper on growing network. submitted by /u/christ1666 [link] [comments]  ( 1 min )
    using AI to language learning
    Who are some world-leading experts in using ML/AI technology for education, especially with language learning (Intelligent Computer assisted language learning, ICALL)? submitted by /u/Beginner2Bishop [link] [comments]
    Researchers Introduce ‘Alpa’: A Compiler System for Distributed Deep Learning, Built on Top of Machine Learning Parallelization Approaches, That Automatically Generates Parallel Execution Plans Covering All Data, Operator, and Pipeline Parallelisms
    Training extremely massive deep learning (DL) models on clusters of high-performance accelerators necessitate significant engineering effort for model definition and training cluster environment specifications and typically necessitates tuning a complex combination of data, operator, and pipeline parallelization approach for individual network operators. Automating the parallelization of large-scale models could speed up DL development and deployment, but due to the intricate structures involved, it remains a difficult undertaking. To address this problem, a group of researchers from UC Berkeley, Amazon Web Services, Google, Shanghai Jiao Tong University, and Duke University proposed Alpa, a compiler system for distributed DL on GPU clusters that can automatically generate parallelization plans that match or outperform hand-tuned model-parallel training systems, even on the models for which they were designed. Continue Reading Paper: https://arxiv.org/pdf/2201.12023.pdf https://preview.redd.it/7k1vcv5icjh81.png?width=1486&format=png&auto=webp&s=0ace8db68dfa01b2c0067d1700f45adf389e8287 submitted by /u/ai-lover [link] [comments]  ( 1 min )
    So I made some art and processed it with Google's Cloud Vision API. The AI appears to give me contact info for an AI expert. Proof of Sentience? Glitch? Original file in comments.
    submitted by /u/throwR8 [link] [comments]  ( 1 min )
    Writing an episode of Star Wars using A.I
    submitted by /u/tomjedi9 [link] [comments]
  • Open

    Why JSON Users Should Learn Turtle
    The Semantic Web has garnered a reputation for complexity among both Javascript and Python developers, primarily because, well, it’s not JSON, and JSON has become the data language of the web. Why learn some obscure language when JSON is perfectly capable of describing everything, right? Well, sort of. The problem that JSON faces, is actually… Read More »Why JSON Users Should Learn Turtle The post Why JSON Users Should Learn Turtle appeared first on Data Science Central.  ( 12 min )
    Is old data strategy drowning new data strategy?
    Ted Gioia, author of Music: A Subversive History, made an excellent point in an article in The Atlantic titled “Is Old Music Killing New Music?” posted in January 2022: “I learned the danger of excessive caution long ago, when I consulted for huge Fortune 500 companies. The single biggest problem I encountered—shared by virtually every… Read More »Is old data strategy drowning new data strategy? The post Is old data strategy drowning new data strategy? appeared first on Data Science Central.  ( 5 min )
  • Open

    Does anyone have expierence with machine learning in UE5?
    I'm looking for a reinforcment training library in UE 5 submitted by /u/GermanyDoesntExist [link] [comments]  ( 1 min )
    Disappointing Results in Mujoco
    I recently installed Mujoco, and I decided to run some of the provided models first. In the cloth simulation, I noticed something worrying : the cloth appears to enter through an obstacle for a fraction of a second. You can clearly see it in this screenshot : https://imgur.com/a/Uq4rTlp. As I'm trying to create an environment to train a highly dynamic robot, should I use another simulator or is this nothing to worry about? submitted by /u/SirFlamenco [link] [comments]  ( 1 min )
  • Open

    How does the way neural networks store organized data differ from that of the human brain?
    I'm not certain, but I'm under the impression that artificial neural networks store data in organized structures that are easily accessible for input purposes. However, the human brain is made entirely out of human neural networks. I'm not certain whether the structures are entirely analogous, or if ANN's are just called that for illustration purposes. Maybe they are, however, intended to be analogs. My question is this: How does the brain both interpret and store information in the form of neural networks? Isn't the main purpose of an artificial neural network to take some form of data and process it, and then reorganize its interpretations into decisions or products? Do the portions of the human brain which store data possess a different organizational makeup than the portions of the brain which process it? submitted by /u/ToastedUranium [link] [comments]  ( 1 min )
    NN AI playing Flappy Bird using the game screen
    submitted by /u/hotcodist [link] [comments]

  • Open

    How AI is Changing Chemical Discovery
    submitted by /u/regalalgorithm [link] [comments]
    What do you think of this course?
    https://www.udacity.com/course/ai-programming-python-nanodegree--nd089 It would be great to know what you think of it you've taken the course. But you can also check out the syllabus. Thanks submitted by /u/Otis43 [link] [comments]
    What is the best way to get into AI?
    So I am very interested in AI, specifically AI Robots and machine learning. So what is the best way to get into this? submitted by /u/Dan_The_PaniniMan [link] [comments]  ( 1 min )
    In A Latest Computer Vision Research, Researchers Introduce ‘JoJoGAN’: An AI Method With One-Shot Face Stylization
    A style mapper applies a preset style to the photos it receives (for example, taking faces to cartoons). In a recent study, researchers from the University of Illinois at Urbana-Champaign introduce JoJoGAN as a straightforward approach for learning a style mapper from a single sample of the style. The technique allows an inexperienced user, for example, to supply a style sample and then apply that style to their image of choice. The team discusses its method in the context of face photographs because stylizing faces is so appealing to inexperienced users; nonetheless, the concept may be applied to any image. A procedure for learning a style mapper should be simple to use, produce compelling and high-quality results, require only one style reference but accept and benefit from more, allow users to control how much style is transferred, and allow more sophisticated users to control which aspects of the style are transferred in order to be useful. Researchers show that the technique achieves these objectives using both qualitative and quantitative evidence. Continue Reading Paper: https://arxiv.org/pdf/2112.11641.pdf Github: https://github.com/mchong6/JoJoGAN https://i.redd.it/osv7g16exfh81.gif submitted by /u/ai-lover [link] [comments]  ( 1 min )
    fun?
    submitted by /u/SOLIDSNAKETOM [link] [comments]
    Is Partial or Semi-Conscious Artificial General Intelligence Possible?
    submitted by /u/Beautiful-Credit-868 [link] [comments]
    From a few images to a 3D model with AI!
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 1 min )
    The Best Machine Learning Courses (2022)
    submitted by /u/Jan_Prince [link] [comments]
    📢 New Course on TensorFlow and Keras by OpenCV
    submitted by /u/spmallick [link] [comments]
    Best Data Science Courses Datacamp to learn
    DataCamp is an excellent and popular online learning platform to build your skills and future in data science. DataCamp offers courses in data science, Python, SQL, R programming, big data, machine learning, deep learning, and applied finance. So, we have collected the Best Data Science Courses Data Camp that makes it easy for you to know the list. submitted by /u/sivasiriyapureddy [link] [comments]  ( 1 min )
    How stupid/intelligent is youtube at recommending videos ?
    I don't use autoplay much but I'm quite disappointed in youtube's lack of ability to present a better home page to me. I think it mostly digs me deeper into the sort of things I already watched, it practically has no ability to guess what else I might like to watch. Editing: Furthermore, it forgets about the videos you watched before too fast and bases the home page on only very recently watched videos, which further narrows the type of the content. submitted by /u/ounurrs [link] [comments]  ( 3 min )
  • Open

    [D] For vision we have RenNet50 Cifar and ImageNet as baselines. Is there any equivalent model/benchmark for Transformers?
    In my research I have been mainly using convnets on vision tasks. ResNet50 (+ Imagenet and Cifar datasets) are a work horse and a great baseline for computer vision research. Crucially, it is easy to train them on a single GPU. Is there an equivalent transformer+dataset combo that can be trained on a single GPU? submitted by /u/sia_rezaei [link] [comments]  ( 1 min )
    [D] Software Engineers for grad labs
    Hello people of Reddit! Me and a few MSc students (all in AI) were wondering if the following business idea is something that would be needed in research labs that use ML (and not only pure ML, but also maybe life sciences and/or computational biology): software engineering services offered to these labs, something along the lines of we get the specs for an algorithm and such, and we implement everything from data pipeline, algorithm, deployment etc., in a maintainable and reproducible way. Since none of us had contact with PhD labs before, we figured this place is as good as any to ask such a question. Thank you all very much for your time! submitted by /u/AlexIsEpic24 [link] [comments]  ( 2 min )
    [D] 3D point cloud smoothing
    For interior of apartment / dwelling units, I have scanned them using LiDAR assisted app on iPad Pro. I have collected the files in most standard formats (PLY , OBJ+ MTL etc). Is there a nice way to smooth the noisy surfaces for preprocessing to give a uniform texture? Any pointers & feedback much appreciated. submitted by /u/mlbloke [link] [comments]  ( 1 min )
    [R] Hierarchical Kinematic Probability Distributions for 3D Human Shape and Pose Estimation from Images in the Wild
    submitted by /u/synonymous1964 [link] [comments]  ( 1 min )
    [Project]Object Localization without being given labels
    Hello people! So I have a problem where given a set of images captured in grocery stores(shelf image), and a set of close up images of products in those stores(product image) For every product image, find the location of that product in all shelf images in which it appears. For every shelf image, locate all products and assign the name from given set of product images. neither of them are labelled. I feel I need to create vector that represents each proudct by doing an encoder decoder model and generating latent space representations. however they ALSO want me to locate the product in the shelf, with a bounding box so in a shelf image, if the shelf has 20 products they want eg : shelf_image 32 - product_id_21 (known purely using vector similarity) - bounding box any ideas on how I can do this? Thanks! Im confident of the vector creation part, but not sure how to use that to use a shelf image and create classification AND bounding boxes TIA! product image https://preview.redd.it/rjf3gr3fseh81.jpg?width=2272&format=pjpg&auto=webp&s=8fd8e050672b06d042fca489c9e906806ceee3bd shelf image submitted by /u/Integral_humanist [link] [comments]  ( 1 min )
    [P] Deep Reinforcement Learning algorithm completing Tekken Tag Tournament at highest difficulty level
    submitted by /u/DIAMBRA_AIArena [link] [comments]  ( 3 min )
    [D] How do you help normal users access your Models / Prototypes ?
    As a Data Scientist I build many prototypes in my spare time. There are several ways to pitch or share my work. Could be writing blogs, sharing on social network, publishing the models in a marketplace like huggingface, aws, gcp etc..,. But I feel my work is not reaching actual users who are in need. It lacks UI for them to interact. The other big problem I see is distribution and visibility. There is no marketplace like playstore / iOS appstore for web apps. I would like know what are the ways and also difficulties faced by machine learning developers to build an app/prototype that a user can interact through browser. submitted by /u/akhilgod [link] [comments]  ( 2 min )
  • Open

    A Bayesian approach to pricing
    Suppose you want to determine how to price a product and you initially don’t know what the market is willing to pay. This post outlines some of the things you might think about, and how Bayesian modeling might help. This post is not the final word on the subject, or even my final word on […] A Bayesian approach to pricing first appeared on John D. Cook.  ( 4 min )
  • Open

    Bad performance on test dataset
    Hello! I'm using actor-critic algorithm in finance and on my train dataset in every episode agent improves himself (i just use this to see if he is actually learning) but on test set he performes worser and worser. I'm thinking maybe it is because i'm feeding him with data in same order everytime but does anybody have some other ideas? ​ Thank you for your help. submitted by /u/morzevavagina [link] [comments]  ( 1 min )
    Quick question in regards to "Average-reward formulation" in Policy Gradient
    In the paper "Policy Gradient Method for Reinforcement Learning with Function Approximation" https://proceedings.neurips.cc/paper/1999/file/464d828b85b0bed98e80ade0a5c43b0f-Paper.pdf, the authors denoted that there are two ways of formulating the agent's objective which are 1) "average-reward formulation" and 2) "designated start state formulation". I am familiar with the 2), but 1) with Q-function formulation does not seem to be clear to me. Does anyone know where this formulation came from? Or the exact meaning of this formulation? Why is the expected reward per step \rho(\pi) is deducted from the reward? And also, why is there no discounting factor on this Q-function formulation? I could not really find any source related to this formulation. submitted by /u/g6ssgs [link] [comments]  ( 1 min )
    Proof of policy induce in Maximum casual entropy RL
    Hi, I'm curious about how maximizing total reward and causal entropy of policy can lead pi = exp(Q - V), where v = ln(sum(exp Q)). Where can I find an intuitive proof about that. That used in large field in RL I guess like AIRL or RLSP. Very Thanks. submitted by /u/Spiritual_Fig3632 [link] [comments]  ( 1 min )
    "On Lottery Tickets and Minimal Task Representations in Deep Reinforcement Learning", Vischer et al 2021 (BC is easier to learn than RL & prunes better)
    submitted by /u/gwern [link] [comments]  ( 1 min )
    "EvoJAX: Hardware-Accelerated Neuroevolution", Tang et al 2022 {G} (PGPE on TPU pods)
    submitted by /u/gwern [link] [comments]
    "Accelerated Quality-Diversity for Robotics through Massive Parallelism", Lim et al 2022 (MAP-Elites on TPU pods)
    submitted by /u/gwern [link] [comments]  ( 1 min )
  • Open

    📢 New Course on TensorFlow and Keras by OpenCV
    submitted by /u/spmallick [link] [comments]
    Block-NeRF: Scalable Large Scene Neural View Synthesis - Waymo
    submitted by /u/nickb [link] [comments]
  • Open

    Autonomous Cars- The Future of Vehicular Technology
    The world is moving at a faster pace with the adoption of 5G and data-driven technology. Artificial Intelligence is introduced to the world as a friend and humans are ensuring every possible way of handing over the tough tasks to machines. The idea of self-driving vehicles has attracted significant attention as well as investments worldwide.… Read More »Autonomous Cars- The Future of Vehicular Technology The post Autonomous Cars- The Future of Vehicular Technology appeared first on Data Science Central.  ( 2 min )
  • Open

    Mean Nystr\"om Embeddings for Adaptive Compressive Learning. (arXiv:2110.10996v2 [stat.ML] UPDATED)
    Compressive learning is an approach to efficient large scale learning based on sketching an entire dataset to a single mean embedding (the sketch), i.e. a vector of generalized moments. The learning task is then approximately solved as an inverse problem using an adapted parametric model. Previous works in this context have focused on sketches obtained by averaging random features, that while universal can be poorly adapted to the problem at hand. In this paper, we propose and study the idea of performing sketching based on data-dependent Nystr\"om approximation. From a theoretical perspective we prove that the excess risk can be controlled under a geometric assumption relating the parametric model used to learn from the sketch and the covariance operator associated to the task at hand. Empirically, we show for k-means clustering and Gaussian modeling that for a fixed sketch size, Nystr\"om sketches indeed outperform those built with random features.  ( 2 min )
    Deep Learning Image Recognition for Non-images. (arXiv:2106.14350v2 [cs.LG] UPDATED)
    Powerful deep learning algorithms open an opportunity for solving non-image Machine Learning (ML) problems by transforming these problems to into the image recognition problems. The CPC-R algorithm presented in this chapter converts non-image data into images by visualizing non-image data. Then deep learning CNN algorithms solve the learning problems on these images. The design of the CPC-R algorithm allows preserving all high-dimensional information in 2-D images. The use of pair values mapping instead of single value mapping used in the alternative approaches allows encoding each n-D point with 2 times fewer visual elements. The attributes of an n-D point are divided into pairs of its values and each pair is visualized as 2-D points in the same 2-D Cartesian coordinates. Next, grey scale or color intensity values are assigned to each pair to encode the order of pairs. This is resulted in the heatmap image. The computational experiments with CPC-R are conducted for different CNN architectures, and methods to optimize the CPC-R images showing that the combined CPC-R and deep learning CNN algorithms are able to solve non-image ML problems reaching high accuracy on the benchmark datasets. This chapter expands our prior work by adding more experiments to test accuracy of classification, exploring saliency and informativeness of discovered features to test their interpretability, and generalizing the approach.  ( 2 min )
    Disentanglement Analysis with Partial Information Decomposition. (arXiv:2108.13753v2 [stat.ML] UPDATED)
    We propose a framework to analyze how multivariate representations disentangle ground-truth generative factors. A quantitative analysis of disentanglement has been based on metrics designed to compare how one variable explains each generative factor. Current metrics, however, may fail to detect entanglement that involves more than two variables, e.g., representations that duplicate and rotate generative factors in high dimensional spaces. In this work, we establish a framework to analyze information sharing in a multivariate representation with Partial Information Decomposition and propose a new disentanglement metric. This framework enables us to understand disentanglement in terms of uniqueness, redundancy, and synergy. We develop an experimental protocol to assess how increasingly entangled representations are evaluated with each metric and confirm that the proposed metric correctly responds to entanglement. Through experiments on variational autoencoders, we find that models with similar disentanglement scores have a variety of characteristics in entanglement, for each of which a distinct strategy may be required to obtain a disentangled representation.  ( 2 min )
    Model-Value Inconsistency as a Signal for Epistemic Uncertainty. (arXiv:2112.04153v2 [cs.LG] UPDATED)
    Using a model of the environment and a value function, an agent can construct many estimates of a state's value, by unrolling the model for different lengths and bootstrapping with its value function. Our key insight is that one can treat this set of value estimates as a type of ensemble, which we call an \emph{implicit value ensemble} (IVE). Consequently, the discrepancy between these estimates can be used as a proxy for the agent's epistemic uncertainty; we term this signal \emph{model-value inconsistency} or \emph{self-inconsistency} for short. Unlike prior work which estimates uncertainty by training an ensemble of many models and/or value functions, this approach requires only the single model and value function which are already being learned in most model-based reinforcement learning algorithms. We provide empirical evidence in both tabular and function approximation settings from pixels that self-inconsistency is useful (i) as a signal for exploration, (ii) for acting safely under distribution shifts, and (iii) for robustifying value-based planning with a learned model.  ( 2 min )
    Learning in Restless Bandits under Exogenous Global Markov Process. (arXiv:2112.09484v2 [cs.LG] UPDATED)
    We consider an extension to the restless multi-armed bandit (RMAB) problem with unknown arm dynamics, where an unknown exogenous global Markov process governs the rewards distribution of each arm. Under each global state, the rewards process of each arm evolves according to an unknown Markovian rule, which is non-identical among different arms. At each time, a player chooses an arm out of $N$ arms to play, and receives a random reward from a finite set of reward states. The arms are restless, that is, their local state evolves regardless of the player's actions. Motivated by recent studies on related RMAB settings, the regret is defined as the reward loss with respect to a player that knows the dynamics of the problem, and plays at each time $t$ the arm that maximizes the expected immediate value. The objective is to develop an arm-selection policy that minimizes the regret. To that end, we develop the Learning under Exogenous Markov Process (LEMP) algorithm. We analyze LEMP theoretically and establish a finite-sample bound on the regret. We show that LEMP achieves a logarithmic regret order with time. We further analyze LEMP numerically and present simulation results that support the theoretical findings and demonstrate that LEMP significantly outperforms alternative algorithms.  ( 2 min )
    A Functional Operator for Model Uncertainty Quantification in the RKHS. (arXiv:2109.10888v3 [cs.LG] UPDATED)
    We propose a framework for single-shot predictive uncertainty quantification of a neural network that replaces the conventional Bayesian notion of weight probability density function (PDF) with a functional defined on the model weights in a reproducing kernel Hilbert space (RKHS). The resulting RKHS based analysis yields a potential field based interpretation of the model weight PDF, which allows us to use a perturbation theory based approach in the RKHS to formulate a moment decomposition problem over the model weight-output uncertainty. We show that the extracted moments from this approach automatically decompose the weight PDF around the local neighborhood of the specified model output and determine with great sensitivity the local heterogeneity and anisotropy of the weight PDF around a given model prediction output. Consequently, these functional moments provide much sharper estimates of model predictive uncertainty than the central stochastic moments characterized by Bayesian and ensemble methods. We demonstrate this experimentally by evaluating the error detection capability of the model uncertainty quantification methods on test data that has undergone a covariate shift away from the training PDF learned by the model. We find our proposed measure for uncertainty quantification to be significantly more precise and better calibrated than baseline methods on various benchmark datasets, while also being much faster to compute.  ( 2 min )
    BEV-SGD: Best Effort Voting SGD for Analog Aggregation Based Federated Learning against Byzantine Attackers. (arXiv:2110.09660v2 [cs.LG] UPDATED)
    As a promising distributed learning technology, analog aggregation based federated learning over the air (FLOA) provides high communication efficiency and privacy provisioning under the edge computing paradigm. When all edge devices (workers) simultaneously upload their local updates to the parameter server (PS) through commonly shared time-frequency resources, the PS obtains the averaged update only rather than the individual local ones. While such a concurrent transmission and aggregation scheme reduces the latency and communication costs, it unfortunately renders FLOA vulnerable to Byzantine attacks. Aiming at Byzantine-resilient FLOA, this paper starts from analyzing the channel inversion (CI) mechanism that is widely used for power control in FLOA. Our theoretical analysis indicates that although CI can achieve good learning performance in the benign scenarios, it fails to work well with limited defensive capability against Byzantine attacks. Then, we propose a novel scheme called the best effort voting (BEV) power control policy that is integrated with stochastic gradient descent (SGD). Our BEV-SGD enhances the robustness of FLOA to Byzantine attacks, by allowing all the workers to send their local updates at their maximum transmit power. Under worst-case attacks, we derive the expected convergence rates of FLOA with CI and BEV power control policies, respectively. The rate comparison reveals that our BEV-SGD outperforms its counterpart with CI in terms of better convergence behavior, which is verified by experimental simulations.  ( 2 min )
    Lower Bounds for Differentially Private ERM: Unconstrained and Non-Euclidean. (arXiv:2105.13637v2 [cs.LG] UPDATED)
    We consider the lower bounds of differentially private empirical risk minimization (DP-ERM) for convex functions in constrained/unconstrained cases with respect to the general $\ell_p$ norm beyond the $\ell_2$ norm considered by most of the previous works. We provide a simple black-box reduction approach which can generalize lower bounds in constrained case to unconstrained case. For $(\epsilon,\delta)$-DP, we achieve $\Omega(\frac{\sqrt{d \log(1/\delta)}}{\epsilon n})$ lower bounds for both constrained and unconstrained cases and any $\ell_p$ geometry where $p\geq 1$ by introducing a novel biased mean property for fingerprinting codes, where $n$ is the size of the data-set and $d$ is the dimension.  ( 2 min )
    Parallel Successive Learning for Dynamic Distributed Model Training over Heterogeneous Wireless Networks. (arXiv:2202.02947v2 [cs.LG] UPDATED)
    Federated learning (FedL) has emerged as a popular technique for distributing model training over a set of wireless devices, via iterative local updates (at devices) and global aggregations (at the server). In this paper, we develop parallel successive learning (PSL), which expands the FedL architecture along three dimensions: (i) Network, allowing decentralized cooperation among the devices via device-to-device (D2D) communications. (ii) Heterogeneity, interpreted at three levels: (ii-a) Learning: PSL considers heterogeneous number of stochastic gradient descent iterations with different mini-batch sizes at the devices; (ii-b) Data: PSL presumes a dynamic environment with data arrival and departure, where the distributions of local datasets evolve over time, captured via a new metric for model/concept drift. (ii-c) Device: PSL considers devices with different computation and communication capabilities. (iii) Proximity, where devices have different distances to each other and the access point. PSL considers the realistic scenario where global aggregations are conducted with idle times in-between them for resource efficiency improvements, and incorporates data dispersion and model dispersion with local model condensation into FedL. Our analysis sheds light on the notion of cold vs. warmed up models, and model inertia in distributed machine learning. We then propose network-aware dynamic model tracking to optimize the model learning vs. resource efficiency tradeoff, which we show is an NP-hard signomial programming problem. We finally solve this problem through proposing a general optimization solver. Our numerical results reveal new findings on the interdependencies between the idle times in-between the global aggregations, model/concept drift, and D2D cooperation configuration.  ( 2 min )
    BLOOM-Net: Blockwise Optimization for Masking Networks Toward Scalable and Efficient Speech Enhancement. (arXiv:2111.09372v2 [eess.AS] UPDATED)
    In this paper, we present a blockwise optimization method for masking-based networks (BLOOM-Net) for training scalable speech enhancement networks. Here, we design our network with a residual learning scheme and train the internal separator blocks sequentially to obtain a scalable masking-based deep neural network for speech enhancement. Its scalability lets it dynamically adjust the run-time complexity depending on the test time environment. To this end, we modularize our models in that they can flexibly accommodate varying needs for enhancement performance and constraints on the resources, incurring minimal memory or training overhead due to the added scalability. Our experiments on speech enhancement demonstrate that the proposed blockwise optimization method achieves the desired scalability with only a slight performance degradation compared to corresponding models trained end-to-end.  ( 2 min )
    SUPA: A Lightweight Diagnostic Simulator for Machine Learning in Particle Physics. (arXiv:2202.05012v1 [physics.data-an])
    Deep learning methods have gained popularity in high energy physics for fast modeling of particle showers in detectors. Detailed simulation frameworks such as the gold standard Geant4 are computationally intensive, and current deep generative architectures work on discretized, lower resolution versions of the detailed simulation. The development of models that work at higher spatial resolutions is currently hindered by the complexity of the full simulation data, and by the lack of simpler, more interpretable benchmarks. Our contribution is SUPA, the SUrrogate PArticle propagation simulator, an algorithm and software package for generating data by simulating simplified particle propagation, scattering and shower development in matter. The generation is extremely fast and easy to use compared to Geant4, but still exhibits the key characteristics and challenges of the detailed simulation. We support this claim experimentally by showing that performance of generative models on data from our simulator reflects the performance on a dataset generated with Geant4. The proposed simulator generates thousands of particle showers per second on a desktop machine, a speed up of up to 6 orders of magnitudes over Geant4, and stores detailed geometric information about the shower propagation. SUPA provides much greater flexibility for setting initial conditions and defining multiple benchmarks for the development of models. Moreover, interpreting particle showers as point clouds creates a connection to geometric machine learning and provides challenging and fundamentally new datasets for the field. The code for SUPA is available at https://github.com/itsdaniele/SUPA.  ( 2 min )
    Training Stable Graph Neural Networks Through Constrained Learning. (arXiv:2110.03576v2 [cs.LG] UPDATED)
    Graph Neural Networks (GNN) rely on graph convolutions to learn features from network data. GNNs are stable to different types of perturbations of the underlying graph, a property that they inherit from graph filters. In this paper we leverage the stability property of GNNs as a typing point in order to seek for representations that are stable within a distribution. We propose a novel constrained learning approach by imposing a constraint on the stability condition of the GNN within a perturbation of choice. We showcase our framework in real world data, corroborating that we are able to obtain more stable representations while not compromising the overall accuracy of the predictor.  ( 2 min )
    Controlling the Complexity and Lipschitz Constant improves polynomial nets. (arXiv:2202.05068v1 [cs.LG])
    While the class of Polynomial Nets demonstrates comparable performance to neural networks (NN), it currently has neither theoretical generalization characterization nor robustness guarantees. To this end, we derive new complexity bounds for the set of Coupled CP-Decomposition (CCP) and Nested Coupled CP-decomposition (NCP) models of Polynomial Nets in terms of the $\ell_\infty$-operator-norm and the $\ell_2$-operator norm. In addition, we derive bounds on the Lipschitz constant for both models to establish a theoretical certificate for their robustness. The theoretical results enable us to propose a principled regularization scheme that we also evaluate experimentally in six datasets and show that it improves the accuracy as well as the robustness of the models to adversarial perturbations. We showcase how this regularization can be combined with adversarial training, resulting in further improvements.  ( 2 min )
    DeepCENT: Prediction of Censored Event Time via Deep Learning. (arXiv:2202.05155v1 [cs.LG])
    With the rapid advances of deep learning, many computational methods have been developed to analyze nonlinear and complex right censored data via deep learning approaches. However, the majority of the methods focus on predicting survival function or hazard function rather than predicting a single valued time to an event. In this paper, we propose a novel method, DeepCENT, to directly predict the individual time to an event. It utilizes the deep learning framework with an innovative loss function that combines the mean square error and the concordance index. Most importantly, DeepCENT can handle competing risks, where one type of event precludes the other types of events from being observed. The validity and advantage of DeepCENT were evaluated using simulation studies and illustrated with three publicly available cancer data sets.  ( 2 min )
    Information Discrepancy in Strategic Learning. (arXiv:2103.01028v4 [cs.GT] UPDATED)
    We initiate the study of the effects of non-transparency in decision rules on individuals' ability to improve in strategic learning settings. Inspired by real-life settings, such as loan approvals and college admissions, we remove the assumption typically made in the strategic learning literature, that the decision rule is fully known to individuals, and focus instead on settings where it is inaccessible. In their lack of knowledge, individuals try to infer this rule by learning from their peers (e.g., friends and acquaintances who previously applied for a loan), naturally forming groups in the population, each with possibly different type and level of information regarding the decision rule. We show that, in equilibrium, the principal's decision rule optimizing welfare across sub-populations may cause a strong negative externality: the true quality of some of the groups can actually deteriorate. On the positive side, we show that, in many natural cases, optimal improvement can be guaranteed simultaneously for all sub-populations. We further introduce a measure we term information overlap proxy, and demonstrate its usefulness in characterizing the disparity in improvements across sub-populations. Finally, we identify a natural condition under which improvement can be guaranteed for all sub-populations while maintaining high predictive accuracy. We complement our theoretical analysis with experiments on real-world datasets.  ( 2 min )
    Machine Learning and Data Science: Foundations, Concepts, Algorithms, and Tools. (arXiv:2202.05163v1 [cs.LG])
    Today, data is a tool and fuel for businesses to gain important insights and improve their performance. Data science has dominated almost every industry in the world. There is no industry in the world today that does not use data. But who will get this insight? Who processes all the raw data? Everything is done by a data analyst or a data scientist.  ( 2 min )
    Deconstructing The Inductive Biases Of Hamiltonian Neural Networks. (arXiv:2202.04836v1 [cs.LG])
    Physics-inspired neural networks (NNs), such as Hamiltonian or Lagrangian NNs, dramatically outperform other learned dynamics models by leveraging strong inductive biases. These models, however, are challenging to apply to many real world systems, such as those that don't conserve energy or contain contacts, a common setting for robotics and reinforcement learning. In this paper, we examine the inductive biases that make physics-inspired models successful in practice. We show that, contrary to conventional wisdom, the improved generalization of HNNs is the result of modeling acceleration directly and avoiding artificial complexity from the coordinate system, rather than symplectic structure or energy conservation. We show that by relaxing the inductive biases of these models, we can match or exceed performance on energy-conserving systems while dramatically improving performance on practical, non-conservative systems. We extend this approach to constructing transition models for common Mujoco environments, showing that our model can appropriately balance inductive biases with the flexibility required for model-based control.  ( 2 min )
    Active label cleaning for improved dataset quality under resource constraints. (arXiv:2109.00574v2 [cs.CV] UPDATED)
    Imperfections in data annotation, known as label noise, are detrimental to the training of machine learning models and have an often-overlooked confounding effect on the assessment of model performance. Nevertheless, employing experts to remove label noise by fully re-annotating large datasets is infeasible in resource-constrained settings, such as healthcare. This work advocates for a data-driven approach to prioritising samples for re-annotation - which we term "active label cleaning". We propose to rank instances according to estimated label correctness and labelling difficulty of each sample, and introduce a simulation framework to evaluate relabelling efficacy. Our experiments on natural images and on a new medical imaging benchmark show that cleaning noisy labels mitigates their negative impact on model training, evaluation, and selection. Crucially, the proposed active label cleaning enables correcting labels up to 4 times more effectively than typical random selection in realistic conditions, making better use of experts' valuable time for improving dataset quality.  ( 2 min )
    Application of the Affinity Propagation Clustering Technique to obtain traffic accident clusters at macro, meso, and micro levels. (arXiv:2202.05175v1 [cs.LG])
    Accident grouping is a crucial step in identifying accident-prone locations. Among the different accident grouping modes, clustering methods present excellent performance for discovering different distributions of accidents in space. This work introduces the Affinity Propagation Clustering (APC) approach for grouping traffic accidents based on criteria of similarity and dissimilarity between distributions of data points in space. The APC provides more realistic representations of the distribution of events from similarity matrices between instances. The results showed that when representative data samples obtain, the preference parameter of similarity provides the necessary performance to calibrate the model and generate clusters according to the desired characteristics. In addition, the study demonstrates that the preference parameter as a continuous parameter facilitates the calibration and control of the model's convergence, allowing the discovery of clustering patterns with less effort and greater control of the results  ( 2 min )
    Joint Shapley values: a measure of joint feature importance. (arXiv:2107.11357v2 [stat.ML] UPDATED)
    The Shapley value is one of the most widely used measures of feature importance partly as it measures a feature's average effect on a model's prediction. We introduce joint Shapley values, which directly extend Shapley's axioms and intuitions: joint Shapley values measure a set of features' average contribution to a model's prediction. We prove the uniqueness of joint Shapley values, for any order of explanation. Results for games show that joint Shapley values present different insights from existing interaction indices, which assess the effect of a feature within a set of features. The joint Shapley values provide intuitive results in ML attribution problems. With binary features, we present a presence-adjusted global value that is more consistent with local intuitions than the usual approach.  ( 2 min )
    Optimal potentials for hedging algorithms. (arXiv:2106.10717v5 [cs.LG] UPDATED)
    We study a family of potential functions for online learning. We show that if the potential function has strictly positive derivatives of order 1-4 then the min-max optimal strategy for the adversary is Brownian motion. Using that fact we analyze different potential functions and show that the Normal-Hedge potential provides the tightest upper bounds on the cumulative regret of the top {\epsilon}-percentile.  ( 2 min )
    Heterogeneous Calibration: A post-hoc model-agnostic framework for improved generalization. (arXiv:2202.04837v1 [stat.ML])
    We introduce the notion of heterogeneous calibration that applies a post-hoc model-agnostic transformation to model outputs for improving AUC performance on binary classification tasks. We consider overconfident models, whose performance is significantly better on training vs test data and give intuition onto why they might under-utilize moderately effective simple patterns in the data. We refer to these simple patterns as heterogeneous partitions of the feature space and show theoretically that perfectly calibrating each partition separately optimizes AUC. This gives a general paradigm of heterogeneous calibration as a post-hoc procedure by which heterogeneous partitions of the feature space are identified through tree-based algorithms and post-hoc calibration techniques are applied to each partition to improve AUC. While the theoretical optimality of this framework holds for any model, we focus on deep neural networks (DNNs) and test the simplest instantiation of this paradigm on a variety of open-source datasets. Experiments demonstrate the effectiveness of this framework and the future potential for applying higher-performing partitioning schemes along with more effective calibration techniques.  ( 2 min )
    Learning Intrusion Prevention Policies through Optimal Stopping. (arXiv:2106.07160v8 [cs.AI] UPDATED)
    We study automated intrusion prevention using reinforcement learning. In a novel approach, we formulate the problem of intrusion prevention as an optimal stopping problem. This formulation allows us insight into the structure of the optimal policies, which turn out to be threshold based. Since the computation of the optimal defender policy using dynamic programming is not feasible for practical cases, we approximate the optimal policy through reinforcement learning in a simulation environment. To define the dynamics of the simulation, we emulate the target infrastructure and collect measurements. Our evaluations show that the learned policies are close to optimal and that they indeed can be expressed using thresholds.  ( 2 min )
    Spectral Bias in Practice: The Role of Function Frequency in Generalization. (arXiv:2110.02424v3 [cs.LG] UPDATED)
    Despite their ability to represent highly expressive functions, deep learning models seem to find simple solutions that generalize surprisingly well. Spectral bias -- the tendency of neural networks to prioritize learning low frequency functions -- is one possible explanation for this phenomenon, but so far spectral bias has primarily been observed in theoretical models and simplified experiments. In this work, we propose methodologies for measuring spectral bias in modern image classification networks on CIFAR-10 and ImageNet. We find that these networks indeed exhibit spectral bias, and that interventions that improve test accuracy on CIFAR-10 tend to produce learned functions that have higher frequencies overall but lower frequencies in the vicinity of examples from each class. This trend holds across variation in training time, model architecture, number of training examples, data augmentation, and self-distillation. We also explore the connections between function frequency and image frequency and find that spectral bias is sensitive to the low frequencies prevalent in natural images. On ImageNet, we find that learned function frequency also varies with internal class diversity, with higher frequencies on more diverse classes. Our work enables measuring and ultimately influencing the spectral behavior of neural networks used for image classification, and is a step towards understanding why deep models generalize well.  ( 2 min )
    Hybrid Contrastive Quantization for Efficient Cross-View Video Retrieval. (arXiv:2202.03384v2 [cs.IR] UPDATED)
    With the recent boom of video-based social platforms (e.g., YouTube and TikTok), video retrieval using sentence queries has become an important demand and attracts increasing research attention. Despite the decent performance, existing text-video retrieval models in vision and language communities are impractical for large-scale Web search because they adopt brute-force search based on high-dimensional embeddings. To improve efficiency, Web search engines widely apply vector compression libraries (e.g., FAISS) to post-process the learned embeddings. Unfortunately, separate compression from feature encoding degrades the robustness of representations and incurs performance decay. To pursue a better balance between performance and efficiency, we propose the first quantized representation learning method for cross-view video retrieval, namely Hybrid Contrastive Quantization (HCQ). Specifically, HCQ learns both coarse-grained and fine-grained quantizations with transformers, which provide complementary understandings for texts and videos and preserve comprehensive semantic information. By performing Asymmetric-Quantized Contrastive Learning (AQ-CL) across views, HCQ aligns texts and videos at coarse-grained and multiple fine-grained levels. This hybrid-grained learning strategy serves as strong supervision on the cross-view video quantization model, where contrastive learning at different levels can be mutually promoted. Extensive experiments on three Web video benchmark datasets demonstrate that HCQ achieves competitive performance with state-of-the-art non-compressed retrieval methods while showing high efficiency in storage and computation. Code and configurations are available at https://github.com/gimpong/WWW22-HCQ.  ( 2 min )
    Nearly Horizon-Free Offline Reinforcement Learning. (arXiv:2103.14077v3 [stat.ML] UPDATED)
    We revisit offline reinforcement learning on episodic time-homogeneous Markov Decision Processes (MDP). For tabular MDP with $S$ states and $A$ actions, or linear MDP with anchor points and feature dimension $d$, given the collected $K$ episodes data with minimum visiting probability of (anchor) state-action pairs $d_m$, we obtain nearly horizon $H$-free sample complexity bounds for offline reinforcement learning when the total reward is upper bounded by $1$. Specifically: 1. For offline policy evaluation, we obtain an $\tilde{O}\left(\sqrt{\frac{1}{Kd_m}} \right)$ error bound for the plug-in estimator, which matches the lower bound up to logarithmic factors and does not have additional dependency on $\mathrm{poly}\left(H, S, A, d\right)$ in higher-order term. 2.For offline policy optimization, we obtain an $\tilde{O}\left(\sqrt{\frac{1}{Kd_m}} + \frac{\min(S, d)}{Kd_m}\right)$ sub-optimality gap for the empirical optimal policy, which approaches the lower bound up to logarithmic factors and a high-order term, improving upon the best known result by \cite{cui2020plug} that has additional $\mathrm{poly}\left(H, S, d\right)$ factors in the main term. To the best of our knowledge, these are the \emph{first} set of nearly horizon-free bounds for episodic time-homogeneous offline tabular MDP and linear MDP with anchor points. Central to our analysis is a simple yet effective recursion based method to bound a "total variance" term in the offline scenarios, which could be of individual interest.  ( 2 min )
    Using Color To Identify Insider Threats. (arXiv:2111.13176v3 [cs.CV] UPDATED)
    Insider threats are costly, hard to detect, and unfortunately rising in occurrence. Seeking to improve detection of such threats, we develop novel techniques to enable us to extract powerful features and augment attack vectors for greater classification power. Most importantly, we generate high quality color image encodings of user behavior that do not have the downsides of traditional greyscale image encodings. Combined, they form Computer Vision User and Entity Behavior Analytics, a detection system designed from the ground up to improve upon advancements in academia and mitigate the issues that prevent the usage of advanced models in industry. The proposed system beats state-of-art methods used in academia and as well as in industry on a gold standard benchmarking dataset.  ( 2 min )
    The Common Intuition to Transfer Learning Can Win or Lose: Case Studies for Linear Regression. (arXiv:2103.05621v2 [cs.LG] UPDATED)
    We study a fundamental transfer learning process from source to target linear regression tasks, including overparameterized settings where there are more learned parameters than data samples. The target task learning is addressed by using its training data together with the parameters previously computed for the source task. We define a transfer learning approach to the target task as a linear regression optimization with a regularization on the distance between the to-be-learned target parameters and the already-learned source parameters. We analytically characterize the generalization performance of our transfer learning approach and demonstrate its ability to resolve the peak in generalization errors in double descent phenomena of the minimum $\ell_2$-norm solution to linear regression. Moreover, we show that for sufficiently related tasks, the optimally tuned transfer learning approach can outperform the optimally tuned ridge regression method, even when the true parameter vector conforms to an isotropic Gaussian prior distribution. Namely, we demonstrate that transfer learning can beat the minimum mean square error (MMSE) solution of the independent target task. Our results emphasize the ability of transfer learning to extend the solution space to the target task and, by that, to have an improved MMSE solution. We formulate the linear MMSE solution to our transfer learning setting and point out its key differences from the common design philosophy to transfer learning.  ( 2 min )
    Using reinforcement learning to autonomously identify the source of errors for agents in a group mission. (arXiv:2107.09232v3 [cs.RO] UPDATED)
    When agents are swarmed to execute a mission, there is often a sudden failure of some of the agents observed from the command base. Solely relying on the communication between the command base and the concerning agent, it is generally difficult to determine whether the failure is caused by actuators (hypothesis, $h_a$) or sensors (hypothesis, $h_s$) However, by instigating collusion between the agents, we can pinpoint the cause of the failure, that is, for $h_a$, we expect to detect corresponding displacements while for $h_a$ we do not. We recommend that such swarm strategies applied to grasp the situation be autonomously generated by artificial intelligence (AI). Preferable actions (${e.g.,}$ the collision) for the distinction will be those maximizing the difference between the expected behaviors for each hypothesis, as a value function. Such actions exist, however, only very sparsely in the whole possibilities, for which the conventional search based on gradient methods does not make sense. To mitigate the abovementioned shortcoming, we (successfully) applied the reinforcement learning technique, achieving the maximization of such a sparse value function. Machine learning was concluded autonomously.The colliding action is the basis of distinguishing the hypothesizes. To pinpoint the agent with the actuator error via this action, the agents behave as if they are assisting the malfunctioning one to achieve a given mission.  ( 2 min )
    Topogivity: A Machine-Learned Chemical Rule for Discovering Topological Materials. (arXiv:2202.05255v1 [cond-mat.mtrl-sci])
    Topological materials present unconventional electronic properties that make them attractive for both basic science and next-generation technological applications. The majority of currently-known topological materials have been discovered using methods that involve symmetry-based analysis of the quantum wavefunction. Here we use machine learning to develop a simple-to-use heuristic chemical rule that diagnoses with a high accuracy whether a material is topological using only its chemical formula. This heuristic rule is based on a notion that we term topogivity, a machine-learned numerical value for each element that loosely captures its tendency to form topological materials. We next implement a high-throughput strategy for discovering topological materials based on the heuristic topogivity-rule prediction followed by ab initio validation. This way, we discover new topological materials that are not diagnosable using symmetry indicators, including several that may be promising for experimental observation.  ( 2 min )
    EquiBind: Geometric Deep Learning for Drug Binding Structure Prediction. (arXiv:2202.05146v1 [q-bio.BM])
    Predicting how a drug-like molecule binds to a specific protein target is a core problem in drug discovery. An extremely fast computational binding method would enable key applications such as fast virtual screening or drug engineering. Existing methods are computationally expensive as they rely on heavy candidate sampling coupled with scoring, ranking, and fine-tuning steps. We challenge this paradigm with EquiBind, an SE(3)-equivariant geometric deep learning model performing direct-shot prediction of both i) the receptor binding location (blind docking) and ii) the ligand's bound pose and orientation. EquiBind achieves significant speed-ups and better quality compared to traditional and recent baselines. Further, we show extra improvements when coupling it with existing fine-tuning techniques at the cost of increased running time. Finally, we propose a novel and fast fine-tuning model that adjusts torsion angles of a ligand's rotatable bonds based on closed-form global minima of the von Mises angular distance to a given input atomic point cloud, avoiding previous expensive differential evolution strategies for energy minimization.  ( 2 min )
    Spectral Propagation Graph Network for Few-shot Time Series Classification. (arXiv:2202.04769v1 [cs.LG])
    Few-shot Time Series Classification (few-shot TSC) is a challenging problem in time series analysis. It is more difficult to classify when time series of the same class are not completely consistent in spectral domain or time series of different classes are partly consistent in spectral domain. To address this problem, we propose a novel method named Spectral Propagation Graph Network (SPGN) to explicitly model and propagate the spectrum-wise relations between different time series with graph network. To the best of our knowledge, SPGN is the first to utilize spectral comparisons in different intervals and involve spectral propagation across all time series with graph networks for few-shot TSC. SPGN first uses bandpass filter to expand time series in spectral domain for calculating spectrum-wise relations between time series. Equipped with graph networks, SPGN then integrates spectral relations with label information to make spectral propagation. The further study conveys the bi-directional effect between spectral relations acquisition and spectral propagation. We conduct extensive experiments on few-shot TSC benchmarks. SPGN outperforms state-of-the-art results by a large margin in $4\% \sim 13\%$. Moreover, SPGN surpasses them by around $12\%$ and $9\%$ under cross-domain and cross-way settings respectively.  ( 2 min )
    Quantune: Post-training Quantization of Convolutional Neural Networks using Extreme Gradient Boosting for Fast Deployment. (arXiv:2202.05048v1 [cs.LG])
    To adopt convolutional neural networks (CNN) for a range of resource-constrained targets, it is necessary to compress the CNN models by performing quantization, whereby precision representation is converted to a lower bit representation. To overcome problems such as sensitivity of the training dataset, high computational requirements, and large time consumption, post-training quantization methods that do not require retraining have been proposed. In addition, to compensate for the accuracy drop without retraining, previous studies on post-training quantization have proposed several complementary methods: calibration, schemes, clipping, granularity, and mixed-precision. To generate a quantized model with minimal error, it is necessary to study all possible combinations of the methods because each of them is complementary and the CNN models have different characteristics. However, an exhaustive or a heuristic search is either too time-consuming or suboptimal. To overcome this challenge, we propose an auto-tuner known as Quantune, which builds a gradient tree boosting model to accelerate the search for the configurations of quantization and reduce the quantization error. We evaluate and compare Quantune with the random, grid, and genetic algorithms. The experimental results show that Quantune reduces the search time for quantization by approximately 36.5x with an accuracy loss of 0.07 ~ 0.65% across six CNN models, including the fragile ones (MobileNet, SqueezeNet, and ShuffleNet). To support multiple targets and adopt continuously evolving quantization works, Quantune is implemented on a full-fledged compiler for deep learning as an open-sourced project.  ( 2 min )
    Benign-Overfitting in Conditional Average Treatment Effect Prediction with Linear Regression. (arXiv:2202.05245v1 [econ.EM])
    We study the benign overfitting theory in the prediction of the conditional average treatment effect (CATE), with linear regression models. As the development of machine learning for causal inference, a wide range of large-scale models for causality are gaining attention. One problem is that suspicions have been raised that the large-scale models are prone to overfitting to observations with sample selection, hence the large models may not be suitable for causal prediction. In this study, to resolve the suspicious, we investigate on the validity of causal inference methods for overparameterized models, by applying the recent theory of benign overfitting (Bartlett et al., 2020). Specifically, we consider samples whose distribution switches depending on an assignment rule, and study the prediction of CATE with linear models whose dimension diverges to infinity. We focus on two methods: the T-learner, which based on a difference between separately constructed estimators with each treatment group, and the inverse probability weight (IPW)-learner, which solves another regression problem approximated by a propensity score. In both methods, the estimator consists of interpolators that fit the samples perfectly. As a result, we show that the T-learner fails to achieve the consistency except the random assignment, while the IPW-learner converges the risk to zero if the propensity score is known. This difference stems from that the T-learner is unable to preserve eigenspaces of the covariances, which is necessary for benign overfitting in the overparameterized setting. Our result provides new insights into the usage of causal inference methods in the overparameterizated setting, in particular, doubly robust estimators.  ( 2 min )
    An Information-Theoretic Justification for Model Pruning. (arXiv:2102.08329v4 [cs.LG] UPDATED)
    We study the neural network (NN) compression problem, viewing the tension between the compression ratio and NN performance through the lens of rate-distortion theory. We choose a distortion metric that reflects the effect of NN compression on the model output and derive the tradeoff between rate (compression) and distortion. In addition to characterizing theoretical limits of NN compression, this formulation shows that \emph{pruning}, implicitly or explicitly, must be a part of a good compression algorithm. This observation bridges a gap between parts of the literature pertaining to NN and data compression, respectively, providing insight into the empirical success of model pruning. Finally, we propose a novel pruning strategy derived from our information-theoretic formulation and show that it outperforms the relevant baselines on CIFAR-10 and ImageNet datasets.  ( 2 min )
    Differentiable Wavetable Synthesis. (arXiv:2111.10003v3 [cs.SD] UPDATED)
    Differentiable Wavetable Synthesis (DWTS) is a technique for neural audio synthesis which learns a dictionary of one-period waveforms i.e. wavetables, through end-to-end training. We achieve high-fidelity audio synthesis with as little as 10 to 20 wavetables and demonstrate how a data-driven dictionary of waveforms opens up unprecedented one-shot learning paradigms on short audio clips. Notably, we show audio manipulations, such as high quality pitch-shifting, using only a few seconds of input audio. Lastly, we investigate performance gains from using learned wavetables for realtime and interactive audio synthesis.  ( 2 min )
    GEOM: Energy-annotated molecular conformations for property prediction and molecular generation. (arXiv:2006.05531v4 [physics.comp-ph] UPDATED)
    Machine learning (ML) outperforms traditional approaches in many molecular design tasks. ML models usually predict molecular properties from a 2D chemical graph or a single 3D structure, but neither of these representations accounts for the ensemble of 3D conformers that are accessible to a molecule. Property prediction could be improved by using conformer ensembles as input, but there is no large-scale dataset that contains graphs annotated with accurate conformers and experimental data. Here we use advanced sampling and semi-empirical density functional theory (DFT) to generate 37 million molecular conformations for over 450,000 molecules. The Geometric Ensemble Of Molecules (GEOM) dataset contains conformers for 133,000 species from QM9, and 317,000 species with experimental data related to biophysics, physiology, and physical chemistry. Ensembles of 1,511 species with BACE-1 inhibition data are also labeled with high-quality DFT free energies in an implicit water solvent, and 534 ensembles are further optimized with DFT. GEOM will assist in the development of models that predict properties from conformer ensembles, and generative models that sample 3D conformations.  ( 2 min )
    Sparse-Dyn: Sparse Dynamic Graph Multi-representation Learning via Event-based Sparse Temporal Attention Network. (arXiv:2201.01384v2 [cs.LG] UPDATED)
    Dynamic graph neural networks have been widely used in modeling and representation learning of graph structure data. Current dynamic representation learning focuses on either discrete learning which results in temporal information loss or continuous learning that involves heavy computation. In this work, we proposed a novel dynamic graph neural network, Sparse-Dyn. It adaptively encodes temporal information into a sequence of patches with an equal amount of temporal-topological structure. Therefore, while avoiding the use of snapshots which causes information loss, it also achieves a finer time granularity, which is close to what continuous networks could provide. In addition, we also designed a lightweight module, Sparse Temporal Transformer, to compute node representations through both structural neighborhoods and temporal dynamics. Since the fully-connected attention conjunction is simplified, the computation cost is far lower than the current state-of-the-arts. Link prediction experiments are conducted on both continuous and discrete graph datasets. Through comparing with several state-of-the-art graph embedding baselines, the experimental results demonstrate that Sparse-Dyn has a faster inference speed while having competitive performance.  ( 2 min )
    Transferred Q-learning. (arXiv:2202.04709v1 [cs.LG])
    We consider $Q$-learning with knowledge transfer, using samples from a target reinforcement learning (RL) task as well as source samples from different but related RL tasks. We propose transfer learning algorithms for both batch and online $Q$-learning with offline source studies. The proposed transferred $Q$-learning algorithm contains a novel re-targeting step that enables vertical information-cascading along multiple steps in an RL task, besides the usual horizontal information-gathering as transfer learning (TL) for supervised learning. We establish the first theoretical justifications of TL in RL tasks by showing a faster rate of convergence of the $Q$ function estimation in the offline RL transfer, and a lower regret bound in the offline-to-online RL transfer under certain similarity assumptions. Empirical evidences from both synthetic and real datasets are presented to back up the proposed algorithm and our theoretical results.  ( 2 min )
    Learning Wave Propagation with Attention-Based Convolutional Recurrent Autoencoder Net. (arXiv:2201.06628v2 [physics.flu-dyn] UPDATED)
    In this paper, we present an end-to-end attention-based convolutional recurrent autoencoder network (AB-CRAN) for data-driven modeling of wave propagation phenomena. To construct the low-dimensional learning model, we employ a denoising-based convolutional autoencoder from the full-order snapshots of wave propagation generated by solving hyperbolic partial differential equations. The proposed deep neural network architecture relies on the attention-based recurrent neural network (RNN) with long short-term memory (LSTM) cells for constructing the trajectory in the latent space. We assess the proposed AB-CRAN framework against the standard RNN-LSTM for the low-dimensional learning of wave propagation. To demonstrate the effectiveness of the AB-CRAN model, we consider three benchmark problems namely one-dimensional linear convection, nonlinear viscous Burgers, and two-dimensional Saint-Venant shallow water system. Using the time-series datasets from the benchmark problems, our novel AB-CRAN architecture accurately captures the wave amplitude and preserves the wave characteristics of the solution for long time horizons. The attention-based sequence-to-sequence network increases the time-horizon of prediction by five times compared to the standard RNN-LSTM. Denoising autoencoder further reduces the mean squared error of prediction and improves the generalization capability in the parameter space.  ( 2 min )
    Understanding Riemannian Acceleration via a Proximal Extragradient Framework. (arXiv:2111.02763v2 [math.OC] UPDATED)
    We contribute to advancing the understanding of Riemannian accelerated gradient methods. In particular, we revisit Accelerated Hybrid Proximal Extragradient(A-HPE), a powerful framework for obtaining Euclidean accelerated methods \citep{monteiro2013accelerated}. Building on A-HPE, we then propose and analyze Riemannian A-HPE. The core of our analysis consists of two key components: (i) a set of new insights into Euclidean A-HPE itself; and (ii) a careful control of metric distortion caused by Riemannian geometry. We illustrate our framework by obtaining a few existing and new Riemannian accelerated gradient methods as special cases, while characterizing their acceleration as corollaries of our main results.  ( 2 min )
    Universal Paralinguistic Speech Representations Using Self-Supervised Conformers. (arXiv:2110.04621v3 [cs.SD] UPDATED)
    Many speech applications require understanding aspects beyond the words being spoken, such as recognizing emotion, detecting whether the speaker is wearing a mask, or distinguishing real from synthetic speech. In this work, we introduce a new state-of-the-art paralinguistic representation derived from large-scale, fully self-supervised training of a 600M+ parameter Conformer-based architecture. We benchmark on a diverse set of speech tasks and demonstrate that simple linear classifiers trained on top of our time-averaged representation outperform nearly all previous results, in some cases by large margins. Our analyses of context-window size demonstrate that, surprisingly, 2 second context-windows achieve 96\% the performance of the Conformers that use the full long-term context on 7 out of 9 tasks. Furthermore, while the best per-task representations are extracted internally in the network, stable performance across several layers allows a single universal representation to reach near optimal performance on all tasks.  ( 2 min )
    ABG: A Multi-Party Mixed Protocol Framework for Privacy-Preserving Cooperative Learning. (arXiv:2202.02928v2 [cs.CR] UPDATED)
    Cooperative learning, that enables two or more data owners to jointly train a model, has been widely adopted to solve the problem of insufficient training data in machine learning. Nowadays, there is an urgent need for institutions and organizations to train a model cooperatively while keeping each other's data privately. To address the issue of privacy-preserving in collaborative learning, secure outsourced computation and federated learning are two typical methods. Nevertheless, there are many drawbacks for these two methods when they are leveraged in cooperative learning. For secure outsourced computation, semi-honest servers need to be introduced. Once the outsourced servers collude or perform other active attacks, the privacy of data will be disclosed. For federated learning, it is difficult to apply to the scenarios where vertically partitioned data are distributed over multiple parties. In this work, we propose a multi-party mixed protocol framework, ABG$^n$, which effectively implements arbitrary conversion between Arithmetic sharing (A), Boolean sharing (B) and Garbled-Circuits sharing (G) for $n$-party scenarios. Based on ABG$^n$, we design a privacy-preserving multi-party cooperative learning system, which allows different data owners to cooperate in machine learning in terms of data security and privacy-preserving. Additionally, we design specific privacy-preserving computation protocols for some typical machine learning methods such as logistic regression and neural networks. Compared with previous work, the proposed method has a wider scope of application and does not need to rely on additional servers. Finally, we evaluate the performance of ABG$^n$ on the local setting and on the public cloud setting. The experiments indicate that ABG$^n$ has excellent performance, especially in the network environment with low latency.  ( 3 min )
    Forecasting large-scale circulation regimes using deformable convolutional neural networks and global spatiotemporal climate data. (arXiv:2202.04964v1 [cs.LG])
    Classifying the state of the atmosphere into a finite number of large-scale circulation regimes is a popular way of investigating teleconnections, the predictability of severe weather events, and climate change. Here, we investigate a supervised machine learning approach based on deformable convolutional neural networks (deCNNs) and transfer learning to forecast the North Atlantic-European weather regimes during extended boreal winter for 1 to 15 days into the future. We apply state-of-the-art interpretation techniques from the machine learning literature to attribute particular regions of interest or potential teleconnections relevant for any given weather cluster prediction or regime transition. We demonstrate superior forecasting performance relative to several classical meteorological benchmarks, as well as logistic regression and random forests. Due to its wider field of view, we also observe deCNN achieving considerably better performance than regular convolutional neural networks at lead times beyond 5-6 days. Finally, we find transfer learning to be of paramount importance, similar to previous data-driven atmospheric forecasting studies.  ( 2 min )
    Dimension Reduction with Prior Information for Knowledge Discovery. (arXiv:2111.13646v2 [stat.ML] UPDATED)
    This paper addresses the problem of mapping high-dimensional data to a low-dimensional space, in the presence of other known features. This problem is ubiquitous in science and engineering as there are often controllable/measurable features in most applications. Furthermore, the discovered features in previous analyses can become the known features in subsequent analyses, repeatedly. To solve this problem, this paper proposes a broad class of methods, which is referred to as conditional multidimensional scaling. An algorithm for optimizing the objective function of conditional multidimensional scaling is also developed. The proposed framework is illustrated with kinship terms, facial expressions, and simulated car-brand perception examples. These examples demonstrate the benefits of the framework for being able to marginalize out the known features to uncover unknown, unanticipated features in the reduced-dimension space and for enabling a repeated, more straightforward knowledge discovery process. Computer codes for this work are available in the open-source cml R package.  ( 2 min )
    Privacy-Preserving Multi-Target Multi-Domain Recommender Systems with Assisted AutoEncoders. (arXiv:2110.13340v2 [cs.IR] UPDATED)
    A long-standing challenge in Recommender Systems (RCs) is the data sparsity problem that often arises when users rate very few items. Multi-Target Multi-Domain Recommender Systems (MTMDR) aim to improve the recommendation performance in multiple domains simultaneously. The existing works assume that the data of different domains can be fully shared, and the computation can be performed in a centralized manner. However, in many realistic scenarios, separate recommender systems are operated by different organizations, which do not allow the sharing of private data, models, and recommendation tasks. This work proposes Multi-Target Assisted Learning (MTAL) with Assisted AutoEncoders (AAE) to help organizational learners improve their recommendation performance simultaneously without sharing sensitive assets. Moreover, AAE has a broad application scope since it allows explicit or implicit feedback, user- or item-based alignment, and with or without side information. Extensive experiments demonstrate that our method significantly outperforms the case where each domain is locally trained, and it performs competitively with the centralized training where all data are shared. As a result, AAE can effectively integrate organizations from various domains to form a community of shared interest.  ( 2 min )
    Graph Neural Networks with Learnable Structural and Positional Representations. (arXiv:2110.07875v2 [cs.LG] UPDATED)
    Graph neural networks (GNNs) have become the standard learning architectures for graphs. GNNs have been applied to numerous domains ranging from quantum chemistry, recommender systems to knowledge graphs and natural language processing. A major issue with arbitrary graphs is the absence of canonical positional information of nodes, which decreases the representation power of GNNs to distinguish e.g. isomorphic nodes and other graph symmetries. An approach to tackle this issue is to introduce Positional Encoding (PE) of nodes, and inject it into the input layer, like in Transformers. Possible graph PE are Laplacian eigenvectors. In this work, we propose to decouple structural and positional representations to make easy for the network to learn these two essential properties. We introduce a novel generic architecture which we call LSPE (Learnable Structural and Positional Encodings). We investigate several sparse and fully-connected (Transformer-like) GNNs, and observe a performance increase for molecular datasets, from 1.79% up to 64.14% when considering learnable PE for both GNN classes.  ( 2 min )
    Dimensionality reduction for prediction: Application to Bitcoin and Ethereum. (arXiv:2112.15036v2 [q-fin.ST] UPDATED)
    The objective of this paper is to assess the performances of dimensionality reduction techniques to establish a link between cryptocurrencies. We have focused our analysis on the two most traded cryptocurrencies: Bitcoin and Ethereum. To perform our analysis, we took log returns and added some covariates to build our data set. We first introduced the pearson correlation coefficient in order to have a preliminary assessment of the link between Bitcoin and Ethereum. We then reduced the dimension of our data set using canonical correlation analysis and principal component analysis. After performing an analysis of the links between Bitcoin and Ethereum with both statistical techniques, we measured their performance on forecasting Ethereum returns with Bitcoin s features.  ( 2 min )
    A Survey on Artificial Intelligence for Source Code: A Dialogue Systems Perspective. (arXiv:2202.04847v1 [cs.CL])
    In this survey paper, we overview major deep learning methods used in Natural Language Processing (NLP) and source code over the last 35 years. Next, we present a survey of the applications of Artificial Intelligence (AI) for source code, also known as Code Intelligence (CI) and Programming Language Processing (PLP). We survey over 287 publications and present a software-engineering centered taxonomy for CI placing each of the works into one category describing how it best assists the software development cycle. Then, we overview the field of conversational assistants and their applications in software engineering and education. Lastly, we highlight research opportunities at the intersection of AI for code and conversational assistants and provide future directions for researching conversational assistants with CI capabilities.  ( 2 min )
    Bayesian Optimisation for Mixed-Variable Inputs using Value Proposals. (arXiv:2202.04832v1 [stat.ML])
    Many real-world optimisation problems are defined over both categorical and continuous variables, yet efficient optimisation methods such asBayesian Optimisation (BO) are not designed tohandle such mixed-variable search spaces. Re-cent approaches to this problem cast the selection of the categorical variables as a bandit problem, operating independently alongside a BO component which optimises the continuous variables. In this paper, we adopt a holistic view and aim to consolidate optimisation of the categorical and continuous sub-spaces under a single acquisition metric. We derive candidates from the ExpectedImprovement criterion, which we call value proposals, and use these proposals to make selections on both the categorical and continuous components of the input. We show that this unified approach significantly outperforms existing mixed-variable optimisation approaches across several mixed-variable black-box optimisation tasks.  ( 2 min )
    Boosting Graph Neural Networks by Injecting Pooling in Message Passing. (arXiv:2202.04768v1 [cs.LG])
    There has been tremendous success in the field of graph neural networks (GNNs) as a result of the development of the message-passing (MP) layer, which updates the representation of a node by combining it with its neighbors to address variable-size and unordered graphs. Despite the fruitful progress of MP GNNs, their performance can suffer from over-smoothing, when node representations become too similar and even indistinguishable from one another. Furthermore, it has been reported that intrinsic graph structures are smoothed out as the GNN layer increases. Inspired by the edge-preserving bilateral filters used in image processing, we propose a new, adaptable, and powerful MP framework to prevent over-smoothing. Our bilateral-MP estimates a pairwise modular gradient by utilizing the class information of nodes, and further preserves the global graph structure by using the gradient when the aggregating function is applied. Our proposed scheme can be generalized to all ordinary MP GNNs. Experiments on five medium-size benchmark datasets using four state-of-the-art MP GNNs indicate that the bilateral-MP improves performance by alleviating over-smoothing. By inspecting quantitative measurements, we additionally validate the effectiveness of the proposed mechanism in preventing the over-smoothing issue.  ( 2 min )
    Gaussian Process Bandit Optimization with Few Batches. (arXiv:2110.07788v3 [stat.ML] UPDATED)
    In this paper, we consider the problem of black-box optimization using Gaussian Process (GP) bandit optimization with a small number of batches. Assuming the unknown function has a low norm in the Reproducing Kernel Hilbert Space (RKHS), we introduce a batch algorithm inspired by batched finite-arm bandit algorithms, and show that it achieves the cumulative regret upper bound $O^\ast(\sqrt{T\gamma_T})$ using $O(\log\log T)$ batches within time horizon $T$, where the $O^\ast(\cdot)$ notation hides dimension-independent logarithmic factors and $\gamma_T$ is the maximum information gain associated with the kernel. This bound is near-optimal for several kernels of interest and improves on the typical $O^\ast(\sqrt{T}\gamma_T)$ bound, and our approach is arguably the simplest among algorithms attaining this improvement. In addition, in the case of a constant number of batches (not depending on $T$), we propose a modified version of our algorithm, and characterize how the regret is impacted by the number of batches, focusing on the squared exponential and Mat\'ern kernels. The algorithmic upper bounds are shown to be nearly minimax optimal via analogous algorithm-independent lower bounds.  ( 2 min )
    The effective noise of Stochastic Gradient Descent. (arXiv:2112.10852v2 [cond-mat.dis-nn] UPDATED)
    Stochastic Gradient Descent (SGD) is the workhorse algorithm of deep learning technology. At each step of the training phase, a mini batch of samples is drawn from the training dataset and the weights of the neural network are adjusted according to the performance on this specific subset of examples. The mini-batch sampling procedure introduces a stochastic dynamics to the gradient descent, with a non-trivial state-dependent noise. We characterize the stochasticity of SGD and a recently-introduced variant, \emph{persistent} SGD, in a prototypical neural network model. In the under-parametrized regime, where the final training error is positive, the SGD dynamics reaches a stationary state and we define an effective temperature from the fluctuation-dissipation theorem, computed from dynamical mean-field theory. We use the effective temperature to quantify the magnitude of the SGD noise as a function of the problem parameters. In the over-parametrized regime, where the training error vanishes, we measure the noise magnitude of SGD by computing the average distance between two replicas of the system with the same initialization and two different realizations of SGD noise. We find that the two noise measures behave similarly as a function of the problem parameters. Moreover, we observe that noisier algorithms lead to wider decision boundaries of the corresponding constraint satisfaction problem.  ( 2 min )
    Online Learning to Transport via the Minimal Selection Principle. (arXiv:2202.04732v1 [cs.LG])
    Motivated by robust dynamic resource allocation in operations research, we study the Online Learning to Transport (OLT) problem where the decision variable is a probability measure, an infinite-dimensional object. We draw connections between online learning, optimal transport, and partial differential equations through an insight called the minimal selection principle, originally studied in the Wasserstein gradient flow setting by Ambrosio et al. (2005). This allows us to extend the standard online learning framework to the infinite-dimensional setting seamlessly. Based on our framework, we derive a novel method called the minimal selection or exploration (MSoE) algorithm to solve OLT problems using mean-field approximation and discretization techniques. In the displacement convex setting, the main theoretical message underpinning our approach is that minimizing transport cost over time (via the minimal selection principle) ensures optimal cumulative regret upper bounds. On the algorithmic side, our MSoE algorithm applies beyond the displacement convex setting, making the mathematical theory of optimal transport practically relevant to non-convex settings common in dynamic resource allocation.  ( 2 min )
    Distilling Image Classifiers in Object Detectors. (arXiv:2106.05209v2 [cs.CV] UPDATED)
    Knowledge distillation constitutes a simple yet effective way to improve the performance of a compact student network by exploiting the knowledge of a more powerful teacher. Nevertheless, the knowledge distillation literature remains limited to the scenario where the student and the teacher tackle the same task. Here, we investigate the problem of transferring knowledge not only across architectures but also across tasks. To this end, we study the case of object detection and, instead of following the standard detector-to-detector distillation approach, introduce a classifier-to-detector knowledge transfer framework. In particular, we propose strategies to exploit the classification teacher to improve both the detector's recognition accuracy and localization performance. Our experiments on several detectors with different backbones demonstrate the effectiveness of our approach, allowing us to outperform the state-of-the-art detector-to-detector distillation methods.  ( 2 min )
    Which Shortcut Cues Will DNNs Choose? A Study from the Parameter-Space Perspective. (arXiv:2110.03095v2 [cs.LG] UPDATED)
    Deep neural networks (DNNs) often rely on easy-to-learn discriminatory features, or cues, that are not necessarily essential to the problem at hand. For example, ducks in an image may be recognized based on their typical background scenery, such as lakes or streams. This phenomenon, also known as shortcut learning, is emerging as a key limitation of the current generation of machine learning models. In this work, we introduce a set of experiments to deepen our understanding of shortcut learning and its implications. We design a training setup with several shortcut cues, named WCST-ML, where each cue is equally conducive to the visual recognition problem at hand. Even under equal opportunities, we observe that (1) certain cues are preferred to others, (2) solutions biased to the easy-to-learn cues tend to converge to relatively flat minima on the loss surface, and (3) the solutions focusing on those preferred cues are far more abundant in the parameter space. We explain the abundance of certain cues via their Kolmogorov (descriptional) complexity: solutions corresponding to Kolmogorov-simple cues are abundant in the parameter space and are thus preferred by DNNs. Our studies are based on the synthetic dataset DSprites and the face dataset UTKFace. In our WCST-ML, we observe that the inborn bias of models leans toward simple cues, such as color and ethnicity. Our findings emphasize the importance of active human intervention to remove the inborn model biases that may cause negative societal impacts.  ( 2 min )
    Label Ranking through Nonparametric Regression. (arXiv:2111.02749v2 [cs.LG] UPDATED)
    Label Ranking (LR) corresponds to the problem of learning a hypothesis that maps features to rankings over a finite set of labels. We adopt a nonparametric regression approach to LR and obtain theoretical performance guarantees for this fundamental practical problem. We introduce a generative model for Label Ranking, in noiseless and noisy nonparametric regression settings, and provide sample complexity bounds for learning algorithms in both cases. In the noiseless setting, we study the LR problem with full rankings and provide computationally efficient algorithms using decision trees and random forests in the high-dimensional regime. In the noisy setting, we consider the more general cases of LR with incomplete and partial rankings from a statistical viewpoint and obtain sample complexity bounds using the One-Versus-One approach of multiclass classification. Finally, we complement our theoretical contributions with experiments, aiming to understand how the input regression noise affects the observed output.  ( 2 min )
    Quantum machine learning beyond kernel methods. (arXiv:2110.13162v2 [quant-ph] UPDATED)
    Machine learning algorithms based on parametrized quantum circuits are a prime candidate for near-term applications on noisy quantum computers. Yet, our understanding of how these quantum machine learning models compare, both mutually and to classical models, remains limited. Previous works achieved important steps in this direction by showing a close connection between some of these quantum models and kernel methods, well-studied in classical machine learning. In this work, we identify the first unifying framework that captures all standard models based on parametrized quantum circuits: that of linear quantum models. In particular, we show how data re-uploading circuits, a generalization of linear models, can be efficiently mapped into equivalent linear quantum models. Going further, we also consider the experimentally-relevant resource requirements of these models in terms of qubit number and data-sample efficiency, i.e., amount of data needed to learn. We establish learning separations demonstrating that linear quantum models must utilize exponentially more qubits than data re-uploading models in order to solve certain learning tasks, while kernel methods additionally require exponentially many more data points. Our results constitute significant strides towards a more comprehensive theory of quantum machine learning models as well as provide guidelines on which models may be better suited from experimental perspectives.  ( 2 min )
    Training Strategies for Deep Learning Gravitational-Wave Searches. (arXiv:2106.03741v2 [astro-ph.IM] UPDATED)
    Compact binary systems emit gravitational radiation which is potentially detectable by current Earth bound detectors. Extracting these signals from the instruments' background noise is a complex problem and the computational cost of most current searches depends on the complexity of the source model. Deep learning may be capable of finding signals where current algorithms hit computational limits. Here we restrict our analysis to signals from non-spinning binary black holes and systematically test different strategies by which training data is presented to the networks. To assess the impact of the training strategies, we re-analyze the first published networks and directly compare them to an equivalent matched-filter search. We find that the deep learning algorithms can generalize low signal-to-noise ratio (SNR) signals to high SNR ones but not vice versa. As such, it is not beneficial to provide high SNR signals during training, and fastest convergence is achieved when low SNR samples are provided early on. During testing we found that the networks are sometimes unable to recover any signals when a false alarm probability $<10^{-3}$ is required. We resolve this restriction by applying a modification we call unbounded Softmax replacement (USR) after training. With this alteration we find that the machine learning search retains $\geq 91.5\%$ of the sensitivity of the matched-filter search down to a false-alarm rate of 1 per month.  ( 2 min )
    FedDQ: Communication-Efficient Federated Learning with Descending Quantization. (arXiv:2110.02291v4 [cs.LG] UPDATED)
    Federated learning (FL) is an emerging privacy-preserving distributed learning scheme. The large model size and frequent model aggregation cause serious communication issue for FL. Many techniques, such as model compression and quantization, have been proposed to reduce the communication volume. Existing adaptive quantization schemes use ascending-trend quantization where the quantization level increases with the training stages. In this paper, we investigate the adaptive quantization scheme by maximizing the convergence rate. It is shown that the optimal quantization level is directly related to the range of the model updates. Given the model is supposed to converge with training, the range of model updates will get smaller, indicating that the quantization level should decrease with the training stages. Based on the theoretical analysis, a descending quantization scheme is thus proposed. Experimental results show that the proposed descending quantization scheme not only reduces more communication volume but also helps FL converge faster, when compared with current ascending quantization scheme.  ( 2 min )
    Filling the G_ap_s: Multivariate Time Series Imputation by Graph Neural Networks. (arXiv:2108.00298v3 [cs.LG] UPDATED)
    Dealing with missing values and incomplete time series is a labor-intensive, tedious, inevitable task when handling data coming from real-world applications. Effective spatio-temporal representations would allow imputation methods to reconstruct missing temporal data by exploiting information coming from sensors at different locations. However, standard methods fall short in capturing the nonlinear time and space dependencies existing within networks of interconnected sensors and do not take full advantage of the available - and often strong - relational information. Notably, most state-of-the-art imputation methods based on deep learning do not explicitly model relational aspects and, in any case, do not exploit processing frameworks able to adequately represent structured spatio-temporal data. Conversely, graph neural networks have recently surged in popularity as both expressive and scalable tools for processing sequential data with relational inductive biases. In this work, we present the first assessment of graph neural networks in the context of multivariate time series imputation. In particular, we introduce a novel graph neural network architecture, named GRIN, which aims at reconstructing missing data in the different channels of a multivariate time series by learning spatio-temporal representations through message passing. Empirical results show that our model outperforms state-of-the-art methods in the imputation task on relevant real-world benchmarks with mean absolute error improvements often higher than 20%.  ( 2 min )
    Hardness of Noise-Free Learning for Two-Hidden-Layer Neural Networks. (arXiv:2202.05258v1 [cs.LG])
    We give exponential statistical query (SQ) lower bounds for learning two-hidden-layer ReLU networks with respect to Gaussian inputs in the standard (noise-free) model. No general SQ lower bounds were known for learning ReLU networks of any depth in this setting: previous SQ lower bounds held only for adversarial noise models (agnostic learning) or restricted models such as correlational SQ. Prior work hinted at the impossibility of our result: Vempala and Wilmes showed that general SQ lower bounds cannot apply to any real-valued family of functions that satisfies a simple non-degeneracy condition. To circumvent their result, we refine a lifting procedure due to Daniely and Vardi that reduces Boolean PAC learning problems to Gaussian ones. We show how to extend their technique to other learning models and, in many well-studied cases, obtain a more efficient reduction. As such, we also prove new cryptographic hardness results for PAC learning two-hidden-layer ReLU networks, as well as new lower bounds for learning constant-depth ReLU networks from membership queries.  ( 2 min )
    Diverse Image Generation via Self-Conditioned GANs. (arXiv:2006.10728v2 [cs.CV] UPDATED)
    We introduce a simple but effective unsupervised method for generating realistic and diverse images. We train a class-conditional GAN model without using manually annotated class labels. Instead, our model is conditional on labels automatically derived from clustering in the discriminator's feature space. Our clustering step automatically discovers diverse modes, and explicitly requires the generator to cover them. Experiments on standard mode collapse benchmarks show that our method outperforms several competing methods when addressing mode collapse. Our method also performs well on large-scale datasets such as ImageNet and Places365, improving both image diversity and standard quality metrics, compared to previous methods.  ( 2 min )
    Characterizing, Detecting, and Predicting Online Ban Evasion. (arXiv:2202.05257v1 [cs.SI])
    Moderators and automated methods enforce bans on malicious users who engage in disruptive behavior. However, malicious users can easily create a new account to evade such bans. Previous research has focused on other forms of online deception, like the simultaneous operation of multiple accounts by the same entities (sockpuppetry), impersonation of other individuals, and studying the effects of de-platforming individuals and communities. Here we conduct the first data-driven study of ban evasion, i.e., the act of circumventing bans on an online platform, leading to temporally disjoint operation of accounts by the same user. We curate a novel dataset of 8,551 ban evasion pairs (parent, child) identified on Wikipedia and contrast their behavior with benign users and non-evading malicious users. We find that evasion child accounts demonstrate similarities with respect to their banned parent accounts on several behavioral axes - from similarity in usernames and edited pages to similarity in content added to the platform and its psycholinguistic attributes. We reveal key behavioral attributes of accounts that are likely to evade bans. Based on the insights from the analyses, we train logistic regression classifiers to detect and predict ban evasion at three different points in the ban evasion lifecycle. Results demonstrate the effectiveness of our methods in predicting future evaders (AUC = 0.78), early detection of ban evasion (AUC = 0.85), and matching child accounts with parent accounts (MRR = 0.97). Our work can aid moderators by reducing their workload and identifying evasion pairs faster and more efficiently than current manual and heuristic-based approaches. Dataset is available $\href{https://github.com/srijankr/ban_evasion}{\text{here}}$.  ( 2 min )
    Latent Imagination Facilitates Zero-Shot Transfer in Autonomous Racing. (arXiv:2103.04909v2 [cs.LG] UPDATED)
    World models learn behaviors in a latent imagination space to enhance the sample-efficiency of deep reinforcement learning (RL) algorithms. While learning world models for high-dimensional observations (e.g., pixel inputs) has become practicable on standard RL benchmarks and some games, their effectiveness in real-world robotics applications has not been explored. In this paper, we investigate how such agents generalize to real-world autonomous vehicle control tasks, where advanced model-free deep RL algorithms fail. In particular, we set up a series of time-lap tasks for an F1TENTH racing robot, equipped with a high-dimensional LiDAR sensor, on a set of test tracks with a gradual increase in their complexity. In this continuous-control setting, we show that model-based agents capable of learning in imagination substantially outperform model-free agents with respect to performance, sample efficiency, successful task completion, and generalization. Moreover, we show that the generalization ability of model-based agents strongly depends on the choice of their observation model. We provide extensive empirical evidence for the effectiveness of world models provided with long enough memory horizons in sim2real tasks.  ( 2 min )
    Hard to Forget: Poisoning Attacks on Certified Machine Unlearning. (arXiv:2109.08266v2 [cs.LG] UPDATED)
    The right to erasure requires removal of a user's information from data held by organizations, with rigorous interpretations extending to downstream products such as learned models. Retraining from scratch with the particular user's data omitted fully removes its influence on the resulting model, but comes with a high computational cost. Machine "unlearning" mitigates the cost incurred by full retraining: instead, models are updated incrementally, possibly only requiring retraining when approximation errors accumulate. Rapid progress has been made towards privacy guarantees on the indistinguishability of unlearned and retrained models, but current formalisms do not place practical bounds on computation. In this paper we demonstrate how an attacker can exploit this oversight, highlighting a novel attack surface introduced by machine unlearning. We consider an attacker aiming to increase the computational cost of data removal. We derive and empirically investigate a poisoning attack on certified machine unlearning where strategically designed training data triggers complete retraining when removed.  ( 2 min )
    Image-to-Image Regression with Distribution-Free Uncertainty Quantification and Applications in Imaging. (arXiv:2202.05265v1 [cs.LG])
    Image-to-image regression is an important learning task, used frequently in biological imaging. Current algorithms, however, do not generally offer statistical guarantees that protect against a model's mistakes and hallucinations. To address this, we develop uncertainty quantification techniques with rigorous statistical guarantees for image-to-image regression problems. In particular, we show how to derive uncertainty intervals around each pixel that are guaranteed to contain the true value with a user-specified confidence probability. Our methods work in conjunction with any base machine learning model, such as a neural network, and endow it with formal mathematical guarantees -- regardless of the true unknown data distribution or choice of model. Furthermore, they are simple to implement and computationally inexpensive. We evaluate our procedure on three image-to-image regression tasks: quantitative phase microscopy, accelerated magnetic resonance imaging, and super-resolution transmission electron microscopy of a Drosophila melanogaster brain.  ( 2 min )
    ViViT: Curvature access through the generalized Gauss-Newton's low-rank structure. (arXiv:2106.02624v2 [cs.LG] UPDATED)
    Curvature in form of the Hessian or its generalized Gauss-Newton (GGN) approximation is valuable for algorithms that rely on a local model for the loss to train, compress, or explain deep networks. Existing methods based on implicit multiplication via automatic differentiation or Kronecker-factored block diagonal approximations do not consider noise in the mini-batch. We present ViViT, a curvature model that leverages the GGN's low-rank structure without further approximations. It allows for efficient computation of eigenvalues, eigenvectors, as well as per-sample first- and second-order directional derivatives. The representation is computed in parallel with gradients in one backward pass and offers a fine-grained cost-accuracy trade-off, which allows it to scale. We demonstrate this by conducting performance benchmarks and substantiate ViViT's usefulness by studying the impact of noise on the GGN's structural properties during neural network training.  ( 2 min )
    Convolutional Neural Bandit for Visual-aware Recommendation. (arXiv:2107.07438v2 [cs.LG] UPDATED)
    Online recommendation/advertising is ubiquitous in web business. Image displaying is considered as one of the most commonly used formats to interact with customers. Contextual multi-armed bandit has shown success in the application of advertising to solve the exploration-exploitation dilemma existing in the recommendation procedure. Inspired by the visual-aware recommendation, in this paper, we propose a contextual bandit algorithm, where the convolutional neural network (CNN) is utilized to learn the reward function along with an upper confidence bound (UCB) for exploration. We also prove a near-optimal regret bound $\tilde{\mathcal{O}}(\sqrt{T})$ when the network is over-parameterized, and establish strong connections with convolutional neural tangent kernel (CNTK). Finally, we evaluate the empirical performance of the proposed algorithm and show that it outperforms other state-of-the-art UCB-based bandit algorithms on real-world image data sets.  ( 2 min )
    A Proximal Algorithm for Sampling from Non-smooth Potentials. (arXiv:2110.04597v2 [cs.LG] UPDATED)
    In this work, we examine sampling problems with non-smooth potentials. We propose a novel Markov chain Monte Carlo algorithm for sampling from non-smooth potentials. We provide a non-asymptotical analysis of our algorithm and establish a polynomial-time complexity $\tilde {\cal O}(d\varepsilon^{-1})$ to obtain $\varepsilon$ total variation distance to the target density, better than most existing results under the same assumptions. Our method is based on the proximal bundle method and an alternating sampling framework. This framework requires the so-called restricted Gaussian oracle, which can be viewed as a sampling counterpart of the proximal mapping in convex optimization. One key contribution of this work is a fast algorithm that realizes the restricted Gaussian oracle for any convex non-smooth potential with bounded Lipschitz constant.  ( 2 min )
    Analyzing and Improving Adversarial Training for Generative Modeling. (arXiv:2012.06568v2 [cs.LG] UPDATED)
    We study a new generative modeling technique based on adversarial training (AT). We show that in a setting where the model is trained to discriminate in-distribution data from adversarial examples perturbed from out-distribution samples, the model learns the support of the in-distribution data. The learning process is also closely related to MCMC-based maximum likelihood learning of energy-based models (EBMs), and can be considered as an approximate maximum likelihood learning method. We show that this AT generative model achieves competitive image generation performance to state-of-the-art EBMs, and at the same time is stable to train and has better sampling efficiency. We demonstrate that the AT generative model is well-suited for the task of image translation and worst-case out-of-distribution detection.  ( 2 min )
    Deep Learning in Random Neural Fields: Numerical Experiments via Neural Tangent Kernel. (arXiv:2202.05254v1 [cs.LG])
    A biological neural network in the cortex forms a neural field. Neurons in the field have their own receptive fields, and connection weights between two neurons are random but highly correlated when they are in close proximity in receptive fields. In this paper, we investigate such neural fields in a multilayer architecture to investigate the supervised learning of the fields. We empirically compare the performances of our field model with those of randomly connected deep networks. The behavior of a randomly connected network is investigated on the basis of the key idea of the neural tangent kernel regime, a recent development in the machine learning theory of over-parameterized networks; for most randomly connected neural networks, it is shown that global minima always exist in their small neighborhoods. We numerically show that this claim also holds for our neural fields. In more detail, our model has two structures: i) each neuron in a field has a continuously distributed receptive field, and ii) the initial connection weights are random but not independent, having correlations when the positions of neurons are close in each layer. We show that such a multilayer neural field is more robust than conventional models when input patterns are deformed by noise disturbances. Moreover, its generalization ability can be slightly superior to that of conventional models.  ( 2 min )
    FedFog: Network-Aware Optimization of Federated Learning over Wireless Fog-Cloud Systems. (arXiv:2107.02755v3 [cs.LG] UPDATED)
    Federated learning (FL) is capable of performing large distributed machine learning tasks across multiple edge users by periodically aggregating trained local parameters. To address key challenges of enabling FL over a wireless fog-cloud system (e.g., non-i.i.d. data, users' heterogeneity), we first propose an efficient FL algorithm based on Federated Averaging (called FedFog) to perform the local aggregation of gradient parameters at fog servers and global training update at the cloud. Next, we employ FedFog in wireless fog-cloud systems by investigating a novel network-aware FL optimization problem that strikes the balance between the global loss and completion time. An iterative algorithm is then developed to obtain a precise measurement of the system performance, which helps design an efficient stopping criteria to output an appropriate number of global rounds. To mitigate the straggler effect, we propose a flexible user aggregation strategy that trains fast users first to obtain a certain level of accuracy before allowing slow users to join the global training updates. Extensive numerical results using several real-world FL tasks are provided to verify the theoretical convergence of FedFog. We also show that the proposed co-design of FL and communication is essential to substantially improve resource utilization while achieving comparable accuracy of the learning model.  ( 2 min )
    ML-Quadrat & DriotData: A Model-Driven Engineering Tool and a Low-Code Platform for Smart IoT Services. (arXiv:2107.02692v3 [cs.SE] UPDATED)
    In this paper, we present ML-Quadrat, an open-source research prototype that is based on the Eclipse Modeling Framework (EMF) and the state of the art in the literature of Model-Driven Software Engineering (MDSE) for smart Cyber-Physical Systems (CPS) and the Internet of Things (IoT). Its envisioned users are mostly software developers who might not have deep knowledge and skills in the heterogeneous IoT platforms and the diverse Artificial Intelligence (AI) technologies, specifically regarding Machine Learning (ML). ML-Quadrat is released under the terms of the Apache 2.0 license on Github. Additionally, we demonstrate an early tool prototype of DriotData, a web-based Low-Code platform targeting citizen data scientists and citizen/end-user software developers. DriotData exploits and adopts ML-Quadrat in the industry by offering an extended version of it as a subscription-based service to companies, mainly Small- and Medium-Sized Enterprises (SME). The current preliminary version of DriotData has three web-based model editors: text-based, tree-/form-based and diagram-based. The latter is designed for domain experts in the problem or use case domains (namely the IoT vertical domains) who might not have knowledge and skills in the field of IT. Finally, a short video demonstrating the tools is available on YouTube: https://youtu.be/VAuz25w0a5k  ( 2 min )
    Dealing with Non-Stationarity in MARL via Trust-Region Decomposition. (arXiv:2102.10616v2 [cs.LG] UPDATED)
    Non-stationarity is one thorny issue in cooperative multi-agent reinforcement learning (MARL). One of the reasons is the policy changes of agents during the learning process. Some existing works have discussed various consequences caused by non-stationarity with several kinds of measurement indicators. This makes the objectives or goals of existing algorithms are inevitably inconsistent and disparate. In this paper, we introduce a novel notion, the $\delta$-measurement, to explicitly measure the non-stationarity of a policy sequence, which can be further proved to be bounded by the KL-divergence of consecutive joint policies. A straightforward but highly non-trivial way is to control the joint policies' divergence, which is difficult to estimate accurately by imposing the trust-region constraint on the joint policy. Although it has lower computational complexity to decompose the joint policy and impose trust-region constraints on the factorized policies, simple policy factorization like mean-field approximation will lead to more considerable policy divergence, which can be considered as the trust-region decomposition dilemma. We model the joint policy as a pairwise Markov random field and propose a trust-region decomposition network (TRD-Net) based on message passing to estimate the joint policy divergence more accurately. The Multi-Agent Mirror descent policy algorithm with Trust region decomposition, called MAMT, is established by adjusting the trust-region of the local policies adaptively in an end-to-end manner. MAMT can approximately constrain the consecutive joint policies' divergence to satisfy $\delta$-stationarity and alleviate the non-stationarity problem. Our method can bring noticeable and stable performance improvement compared with baselines in cooperative tasks of different complexity.  ( 2 min )
    Remote Contextual Bandits. (arXiv:2202.05182v1 [cs.IT])
    We consider a remote contextual multi-armed bandit (CMAB) problem, in which the decision-maker observes the context and the reward, but must communicate the actions to be taken by the agents over a rate-limited communication channel. This can model, for example, a personalized ad placement application, where the content owner observes the individual visitors to its website, and hence has the context information, but must convey the ads that must be shown to each visitor to a separate entity that manages the marketing content. In this remote CMAB (R-CMAB) problem, the constraint on the communication rate between the decision-maker and the agents imposes a trade-off between the number of bits sent per agent and the acquired average reward. We are particularly interested in characterizing the rate required to achieve sub-linear regret. Consequently, this can be considered as a policy compression problem, where the distortion metric is induced by the learning objectives. We first study the fundamental information theoretic limits of this problem by letting the number of agents go to infinity, and study the regret achieved when Thompson sampling strategy is adopted. In particular, we identify two distinct rate regions resulting in linear and sub-linear regret behavior, respectively. Then, we provide upper bounds on the achievable regret when the decision-maker can reliably transmit the policy without distortion.  ( 2 min )
    Understanding Rare Spurious Correlations in Neural Networks. (arXiv:2202.05189v1 [cs.LG])
    Neural networks are known to use spurious correlations for classification; for example, they commonly use background information to classify objects. But how many examples does it take for a network to pick up these correlations? This is the question that we empirically investigate in this work. We introduce spurious patterns correlated with a specific class to a few examples and find that it takes only a handful of such examples for the network to pick up on the spurious correlation. Through extensive experiments, we show that (1) spurious patterns with a larger $\ell_2$ norm are learnt to correlate with the specified class more easily; (2) network architectures that are more sensitive to the input are more susceptible to learning these rare spurious correlations; (3) standard data deletion methods, including incremental retraining and influence functions, are unable to forget these rare spurious correlations through deleting the examples that cause these spurious correlations to be learnt. Code available at https://github.com/yangarbiter/rare-spurious-correlation.  ( 2 min )
    A Theory of Neural Tangent Kernel Alignment and Its Influence on Training. (arXiv:2105.14301v2 [stat.ML] UPDATED)
    The training dynamics and generalization properties of neural networks (NN) can be precisely characterized in function space via the neural tangent kernel (NTK). Structural changes to the NTK during training reflect feature learning and underlie the superior performance of networks outside of the static kernel regime. In this work, we seek to theoretically understand kernel alignment, a prominent and ubiquitous structural change that aligns the NTK with the target function. We first study a toy model of kernel evolution in which the NTK evolves to accelerate training and show that alignment naturally emerges from this demand. We then study alignment mechanism in deep linear networks and two layer ReLU networks. These theories provide good qualitative descriptions of kernel alignment and specialization in practical networks and identify factors in network architecture and data structure that drive kernel alignment. In nonlinear networks with multiple outputs, we identify the phenomenon of kernel specialization, where the kernel function for each output head preferentially aligns to its own target function. Together, our results provide a mechanistic explanation of how kernel alignment emerges during NN training and a normative explanation of how it benefits training.  ( 2 min )
    From One to Many: A Deep Learning Coincident Gravitational-Wave Search. (arXiv:2108.10715v2 [astro-ph.IM] UPDATED)
    Gravitational waves from the coalescence of compact-binary sources are now routinely observed by Earth bound detectors. The most sensitive search algorithms convolve many different pre-calculated gravitational waveforms with the detector data and look for coincident matches between different detectors. Machine learning is being explored as an alternative approach to building a search algorithm that has the prospect to reduce computational costs and target more complex signals. In this work we construct a two-detector search for gravitational waves from binary black hole mergers using neural networks trained on non-spinning binary black hole data from a single detector. The network is applied to the data from both observatories independently and we check for events coincident in time between the two. This enables the efficient analysis of large quantities of background data by time-shifting the independent detector data. We find that while for a single detector the network retains $91.5\%$ of the sensitivity matched filtering can achieve, this number drops to $83.9\%$ for two observatories. To enable the network to check for signal consistency in the detectors, we then construct a set of simple networks that operate directly on data from both detectors. We find that none of these simple two-detector networks are capable of improving the sensitivity over applying networks individually to the data from the detectors and searching for time coincidences.  ( 2 min )
    P-split formulations: A class of intermediate formulations between big-M and convex hull for disjunctive constraints. (arXiv:2202.05198v1 [math.OC])
    We develop a class of mixed-integer formulations for disjunctive constraints intermediate to the big-M and convex hull formulations in terms of relaxation strength. The main idea is to capture the best of both the big-M and convex hull formulations: a computationally light formulation with a tight relaxation. The "$P$-split" formulations are based on a lifted transformation that splits convex additively separable constraints into $P$ partitions and forms the convex hull of the linearized and partitioned disjunction. We analyze the continuous relaxation of the $P$-split formulations and show that, under certain assumptions, the formulations form a hierarchy starting from a big-M equivalent and converging to the convex hull. The goal of the $P$-split formulations is to form a strong approximation of the convex hull through a computationally simpler formulation. We computationally compare the $P$-split formulations against big-M and convex hull formulations on 320 test instances. The test problems include K-means clustering, P_ball problems, and optimization over trained ReLU neural networks. The computational results show promising potential of the $P$-split formulations. For many of the test problems, $P$-split formulations are solved with a similar number of explored nodes as the convex hull formulation, while reducing the solution time by an order of magnitude and outperforming big-M both in time and number of explored nodes.  ( 2 min )
    Online Body Schema Adaptation through Cost-Sensitive Active Learning. (arXiv:2101.10892v2 [cs.RO] UPDATED)
    Humanoid robots have complex bodies and kinematic chains with several Degrees-of-Freedom (DoF) which are difficult to model. Learning the parameters of a kinematic model can be achieved by observing the position of the robot links during prospective motions and minimising the prediction errors. This work proposes a movement efficient approach for estimating online the body-schema of a humanoid robot arm in the form of Denavit-Hartenberg (DH) parameters. A cost-sensitive active learning approach based on the A-Optimality criterion is used to select optimal joint configurations. The chosen joint configurations simultaneously minimise the error in the estimation of the body schema and minimise the movement between samples. This reduces energy consumption, along with mechanical fatigue and wear, while not compromising the learning accuracy. The work was implemented in a simulation environment, using the 7DoF arm of the iCub robot simulator. The hand pose is measured with a single camera via markers placed in the palm and back of the robot's hand. A non-parametric occlusion model is proposed to avoid choosing joint configurations where the markers are not visible, thus preventing worthless attempts. The results show cost-sensitive active learning has similar accuracy to the standard active learning approach, while reducing in about half the executed movement.  ( 2 min )
    Cooperative Online Learning with Feedback Graphs. (arXiv:2106.04982v3 [cs.LG] UPDATED)
    We study the interplay between feedback and communication in a cooperative online learning setting where a network of agents solves a task in which the learners' feedback is determined by an arbitrary graph. We characterize regret in terms of the independence number of the strong product between the feedback graph and the communication network. Our analysis recovers as special cases many previously known bounds for distributed online learning with either expert or bandit feedback. A more detailed version of our results also captures the dependence of the regret on the delay caused by the time the information takes to traverse each graph. Experiments run on synthetic data show that the empirical behavior of our algorithm is consistent with the theoretical results.  ( 2 min )
    Automated Atrial Fibrillation Classification Based on Denoising Stacked Autoencoder and Optimized Deep Network. (arXiv:2202.05177v1 [eess.SP])
    The incidences of atrial fibrillation (AFib) are increasing at a daunting rate worldwide. For the early detection of the risk of AFib, we have developed an automatic detection system based on deep neural networks. For achieving better classification, it is mandatory to have good pre-processing of physiological signals. Keeping this in mind, we have proposed a two-fold study. First, an end-to-end model is proposed to denoise the electrocardiogram signals using denoising autoencoders (DAE). To achieve denoising, we have used three networks including, convolutional neural network (CNN), dense neural network (DNN), and recurrent neural networks (RNN). Compared the three models and CNN based DAE performance is found to be better than the other two. Therefore, the signals denoised by the CNN based DAE were used to train the deep neural networks for classification. Three neural networks' performance has been evaluated using accuracy, specificity, sensitivity, and signal to noise ratio (SNR) as the evaluation criteria. The proposed end-to-end deep learning model for detecting atrial fibrillation in this study has achieved an accuracy rate of 99.20%, a specificity of 99.50%, a sensitivity of 99.50%, and a true positive rate of 99.00%. The average accuracy of the algorithms we compared is 96.26%, and our algorithm's accuracy is 3.2% higher than this average of the other algorithms. The CNN classification network performed better as compared to the other two. Additionally, the model is computationally efficient for real-time applications, and it takes approx 1.3 seconds to process 24 hours ECG signal. The proposed model was also tested on unseen dataset with different proportions of arrhythmias to examine the model's robustness, which resulted in 99.10% of recall and 98.50% of precision.  ( 2 min )
    Efficacy of Transformer Networks for Classification of Raw EEG Data. (arXiv:2202.05170v1 [eess.SP])
    With the unprecedented success of transformer networks in natural language processing (NLP), recently, they have been successfully adapted to areas like computer vision, generative adversarial networks (GAN), and reinforcement learning. Classifying electroencephalogram (EEG) data has been challenging and researchers have been overly dependent on pre-processing and hand-crafted feature extraction. Despite having achieved automated feature extraction in several other domains, deep learning has not yet been accomplished for EEG. In this paper, the efficacy of the transformer network for the classification of raw EEG data (cleaned and pre-processed) is explored. The performance of transformer networks was evaluated on a local (age and gender data) and a public dataset (STEW). First, a classifier using a transformer network is built to classify the age and gender of a person with raw resting-state EEG data. Second, the classifier is tuned for mental workload classification with open access raw multi-tasking mental workload EEG data (STEW). The network achieves an accuracy comparable to state-of-the-art accuracy on both the local (Age and Gender dataset; 94.53% (gender) and 87.79% (age)) and the public (STEW dataset; 95.28% (two workload levels) and 88.72% (three workload levels)) dataset. The accuracy values have been achieved using raw EEG data without feature extraction. Results indicate that the transformer-based deep learning models can successfully abate the need for heavy feature-extraction of EEG data for successful classification.  ( 2 min )
    F8Net: Fixed-Point 8-bit Only Multiplication for Network Quantization. (arXiv:2202.05239v1 [cs.CV])
    Neural network quantization is a promising compression technique to reduce memory footprint and save energy consumption, potentially leading to real-time inference. However, there is a performance gap between quantized and full-precision models. To reduce it, existing quantization approaches require high-precision INT32 or full-precision multiplication during inference for scaling or dequantization. This introduces a noticeable cost in terms of memory, speed, and required energy. To tackle these issues, we present F8Net, a novel quantization framework consisting of only fixed-point 8-bit multiplication. To derive our method, we first discuss the advantages of fixed-point multiplication with different formats of fixed-point numbers and study the statistical behavior of the associated fixed-point numbers. Second, based on the statistical and algorithmic analysis, we apply different fixed-point formats for weights and activations of different layers. We introduce a novel algorithm to automatically determine the right format for each layer during training. Third, we analyze a previous quantization algorithm -- parameterized clipping activation (PACT) -- and reformulate it using fixed-point arithmetic. Finally, we unify the recently proposed method for quantization fine-tuning and our fixed-point approach to show the potential of our method. We verify F8Net on ImageNet for MobileNet V1/V2 and ResNet18/50. Our approach achieves comparable and better performance, when compared not only to existing quantization techniques with INT32 multiplication or floating-point arithmetic, but also to the full-precision counterparts, achieving state-of-the-art performance.  ( 2 min )
    Design of Flexible Meander Line Antenna for Healthcare for Wireless Medical Body Area Networks. (arXiv:2202.05166v1 [eess.SP])
    A flexible meander line monopole antenna (MMA) is presented in this paper. The antenna can be worn for on-and off-body applications. The overall dimension of the MMA is 37 mm x 50 mm x2.37 mm3. The MMA was manufactured and measured, and the results matched with simulation results. The MMA design shows a bandwidth of up to 1282.4 (450.5) MHz and provides gains of 3.03 (4.85) dBi in the lower and upper operating bands, respectively, showing omnidirectional radiation patterns in free space. While worn on the chest or arm, bandwidths as high as 688.9 (500.9) MHz and 1261.7 (524.2) MHz, and the gains of 3.80 (4.67) dBi and 3.00 (4.55) dBi were observed. The experimental measurements of the read range confirmed the results of the coverage range of up to 11 meters.  ( 2 min )
    REvolveR: Continuous Evolutionary Models for Robot-to-robot Policy Transfer. (arXiv:2202.05244v1 [cs.LG])
    A popular paradigm in robotic learning is to train a policy from scratch for every new robot. This is not only inefficient but also often impractical for complex robots. In this work, we consider the problem of transferring a policy across two different robots with significantly different parameters such as kinematics and morphology. Existing approaches that train a new policy by matching the action or state transition distribution, including imitation learning methods, fail due to optimal action and/or state distribution being mismatched in different robots. In this paper, we propose a novel method named $REvolveR$ of using continuous evolutionary models for robotic policy transfer implemented in a physics simulator. We interpolate between the source robot and the target robot by finding a continuous evolutionary change of robot parameters. An expert policy on the source robot is transferred through training on a sequence of intermediate robots that gradually evolve into the target robot. Experiments show that the proposed continuous evolutionary model can effectively transfer the policy across robots and achieve superior sample efficiency on new robots using a physics simulator. The proposed method is especially advantageous in sparse reward settings where exploration can be significantly reduced.  ( 2 min )
    Locating and Editing Factual Knowledge in GPT. (arXiv:2202.05262v1 [cs.CL])
    We investigate the mechanisms underlying factual knowledge recall in autoregressive transformer language models. First, we develop a causal intervention for identifying neuron activations capable of altering a model's factual predictions. Within large GPT-style models, this reveals two distinct sets of neurons that we hypothesize correspond to knowing an abstract fact and saying a concrete word, respectively. This insight inspires the development of ROME, a novel method for editing facts stored in model weights. For evaluation, we assemble CounterFact, a dataset of over twenty thousand counterfactuals and tools to facilitate sensitive measurements of knowledge editing. Using CounterFact, we confirm the distinction between saying and knowing neurons, and we find that ROME achieves state-of-the-art performance in knowledge editing compared to other methods. An interactive demo notebook, full code implementation, and the dataset are available at https://rome.baulab.info/.  ( 2 min )
    Zero Shot Learning for Predicting Energy Usage of Buildings in Sustainable Design. (arXiv:2202.05206v1 [cs.LG])
    The 2030 Challenge is aimed at making all new buildings and major renovations carbon neutral by 2030. One of the potential solutions to meet this challenge is through innovative sustainable design strategies. For developing such strategies it is important to understand how the various building factors contribute to energy usage of a building, right at design time. The growth of artificial intelligence (AI) in recent years provides an unprecedented opportunity to advance sustainable design by learning complex relationships between building factors from available data. However, rich training datasets are needed for AI-based solutions to achieve good prediction accuracy. Unfortunately, obtaining training datasets are time consuming and expensive in many real-world applications. Motivated by these reasons, we address the problem of accurately predicting the energy usage of new or unknown building types, i.e., those building types that do not have any training data. We propose a novel approach based on zero-shot learning (ZSL) to solve this problem. Our approach uses side information from building energy modeling experts to predict the closest building types for a given new/unknown building type. We then obtain the predicted energy usage for the k-closest building types using the models learned during training and combine the predicted values using a weighted averaging function. We evaluated our approach on a dataset containing five building types generated using BuildSimHub, a popular platform for building energy modeling. Our approach achieved better average accuracy than a regression model (based on XGBoost) trained on the entire dataset of known building types.  ( 2 min )
    Recursive Tree Grammar Autoencoders. (arXiv:2012.02097v3 [cs.LG] UPDATED)
    Machine learning on trees has been mostly focused on trees as input to algorithms. Much less research has investigated trees as output, which has many applications, such as molecule optimization for drug discovery, or hint generation for intelligent tutoring systems. In this work, we propose a novel autoencoder approach, called recursive tree grammar autoencoder (RTG-AE), which encodes trees via a bottom-up parser and decodes trees via a tree grammar, both learned via recursive neural networks that minimize the variational autoencoder loss. The resulting encoder and decoder can then be utilized in subsequent tasks, such as optimization and time series prediction. RTG-AEs are the first model to combine variational autoencoders, grammatical knowledge, and recursive processing. Our key message is that this unique combination of all three elements outperforms models which combine any two of the three. In particular, we perform an ablation study to show that our proposed method improves the autoencoding error, training time, and optimization score on synthetic as well as real datasets compared to four baselines. We further prove that RTG-AEs parse and generate trees in linear time and are expressive enough to handle all regular tree grammars.  ( 2 min )
    Visual Servoing for Pose Control of Soft Continuum Arm in a Structured Environment. (arXiv:2202.05200v1 [cs.RO])
    For soft continuum arms, visual servoing is a popular control strategy that relies on visual feedback to close the control loop. However, robust visual servoing is challenging as it requires reliable feature extraction from the image, accurate control models and sensors to perceive the shape of the arm, both of which can be hard to implement in a soft robot. This letter circumvents these challenges by presenting a deep neural network-based method to perform smooth and robust 3D positioning tasks on a soft arm by visual servoing using a camera mounted at the distal end of the arm. A convolutional neural network is trained to predict the actuations required to achieve the desired pose in a structured environment. Integrated and modular approaches for estimating the actuations from the image are proposed and are experimentally compared. A proportional control law is implemented to reduce the error between the desired and current image as seen by the camera. The model together with the proportional feedback control makes the described approach robust to several variations such as new targets, lighting, loads, and diminution of the soft arm. Furthermore, the model lends itself to be transferred to a new environment with minimal effort.  ( 2 min )
    Monotone Learning. (arXiv:2202.05246v1 [cs.LG])
    The amount of training-data is one of the key factors which determines the generalization capacity of learning algorithms. Intuitively, one expects the error rate to decrease as the amount of training-data increases. Perhaps surprisingly, natural attempts to formalize this intuition give rise to interesting and challenging mathematical questions. For example, in their classical book on pattern recognition, Devroye, Gyorfi, and Lugosi (1996) ask whether there exists a {monotone} Bayes-consistent algorithm. This question remained open for over 25 years, until recently Pestov (2021) resolved it for binary classification, using an intricate construction of a monotone Bayes-consistent algorithm. We derive a general result in multiclass classification, showing that every learning algorithm A can be transformed to a monotone one with similar performance. Further, the transformation is efficient and only uses a black-box oracle access to A. This demonstrates that one can provably avoid non-monotonic behaviour without compromising performance, thus answering questions asked by Devroye et al (1996), Viering, Mey, and Loog (2019), Viering and Loog (2021), and by Mhammedi (2021). Our transformation readily implies monotone learners in a variety of contexts: for example it extends Pestov's result to classification tasks with an arbitrary number of labels. This is in contrast with Pestov's work which is tailored to binary classification. In addition, we provide uniform bounds on the error of the monotone algorithm. This makes our transformation applicable in distribution-free settings. For example, in PAC learning it implies that every learnable class admits a monotone PAC learner. This resolves questions by Viering, Mey, and Loog (2019); Viering and Loog (2021); Mhammedi (2021).  ( 2 min )
    Conditional Diffusion Probabilistic Model for Speech Enhancement. (arXiv:2202.05256v1 [eess.AS])
    Speech enhancement is a critical component of many user-oriented audio applications, yet current systems still suffer from distorted and unnatural outputs. While generative models have shown strong potential in speech synthesis, they are still lagging behind in speech enhancement. This work leverages recent advances in diffusion probabilistic models, and proposes a novel speech enhancement algorithm that incorporates characteristics of the observed noisy speech signal into the diffusion and reverse processes. More specifically, we propose a generalized formulation of the diffusion probabilistic model named conditional diffusion probabilistic model that, in its reverse process, can adapt to non-Gaussian real noises in the estimated speech signal. In our experiments, we demonstrate strong performance of the proposed approach compared to representative generative models, and investigate the generalization capability of our models to other datasets with noise characteristics unseen during training.  ( 2 min )
    Adaptive and Robust Multi-task Learning. (arXiv:2202.05250v1 [stat.ML])
    We study the multi-task learning problem that aims to simultaneously analyze multiple datasets collected from different sources and learn one model for each of them. We propose a family of adaptive methods that automatically utilize possible similarities among those tasks while carefully handling their differences. We derive sharp statistical guarantees for the methods and prove their robustness against outlier tasks. Numerical experiments on synthetic and real datasets demonstrate the efficacy of our new methods.  ( 2 min )
    Radar-based Materials Classification Using Deep Wavelet Scattering Transform: A Comparison of Centimeter vs. Millimeter Wave Units. (arXiv:2202.05169v1 [eess.SP])
    Radar-based materials detection received significant attention in recent years for its potential inclusion in consumer and industrial applications like object recognition for grasping and manufacturing quality assurance and control. Several radar publications were developed for material classification under controlled settings with specific materials' properties and shapes. Recent literature has challenged the earlier findings on radars-based materials classification claiming that earlier solutions are not easily scaled to industrial applications due to a variety of real-world issues. Published experiments on the impact of these factors on the robustness of the extracted radar-based traditional features have already demonstrated that the application of deep neural networks can mitigate, to some extent, the impact to produce a viable solution. However, previous studies lacked an investigation of the usefulness of lower frequency radar units, specifically <10GHz, against the higher range units around and above 60GHz. This research considers two radar units with different frequency ranges: Walabot-3D (6.3-8 GHz) cm-wave and IMAGEVK-74 (62-69 GHz) mm-wave imaging units by Vayyar Imaging. A comparison is presented on the applicability of each unit for material classification. This work extends upon previous efforts, by applying deep wavelet scattering transform for the identification of different materials based on the reflected signals. In the wavelet scattering feature extractor, data is propagated through a series of wavelet transforms, nonlinearities, and averaging to produce low-variance representations of the reflected radar signals. This work is unique in comparison of the radar units and algorithms in material classification and includes real-time demonstrations that show strong performance by both units, with increased robustness offered by the cm-wave radar unit.  ( 2 min )
    ChemicalX: A Deep Learning Library for Drug Pair Scoring. (arXiv:2202.05240v1 [cs.LG])
    In this paper, we introduce ChemicalX, a PyTorch-based deep learning library designed for providing a range of state of the art models to solve the drug pair scoring task. The primary objective of the library is to make deep drug pair scoring models accessible to machine learning researchers and practitioners in a streamlined framework.The design of ChemicalX reuses existing high level model training utilities, geometric deep learning, and deep chemistry layers from the PyTorch ecosystem. Our system provides neural network layers, custom pair scoring architectures, data loaders, and batch iterators for end users. We showcase these features with example code snippets and case studies to highlight the characteristics of ChemicalX. A range of experiments on real world drug-drug interaction, polypharmacy side effect, and combination synergy prediction tasks demonstrate that the models available in ChemicalX are effective at solving the pair scoring task. Finally, we show that ChemicalX could be used to train and score machine learning models on large drug pair datasets with hundreds of thousands of compounds on commodity hardware.  ( 2 min )
    Deadwooding: Robust Global Pruning for Deep Neural Networks. (arXiv:2202.05226v1 [cs.LG])
    The ability of Deep Neural Networks to approximate highly complex functions is the key to their success. This benefit, however, often comes at the cost of a large model size, which challenges their deployment in resource-constrained environments. To limit this issue, pruning techniques can introduce sparsity in the models, but at the cost of accuracy and adversarial robustness. This paper addresses these critical issues and introduces Deadwooding, a novel pruning technique that exploits a Lagrangian Dual method to encourage model sparsity while retaining accuracy and ensuring robustness. The resulting model is shown to significantly outperform the state-of-the-art studies in measures of robustness and accuracy.  ( 2 min )
    Vehicle: Interfacing Neural Network Verifiers with Interactive Theorem Provers. (arXiv:2202.05207v1 [cs.LG])
    Verification of neural networks is currently a hot topic in automated theorem proving. Progress has been rapid and there are now a wide range of tools available that can verify properties of networks with hundreds of thousands of nodes. In theory this opens the door to the verification of larger control systems that make use of neural network components. However, although work has managed to incorporate the results of these verifiers to prove larger properties of individual systems, there is currently no general methodology for bridging the gap between verifiers and interactive theorem provers (ITPs). In this paper we present Vehicle, our solution to this problem. Vehicle is equipped with an expressive domain specific language for stating neural network specifications which can be compiled to both verifiers and ITPs. It overcomes previous issues with maintainability and scalability in similar ITP formalisations by using a standard ONNX file as the single canonical representation of the network. We demonstrate its utility by using it to connect the neural network verifier Marabou to Agda and then formally verifying that a car steered by a neural network never leaves the road, even in the face of an unpredictable cross wind and imperfect sensors. The network has over 20,000 nodes, and therefore this proof represents an improvement of 3 orders of magnitude over prior proofs about neural network enhanced systems in ITPs.  ( 2 min )
    Bayes Optimal Algorithm is Suboptimal in Frequentist Best Arm Identification. (arXiv:2202.05193v1 [stat.ML])
    We consider the fixed-budget best arm identification problem with Normal rewards. In this problem, the forecaster is given $K$ arms (treatments) and $T$ time steps. The forecaster attempts to find the best arm in terms of the largest mean via an adaptive experiment conducted with an algorithm. The performance of the algorithm is measured by the simple regret, or the quality of the estimated best arm. It is known that the frequentist simple regret can be exponentially small to $T$ for any fixed parameters, whereas the Bayesian simple regret is $\Theta(T^{-1})$ over a continuous prior distribution. This paper shows that Bayes optimal algorithm, which minimizes the Bayesian simple regret, does not have an exponential simple regret for some parameters. This finding contrasts with the many results indicating the asymptotic equivalence of Bayesian and frequentist algorithms in fixed sampling regimes. While the Bayes optimal algorithm is described in terms of a recursive equation that is virtually impossible to compute exactly, we pave the way to an analysis by introducing a key quantity that we call the expected Bellman improvement.  ( 2 min )
    Quantization in Layer's Input is Matter. (arXiv:2202.05137v1 [cs.LG])
    In this paper, we will show that the quantization in layer's input is more important than parameters' quantization for loss function. And the algorithm which is based on the layer's input quantization error is better than hessian-based mixed precision layout algorithm.  ( 2 min )
    Effective classification of ecg signals using enhanced convolutional neural network in iot. (arXiv:2202.05154v1 [eess.SP])
    In this paper, a novel ECG monitoring approach based on IoT technology is suggested. This paper proposes a routing system for IoT healthcare platforms based on Dynamic Source Routing (DSR) and Routing by Energy and Link Quality (REL). In addition, the Artificial Neural Network (ANN), Support Vector Machine (SVM), and Convolution Neural Networks (CNNs)-based approaches for ECG signal categorization were tested in this study. Deep-ECG will employ a deep CNN to extract important characteristics, which will then be compared using simple and fast distance functions in order to classify cardiac problems efficiently. This work has suggested algorithms for the categorization of ECG data acquired from mobile watch users in order to identify aberrant data. The Massachusetts Institute of Technology (MIT) and Beth Israel Hospital (MIT/BIH) Arrhythmia Database have been used for experimental verification of the suggested approaches. The results show that the proposed strategy outperforms others in terms of classification accuracy.  ( 2 min )
    Optimal reservoir computers for forecasting systems of nonlinear dynamics. (arXiv:2202.05159v1 [cs.LG])
    Prediction and analysis of systems of nonlinear dynamics is crucial in many applications. Here, we study characteristics and optimization of reservoir computing, a machine learning technique that has gained attention as a suitable method for this task. By systematically applying Bayesian optimization on reservoirs we show that reservoirs of low connectivity perform better than or as well as those of high connectivity in forecasting noiseless Lorenz and coupled Wilson-Cowan systems. We also show that, unexpectedly, computationally effective reservoirs of unconnected nodes (RUN) outperform reservoirs of linked network topologies in predicting these systems. In the presence of noise, reservoirs of linked nodes perform only slightly better than RUNs. In contrast to previously reported results, we find that the topology of linked reservoirs has no significance in the performance of system prediction. Based on our findings, we give a procedure for designing optimal reservoir computers (RC) for forecasting dynamical systems. This work paves way for computationally effective RCs applicable to real-time prediction of signals measured on systems of nonlinear dynamics such as EEG or MEG signals measured on a brain.  ( 2 min )
    Probabilistic learning inference of boundary value problem with uncertainties based on Kullback-Leibler divergence under implicit constraints. (arXiv:2202.05112v1 [stat.ML])
    In a first part, we present a mathematical analysis of a general methodology of a probabilistic learning inference that allows for estimating a posterior probability model for a stochastic boundary value problem from a prior probability model. The given targets are statistical moments for which the underlying realizations are not available. Under these conditions, the Kullback-Leibler divergence minimum principle is used for estimating the posterior probability measure. A statistical surrogate model of the implicit mapping, which represents the constraints, is introduced. The MCMC generator and the necessary numerical elements are given to facilitate the implementation of the methodology in a parallel computing framework. In a second part, an application is presented to illustrate the proposed theory and is also, as such, a contribution to the three-dimensional stochastic homogenization of heterogeneous linear elastic media in the case of a non-separation of the microscale and macroscale. For the construction of the posterior probability measure by using the probabilistic learning inference, in addition to the constraints defined by given statistical moments of the random effective elasticity tensor, the second-order moment of the random normalized residue of the stochastic partial differential equation has been added as a constraint. This constraint guarantees that the algorithm seeks to bring the statistical moments closer to their targets while preserving a small residue.  ( 2 min )
    SUMO: Advanced sleep spindle identification with neural networks. (arXiv:2202.05158v1 [eess.SP])
    Sleep spindles are neurophysiological phenomena that appear to be linked to memory formation and other functions of the central nervous system, and that can be observed in electroencephalographic recordings (EEG) during sleep. Manually identified spindle annotations in EEG recordings suffer from substantial intra- and inter-rater variability, even if raters have been highly trained, which reduces the reliability of spindle measures as a research and diagnostic tool. The Massive Online Data Annotation (MODA) project has recently addressed this problem by forming a consensus from multiple such rating experts, thus providing a corpus of spindle annotations of enhanced quality. Based on this dataset, we present a U-Net-type deep neural network model to automatically detect sleep spindles. Our model's performance exceeds that of the state-of-the-art detector and of most experts in the MODA dataset. We observed improved detection accuracy in subjects of all ages, including older individuals whose spindles are particularly challenging to detect reliably. Our results underline the potential of automated methods to do repetitive cumbersome tasks with super-human performance.  ( 2 min )
    Feature-level augmentation to improve robustness of deep neural networks to affine transformations. (arXiv:2202.05152v1 [cs.CV])
    Recent studies revealed that convolutional neural networks do not generalize well to small image transformations, e.g. rotations by a few degrees or translations of a few pixels. To improve the robustness to such transformations, we propose to introduce data augmentation at intermediate layers of the neural architecture, in addition to the common data augmentation applied on the input images. By introducing small perturbations to activation maps (features) at various levels, we develop the capacity of the neural network to cope with such transformations. We conduct experiments on three image classification benchmarks (Tiny ImageNet, Caltech-256 and Food-101), considering two different convolutional architectures (ResNet-18 and DenseNet-121). When compared with two state-of-the-art methods, the empirical results show that our approach consistently attains the best trade-off between accuracy and mean flip rate.  ( 2 min )
    Two-Stage Deep Anomaly Detection with Heterogeneous Time Series Data. (arXiv:2202.05093v1 [cs.AI])
    We introduce a data-driven anomaly detection framework using a manufacturing dataset collected from a factory assembly line. Given heterogeneous time series data consisting of operation cycle signals and sensor signals, we aim at discovering abnormal events. Motivated by our empirical findings that conventional single-stage benchmark approaches may not exhibit satisfactory performance under our challenging circumstances, we propose a two-stage deep anomaly detection (TDAD) framework in which two different unsupervised learning models are adopted depending on types of signals. In Stage I, we select anomaly candidates by using a model trained by operation cycle signals; in Stage II, we finally detect abnormal events out of the candidates by using another model, which is suitable for taking advantage of temporal continuity, trained by sensor signals. A distinguishable feature of our framework is that operation cycle signals are exploited first to find likely anomalous points, whereas sensor signals are leveraged to filter out unlikely anomalous points afterward. Our experiments comprehensively demonstrate the superiority over single-stage benchmark approaches, the model-agnostic property, and the robustness to difficult situations.  ( 2 min )
    Transfer-Learning Across Datasets with Different Input Dimensions: An Algorithm and Analysis for the Linear Regression Case. (arXiv:2202.05069v1 [stat.ML])
    With the development of new sensors and monitoring devices, more sources of data become available to be used as inputs for machine learning models. These can on the one hand help to improve the accuracy of a model. On the other hand however, combining these new inputs with historical data remains a challenge that has not yet been studied in enough detail. In this work, we propose a transfer-learning algorithm that combines the new and the historical data, that is especially beneficial when the new data is scarce. We focus the approach on the linear regression case, which allows us to conduct a rigorous theoretical study on the benefits of the approach. We show that our approach is robust against negative transfer-learning, and we confirm this result empirically with real and simulated data.  ( 2 min )
    AD-NEGF: An End-to-End Differentiable Quantum Transport Simulator for Sensitivity Analysis and Inverse Problems. (arXiv:2202.05098v1 [cond-mat.mes-hall])
    Since proposed in the 70s, the Non-Equilibrium Green Function (NEGF) method has been recognized as a standard approach to quantum transport simulations. Although it achieves superiority in simulation accuracy, the tremendous computational cost makes it unbearable for high-throughput simulation tasks such as sensitivity analysis, inverse design, etc. In this work, we propose AD-NEGF, to our best knowledge the first end-to-end differentiable NEGF model for quantum transport simulations. We implement the entire numerical process in PyTorch, and design customized backward pass with implicit layer techniques, which provides gradient information at an affordable cost while guaranteeing the correctness of the forward simulation. The proposed model is validated with applications in calculating differential physical quantities, empirical parameter fitting, and doping optimization, which demonstrates its capacity to accelerate the material design process by conducting gradient-based parameter optimization.  ( 2 min )
    Deep learning for drug repurposing: methods, databases, and applications. (arXiv:2202.05145v1 [q-bio.BM])
    Drug development is time-consuming and expensive. Repurposing existing drugs for new therapies is an attractive solution that accelerates drug development at reduced experimental costs, specifically for Coronavirus Disease 2019 (COVID-19), an infectious disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). However, comprehensively obtaining and productively integrating available knowledge and big biomedical data to effectively advance deep learning models is still challenging for drug repurposing in other complex diseases. In this review, we introduce guidelines on how to utilize deep learning methodologies and tools for drug repurposing. We first summarized the commonly used bioinformatics and pharmacogenomics databases for drug repurposing. Next, we discuss recently developed sequence-based and graph-based representation approaches as well as state-of-the-art deep learning-based methods. Finally, we present applications of drug repurposing to fight the COVID-19 pandemic, and outline its future challenges.  ( 2 min )
    Game of Privacy: Towards Better Federated Platform Collaboration under Privacy Restriction. (arXiv:2202.05139v1 [cs.LG])
    Vertical federated learning (VFL) aims to train models from cross-silo data with different feature spaces stored on different platforms. Existing VFL methods usually assume all data on each platform can be used for model training. However, due to the intrinsic privacy risks of federated learning, the total amount of involved data may be constrained. In addition, existing VFL studies usually assume only one platform has task labels and can benefit from the collaboration, making it difficult to attract other platforms to join in the collaborative learning. In this paper, we study the platform collaboration problem in VFL under privacy constraint. We propose to incent different platforms through a reciprocal collaboration, where all platforms can exploit multi-platform information in the VFL framework to benefit their own tasks. With limited privacy budgets, each platform needs to wisely allocate its data quotas for collaboration with other platforms. Thereby, they naturally form a multi-party game. There are two core problems in this game, i.e., how to appraise other platforms' data value to compute game rewards and how to optimize policies to solve the game. To evaluate the contributions of other platforms' data, each platform offers a small amount of "deposit" data to participate in the VFL. We propose a performance estimation method to predict the expected model performance when involving different amount combinations of inter-platform data. To solve the game, we propose a platform negotiation method that simulates the bargaining among platforms and locally optimizes their policies via gradient descent. Extensive experiments on two real-world datasets show that our approach can effectively facilitate the collaborative exploitation of multi-platform data in VFL under privacy restrictions.  ( 2 min )
    Deep Learning for Computational Cytology: A Survey. (arXiv:2202.05126v1 [eess.IV])
    Computational cytology is a critical, rapid-developing, yet challenging topic in the field of medical image computing which analyzes the digitized cytology image by computer-aided technologies for cancer screening. Recently, an increasing number of deep learning (DL) algorithms have made significant progress in medical image analysis, leading to the boosting publications of cytological studies. To investigate the advanced methods and comprehensive applications, we survey more than 120 publications of DL-based cytology image analysis in this article. We first introduce various deep learning methods, including fully supervised, weakly supervised, unsupervised, and transfer learning. Then, we systematically summarize the public datasets, evaluation metrics, versatile cytology image analysis applications including classification, detection, segmentation, and other related tasks. Finally, we discuss current challenges and potential research directions of computational cytology.  ( 2 min )
    Reinforcement Learning in the Wild: Scalable RL Dispatching Algorithm Deployed in Ridehailing Marketplace. (arXiv:2202.05118v1 [cs.LG])
    In this study, a real-time dispatching algorithm based on reinforcement learning is proposed and for the first time, is deployed in large scale. Current dispatching methods in ridehailing platforms are dominantly based on myopic or rule-based non-myopic approaches. Reinforcement learning enables dispatching policies that are informed of historical data and able to employ the learned information to optimize returns of expected future trajectories. Previous studies in this field yielded promising results, yet have left room for further improvements in terms of performance gain, self-dependency, transferability, and scalable deployment mechanisms. The present study proposes a standalone RL-based dispatching solution that is equipped with multiple mechanisms to ensure robust and efficient on-policy learning and inference while being adaptable for full-scale deployment. A new form of value updating based on temporal difference is proposed that is more adapted to the inherent uncertainty of the problem. For the driver-order assignment, a customized utility function is proposed that when tuned based on the statistics of the market, results in remarkable performance improvement and interpretability. In addition, for reducing the risk of cancellation after drivers' assignment, an adaptive graph pruning strategy based on the multi-arm bandit problem is introduced. The method is evaluated using offline simulation with real data and yields notable performance improvement. In addition, the algorithm is deployed online in multiple cities under DiDi's operation for A/B testing and is launched in one of the major international markets as the primary mode of dispatch. The deployed algorithm shows over 1.3% improvement in total driver income from A/B testing. In addition, by causal inference analysis, as much as 5.3% improvement in major performance metrics is detected after full-scale deployment.  ( 2 min )
    Mixture-of-Rookies: Saving DNN Computations by Predicting ReLU Outputs. (arXiv:2202.04990v1 [cs.AR])
    Deep Neural Networks (DNNs) are widely used in many applications domains. However, they require a vast amount of computations and memory accesses to deliver outstanding accuracy. In this paper, we propose a scheme to predict whether the output of each ReLu activated neuron will be a zero or a positive number in order to skip the computation of those neurons that will likely output a zero. Our predictor, named Mixture-of-Rookies, combines two inexpensive components. The first one exploits the high linear correlation between binarized (1-bit) and full-precision (8-bit) dot products, whereas the second component clusters together neurons that tend to output zero at the same time. We propose a novel clustering scheme based on the analysis of angles, as the sign of the dot product of two vectors depends on the cosine of the angle between them. We implement our hybrid zero output predictor on top of a state-of-the-art DNN accelerator. Experimental results show that our scheme introduces a small area overhead of 5.3% while achieving a speedup of 1.2x and reducing energy consumption by 16.5% on average for a set of diverse DNNs.  ( 2 min )
    Equivariance Regularization for Image Reconstruction. (arXiv:2202.05062v1 [math.OC])
    In this work, we propose Regularization-by-Equivariance (REV), a novel structure-adaptive regularization scheme for solving imaging inverse problems under incomplete measurements. Our regularization scheme utilizes the equivariant structure in the physics of the measurements -- which is prevalent in many inverse problems such as tomographic image reconstruction -- to mitigate the ill-poseness of the inverse problem. Our proposed scheme can be applied in a plug-and-play manner alongside with any classic first-order optimization algorithm such as the accelerated gradient descent/FISTA for simplicity and fast convergence. Our numerical experiments in sparse-view X-ray CT image reconstruction tasks demonstrate the effectiveness of our approach.  ( 2 min )
    Fair When Trained, Unfair When Deployed: Observable Fairness Measures are Unstable in Performative Prediction Settings. (arXiv:2202.05049v1 [stat.ML])
    Many popular algorithmic fairness measures depend on the joint distribution of predictions, outcomes, and a sensitive feature like race or gender. These measures are sensitive to distribution shift: a predictor which is trained to satisfy one of these fairness definitions may become unfair if the distribution changes. In performative prediction settings, however, predictors are precisely intended to induce distribution shift. For example, in many applications in criminal justice, healthcare, and consumer finance, the purpose of building a predictor is to reduce the rate of adverse outcomes such as recidivism, hospitalization, or default on a loan. We formalize the effect of such predictors as a type of concept shift-a particular variety of distribution shift-and show both theoretically and via simulated examples how this causes predictors which are fair when they are trained to become unfair when they are deployed. We further show how many of these issues can be avoided by using fairness definitions that depend on counterfactual rather than observable outcomes.  ( 2 min )
    Backpropagation Clipping for Deep Learning with Differential Privacy. (arXiv:2202.05089v1 [cs.LG])
    We present backpropagation clipping, a novel variant of differentially private stochastic gradient descent (DP-SGD) for privacy-preserving deep learning. Our approach clips each trainable layer's inputs (during the forward pass) and its upstream gradients (during the backward pass) to ensure bounded global sensitivity for the layer's gradient; this combination replaces the gradient clipping step in existing DP-SGD variants. Our approach is simple to implement in existing deep learning frameworks. The results of our empirical evaluation demonstrate that backpropagation clipping provides higher accuracy at lower values for the privacy parameter $\epsilon$ compared to previous work. We achieve 98.7% accuracy for MNIST with $\epsilon = 0.07$ and 74% accuracy for CIFAR-10 with $\epsilon = 3.64$.  ( 2 min )
    Unaligned but Safe -- Formally Compensating Performance Limitations for Imprecise 2D Object Detection. (arXiv:2202.05123v1 [cs.LG])
    In this paper, we consider the imperfection within machine learning-based 2D object detection and its impact on safety. We address a special sub-type of performance limitations: the prediction bounding box cannot be perfectly aligned with the ground truth, but the computed Intersection-over-Union metric is always larger than a given threshold. Under such type of performance limitation, we formally prove the minimum required bounding box enlargement factor to cover the ground truth. We then demonstrate that the factor can be mathematically adjusted to a smaller value, provided that the motion planner takes a fixed-length buffer in making its decisions. Finally, observing the difference between an empirically measured enlargement factor and our formally derived worst-case enlargement factor offers an interesting connection between the quantitative evidence (demonstrated by statistics) and the qualitative evidence (demonstrated by worst-case analysis).  ( 2 min )
    Graph Neural Network for Local Corruption Recovery. (arXiv:2202.04936v1 [cs.LG])
    Graph neural networks (GNNs) have seen a surge of development for exploiting the relational information of input graphs. Nevertheless, messages propagating through a graph contain both interpretable patterns and small perturbations. Despite global noise could be distributed over the entire graph data, it is not uncommon that corruptions appear well-concealed and merely pollute local regions while still having a vital influence on the GNN learning and prediction performance. This work tackles the graph recovery problem from local poisons by a robustness representation learning. Our developed strategy identifies regional graph perturbations and formulates a robust hidden feature representation for GNNs. A mask function pinpointed the anomalies without prior knowledge, and an $\ell_{p,q}$ regularizer defends local poisonings through pursuing sparsity in the framelet domain while maintaining a conditional closeness between the observation and new representation. The proposed robust computational unit alleviates the inertial alternating direction method of multipliers to achieve an efficient solution. Extensive experiments show that our new model recovers graph representations from local pollution and achieves excellent performance.  ( 2 min )
    Adaptively Exploiting d-Separators with Causal Bandits. (arXiv:2202.05100v1 [stat.ML])
    Multi-armed bandit problems provide a framework to identify the optimal intervention over a sequence of repeated experiments. Without additional assumptions, minimax optimal performance (measured by cumulative regret) is well-understood. With access to additional observed variables that d-separate the intervention from the outcome (i.e., they are a d-separator), recent causal bandit algorithms provably incur less regret. However, in practice it is desirable to be agnostic to whether observed variables are a d-separator. Ideally, an algorithm should be adaptive; that is, perform nearly as well as an algorithm with oracle knowledge of the presence or absence of a d-separator. In this work, we formalize and study this notion of adaptivity, and provide a novel algorithm that simultaneously achieves (a) optimal regret when a d-separator is observed, improving on classical minimax algorithms, and (b) significantly smaller regret than recent causal bandit algorithms when the observed variables are not a d-separator. Crucially, our algorithm does not require any oracle knowledge of whether a d-separator is observed. We also generalize this adaptivity to other conditions, such as the front-door criterion.  ( 2 min )
    Random Forests Weighted Local Fr\'echet Regression with Theoretical Guarantee. (arXiv:2202.04912v1 [stat.ML])
    Statistical analysis is increasingly confronted with complex data from general metric spaces, such as symmetric positive definite matrix-valued data and probability distribution functions. [47] and [17] establish a general paradigm of Fr\'echet regression with complex metric space valued responses and Euclidean predictors. However, their proposed local Fr\'echet regression approach involves nonparametric kernel smoothing and suffers from the curse of dimensionality. To address this issue, we in this paper propose a novel random forests weighted local Fr\'echet regression paradigm. The main mechanism of our approach relies on the adaptive kernels generated by random forests. Our first method utilizes these weights as the local average to solve the Fr\'echet mean, while the second method performs local linear Fr\'echet regression, making both methods locally adaptive. Our proposals significantly improve existing Fr\'echet regression methods. Based on the theory of infinite order U-processes and infinite order Mmn-estimator, we establish the consistency, rate of convergence, and asymptotic normality for our proposed random forests weighted Fr\'echet regression estimator, which covers the current large sample theory of random forests with Euclidean responses as a special case. Numerical studies show the superiority of our proposed two methods for Fr\'echet regression with several commonly encountered types of responses such as probability distribution functions, symmetric positive definite matrices, and sphere data. The practical merits of our proposals are also demonstrated through the application to the human mortality distribution data.  ( 2 min )
    Energy-Based Contrastive Learning of Visual Representations. (arXiv:2202.04933v1 [cs.LG])
    Contrastive learning is a method of learning visual representations by training Deep Neural Networks (DNNs) to increase the similarity between representations of positive pairs and reduce the similarity between representations of negative pairs. However, contrastive methods usually require large datasets with significant number of negative pairs per iteration to achieve reasonable performance on downstream tasks. To address this problem, here we propose Energy-Based Contrastive Learning (EBCLR) that combines contrastive learning with Energy-Based Models (EBMs) and can be theoretically interpreted as learning the joint distribution of positive pairs. Using a novel variant of Stochastic Gradient Langevin Dynamics (SGLD) to accelerate the training of EBCLR, we show that EBCLR is far more sample-efficient than previous self-supervised learning methods. Specifically, EBCLR shows from X4 up to X20 acceleration compared to SimCLR and MoCo v2 in terms of training epochs. Furthermore, in contrast to SimCLR, EBCLR achieves nearly the same performance with 254 negative pairs (batch size 128) and 30 negative pairs (batch size 16) per positive pair, demonstrating the robustness of EBCLR to small number of negative pairs.  ( 2 min )
    Generalization Bounds via Convex Analysis. (arXiv:2202.04985v1 [stat.ML])
    Since the celebrated works of Russo and Zou (2016,2019) and Xu and Raginsky (2017), it has been well known that the generalization error of supervised learning algorithms can be bounded in terms of the mutual information between their input and the output, given that the loss of any fixed hypothesis has a subgaussian tail. In this work, we generalize this result beyond the standard choice of Shannon's mutual information to measure the dependence between the input and the output. Our main result shows that it is indeed possible to replace the mutual information by any strongly convex function of the joint input-output distribution, with the subgaussianity condition on the losses replaced by a bound on an appropriately chosen norm capturing the geometry of the dependence measure. This allows us to derive a range of generalization bounds that are either entirely new or strengthen previously known ones. Examples include bounds stated in terms of $p$-norm divergences and the Wasserstein-2 distance, which are respectively applicable for heavy-tailed loss distributions and highly smooth loss functions. Our analysis is entirely based on elementary tools from convex analysis by tracking the growth of a potential function associated with the dependence measure and the loss function.  ( 2 min )
    DDA3C: Cooperative Distributed Deep Reinforcement Learning in A Group-Agent System. (arXiv:2202.05135v1 [cs.LG])
    It can largely benefit the reinforcement learning process of each agent if multiple agents perform their separate reinforcement learning tasks cooperatively. These tasks can be not exactly the same but still benefit from the communication behaviour between agents due to task similarities. In fact, this learning scenario is not well understood yet and not well formulated. As the first effort, we provide a detailed discussion of this scenario, and propose group-agent reinforcement learning as a formulation of the reinforcement learning problem under this scenario and a third type of reinforcement learning problem with respect to single-agent and multi-agent reinforcement learning. We propose that it can be solved with the help of modern deep reinforcement learning techniques and provide a distributed deep reinforcement learning algorithm called DDA3C (Decentralised Distributed Asynchronous Advantage Actor-Critic) that is the first framework designed for group-agent reinforcement learning. We show through experiments in the CartPole-v0 game environment that DDA3C achieved desirable performance and has good scalability.  ( 2 min )
    Near-Optimal Statistical Query Lower Bounds for Agnostically Learning Intersections of Halfspaces with Gaussian Marginals. (arXiv:2202.05096v1 [cs.LG])
    We consider the well-studied problem of learning intersections of halfspaces under the Gaussian distribution in the challenging \emph{agnostic learning} model. Recent work of Diakonikolas et al. (2021) shows that any Statistical Query (SQ) algorithm for agnostically learning the class of intersections of $k$ halfspaces over $\mathbb{R}^n$ to constant excess error either must make queries of tolerance at most $n^{-\tilde{\Omega}(\sqrt{\log k})}$ or must make $2^{n^{\Omega(1)}}$ queries. We strengthen this result by improving the tolerance requirement to $n^{-\tilde{\Omega}(\log k)}$. This lower bound is essentially best possible since an SQ algorithm of Klivans et al. (2008) agnostically learns this class to any constant excess error using $n^{O(\log k)}$ queries of tolerance $n^{-O(\log k)}$. We prove two variants of our lower bound, each of which combines ingredients from Diakonikolas et al. (2021) with (an extension of) a different earlier approach for agnostic SQ lower bounds for the Boolean setting due to Dachman-Soled et al. (2014). Our approach also yields lower bounds for agnostically SQ learning the class of "convex subspace juntas" (studied by Vempala, 2010) and the class of sets with bounded Gaussian surface area; all of these lower bounds are nearly optimal since they essentially match known upper bounds from Klivans et al. (2008).  ( 2 min )
    Feasible Low-thrust Trajectory Identification via a Deep Neural Network Classifier. (arXiv:2202.04962v1 [math.OC])
    In recent years, deep learning techniques have been introduced into the field of trajectory optimization to improve convergence and speed. Training such models requires large trajectory datasets. However, the convergence of low thrust (LT) optimizations is unpredictable before the optimization process ends. For randomly initialized low thrust transfer data generation, most of the computation power will be wasted on optimizing infeasible low thrust transfers, which leads to an inefficient data generation process. This work proposes a deep neural network (DNN) classifier to accurately identify feasible LT transfer prior to the optimization process. The DNN-classifier achieves an overall accuracy of 97.9%, which has the best performance among the tested algorithms. The accurate low-thrust trajectory feasibility identification can avoid optimization on undesired samples, so that the majority of the optimized samples are LT trajectories that converge. This technique enables efficient dataset generation for different mission scenarios with different spacecraft configurations.  ( 2 min )
    AA-TransUNet: Attention Augmented TransUNet For Nowcasting Tasks. (arXiv:2202.04996v1 [cs.LG])
    Data driven modeling based approaches have recently gained a lot of attention in many challenging meteorological applications including weather element forecasting. This paper introduces a novel data-driven predictive model based on TransUNet for precipitation nowcasting task. The TransUNet model which combines the Transformer and U-Net models has been previously successfully applied in medical segmentation tasks. Here, TransUNet is used as a core model and is further equipped with Convolutional Block Attention Modules (CBAM) and Depthwise-separable Convolution (DSC). The proposed Attention Augmented TransUNet (AA-TransUNet) model is evaluated on two distinct datasets: the Dutch precipitation map dataset and the French cloud cover dataset. The obtained results show that the proposed model outperforms other examined models on both tested datasets. Furthermore, the uncertainty analysis of the proposed AA-TransUNet is provided to give additional insights on its predictions.  ( 2 min )
    Loss-guided Stability Selection. (arXiv:2202.04956v1 [cs.LG])
    In modern data analysis, sparse model selection becomes inevitable once the number of predictors variables is very high. It is well-known that model selection procedures like the Lasso or Boosting tend to overfit on real data. The celebrated Stability Selection overcomes these weaknesses by aggregating models, based on subsamples of the training data, followed by choosing a stable predictor set which is usually much sparser than the predictor sets from the raw models. The standard Stability Selection is based on a global criterion, namely the per-family error rate, while additionally requiring expert knowledge to suitably configure the hyperparameters. Since model selection depends on the loss function, i.e., predictor sets selected w.r.t. some particular loss function differ from those selected w.r.t. some other loss function, we propose a Stability Selection variant which respects the chosen loss function via an additional validation step based on out-of-sample validation data, optionally enhanced with an exhaustive search strategy. Our Stability Selection variants are widely applicable and user-friendly. Moreover, our Stability Selection variants can avoid the issue of severe underfitting which affects the original Stability Selection for noisy high-dimensional data, so our priority is not to avoid false positives at all costs but to result in a sparse stable model with which one can make predictions. Experiments where we consider both regression and binary classification and where we use Boosting as model selection algorithm reveal a significant precision improvement compared to raw Boosting models while not suffering from any of the mentioned issues of the original Stability Selection.  ( 2 min )
    Settling the Communication Complexity for Distributed Offline Reinforcement Learning. (arXiv:2202.04862v1 [stat.ML])
    We study a novel setting in offline reinforcement learning (RL) where a number of distributed machines jointly cooperate to solve the problem but only one single round of communication is allowed and there is a budget constraint on the total number of information (in terms of bits) that each machine can send out. For value function prediction in contextual bandits, and both episodic and non-episodic MDPs, we establish information-theoretic lower bounds on the minimax risk for distributed statistical estimators; this reveals the minimum amount of communication required by any offline RL algorithms. Specifically, for contextual bandits, we show that the number of bits must scale at least as $\Omega(AC)$ to match the centralised minimax optimal rate, where $A$ is the number of actions and $C$ is the context dimension; meanwhile, we reach similar results in the MDP settings. Furthermore, we develop learning algorithms based on least-squares estimates and Monte-Carlo return estimates and provide a sharp analysis showing that they can achieve optimal risk up to logarithmic factors. Additionally, we also show that temporal difference is unable to efficiently utilise information from all available devices under the single-round communication setting due to the initial bias of this method. To our best knowledge, this paper presents the first minimax lower bounds for distributed offline RL problems.  ( 2 min )
    Online Learning for Min Sum Set Cover and Pandora's Box. (arXiv:2202.04870v1 [cs.LG])
    Two central problems in Stochastic Optimization are Min Sum Set Cover and Pandora's Box. In Pandora's Box, we are presented with n boxes, each containing an unknown value and the goal is to open the boxes in some order to minimize the sum of the search cost and the smallest value found. Given a distribution of value vectors, we are asked to identify a near-optimal search order. Min Sum Set Cover corresponds to the case where values are either 0 or infinity. In this work, we study the case where the value vectors are not drawn from a distribution but are presented to a learner in an online fashion. We present a computationally efficient algorithm that is constant-competitive against the cost of the optimal search order. We extend our results to a bandit setting where only the values of the boxes opened are revealed to the learner after every round. We also generalize our results to other commonly studied variants of Pandora's Box and Min Sum Set Cover that involve selecting more than a single value subject to a matroid constraint.  ( 2 min )
    Instance-wise algorithm configuration with graph neural networks. (arXiv:2202.04910v1 [cs.LG])
    We present our submission for the configuration task of the Machine Learning for Combinatorial Optimization (ML4CO) NeurIPS 2021 competition. The configuration task is to predict a good configuration of the open-source solver SCIP to solve a mixed integer linear program (MILP) efficiently. We pose this task as a supervised learning problem: First, we compile a large dataset of the solver performance for various configurations and all provided MILP instances. Second, we use this data to train a graph neural network that learns to predict a good configuration for a specific instance. The submission was tested on the three problem benchmarks of the competition and improved solver performance over the default by 12% and 35% and 8% across the hidden test instances. We ranked 3rd out of 15 on the global leaderboard and won the student leaderboard. We make our code publicly available at \url{https://github.com/RomeoV/ml4co-competition} .  ( 2 min )
    Low-Rank Approximation with $1/\epsilon^{1/3}$ Matrix-Vector Products. (arXiv:2202.05120v1 [cs.DS])
    We study iterative methods based on Krylov subspaces for low-rank approximation under any Schatten-$p$ norm. Here, given access to a matrix $A$ through matrix-vector products, an accuracy parameter $\epsilon$, and a target rank $k$, the goal is to find a rank-$k$ matrix $Z$ with orthonormal columns such that $\| A(I -ZZ^\top)\|_{S_p} \leq (1+\epsilon)\min_{U^\top U = I_k} \|A(I - U U^\top)\|_{S_p}$, where $\|M\|_{S_p}$ denotes the $\ell_p$ norm of the the singular values of $M$. For the special cases of $p=2$ (Frobenius norm) and $p = \infty$ (Spectral norm), Musco and Musco (NeurIPS 2015) obtained an algorithm based on Krylov methods that uses $\tilde{O}(k/\sqrt{\epsilon})$ matrix-vector products, improving on the na\"ive $\tilde{O}(k/\epsilon)$ dependence obtainable by the power method, where $\tilde{O}$ suppresses poly$(\log(dk/\epsilon))$ factors. Our main result is an algorithm that uses only $\tilde{O}(kp^{1/6}/\epsilon^{1/3})$ matrix-vector products, and works for all $p \geq 1$. For $p = 2$ our bound improves the previous $\tilde{O}(k/\epsilon^{1/2})$ bound to $\tilde{O}(k/\epsilon^{1/3})$. Since the Schatten-$p$ and Schatten-$\infty$ norms are the same up to a $1+ \epsilon$ factor when $p \geq (\log d)/\epsilon$, our bound recovers the result of Musco and Musco for $p = \infty$. Further, we prove a matrix-vector query lower bound of $\Omega(1/\epsilon^{1/3})$ for any fixed constant $p \geq 1$, showing that surprisingly $\tilde{\Theta}(1/\epsilon^{1/3})$ is the optimal complexity for constant~$k$. To obtain our results, we introduce several new techniques, including optimizing over multiple Krylov subspaces simultaneously, and pinching inequalities for partitioned operators. Our lower bound for $p \in [1,2]$ uses the Araki-Lieb-Thirring trace inequality, whereas for $p>2$, we appeal to a norm-compression inequality for aligned partitioned operators.  ( 2 min )
    On characterizations of learnability with computable learners. (arXiv:2202.05041v1 [cs.LG])
    We study computable PAC (CPAC) learning as introduced by Agarwal et al. (2020). First, we consider the main open question of finding characterizations of proper and improper CPAC learning. We give a characterization of a closely related notion of strong CPAC learning, and we provide a negative answer to the open problem posed by Agarwal et al. (2021) whether all decidable PAC learnable classes are improperly CPAC learnable. Second, we consider undecidability of (computable) PAC learnability. We give a simple and general argument to exhibit such undecidability, and we initiate a study of the arithmetical complexity of learnability. We briefly discuss the relation to the undecidability result of Ben-David et al. (2019), that motivated the work of Agarwal et al.  ( 2 min )
    SAFER: Data-Efficient and Safe Reinforcement Learning via Skill Acquisition. (arXiv:2202.04849v1 [cs.LG])
    Though many reinforcement learning (RL) problems involve learning policies in settings with difficult-to-specify safety constraints and sparse rewards, current methods struggle to acquire successful and safe policies. Methods that extract useful policy primitives from offline datasets using generative modeling have recently shown promise at accelerating RL in these more complex settings. However, we discover that current primitive-learning methods may not be well-equipped for safe policy learning and may promote unsafe behavior due to their tendency to ignore data from undesirable behaviors. To overcome these issues, we propose SAFEty skill pRiors (SAFER), an algorithm that accelerates policy learning on complex control tasks under safety constraints. Through principled training on an offline dataset, SAFER learns to extract safe primitive skills. In the inference stage, policies trained with SAFER learn to compose safe skills into successful policies. We theoretically characterize why SAFER can enforce safe policy learning and demonstrate its effectiveness on several complex safety-critical robotic grasping tasks inspired by the game Operation, in which SAFER outperforms baseline methods in learning successful policies and enforcing safety.  ( 2 min )
    Semi-Supervised Convolutive NMF for Automatic Music Transcription. (arXiv:2202.04989v1 [cs.SD])
    Automatic Music Transcription, which consists in transforming an audio recording of a musical performance into symbolic format, remains a difficult Music Information Retrieval task. In this work, we propose a semi-supervised approach using low-rank matrix factorization techniques, in particular Convolutive Nonnegative Matrix Factorization. In the semi-supervised setting, only a single recording of each individual notes is required. We show on the MAPS dataset that the proposed semi-supervised CNMF method performs better than state-of-the-art low-rank factorization techniques and a little worse than supervised deep learning state-of-the-art methods, while however suffering from generalization issues.  ( 2 min )
    Cross-speaker style transfer for text-to-speech using data augmentation. (arXiv:2202.05083v1 [eess.AS])
    We address the problem of cross-speaker style transfer for text-to-speech (TTS) using data augmentation via voice conversion. We assume to have a corpus of neutral non-expressive data from a target speaker and supporting conversational expressive data from different speakers. Our goal is to build a TTS system that is expressive, while retaining the target speaker's identity. The proposed approach relies on voice conversion to first generate high-quality data from the set of supporting expressive speakers. The voice converted data is then pooled with natural data from the target speaker and used to train a single-speaker multi-style TTS system. We provide evidence that this approach is efficient, flexible, and scalable. The method is evaluated using one or more supporting speakers, as well as a variable amount of supporting data. We further provide evidence that this approach allows some controllability of speaking style, when using multiple supporting speakers. We conclude by scaling our proposed technology to a set of 14 speakers across 7 languages. Results indicate that our technology consistently improves synthetic samples in terms of style similarity, while retaining the target speaker's identity.  ( 2 min )
    Off-Policy Fitted Q-Evaluation with Differentiable Function Approximators: Z-Estimation and Inference Theory. (arXiv:2202.04970v1 [stat.ML])
    Off-Policy Evaluation (OPE) serves as one of the cornerstones in Reinforcement Learning (RL). Fitted Q Evaluation (FQE) with various function approximators, especially deep neural networks, has gained practical success. While statistical analysis has proved FQE to be minimax-optimal with tabular, linear and several nonparametric function families, its practical performance with more general function approximator is less theoretically understood. We focus on FQE with general differentiable function approximators, making our theory applicable to neural function approximations. We approach this problem using the Z-estimation theory and establish the following results: The FQE estimation error is asymptotically normal with explicit variance determined jointly by the tangent space of the function class at the ground truth, the reward structure, and the distribution shift due to off-policy learning; The finite-sample FQE error bound is dominated by the same variance term, and it can also be bounded by function class-dependent divergence, which measures how the off-policy distribution shift intertwines with the function approximator. In addition, we study bootstrapping FQE estimators for error distribution inference and estimating confidence intervals, accompanied by a Cramer-Rao lower bound that matches our upper bounds. The Z-estimation analysis provides a generalizable theoretical framework for studying off-policy estimation in RL and provides sharp statistical theory for FQE with differentiable function approximators.  ( 2 min )
    Decreasing Annotation Burden of Pairwise Comparisons with Human-in-the-Loop Sorting: Application in Medical Image Artifact Rating. (arXiv:2202.04823v1 [q-bio.QM])
    Ranking by pairwise comparisons has shown improved reliability over ordinal classification. However, as the annotations of pairwise comparisons scale quadratically, this becomes less practical when the dataset is large. We propose a method for reducing the number of pairwise comparisons required to rank by a quantitative metric, demonstrating the effectiveness of the approach in ranking medical images by image quality in this proof of concept study. Using the medical image annotation software that we developed, we actively subsample pairwise comparisons using a sorting algorithm with a human rater in the loop. We find that this method substantially reduces the number of comparisons required for a full ordinal ranking without compromising inter-rater reliability when compared to pairwise comparisons without sorting.  ( 2 min )
    Barwise Compression Schemes for Audio-Based Music Structure Analysis. (arXiv:2202.04981v1 [cs.SD])
    Music Structure Analysis (MSA) consists in segmenting a music piece in several distinct sections. We approach MSA within a compression framework, under the hypothesis that the structure is more easily revealed by a simplified representation of the original content of the song. More specifically, under the hypothesis that MSA is correlated with similarities occurring at the bar scale, linear and non-linear compression schemes can be applied to barwise audio signals. Compressed representations capture the most salient components of the different bars in the song and are then used to infer the song structure using a dynamic programming algorithm. This work explores both low-rank approximation models such as Principal Component Analysis or Nonnegative Matrix Factorization and "piece-specific" Auto-Encoding Neural Networks, with the objective to learn latent representations specific to a given song. Such approaches do not rely on supervision nor annotations, which are well-known to be tedious to collect and possibly ambiguous in MSA description. In our experiments, several unsupervised compression schemes achieve a level of performance comparable to that of state-of-the-art supervised methods (for 3s tolerance) on the RWC-Pop dataset, showcasing the importance of the barwise compression processing for MSA.  ( 2 min )
    PCENet: High Dimensional Deep Surrogate Modeling. (arXiv:2202.05063v1 [cs.LG])
    Learning data representations under uncertainty is an important task that emerges in numerous machine learning applications. However, uncertainty quantification (UQ) techniques are computationally intensive and become prohibitively expensive for high-dimensional data. In this paper, we present a novel surrogate model for representation learning and uncertainty quantification, which aims to deal with data of moderate to high dimensions. The proposed model combines a neural network approach for dimensionality reduction of the (potentially high-dimensional) data, with a surrogate model method for learning the data distribution. We first employ a variational autoencoder (VAE) to learn a low-dimensional representation of the data distribution. We then propose to harness polynomial chaos expansion (PCE) formulation to map this distribution to the output target. The coefficients of PCE are learned from the distribution representation of the training data using a maximum mean discrepancy (MMD) approach. Our model enables us to (a) learn a representation of the data, (b) estimate uncertainty in the high-dimensional data system, and (c) match high order moments of the output distribution; without any prior statistical assumptions on the data. Numerical experimental results are presented to illustrate the performance of the proposed method.  ( 2 min )
    Interpretable pipelines with evolutionarily optimized modules for RL tasks with visual inputs. (arXiv:2202.04943v1 [cs.LG])
    The importance of explainability in AI has become a pressing concern, for which several explainable AI (XAI) approaches have been recently proposed. However, most of the available XAI techniques are post-hoc methods, which however may be only partially reliable, as they do not reflect exactly the state of the original models. Thus, a more direct way for achieving XAI is through interpretable (also called glass-box) models. These models have been shown to obtain comparable (and, in some cases, better) performance with respect to black-boxes models in various tasks such as classification and reinforcement learning. However, they struggle when working with raw data, especially when the input dimensionality increases and the raw inputs alone do not give valuable insights on the decision-making process. Here, we propose to use end-to-end pipelines composed of multiple interpretable models co-optimized by means of evolutionary algorithms, that allows us to decompose the decision-making process into two parts: computing high-level features from raw data, and reasoning on the extracted high-level features. We test our approach in reinforcement learning environments from the Atari benchmark, where we obtain comparable results (with respect to black-box approaches) in settings without stochastic frame-skipping, while performance degrades in frame-skipping settings.  ( 2 min )
    Understanding Value Decomposition Algorithms in Deep Cooperative Multi-Agent Reinforcement Learning. (arXiv:2202.04868v1 [cs.LG])
    Value function decomposition is becoming a popular rule of thumb for scaling up multi-agent reinforcement learning (MARL) in cooperative games. For such a decomposition rule to hold, the assumption of the individual-global max (IGM) principle must be made; that is, the local maxima on the decomposed value function per every agent must amount to the global maximum on the joint value function. This principle, however, does not have to hold in general. As a result, the applicability of value decomposition algorithms is concealed and their corresponding convergence properties remain unknown. In this paper, we make the first effort to answer these questions. Specifically, we introduce the set of cooperative games in which the value decomposition methods find their validity, which is referred as decomposable games. In decomposable games, we theoretically prove that applying the multi-agent fitted Q-Iteration algorithm (MA-FQI) will lead to an optimal Q-function. In non-decomposable games, the estimated Q-function by MA-FQI can still converge to the optimum under the circumstance that the Q-function needs projecting into the decomposable function space at each iteration. In both settings, we consider value function representations by practical deep neural networks and derive their corresponding convergence rates. To summarize, our results, for the first time, offer theoretical insights for MARL practitioners in terms of when value decomposition algorithms converge and why they perform well.  ( 2 min )
    Differential Private Knowledge Transfer for Privacy-Preserving Cross-Domain Recommendation. (arXiv:2202.04893v1 [cs.LG])
    Cross Domain Recommendation (CDR) has been popularly studied to alleviate the cold-start and data sparsity problem commonly existed in recommender systems. CDR models can improve the recommendation performance of a target domain by leveraging the data of other source domains. However, most existing CDR models assume information can directly 'transfer across the bridge', ignoring the privacy issues. To solve the privacy concern in CDR, in this paper, we propose a novel two stage based privacy-preserving CDR framework (PriCDR). In the first stage, we propose two methods, i.e., Johnson-Lindenstrauss Transform (JLT) based and Sparse-awareJLT (SJLT) based, to publish the rating matrix of the source domain using differential privacy. We theoretically analyze the privacy and utility of our proposed differential privacy based rating publishing methods. In the second stage, we propose a novel heterogeneous CDR model (HeteroCDR), which uses deep auto-encoder and deep neural network to model the published source rating matrix and target rating matrix respectively. To this end, PriCDR can not only protect the data privacy of the source domain, but also alleviate the data sparsity of the source domain. We conduct experiments on two benchmark datasets and the results demonstrate the effectiveness of our proposed PriCDR and HeteroCDR.  ( 2 min )
    Estimation of Clinical Workload and Patient Activity using Deep Learning and Optical Flow. (arXiv:2202.04748v1 [cs.CV])
    Contactless monitoring using thermal imaging has become increasingly proposed to monitor patient deterioration in hospital, most recently to detect fevers and infections during the COVID-19 pandemic. In this letter, we propose a novel method to estimate patient motion and observe clinical workload using a similar technical setup but combined with open source object detection algorithms (YOLOv4) and optical flow. Patient motion estimation was used to approximate patient agitation and sedation, while worker motion was used as a surrogate for caregiver workload. Performance was illustrated by comparing over 32000 frames from videos of patients recorded in an Intensive Care Unit, to clinical agitation scores recorded by clinical workers.  ( 2 min )
    PPA: Preference Profiling Attack Against Federated Learning. (arXiv:2202.04856v1 [cs.LG])
    Federated learning (FL) trains a global model across a number of decentralized participants, each with a local dataset. Compared to traditional centralized learning, FL does not require direct local datasets access and thus mitigates data security and privacy concerns. However, data privacy concerns for FL still exist due to inference attacks, including known membership inference, property inference, and data inversion. In this work, we reveal a new type of privacy inference attack, coined Preference Profiling Attack (PPA), that accurately profiles private preferences of a local user. In general, the PPA can profile top-k, especially for top-1, preferences contingent on the local user's characteristics. Our key insight is that the gradient variation of a local user's model has a distinguishable sensitivity to the sample proportion of a given class, especially the majority/minority class. By observing a user model's gradient sensitivity to a class, the PPA can profile the sample proportion of the class in the user's local dataset and thus the user's preference of the class is exposed. The inherent statistical heterogeneity of FL further facilitates the PPA. We have extensively evaluated the PPA's effectiveness using four datasets from the image domains of MNIST, CIFAR10, Products-10K and RAF-DB. Our results show that the PPA achieves 90% and 98% top-1 attack accuracy to the MNIST and CIFAR10, respectively. More importantly, in the real-world commercial scenarios of shopping (i.e., Products-10K) and the social network (i.e., RAF-DB), the PPA gains a top-1 attack accuracy of 78% in the former case to infer the most ordered items, and 88% in the latter case to infer a victim user's emotions. Although existing countermeasures such as dropout and differential privacy protection can lower the PPA's accuracy to some extent, they unavoidably incur notable global model deterioration.  ( 2 min )
    Understanding Hyperdimensional Computing for Parallel Single-Pass Learning. (arXiv:2202.04805v1 [cs.LG])
    Hyperdimensional computing (HDC) is an emerging learning paradigm that computes with high dimensional binary vectors. It is attractive because of its energy efficiency and low latency, especially on emerging hardware -- but HDC suffers from low model accuracy, with little theoretical understanding of what limits its performance. We propose a new theoretical analysis of the limits of HDC via a consideration of what similarity matrices can be "expressed" by binary vectors, and we show how the limits of HDC can be approached using random Fourier features (RFF). We extend our analysis to the more general class of vector symbolic architectures (VSA), which compute with high-dimensional vectors (hypervectors) that are not necessarily binary. We propose a new class of VSAs, finite group VSAs, which surpass the limits of HDC. Using representation theory, we characterize which similarity matrices can be "expressed" by finite group VSA hypervectors, and we show how these VSAs can be constructed. Experimental results show that our RFF method and group VSA can both outperform the state-of-the-art HDC model by up to 7.6\% while maintaining hardware efficiency.  ( 2 min )
    Multi-relation Message Passing for Multi-label Text Classification. (arXiv:2202.04844v1 [cs.LG])
    A well-known challenge associated with the multi-label classification problem is modelling dependencies between labels. Most attempts at modelling label dependencies focus on co-occurrences, ignoring the valuable information that can be extracted by detecting label subsets that rarely occur together. For example, consider customer product reviews; a product probably would not simultaneously be tagged by both "recommended" (i.e., reviewer is happy and recommends the product) and "urgent" (i.e., the review suggests immediate action to remedy an unsatisfactory experience). Aside from the consideration of positive and negative dependencies, the direction of a relationship should also be considered. For a multi-label image classification problem, the "ship" and "sea" labels have an obvious dependency, but the presence of the former implies the latter much more strongly than the other way around. These examples motivate the modelling of multiple types of bi-directional relationships between labels. In this paper, we propose a novel method, entitled Multi-relation Message Passing (MrMP), for the multi-label classification problem. Experiments on benchmark multi-label text classification datasets show that the MrMP module yields similar or superior performance compared to state-of-the-art methods. The approach imposes only minor additional computational and memory overheads.  ( 2 min )
    GrASP: Gradient-Based Affordance Selection for Planning. (arXiv:2202.04772v1 [cs.LG])
    Planning with a learned model is arguably a key component of intelligence. There are several challenges in realizing such a component in large-scale reinforcement learning (RL) problems. One such challenge is dealing effectively with continuous action spaces when using tree-search planning (e.g., it is not feasible to consider every action even at just the root node of the tree). In this paper we present a method for selecting affordances useful for planning -- for learning which small number of actions/options from a continuous space of actions/options to consider in the tree-expansion process during planning. We consider affordances that are goal-and-state-conditional mappings to actions/options as well as unconditional affordances that simply select actions/options available in all states. Our selection method is gradient based: we compute gradients through the planning procedure to update the parameters of the function that represents affordances. Our empirical work shows that it is feasible to learn to select both primitive-action and option affordances, and that simultaneously learning to select affordances and planning with a learned value-equivalent model can outperform model-free RL.  ( 2 min )
    Measuring disentangled generative spatio-temporal representation. (arXiv:2202.04821v1 [cs.LG])
    Disentangled representation learning offers useful properties such as dimension reduction and interpretability, which are essential to modern deep learning approaches. Although deep learning techniques have been widely applied to spatio-temporal data mining, there has been little attention to further disentangle the latent features and understanding their contribution to the model performance, particularly their mutual information and correlation across features. In this study, we adopt two state-of-the-art disentangled representation learning methods and apply them to three large-scale public spatio-temporal datasets. To evaluate their performance, we propose an internal evaluation metric focusing on the degree of correlations among latent variables of the learned representations and the prediction performance of the downstream tasks. Empirical results show that our modified method can learn disentangled representations that achieve the same level of performance as existing state-of-the-art ST deep learning methods in a spatio-temporal sequence forecasting problem. Additionally, we find that our methods can be used to discover real-world spatial-temporal semantics to describe the variables in the learned representation.  ( 2 min )
    Discovering Concepts in Learned Representations using Statistical Inference and Interactive Visualization. (arXiv:2202.04753v1 [cs.LG])
    Concept discovery is one of the open problems in the interpretability literature that is important for bridging the gap between non-deep learning experts and model end-users. Among current formulations, concepts defines them by as a direction in a learned representation space. This definition makes it possible to evaluate whether a particular concept significantly influences classification decisions for classes of interest. However, finding relevant concepts is tedious, as representation spaces are high-dimensional and hard to navigate. Current approaches include hand-crafting concept datasets and then converting them to latent space directions; alternatively, the process can be automated by clustering the latent space. In this study, we offer another two approaches to guide user discovery of meaningful concepts, one based on multiple hypothesis testing, and another on interactive visualization. We explore the potential value and limitations of these approaches through simulation experiments and an demo visual interface to real data. Overall, we find that these techniques offer a promising strategy for discovering relevant concepts in settings where users do not have predefined descriptions of them, but without completely automating the process.  ( 2 min )
    L0Learn: A Scalable Package for Sparse Learning using L0 Regularization. (arXiv:2202.04820v1 [cs.LG])
    We introduce L0Learn: an open-source package for sparse regression and classification using L0 regularization. L0Learn implements scalable, approximate algorithms, based on coordinate descent and local combinatorial optimization. The package is built using C++ and has a user-friendly R interface. Our experiments indicate that L0Learn can scale to problems with millions of features, achieving competitive run times with state-of-the-art sparse learning packages. L0Learn is available on both CRAN and GitHub.  ( 2 min )
    Survey on Graph Neural Network Acceleration: An Algorithmic Perspective. (arXiv:2202.04822v1 [cs.LG])
    Graph neural networks (GNNs) have been a hot spot of recent research and are widely utilized in diverse applications. However, with the use of huger data and deeper models, an urgent demand is unsurprisingly made to accelerate GNNs for more efficient execution. In this paper, we provide a comprehensive survey on acceleration methods for GNNs from an algorithmic perspective. We first present a new taxonomy to classify existing acceleration methods into five categories. Based on the classification, we systematically discuss these methods and highlight their correlations. Next, we provide comparisons from aspects of the efficiency and characteristics of these methods. Finally, we suggest some promising prospects for future research.  ( 2 min )
    A Neural Network Model of Continual Learning with Cognitive Control. (arXiv:2202.04773v1 [q-bio.NC])
    Neural networks struggle in continual learning settings from catastrophic forgetting: when trials are blocked, new learning can overwrite the learning from previous blocks. Humans learn effectively in these settings, in some cases even showing an advantage of blocking, suggesting the brain contains mechanisms to overcome this problem. Here, we build on previous work and show that neural networks equipped with a mechanism for cognitive control do not exhibit catastrophic forgetting when trials are blocked. We further show an advantage of blocking over interleaving when there is a bias for active maintenance in the control signal, implying a tradeoff between maintenance and the strength of control. Analyses of map-like representations learned by the networks provided additional insights into these mechanisms. Our work highlights the potential of cognitive control to aid continual learning in neural networks, and offers an explanation for the advantage of blocking that has been observed in humans.  ( 2 min )
    A survey of unsupervised learning methods for high-dimensional uncertainty quantification in black-box-type problems. (arXiv:2202.04648v1 [cs.LG])
    Constructing surrogate models for uncertainty quantification (UQ) on complex partial differential equations (PDEs) having inherently high-dimensional $\mathcal{O}(10^{\ge 2})$ stochastic inputs (e.g., forcing terms, boundary conditions, initial conditions) poses tremendous challenges. The curse of dimensionality can be addressed with suitable unsupervised learning techniques used as a pre-processing tool to encode inputs onto lower-dimensional subspaces while retaining its structural information and meaningful properties. In this work, we review and investigate thirteen dimension reduction methods including linear and nonlinear, spectral, blind source separation, convex and non-convex methods and utilize the resulting embeddings to construct a mapping to quantities of interest via polynomial chaos expansions (PCE). We refer to the general proposed approach as manifold PCE (m-PCE), where manifold corresponds to the latent space resulting from any of the studied dimension reduction methods. To investigate the capabilities and limitations of these methods we conduct numerical tests for three physics-based systems (treated as black-boxes) having high-dimensional stochastic inputs of varying complexity modeled as both Gaussian and non-Gaussian random fields to investigate the effect of the intrinsic dimensionality of input data. We demonstrate both the advantages and limitations of the unsupervised learning methods and we conclude that a suitable m-PCE model provides a cost-effective approach compared to alternative algorithms proposed in the literature, including recently proposed expensive deep neural network-based surrogates and can be readily applied for high-dimensional UQ in stochastic PDEs.  ( 2 min )
    Robust Bayesian Inference for Simulator-based Models via the MMD Posterior Bootstrap. (arXiv:2202.04744v1 [stat.ME])
    Simulator-based models are models for which the likelihood is intractable but simulation of synthetic data is possible. They are often used to describe complex real-world phenomena, and as such can often be misspecified in practice. Unfortunately, existing Bayesian approaches for simulators are known to perform poorly in those cases. In this paper, we propose a novel algorithm based on the posterior bootstrap and maximum mean discrepancy estimators. This leads to a highly-parallelisable Bayesian inference algorithm with strong robustness properties. This is demonstrated through an in-depth theoretical study which includes generalisation bounds and proofs of frequentist consistency and robustness of our posterior. The approach is then assessed on a range of examples including a g-and-k distribution and a toggle-switch model.  ( 2 min )
    Designing Closed Human-in-the-loop Deferral Pipelines. (arXiv:2202.04718v1 [cs.HC])
    In hybrid human-machine deferral frameworks, a classifier can defer uncertain cases to human decision-makers (who are often themselves fallible). Prior work on simultaneous training of such classifier and deferral models has typically assumed access to an oracle during training to obtain true class labels for training samples, but in practice there often is no such oracle. In contrast, we consider a "closed" decision-making pipeline in which the same fallible human decision-makers used in deferral also provide training labels. How can imperfect and biased human expert labels be used to train a fair and accurate deferral framework? Our key insight is that by exploiting weak prior information, we can match experts to input examples to ensure fairness and accuracy of the resulting deferral framework, even when imperfect and biased experts are used in place of ground truth labels. The efficacy of our approach is shown both by theoretical analysis and by evaluation on two tasks.  ( 2 min )
    The leap to ordinal: functional prognosis after traumatic brain injury using artificial intelligence. (arXiv:2202.04801v1 [cs.LG])
    When a patient is admitted to the intensive care unit (ICU) after a traumatic brain injury (TBI), an early prognosis is essential for baseline risk adjustment and shared decision making. TBI outcomes are commonly categorised by the Glasgow Outcome Scale-Extended (GOSE) into 8, ordered levels of functional recovery at 6 months after injury. Existing ICU prognostic models predict binary outcomes at a certain threshold of GOSE (e.g., prediction of survival [GOSE>1] or functional independence [GOSE>4]). We aimed to develop ordinal prediction models that concurrently predict probabilities of each GOSE score. From a prospective cohort (n=1,550, 65 centres) in the ICU stratum of the Collaborative European NeuroTrauma Effectiveness Research in TBI (CENTER-TBI) patient dataset, we extracted all clinical information within 24 hours of ICU admission (1,151 predictors) and 6-month GOSE scores. We analysed the effect of 2 design elements on ordinal model performance: (1) the baseline predictor set, ranging from a concise set of 10 validated predictors to a token-embedded representation of all possible predictors, and (2) the modelling strategy, from ordinal logistic regression to multinomial deep learning. With repeated k-fold cross-validation, we found that expanding the baseline predictor set significantly improved ordinal prediction performance while increasing analytical complexity did not. Half of these gains could be achieved with the addition of 8 high-impact predictors (2 demographic variables, 4 protein biomarkers, and 2 severity assessments) to the concise set. At best, ordinal models achieved 0.76 (95% CI: 0.74-0.77) ordinal discrimination ability (ordinal c-index) and 57% (95% CI: 54%-60%) explanation of ordinal variation in 6-month GOSE (Somers' D). Our results motivate the search for informative predictors for higher GOSE and the development of ordinal dynamic prediction models.  ( 2 min )
    Bayesian Nonparametrics for Offline Skill Discovery. (arXiv:2202.04675v1 [cs.LG])
    Skills or low-level policies in reinforcement learning are temporally extended actions that can speed up learning and enable complex behaviours. Recent work in offline reinforcement learning and imitation learning has proposed several techniques for skill discovery from a set of expert trajectories. While these methods are promising, the number K of skills to discover is always a fixed hyperparameter, which requires either prior knowledge about the environment or an additional parameter search to tune it. We first propose a method for offline learning of options (a particular skill framework) exploiting advances in variational inference and continuous relaxations. We then highlight an unexplored connection between Bayesian nonparametrics and offline skill discovery, and show how to obtain a nonparametric version of our model. This version is tractable thanks to a carefully structured approximate posterior with a dynamically-changing number of options, removing the need to specify K. We also show how our nonparametric extension can be applied in other skill frameworks, and empirically demonstrate that our method can outperform state-of-the-art offline skill learning algorithms across a variety of environments. Our code is available at https://github.com/layer6ai-labs/BNPO .  ( 2 min )
    Learning Latent Causal Dynamics. (arXiv:2202.04828v1 [stat.ML])
    One critical challenge of time-series modeling is how to learn and quickly correct the model under unknown distribution shifts. In this work, we propose a principled framework, called LiLY, to first recover time-delayed latent causal variables and identify their relations from measured temporal data under different distribution shifts. The correction step is then formulated as learning the low-dimensional change factors with a few samples from the new environment, leveraging the identified causal structure. Specifically, the framework factorizes unknown distribution shifts into transition distribution changes caused by fixed dynamics and time-varying latent causal relations, and by global changes in observation. We establish the identifiability theories of nonparametric latent causal dynamics from their nonlinear mixtures under fixed dynamics and under changes. Through experiments, we show that time-delayed latent causal influences are reliably identified from observed variables under different distribution changes. By exploiting this modular representation of changes, we can efficiently learn to correct the model under unknown distribution shifts with only a few samples.  ( 2 min )
    Orthogonal Matrices for MBAT Vector Symbolic Architectures, and a "Soft" VSA Representation for JSON. (arXiv:2202.04771v1 [cs.LG])
    Vector Symbolic Architectures (VSAs) give a way to represent a complex object as a single fixed-length vector, so that similar objects have similar vector representations. These vector representations then become easy to use for machine learning or nearest-neighbor search. We review a previously proposed VSA method, MBAT (Matrix Binding of Additive Terms), which uses multiplication by random matrices for binding related terms. However, multiplying by such matrices introduces instabilities which can harm performance. Making the random matrices be orthogonal matrices provably fixes this problem. With respect to larger scale applications, we see how to apply MBAT vector representations for any data expressed in JSON. JSON is used in numerous programming languages to express complex data, but its native format appears highly unsuited for machine learning. Expressing JSON as a fixed-length vector makes it readily usable for machine learning and nearest-neighbor search. Creating such JSON vectors also shows that a VSA needs to employ binding operations that are non-commutative. VSAs are now ready to try with full-scale practical applications, including healthcare, pharmaceuticals, and genomics. Keywords: MBAT (Matrix Binding of Additive Terms), VSA (Vector Symbolic Architecture), HDC (Hyperdimensional Computing), Distributed Representations, Binding, Orthogonal Matrices, Recurrent Connections, Machine Learning, Search, JSON, VSA Applications  ( 2 min )
    Predicting Human Similarity Judgments Using Large Language Models. (arXiv:2202.04728v1 [cs.LG])
    Similarity judgments provide a well-established method for accessing mental representations, with applications in psychology, neuroscience and machine learning. However, collecting similarity judgments can be prohibitively expensive for naturalistic datasets as the number of comparisons grows quadratically in the number of stimuli. One way to tackle this problem is to construct approximation procedures that rely on more accessible proxies for predicting similarity. Here we leverage recent advances in language models and online recruitment, proposing an efficient domain-general procedure for predicting human similarity judgments based on text descriptions. Intuitively, similar stimuli are likely to evoke similar descriptions, allowing us to use description similarity to predict pairwise similarity judgments. Crucially, the number of descriptions required grows only linearly with the number of stimuli, drastically reducing the amount of data required. We test this procedure on six datasets of naturalistic images and show that our models outperform previous approaches based on visual information.  ( 2 min )
    Non-Linear Spectral Dimensionality Reduction Under Uncertainty. (arXiv:2202.04678v1 [cs.LG])
    In this paper, we consider the problem of non-linear dimensionality reduction under uncertainty, both from a theoretical and algorithmic perspectives. Since real-world data usually contain measurements with uncertainties and artifacts, the input space in the proposed framework consists of probability distributions to model the uncertainties associated with each sample. We propose a new dimensionality reduction framework, called NGEU, which leverages uncertainty information and directly extends several traditional approaches, e.g., KPCA, MDA/KMFA, to receive as inputs the probability distributions instead of the original data. We show that the proposed NGEU formulation exhibits a global closed-form solution, and we analyze, based on the Rademacher complexity, how the underlying uncertainties theoretically affect the generalization ability of the framework. Empirical results on different datasets show the effectiveness of the proposed framework.  ( 2 min )
    Unsupervised Time-Series Representation Learning with Iterative Bilinear Temporal-Spectral Fusion. (arXiv:2202.04770v1 [cs.LG])
    Unsupervised/self-supervised time series representation learning is a challenging problem because of its complex dynamics and sparse annotations. Existing works mainly adopt the framework of contrastive learning with the time-based augmentation techniques to sample positives and negatives for contrastive training. Nevertheless, they mostly use segment-level augmentation derived from time slicing, which may bring about sampling bias and incorrect optimization with false negatives due to the loss of global context. Besides, they all pay no attention to incorporate the spectral information in feature representation. In this paper, we propose a unified framework, namely Bilinear Temporal-Spectral Fusion (BTSF). Specifically, we firstly utilize the instance-level augmentation with a simple dropout on the entire time series for maximally capturing long-term dependencies. We devise a novel iterative bilinear temporal-spectral fusion to explicitly encode the affinities of abundant time-frequency pairs, and iteratively refines representations in a fusion-and-squeeze manner with Spectrum-to-Time (S2T) and Time-to-Spectrum (T2S) Aggregation modules. We firstly conducts downstream evaluations on three major tasks for time series including classification, forecasting and anomaly detection. Experimental results shows that our BTSF consistently significantly outperforms the state-of-the-art methods.  ( 2 min )
    Augmenting Neural Networks with Priors on Function Values. (arXiv:2202.04798v1 [cs.LG])
    The need for function estimation in label-limited settings is common in the natural sciences. At the same time, prior knowledge of function values is often available in these domains. For example, data-free biophysics-based models can be informative on protein properties, while quantum-based computations can be informative on small molecule properties. How can we coherently leverage such prior knowledge to help improve a neural network model that is quite accurate in some regions of input space -- typically near the training data -- but wildly wrong in other regions? Bayesian neural networks (BNN) enable the user to specify prior information only on the neural network weights, not directly on the function values. Moreover, there is in general no clear mapping between these. Herein, we tackle this problem by developing an approach to augment BNNs with prior information on the function values themselves. Our probabilistic approach yields predictions that rely more heavily on the prior information when the epistemic uncertainty is large, and more heavily on the neural network when the epistemic uncertainty is small.  ( 2 min )
    Target-aware Molecular Graph Generation. (arXiv:2202.04829v1 [cs.LG])
    Generating molecules with desired biological activities has attracted growing attention in drug discovery. Previous molecular generation models are designed as chemocentric methods that hardly consider the drug-target interaction, limiting their practical applications. In this paper, we aim to generate molecular drugs in a target-aware manner that bridges biological activity and molecular design. To solve this problem, we compile a benchmark dataset from several publicly available datasets and build baselines in a unified framework. Building on the recent advantages of flow-based molecular generation models, we propose SiamFlow, which forces the flow to fit the distribution of target sequence embeddings in latent space. Specifically, we employ an alignment loss and a uniform loss to bring target sequence embeddings and drug graph embeddings into agreements while avoiding collapse. Furthermore, we formulate the alignment into a one-to-many problem by learning spaces of target sequence embeddings. Experiments quantitatively show that our proposed method learns meaningful representations in the latent space toward the target-aware molecular graph generation and provides an alternative approach to bridge biology and chemistry in drug discovery.  ( 2 min )
    Exact Solutions of a Deep Linear Network. (arXiv:2202.04777v1 [stat.ML])
    This work finds the exact solutions to a deep linear network with weight decay and stochastic neurons, a fundamental model for understanding the landscape of neural networks. Our result implies that weight decay strongly interacts with the model architecture and can create bad minima in a network with more than $1$ hidden layer, qualitatively different for a network with only $1$ hidden layer. As an application, we also analyze stochastic nets and show that their prediction variance vanishes to zero as the stochasticity, the width, or the depth tends to infinity.  ( 2 min )
    Semantic Segmentation of Anaemic RBCs Using Multilevel Deep Convolutional Encoder-Decoder Network. (arXiv:2202.04650v1 [eess.IV])
    Pixel-level analysis of blood images plays a pivotal role in diagnosing blood-related diseases, especially Anaemia. These analyses mainly rely on an accurate diagnosis of morphological deformities like shape, size, and precise pixel counting. In traditional segmentation approaches, instance or object-based approaches have been adopted that are not feasible for pixel-level analysis. The convolutional neural network (CNN) model required a large dataset with detailed pixel-level information for the semantic segmentation of red blood cells in the deep learning domain. In current research work, we address these problems by proposing a multi-level deep convolutional encoder-decoder network along with two state-of-the-art healthy and Anaemic-RBC datasets. The proposed multi-level CNN model preserved pixel-level semantic information extracted in one layer and then passed to the next layer to choose relevant features. This phenomenon helps to precise pixel-level counting of healthy and anaemic-RBC elements along with morphological analysis. For experimental purposes, we proposed two state-of-the-art RBC datasets, i.e., Healthy-RBCs and Anaemic-RBCs dataset. Each dataset contains 1000 images, ground truth masks, relevant, complete blood count (CBC), and morphology reports for performance evaluation. The proposed model results were evaluated using crossmatch analysis with ground truth mask by finding IoU, individual training, validation, testing accuracies, and global accuracies using a 05-fold training procedure. This model got training, validation, and testing accuracies as 0.9856, 0.9760, and 0.9720 on the Healthy-RBC dataset and 0.9736, 0.9696, and 0.9591 on an Anaemic-RBC dataset. The IoU and BFScore of the proposed model were 0.9311, 0.9138, and 0.9032, 0.8978 on healthy and anaemic datasets, respectively.  ( 2 min )
    Active Learning Improves Performance on Symbolic RegressionTasks in StackGP. (arXiv:2202.04708v1 [cs.LG])
    In this paper we introduce an active learning method for symbolic regression using StackGP. The approach begins with a small number of data points for StackGP to model. To improve the model the system incrementally adds a data point such that the new point maximizes prediction uncertainty as measured by the model ensemble. Symbolic regression is re-run with the larger data set. This cycle continues until the system satisfies a termination criterion. We use the Feynman AI benchmark set of equations to examine the ability of our method to find appropriate models using fewer data points. The approach was found to successfully rediscover 72 of the 100 Feynman equations using as few data points as possible, and without use of domain expertise or data translation.  ( 2 min )
    No-Regret Learning in Dynamic Stackelberg Games. (arXiv:2202.04786v1 [cs.GT])
    In a Stackelberg game, a leader commits to a randomized strategy, and a follower chooses their best strategy in response. We consider an extension of a standard Stackelberg game, called a discrete-time dynamic Stackelberg game, that has an underlying state space that affects the leader's rewards and available strategies and evolves in a Markovian manner depending on both the leader and follower's selected strategies. Although standard Stackelberg games have been utilized to improve scheduling in security domains, their deployment is often limited by requiring complete information of the follower's utility function. In contrast, we consider scenarios where the follower's utility function is unknown to the leader; however, it can be linearly parameterized. Our objective then is to provide an algorithm that prescribes a randomized strategy to the leader at each step of the game based on observations of how the follower responded in previous steps. We design a no-regret learning algorithm that, with high probability, achieves a regret bound (when compared to the best policy in hindsight) which is sublinear in the number of time steps; the degree of sublinearity depends on the number of features representing the follower's utility function. The regret of the proposed learning algorithm is independent of the size of the state space and polynomial in the rest of the parameters of the game. We show that the proposed learning algorithm outperforms existing model-free reinforcement learning approaches.  ( 2 min )
    Can Humans Do Less-Than-One-Shot Learning?. (arXiv:2202.04670v1 [cs.LG])
    Being able to learn from small amounts of data is a key characteristic of human intelligence, but exactly {\em how} small? In this paper, we introduce a novel experimental paradigm that allows us to examine classification in an extremely data-scarce setting, asking whether humans can learn more categories than they have exemplars (i.e., can humans do "less-than-one shot" learning?). An experiment conducted using this paradigm reveals that people are capable of learning in such settings, and provides several insights into underlying mechanisms. First, people can accurately infer and represent high-dimensional feature spaces from very little data. Second, having inferred the relevant spaces, people use a form of prototype-based categorization (as opposed to exemplar-based) to make categorical inferences. Finally, systematic, machine-learnable patterns in responses indicate that people may have efficient inductive biases for dealing with this class of data-scarce problems.  ( 2 min )
    FedQAS: Privacy-aware machine reading comprehension with federated learning. (arXiv:2202.04742v1 [cs.CL])
    Machine reading comprehension (MRC) of text data is one important task in Natural Language Understanding. It is a complex NLP problem with a lot of ongoing research fueled by the release of the Stanford Question Answering Dataset (SQuAD) and Conversational Question Answering (CoQA). It is considered to be an effort to teach computers how to "understand" a text, and then to be able to answer questions about it using deep learning. However, until now large-scale training on private text data and knowledge sharing has been missing for this NLP task. Hence, we present FedQAS, a privacy-preserving machine reading system capable of leveraging large-scale private data without the need to pool those datasets in a central location. The proposed approach combines transformer models and federated learning technologies. The system is developed using the FEDn framework and deployed as a proof-of-concept alliance initiative. FedQAS is flexible, language-agnostic, and allows intuitive participation and execution of local model training. In addition, we present the architecture and implementation of the system, as well as provide a reference evaluation based on the SQUAD dataset, to showcase how it overcomes data privacy issues and enables knowledge sharing between alliance members in a Federated learning setting.  ( 2 min )
    New Projection-free Algorithms for Online Convex Optimization with Adaptive Regret Guarantees. (arXiv:2202.04721v1 [cs.LG])
    We present new efficient \textit{projection-free} algorithms for online convex optimization (OCO), where by projection-free we refer to algorithms that avoid computing orthogonal projections onto the feasible set, and instead relay on different and potentially much more efficient oracles. While most state-of-the-art projection-free algorithms are based on the \textit{follow-the-leader} framework, our algorithms are fundamentally different and are based on the \textit{online gradient descent} algorithm with a novel and efficient approach to computing so-called \textit{infeasible projections}. As a consequence, we obtain the first projection-free algorithms which naturally yield \textit{adaptive regret} guarantees, i.e., regret bounds that hold w.r.t. any sub-interval of the sequence. Concretely, when assuming the availability of a linear optimization oracle (LOO) for the feasible set, on a sequence of length $T$, our algorithms guarantee $O(T^{3/4})$ adaptive regret and $O(T^{3/4})$ adaptive expected regret, for the full-information and bandit settings, respectively, using only $O(T)$ calls to the LOO. These bounds match the current state-of-the-art regret bounds for LOO-based projection-free OCO, which are \textit{not adaptive}. We also consider a new natural setting in which the feasible set is accessible through a separation oracle. We present algorithms which, using overall $O(T)$ calls to the separation oracle, guarantee $O(\sqrt{T})$ adaptive regret and $O(T^{3/4})$ adaptive expected regret for the full-information and bandit settings, respectively.  ( 2 min )
    FCM-DNN: diagnosing coronary artery disease by deep accuracy Fuzzy C-Means clustering model. (arXiv:2202.04645v1 [eess.IV])
    Cardiovascular disease is one of the most challenging diseases in middle-aged and older people, which causes high mortality. Coronary artery disease (CAD) is known as a common cardiovascular disease. A standard clinical tool for diagnosing CAD is angiography. The main challenges are dangerous side effects and high angiography costs. Today, the development of artificial intelligence-based methods is a valuable achievement for diagnosing disease. Hence, in this paper, artificial intelligence methods such as neural network (NN), deep neural network (DNN), and Fuzzy C-Means clustering combined with deep neural network (FCM-DNN) are developed for diagnosing CAD on a cardiac magnetic resonance imaging (CMRI) dataset. The original dataset is used in two different approaches. First, the labeled dataset is applied to the NN and DNN to create the NN and DNN models. Second, the labels are removed, and the unlabeled dataset is clustered via the FCM method, and then, the clustered dataset is fed to the DNN to create the FCM-DNN model. By utilizing the second clustering and modeling, the training process is improved, and consequently, the accuracy is increased. As a result, the proposed FCM-DNN model achieves the best performance with a 99.91% accuracy specifying 10 clusters, i.e., 5 clusters for healthy subjects and 5 clusters for sick subjects, through the 10-fold cross-validation technique compared to the NN and DNN models reaching the accuracies of 92.18% and 99.63%, respectively. To the best of our knowledge, no study has been conducted for CAD diagnosis on the CMRI dataset using artificial intelligence methods. The results confirm that the proposed FCM-DNN model can be helpful for scientific and research centers.  ( 2 min )
    Dimensionally Consistent Learning with Buckingham Pi. (arXiv:2202.04643v1 [cs.LG])
    In the absence of governing equations, dimensional analysis is a robust technique for extracting insights and finding symmetries in physical systems. Given measurement variables and parameters, the Buckingham Pi theorem provides a procedure for finding a set of dimensionless groups that spans the solution space, although this set is not unique. We propose an automated approach using the symmetric and self-similar structure of available measurement data to discover the dimensionless groups that best collapse this data to a lower dimensional space according to an optimal fit. We develop three data-driven techniques that use the Buckingham Pi theorem as a constraint: (i) a constrained optimization problem with a non-parametric input-output fitting function, (ii) a deep learning algorithm (BuckiNet) that projects the input parameter space to a lower dimension in the first layer, and (iii) a technique based on sparse identification of nonlinear dynamics (SINDy) to discover dimensionless equations whose coefficients parameterize the dynamics. We explore the accuracy, robustness and computational complexity of these methods as applied to three example problems: a bead on a rotating hoop, a laminar boundary layer, and Rayleigh-B\'enard convection.  ( 2 min )
    Smoothed Online Learning is as Easy as Statistical Learning. (arXiv:2202.04690v1 [stat.ML])
    Much of modern learning theory has been split between two regimes: the classical \emph{offline} setting, where data arrive independently, and the \emph{online} setting, where data arrive adversarially. While the former model is often both computationally and statistically tractable, the latter requires no distributional assumptions. In an attempt to achieve the best of both worlds, previous work proposed the smooth online setting where each sample is drawn from an adversarially chosen distribution, which is smooth, i.e., it has a bounded density with respect to a fixed dominating measure. We provide tight bounds on the minimax regret of learning a nonparametric function class, with nearly optimal dependence on both the horizon and smoothness parameters. Furthermore, we provide the first oracle-efficient, no-regret algorithms in this setting. In particular, we propose an oracle-efficient improper algorithm whose regret achieves optimal dependence on the horizon and a proper algorithm requiring only a single oracle call per round whose regret has the optimal horizon dependence in the classification setting and is sublinear in general. Both algorithms have exponentially worse dependence on the smoothness parameter of the adversary than the minimax rate. We then prove a lower bound on the oracle complexity of any proper learning algorithm, which matches the oracle-efficient upper bounds up to a polynomial factor, thus demonstrating the existence of a statistical-computational gap in smooth online learning. Finally, we apply our results to the contextual bandit setting to show that if a function class is learnable in the classical setting, then there is an oracle-efficient, no-regret algorithm for contextual bandits in the case that contexts arrive in a smooth manner.  ( 2 min )
    A Coupled CP Decomposition for Principal Components Analysis of Symmetric Networks. (arXiv:2202.04719v1 [stat.ML])
    In a number of application domains, one observes a sequence of network data; for example, repeated measurements between users interactions in social media platforms, financial correlation networks over time, or across subjects, as in multi-subject studies of brain connectivity. One way to analyze such data is by stacking networks into a third-order array or tensor. We propose a principal components analysis (PCA) framework for sequence network data, based on a novel decomposition for semi-symmetric tensors. We derive efficient algorithms for computing our proposed "Coupled CP" decomposition and establish estimation consistency of our approach under an analogue of the spiked covariance model with rates the same as the matrix case up to a logarithmic term. Our framework inherits many of the strengths of classical PCA and is suitable for a wide range of unsupervised learning tasks, including identifying principal networks, isolating meaningful changepoints or outliers across observations, and for characterizing the "variability network" of the most varying edges. Finally, we demonstrate the effectiveness of our proposal on simulated data and on examples from political science and financial economics. The proof techniques used to establish our main consistency results are surprisingly straight-forward and may find use in a variety of other matrix and tensor decomposition problems.  ( 2 min )
    Coarsening the Granularity: Towards Structurally Sparse Lottery Tickets. (arXiv:2202.04736v1 [cs.LG])
    The lottery ticket hypothesis (LTH) has shown that dense models contain highly sparse subnetworks (i.e., winning tickets) that can be trained in isolation to match full accuracy. Despite many exciting efforts being made, there is one "commonsense" seldomly challenged: a winning ticket is found by iterative magnitude pruning (IMP) and hence the resultant pruned subnetworks have only unstructured sparsity. That gap limits the appeal of winning tickets in practice, since the highly irregular sparse patterns are challenging to accelerate on hardware. Meanwhile, directly substituting structured pruning for unstructured pruning in IMP damages performance more severely and is usually unable to locate winning tickets. In this paper, we demonstrate the first positive result that a structurally sparse winning ticket can be effectively found in general. The core idea is to append "post-processing techniques" after each round of (unstructured) IMP, to enforce the formation of structural sparsity. Specifically, we first "re-fill" pruned elements back in some channels deemed to be important, and then "re-group" non-zero elements to create flexible group-wise structural patterns. Both our identified channel- and group-wise structural subnetworks win the lottery, with substantial inference speedups readily supported by existing hardware. Extensive experiments, conducted on diverse datasets across multiple network backbones, consistently validate our proposal, showing that the hardware acceleration roadblock of LTH is now removed. Specifically, the structural winning tickets obtain up to {64.93%, 64.84%, 64.84%} running time savings at {36% ~ 80%, 74%, 58%} sparsity on {CIFAR, Tiny-ImageNet, ImageNet}, while maintaining comparable accuracy. Codes are available in https://github.com/VITA-Group/Structure-LTH.  ( 2 min )
  • Open

    Disentanglement Analysis with Partial Information Decomposition. (arXiv:2108.13753v2 [stat.ML] UPDATED)
    We propose a framework to analyze how multivariate representations disentangle ground-truth generative factors. A quantitative analysis of disentanglement has been based on metrics designed to compare how one variable explains each generative factor. Current metrics, however, may fail to detect entanglement that involves more than two variables, e.g., representations that duplicate and rotate generative factors in high dimensional spaces. In this work, we establish a framework to analyze information sharing in a multivariate representation with Partial Information Decomposition and propose a new disentanglement metric. This framework enables us to understand disentanglement in terms of uniqueness, redundancy, and synergy. We develop an experimental protocol to assess how increasingly entangled representations are evaluated with each metric and confirm that the proposed metric correctly responds to entanglement. Through experiments on variational autoencoders, we find that models with similar disentanglement scores have a variety of characteristics in entanglement, for each of which a distinct strategy may be required to obtain a disentangled representation.  ( 2 min )
    The effective noise of Stochastic Gradient Descent. (arXiv:2112.10852v2 [cond-mat.dis-nn] UPDATED)
    Stochastic Gradient Descent (SGD) is the workhorse algorithm of deep learning technology. At each step of the training phase, a mini batch of samples is drawn from the training dataset and the weights of the neural network are adjusted according to the performance on this specific subset of examples. The mini-batch sampling procedure introduces a stochastic dynamics to the gradient descent, with a non-trivial state-dependent noise. We characterize the stochasticity of SGD and a recently-introduced variant, \emph{persistent} SGD, in a prototypical neural network model. In the under-parametrized regime, where the final training error is positive, the SGD dynamics reaches a stationary state and we define an effective temperature from the fluctuation-dissipation theorem, computed from dynamical mean-field theory. We use the effective temperature to quantify the magnitude of the SGD noise as a function of the problem parameters. In the over-parametrized regime, where the training error vanishes, we measure the noise magnitude of SGD by computing the average distance between two replicas of the system with the same initialization and two different realizations of SGD noise. We find that the two noise measures behave similarly as a function of the problem parameters. Moreover, we observe that noisier algorithms lead to wider decision boundaries of the corresponding constraint satisfaction problem.  ( 2 min )
    Which Shortcut Cues Will DNNs Choose? A Study from the Parameter-Space Perspective. (arXiv:2110.03095v2 [cs.LG] UPDATED)
    Deep neural networks (DNNs) often rely on easy-to-learn discriminatory features, or cues, that are not necessarily essential to the problem at hand. For example, ducks in an image may be recognized based on their typical background scenery, such as lakes or streams. This phenomenon, also known as shortcut learning, is emerging as a key limitation of the current generation of machine learning models. In this work, we introduce a set of experiments to deepen our understanding of shortcut learning and its implications. We design a training setup with several shortcut cues, named WCST-ML, where each cue is equally conducive to the visual recognition problem at hand. Even under equal opportunities, we observe that (1) certain cues are preferred to others, (2) solutions biased to the easy-to-learn cues tend to converge to relatively flat minima on the loss surface, and (3) the solutions focusing on those preferred cues are far more abundant in the parameter space. We explain the abundance of certain cues via their Kolmogorov (descriptional) complexity: solutions corresponding to Kolmogorov-simple cues are abundant in the parameter space and are thus preferred by DNNs. Our studies are based on the synthetic dataset DSprites and the face dataset UTKFace. In our WCST-ML, we observe that the inborn bias of models leans toward simple cues, such as color and ethnicity. Our findings emphasize the importance of active human intervention to remove the inborn model biases that may cause negative societal impacts.  ( 2 min )
    Quantum machine learning beyond kernel methods. (arXiv:2110.13162v2 [quant-ph] UPDATED)
    Machine learning algorithms based on parametrized quantum circuits are a prime candidate for near-term applications on noisy quantum computers. Yet, our understanding of how these quantum machine learning models compare, both mutually and to classical models, remains limited. Previous works achieved important steps in this direction by showing a close connection between some of these quantum models and kernel methods, well-studied in classical machine learning. In this work, we identify the first unifying framework that captures all standard models based on parametrized quantum circuits: that of linear quantum models. In particular, we show how data re-uploading circuits, a generalization of linear models, can be efficiently mapped into equivalent linear quantum models. Going further, we also consider the experimentally-relevant resource requirements of these models in terms of qubit number and data-sample efficiency, i.e., amount of data needed to learn. We establish learning separations demonstrating that linear quantum models must utilize exponentially more qubits than data re-uploading models in order to solve certain learning tasks, while kernel methods additionally require exponentially many more data points. Our results constitute significant strides towards a more comprehensive theory of quantum machine learning models as well as provide guidelines on which models may be better suited from experimental perspectives.  ( 2 min )
    Bayes Optimal Algorithm is Suboptimal in Frequentist Best Arm Identification. (arXiv:2202.05193v1 [stat.ML])
    We consider the fixed-budget best arm identification problem with Normal rewards. In this problem, the forecaster is given $K$ arms (treatments) and $T$ time steps. The forecaster attempts to find the best arm in terms of the largest mean via an adaptive experiment conducted with an algorithm. The performance of the algorithm is measured by the simple regret, or the quality of the estimated best arm. It is known that the frequentist simple regret can be exponentially small to $T$ for any fixed parameters, whereas the Bayesian simple regret is $\Theta(T^{-1})$ over a continuous prior distribution. This paper shows that Bayes optimal algorithm, which minimizes the Bayesian simple regret, does not have an exponential simple regret for some parameters. This finding contrasts with the many results indicating the asymptotic equivalence of Bayesian and frequentist algorithms in fixed sampling regimes. While the Bayes optimal algorithm is described in terms of a recursive equation that is virtually impossible to compute exactly, we pave the way to an analysis by introducing a key quantity that we call the expected Bellman improvement.  ( 2 min )
    Joint Shapley values: a measure of joint feature importance. (arXiv:2107.11357v2 [stat.ML] UPDATED)
    The Shapley value is one of the most widely used measures of feature importance partly as it measures a feature's average effect on a model's prediction. We introduce joint Shapley values, which directly extend Shapley's axioms and intuitions: joint Shapley values measure a set of features' average contribution to a model's prediction. We prove the uniqueness of joint Shapley values, for any order of explanation. Results for games show that joint Shapley values present different insights from existing interaction indices, which assess the effect of a feature within a set of features. The joint Shapley values provide intuitive results in ML attribution problems. With binary features, we present a presence-adjusted global value that is more consistent with local intuitions than the usual approach.  ( 2 min )
    Mean Nystr\"om Embeddings for Adaptive Compressive Learning. (arXiv:2110.10996v2 [stat.ML] UPDATED)
    Compressive learning is an approach to efficient large scale learning based on sketching an entire dataset to a single mean embedding (the sketch), i.e. a vector of generalized moments. The learning task is then approximately solved as an inverse problem using an adapted parametric model. Previous works in this context have focused on sketches obtained by averaging random features, that while universal can be poorly adapted to the problem at hand. In this paper, we propose and study the idea of performing sketching based on data-dependent Nystr\"om approximation. From a theoretical perspective we prove that the excess risk can be controlled under a geometric assumption relating the parametric model used to learn from the sketch and the covariance operator associated to the task at hand. Empirically, we show for k-means clustering and Gaussian modeling that for a fixed sketch size, Nystr\"om sketches indeed outperform those built with random features.  ( 2 min )
    An Information-Theoretic Justification for Model Pruning. (arXiv:2102.08329v4 [cs.LG] UPDATED)
    We study the neural network (NN) compression problem, viewing the tension between the compression ratio and NN performance through the lens of rate-distortion theory. We choose a distortion metric that reflects the effect of NN compression on the model output and derive the tradeoff between rate (compression) and distortion. In addition to characterizing theoretical limits of NN compression, this formulation shows that \emph{pruning}, implicitly or explicitly, must be a part of a good compression algorithm. This observation bridges a gap between parts of the literature pertaining to NN and data compression, respectively, providing insight into the empirical success of model pruning. Finally, we propose a novel pruning strategy derived from our information-theoretic formulation and show that it outperforms the relevant baselines on CIFAR-10 and ImageNet datasets.  ( 2 min )
    Understanding Riemannian Acceleration via a Proximal Extragradient Framework. (arXiv:2111.02763v2 [math.OC] UPDATED)
    We contribute to advancing the understanding of Riemannian accelerated gradient methods. In particular, we revisit Accelerated Hybrid Proximal Extragradient(A-HPE), a powerful framework for obtaining Euclidean accelerated methods \citep{monteiro2013accelerated}. Building on A-HPE, we then propose and analyze Riemannian A-HPE. The core of our analysis consists of two key components: (i) a set of new insights into Euclidean A-HPE itself; and (ii) a careful control of metric distortion caused by Riemannian geometry. We illustrate our framework by obtaining a few existing and new Riemannian accelerated gradient methods as special cases, while characterizing their acceleration as corollaries of our main results.  ( 2 min )
    Hardness of Noise-Free Learning for Two-Hidden-Layer Neural Networks. (arXiv:2202.05258v1 [cs.LG])
    We give exponential statistical query (SQ) lower bounds for learning two-hidden-layer ReLU networks with respect to Gaussian inputs in the standard (noise-free) model. No general SQ lower bounds were known for learning ReLU networks of any depth in this setting: previous SQ lower bounds held only for adversarial noise models (agnostic learning) or restricted models such as correlational SQ. Prior work hinted at the impossibility of our result: Vempala and Wilmes showed that general SQ lower bounds cannot apply to any real-valued family of functions that satisfies a simple non-degeneracy condition. To circumvent their result, we refine a lifting procedure due to Daniely and Vardi that reduces Boolean PAC learning problems to Gaussian ones. We show how to extend their technique to other learning models and, in many well-studied cases, obtain a more efficient reduction. As such, we also prove new cryptographic hardness results for PAC learning two-hidden-layer ReLU networks, as well as new lower bounds for learning constant-depth ReLU networks from membership queries.  ( 2 min )
    Adaptive and Robust Multi-task Learning. (arXiv:2202.05250v1 [stat.ML])
    We study the multi-task learning problem that aims to simultaneously analyze multiple datasets collected from different sources and learn one model for each of them. We propose a family of adaptive methods that automatically utilize possible similarities among those tasks while carefully handling their differences. We derive sharp statistical guarantees for the methods and prove their robustness against outlier tasks. Numerical experiments on synthetic and real datasets demonstrate the efficacy of our new methods.  ( 2 min )
    Online False Discovery Rate Control for LORD & SAFFRON Under Positive, Local Dependence. (arXiv:2110.08161v3 [stat.ME] UPDATED)
    Online testing procedures assume that hypotheses are observed in sequence, and allow the significance thresholds for upcoming tests to depend on the test statistics observed so far. Some of the most popular online methods include alpha investing, LORD++ (hereafter, LORD), and SAFFRON. These three methods have been shown to provide online control of the "modified" false discovery rate (mFDR) under a condition known as conditional superuniformity. However, to our knowledge, LORD & SAFFRON have only been shown to control the traditional false discovery rate (FDR) under an independence condition on the test statistics. Our work bolsters these results by showing that SAFFRON and LORD additionally ensure online control of the FDR under a "local" form of nonnegative dependence. Further, FDR control is maintained under certain types of adaptive stopping rules, such as stopping after a certain number of rejections have been observed. Because alpha investing can be recovered as a special case of the SAFFRON framework, our results immediately apply to alpha investing as well. In the process of deriving these results, we also formally characterize how the conditional superuniformity assumption implicitly limits the allowed p-value dependencies. This implicit limitation is important not only to our proposed FDR result, but also to many existing mFDR results.  ( 2 min )
    Gaussian Process Bandit Optimization with Few Batches. (arXiv:2110.07788v3 [stat.ML] UPDATED)
    In this paper, we consider the problem of black-box optimization using Gaussian Process (GP) bandit optimization with a small number of batches. Assuming the unknown function has a low norm in the Reproducing Kernel Hilbert Space (RKHS), we introduce a batch algorithm inspired by batched finite-arm bandit algorithms, and show that it achieves the cumulative regret upper bound $O^\ast(\sqrt{T\gamma_T})$ using $O(\log\log T)$ batches within time horizon $T$, where the $O^\ast(\cdot)$ notation hides dimension-independent logarithmic factors and $\gamma_T$ is the maximum information gain associated with the kernel. This bound is near-optimal for several kernels of interest and improves on the typical $O^\ast(\sqrt{T}\gamma_T)$ bound, and our approach is arguably the simplest among algorithms attaining this improvement. In addition, in the case of a constant number of batches (not depending on $T$), we propose a modified version of our algorithm, and characterize how the regret is impacted by the number of batches, focusing on the squared exponential and Mat\'ern kernels. The algorithmic upper bounds are shown to be nearly minimax optimal via analogous algorithm-independent lower bounds.  ( 2 min )
    Random Forests Weighted Local Fr\'echet Regression with Theoretical Guarantee. (arXiv:2202.04912v1 [stat.ML])
    Statistical analysis is increasingly confronted with complex data from general metric spaces, such as symmetric positive definite matrix-valued data and probability distribution functions. [47] and [17] establish a general paradigm of Fr\'echet regression with complex metric space valued responses and Euclidean predictors. However, their proposed local Fr\'echet regression approach involves nonparametric kernel smoothing and suffers from the curse of dimensionality. To address this issue, we in this paper propose a novel random forests weighted local Fr\'echet regression paradigm. The main mechanism of our approach relies on the adaptive kernels generated by random forests. Our first method utilizes these weights as the local average to solve the Fr\'echet mean, while the second method performs local linear Fr\'echet regression, making both methods locally adaptive. Our proposals significantly improve existing Fr\'echet regression methods. Based on the theory of infinite order U-processes and infinite order Mmn-estimator, we establish the consistency, rate of convergence, and asymptotic normality for our proposed random forests weighted Fr\'echet regression estimator, which covers the current large sample theory of random forests with Euclidean responses as a special case. Numerical studies show the superiority of our proposed two methods for Fr\'echet regression with several commonly encountered types of responses such as probability distribution functions, symmetric positive definite matrices, and sphere data. The practical merits of our proposals are also demonstrated through the application to the human mortality distribution data.  ( 2 min )
    Estimating the Long-Term Effects of Novel Treatments. (arXiv:2103.08390v2 [econ.EM] UPDATED)
    Policy makers typically face the problem of wanting to estimate the long-term effects of novel treatments, while only having historical data of older treatment options. We assume access to a long-term dataset where only past treatments were administered and a short-term dataset where novel treatments have been administered. We propose a surrogate based approach where we assume that the long-term effect is channeled through a multitude of available short-term proxies. Our work combines three major recent techniques in the causal machine learning literature: surrogate indices, dynamic treatment effect estimation and double machine learning, in a unified pipeline. We show that our method is consistent and provides root-n asymptotically normal estimates under a Markovian assumption on the data and the observational policy. We use a data-set from a major corporation that includes customer investments over a three year period to create a semi-synthetic data distribution where the major qualitative properties of the real dataset are preserved. We evaluate the performance of our method and discuss practical challenges of deploying our formal methodology and how to address them.  ( 2 min )
    PCENet: High Dimensional Deep Surrogate Modeling. (arXiv:2202.05063v1 [cs.LG])
    Learning data representations under uncertainty is an important task that emerges in numerous machine learning applications. However, uncertainty quantification (UQ) techniques are computationally intensive and become prohibitively expensive for high-dimensional data. In this paper, we present a novel surrogate model for representation learning and uncertainty quantification, which aims to deal with data of moderate to high dimensions. The proposed model combines a neural network approach for dimensionality reduction of the (potentially high-dimensional) data, with a surrogate model method for learning the data distribution. We first employ a variational autoencoder (VAE) to learn a low-dimensional representation of the data distribution. We then propose to harness polynomial chaos expansion (PCE) formulation to map this distribution to the output target. The coefficients of PCE are learned from the distribution representation of the training data using a maximum mean discrepancy (MMD) approach. Our model enables us to (a) learn a representation of the data, (b) estimate uncertainty in the high-dimensional data system, and (c) match high order moments of the output distribution; without any prior statistical assumptions on the data. Numerical experimental results are presented to illustrate the performance of the proposed method.  ( 2 min )
    Fair When Trained, Unfair When Deployed: Observable Fairness Measures are Unstable in Performative Prediction Settings. (arXiv:2202.05049v1 [stat.ML])
    Many popular algorithmic fairness measures depend on the joint distribution of predictions, outcomes, and a sensitive feature like race or gender. These measures are sensitive to distribution shift: a predictor which is trained to satisfy one of these fairness definitions may become unfair if the distribution changes. In performative prediction settings, however, predictors are precisely intended to induce distribution shift. For example, in many applications in criminal justice, healthcare, and consumer finance, the purpose of building a predictor is to reduce the rate of adverse outcomes such as recidivism, hospitalization, or default on a loan. We formalize the effect of such predictors as a type of concept shift-a particular variety of distribution shift-and show both theoretically and via simulated examples how this causes predictors which are fair when they are trained to become unfair when they are deployed. We further show how many of these issues can be avoided by using fairness definitions that depend on counterfactual rather than observable outcomes.  ( 2 min )
    Optimal Clustering with Bandit Feedback. (arXiv:2202.04294v1 [cs.LG] CROSS LISTED)
    This paper considers the problem of online clustering with bandit feedback. A set of arms (or items) can be partitioned into various groups that are unknown. Within each group, the observations associated to each of the arms follow the same distribution with the same mean vector. At each time step, the agent queries or pulls an arm and obtains an independent observation from the distribution it is associated to. Subsequent pulls depend on previous ones as well as the previously obtained samples. The agent's task is to uncover the underlying partition of the arms with the least number of arm pulls and with a probability of error not exceeding a prescribed constant $\delta$. The problem proposed finds numerous applications from clustering of variants of viruses to online market segmentation. We present an instance-dependent information-theoretic lower bound on the expected sample complexity for this task, and design a computationally efficient and asymptotically optimal algorithm, namely Bandit Online Clustering (BOC). The algorithm includes a novel stopping rule for adaptive sequential testing that circumvents the need to exactly solve any NP-hard weighted clustering problem as its subroutines. We show through extensive simulations on synthetic and real-world datasets that BOC's performance matches the lower bound asymptotically, and significantly outperforms a non-adaptive baseline algorithm.  ( 2 min )
    Cooperative Online Learning with Feedback Graphs. (arXiv:2106.04982v3 [cs.LG] UPDATED)
    We study the interplay between feedback and communication in a cooperative online learning setting where a network of agents solves a task in which the learners' feedback is determined by an arbitrary graph. We characterize regret in terms of the independence number of the strong product between the feedback graph and the communication network. Our analysis recovers as special cases many previously known bounds for distributed online learning with either expert or bandit feedback. A more detailed version of our results also captures the dependence of the regret on the delay caused by the time the information takes to traverse each graph. Experiments run on synthetic data show that the empirical behavior of our algorithm is consistent with the theoretical results.  ( 2 min )
    Benign-Overfitting in Conditional Average Treatment Effect Prediction with Linear Regression. (arXiv:2202.05245v1 [econ.EM])
    We study the benign overfitting theory in the prediction of the conditional average treatment effect (CATE), with linear regression models. As the development of machine learning for causal inference, a wide range of large-scale models for causality are gaining attention. One problem is that suspicions have been raised that the large-scale models are prone to overfitting to observations with sample selection, hence the large models may not be suitable for causal prediction. In this study, to resolve the suspicious, we investigate on the validity of causal inference methods for overparameterized models, by applying the recent theory of benign overfitting (Bartlett et al., 2020). Specifically, we consider samples whose distribution switches depending on an assignment rule, and study the prediction of CATE with linear models whose dimension diverges to infinity. We focus on two methods: the T-learner, which based on a difference between separately constructed estimators with each treatment group, and the inverse probability weight (IPW)-learner, which solves another regression problem approximated by a propensity score. In both methods, the estimator consists of interpolators that fit the samples perfectly. As a result, we show that the T-learner fails to achieve the consistency except the random assignment, while the IPW-learner converges the risk to zero if the propensity score is known. This difference stems from that the T-learner is unable to preserve eigenspaces of the covariances, which is necessary for benign overfitting in the overparameterized setting. Our result provides new insights into the usage of causal inference methods in the overparameterizated setting, in particular, doubly robust estimators.  ( 2 min )
    ViViT: Curvature access through the generalized Gauss-Newton's low-rank structure. (arXiv:2106.02624v2 [cs.LG] UPDATED)
    Curvature in form of the Hessian or its generalized Gauss-Newton (GGN) approximation is valuable for algorithms that rely on a local model for the loss to train, compress, or explain deep networks. Existing methods based on implicit multiplication via automatic differentiation or Kronecker-factored block diagonal approximations do not consider noise in the mini-batch. We present ViViT, a curvature model that leverages the GGN's low-rank structure without further approximations. It allows for efficient computation of eigenvalues, eigenvectors, as well as per-sample first- and second-order directional derivatives. The representation is computed in parallel with gradients in one backward pass and offers a fine-grained cost-accuracy trade-off, which allows it to scale. We demonstrate this by conducting performance benchmarks and substantiate ViViT's usefulness by studying the impact of noise on the GGN's structural properties during neural network training.  ( 2 min )
    Nearly Horizon-Free Offline Reinforcement Learning. (arXiv:2103.14077v3 [stat.ML] UPDATED)
    We revisit offline reinforcement learning on episodic time-homogeneous Markov Decision Processes (MDP). For tabular MDP with $S$ states and $A$ actions, or linear MDP with anchor points and feature dimension $d$, given the collected $K$ episodes data with minimum visiting probability of (anchor) state-action pairs $d_m$, we obtain nearly horizon $H$-free sample complexity bounds for offline reinforcement learning when the total reward is upper bounded by $1$. Specifically: 1. For offline policy evaluation, we obtain an $\tilde{O}\left(\sqrt{\frac{1}{Kd_m}} \right)$ error bound for the plug-in estimator, which matches the lower bound up to logarithmic factors and does not have additional dependency on $\mathrm{poly}\left(H, S, A, d\right)$ in higher-order term. 2.For offline policy optimization, we obtain an $\tilde{O}\left(\sqrt{\frac{1}{Kd_m}} + \frac{\min(S, d)}{Kd_m}\right)$ sub-optimality gap for the empirical optimal policy, which approaches the lower bound up to logarithmic factors and a high-order term, improving upon the best known result by \cite{cui2020plug} that has additional $\mathrm{poly}\left(H, S, d\right)$ factors in the main term. To the best of our knowledge, these are the \emph{first} set of nearly horizon-free bounds for episodic time-homogeneous offline tabular MDP and linear MDP with anchor points. Central to our analysis is a simple yet effective recursion based method to bound a "total variance" term in the offline scenarios, which could be of individual interest.  ( 2 min )
    Image-to-Image Regression with Distribution-Free Uncertainty Quantification and Applications in Imaging. (arXiv:2202.05265v1 [cs.LG])
    Image-to-image regression is an important learning task, used frequently in biological imaging. Current algorithms, however, do not generally offer statistical guarantees that protect against a model's mistakes and hallucinations. To address this, we develop uncertainty quantification techniques with rigorous statistical guarantees for image-to-image regression problems. In particular, we show how to derive uncertainty intervals around each pixel that are guaranteed to contain the true value with a user-specified confidence probability. Our methods work in conjunction with any base machine learning model, such as a neural network, and endow it with formal mathematical guarantees -- regardless of the true unknown data distribution or choice of model. Furthermore, they are simple to implement and computationally inexpensive. We evaluate our procedure on three image-to-image regression tasks: quantitative phase microscopy, accelerated magnetic resonance imaging, and super-resolution transmission electron microscopy of a Drosophila melanogaster brain.  ( 2 min )
    A Theory of Neural Tangent Kernel Alignment and Its Influence on Training. (arXiv:2105.14301v2 [stat.ML] UPDATED)
    The training dynamics and generalization properties of neural networks (NN) can be precisely characterized in function space via the neural tangent kernel (NTK). Structural changes to the NTK during training reflect feature learning and underlie the superior performance of networks outside of the static kernel regime. In this work, we seek to theoretically understand kernel alignment, a prominent and ubiquitous structural change that aligns the NTK with the target function. We first study a toy model of kernel evolution in which the NTK evolves to accelerate training and show that alignment naturally emerges from this demand. We then study alignment mechanism in deep linear networks and two layer ReLU networks. These theories provide good qualitative descriptions of kernel alignment and specialization in practical networks and identify factors in network architecture and data structure that drive kernel alignment. In nonlinear networks with multiple outputs, we identify the phenomenon of kernel specialization, where the kernel function for each output head preferentially aligns to its own target function. Together, our results provide a mechanistic explanation of how kernel alignment emerges during NN training and a normative explanation of how it benefits training.  ( 2 min )
    Towards a Theory of Non-Log-Concave Sampling: First-Order Stationarity Guarantees for Langevin Monte Carlo. (arXiv:2202.05214v1 [math.ST])
    For the task of sampling from a density $\pi \propto \exp(-V)$ on $\mathbb{R}^d$, where $V$ is possibly non-convex but $L$-gradient Lipschitz, we prove that averaged Langevin Monte Carlo outputs a sample with $\varepsilon$-relative Fisher information after $O( L^2 d^2/\varepsilon^2)$ iterations. This is the sampling analogue of complexity bounds for finding an $\varepsilon$-approximate first-order stationary points in non-convex optimization and therefore constitutes a first step towards the general theory of non-log-concave sampling. We discuss numerous extensions and applications of our result; in particular, it yields a new state-of-the-art guarantee for sampling from distributions which satisfy a Poincar\'e inequality.  ( 2 min )
    Adaptively Exploiting d-Separators with Causal Bandits. (arXiv:2202.05100v1 [stat.ML])
    Multi-armed bandit problems provide a framework to identify the optimal intervention over a sequence of repeated experiments. Without additional assumptions, minimax optimal performance (measured by cumulative regret) is well-understood. With access to additional observed variables that d-separate the intervention from the outcome (i.e., they are a d-separator), recent causal bandit algorithms provably incur less regret. However, in practice it is desirable to be agnostic to whether observed variables are a d-separator. Ideally, an algorithm should be adaptive; that is, perform nearly as well as an algorithm with oracle knowledge of the presence or absence of a d-separator. In this work, we formalize and study this notion of adaptivity, and provide a novel algorithm that simultaneously achieves (a) optimal regret when a d-separator is observed, improving on classical minimax algorithms, and (b) significantly smaller regret than recent causal bandit algorithms when the observed variables are not a d-separator. Crucially, our algorithm does not require any oracle knowledge of whether a d-separator is observed. We also generalize this adaptivity to other conditions, such as the front-door criterion.  ( 2 min )
    Bayesian Optimisation for Mixed-Variable Inputs using Value Proposals. (arXiv:2202.04832v1 [stat.ML])
    Many real-world optimisation problems are defined over both categorical and continuous variables, yet efficient optimisation methods such asBayesian Optimisation (BO) are not designed tohandle such mixed-variable search spaces. Re-cent approaches to this problem cast the selection of the categorical variables as a bandit problem, operating independently alongside a BO component which optimises the continuous variables. In this paper, we adopt a holistic view and aim to consolidate optimisation of the categorical and continuous sub-spaces under a single acquisition metric. We derive candidates from the ExpectedImprovement criterion, which we call value proposals, and use these proposals to make selections on both the categorical and continuous components of the input. We show that this unified approach significantly outperforms existing mixed-variable optimisation approaches across several mixed-variable black-box optimisation tasks.  ( 2 min )
    Diffusion bridges vector quantized Variational AutoEncoders. (arXiv:2202.04895v1 [stat.ML])
    Vector Quantised-Variational AutoEncoders (VQ-VAE) are generative models based on discrete latent representations of the data, where inputs are mapped to a finite set of learned embeddings.To generate new samples, an autoregressive prior distribution over the discrete states must be trained separately. This prior is generally very complex and leads to very slow generation. In this work, we propose a new model to train the prior and the encoder/decoder networks simultaneously. We build a diffusion bridge between a continuous coded vector and a non-informative prior distribution. The latent discrete states are then given as random functions of these continuous vectors. We show that our model is competitive with the autoregressive prior on the mini-Imagenet dataset and is very efficient in both optimization and sampling. Our framework also extends the standard VQ-VAE and enables end-to-end training.  ( 2 min )
    Dimension Reduction with Prior Information for Knowledge Discovery. (arXiv:2111.13646v2 [stat.ML] UPDATED)
    This paper addresses the problem of mapping high-dimensional data to a low-dimensional space, in the presence of other known features. This problem is ubiquitous in science and engineering as there are often controllable/measurable features in most applications. Furthermore, the discovered features in previous analyses can become the known features in subsequent analyses, repeatedly. To solve this problem, this paper proposes a broad class of methods, which is referred to as conditional multidimensional scaling. An algorithm for optimizing the objective function of conditional multidimensional scaling is also developed. The proposed framework is illustrated with kinship terms, facial expressions, and simulated car-brand perception examples. These examples demonstrate the benefits of the framework for being able to marginalize out the known features to uncover unknown, unanticipated features in the reduced-dimension space and for enabling a repeated, more straightforward knowledge discovery process. Computer codes for this work are available in the open-source cml R package.  ( 2 min )
    Decomposing neural networks as mappings of correlation functions. (arXiv:2202.04925v1 [cond-mat.dis-nn])
    Understanding the functional principles of information processing in deep neural networks continues to be a challenge, in particular for networks with trained and thus non-random weights. To address this issue, we study the mapping between probability distributions implemented by a deep feed-forward network. We characterize this mapping as an iterated transformation of distributions, where the non-linearity in each layer transfers information between different orders of correlation functions. This allows us to identify essential statistics in the data, as well as different information representations that can be used by neural networks. Applied to an XOR task and to MNIST, we show that correlations up to second order predominantly capture the information processing in the internal layers, while the input layer also extracts higher-order correlations from the data. This analysis provides a quantitative and explainable perspective on classification.  ( 2 min )
    Exact Solutions of a Deep Linear Network. (arXiv:2202.04777v1 [stat.ML])
    This work finds the exact solutions to a deep linear network with weight decay and stochastic neurons, a fundamental model for understanding the landscape of neural networks. Our result implies that weight decay strongly interacts with the model architecture and can create bad minima in a network with more than $1$ hidden layer, qualitatively different for a network with only $1$ hidden layer. As an application, we also analyze stochastic nets and show that their prediction variance vanishes to zero as the stochasticity, the width, or the depth tends to infinity.  ( 2 min )
    Deconstructing The Inductive Biases Of Hamiltonian Neural Networks. (arXiv:2202.04836v1 [cs.LG])
    Physics-inspired neural networks (NNs), such as Hamiltonian or Lagrangian NNs, dramatically outperform other learned dynamics models by leveraging strong inductive biases. These models, however, are challenging to apply to many real world systems, such as those that don't conserve energy or contain contacts, a common setting for robotics and reinforcement learning. In this paper, we examine the inductive biases that make physics-inspired models successful in practice. We show that, contrary to conventional wisdom, the improved generalization of HNNs is the result of modeling acceleration directly and avoiding artificial complexity from the coordinate system, rather than symplectic structure or energy conservation. We show that by relaxing the inductive biases of these models, we can match or exceed performance on energy-conserving systems while dramatically improving performance on practical, non-conservative systems. We extend this approach to constructing transition models for common Mujoco environments, showing that our model can appropriately balance inductive biases with the flexibility required for model-based control.  ( 2 min )
    Understanding Value Decomposition Algorithms in Deep Cooperative Multi-Agent Reinforcement Learning. (arXiv:2202.04868v1 [cs.LG])
    Value function decomposition is becoming a popular rule of thumb for scaling up multi-agent reinforcement learning (MARL) in cooperative games. For such a decomposition rule to hold, the assumption of the individual-global max (IGM) principle must be made; that is, the local maxima on the decomposed value function per every agent must amount to the global maximum on the joint value function. This principle, however, does not have to hold in general. As a result, the applicability of value decomposition algorithms is concealed and their corresponding convergence properties remain unknown. In this paper, we make the first effort to answer these questions. Specifically, we introduce the set of cooperative games in which the value decomposition methods find their validity, which is referred as decomposable games. In decomposable games, we theoretically prove that applying the multi-agent fitted Q-Iteration algorithm (MA-FQI) will lead to an optimal Q-function. In non-decomposable games, the estimated Q-function by MA-FQI can still converge to the optimum under the circumstance that the Q-function needs projecting into the decomposable function space at each iteration. In both settings, we consider value function representations by practical deep neural networks and derive their corresponding convergence rates. To summarize, our results, for the first time, offer theoretical insights for MARL practitioners in terms of when value decomposition algorithms converge and why they perform well.  ( 2 min )
    Understanding Rare Spurious Correlations in Neural Networks. (arXiv:2202.05189v1 [cs.LG])
    Neural networks are known to use spurious correlations for classification; for example, they commonly use background information to classify objects. But how many examples does it take for a network to pick up these correlations? This is the question that we empirically investigate in this work. We introduce spurious patterns correlated with a specific class to a few examples and find that it takes only a handful of such examples for the network to pick up on the spurious correlation. Through extensive experiments, we show that (1) spurious patterns with a larger $\ell_2$ norm are learnt to correlate with the specified class more easily; (2) network architectures that are more sensitive to the input are more susceptible to learning these rare spurious correlations; (3) standard data deletion methods, including incremental retraining and influence functions, are unable to forget these rare spurious correlations through deleting the examples that cause these spurious correlations to be learnt. Code available at https://github.com/yangarbiter/rare-spurious-correlation.  ( 2 min )
    Transfer-Learning Across Datasets with Different Input Dimensions: An Algorithm and Analysis for the Linear Regression Case. (arXiv:2202.05069v1 [stat.ML])
    With the development of new sensors and monitoring devices, more sources of data become available to be used as inputs for machine learning models. These can on the one hand help to improve the accuracy of a model. On the other hand however, combining these new inputs with historical data remains a challenge that has not yet been studied in enough detail. In this work, we propose a transfer-learning algorithm that combines the new and the historical data, that is especially beneficial when the new data is scarce. We focus the approach on the linear regression case, which allows us to conduct a rigorous theoretical study on the benefits of the approach. We show that our approach is robust against negative transfer-learning, and we confirm this result empirically with real and simulated data.  ( 2 min )
    Off-Policy Fitted Q-Evaluation with Differentiable Function Approximators: Z-Estimation and Inference Theory. (arXiv:2202.04970v1 [stat.ML])
    Off-Policy Evaluation (OPE) serves as one of the cornerstones in Reinforcement Learning (RL). Fitted Q Evaluation (FQE) with various function approximators, especially deep neural networks, has gained practical success. While statistical analysis has proved FQE to be minimax-optimal with tabular, linear and several nonparametric function families, its practical performance with more general function approximator is less theoretically understood. We focus on FQE with general differentiable function approximators, making our theory applicable to neural function approximations. We approach this problem using the Z-estimation theory and establish the following results: The FQE estimation error is asymptotically normal with explicit variance determined jointly by the tangent space of the function class at the ground truth, the reward structure, and the distribution shift due to off-policy learning; The finite-sample FQE error bound is dominated by the same variance term, and it can also be bounded by function class-dependent divergence, which measures how the off-policy distribution shift intertwines with the function approximator. In addition, we study bootstrapping FQE estimators for error distribution inference and estimating confidence intervals, accompanied by a Cramer-Rao lower bound that matches our upper bounds. The Z-estimation analysis provides a generalizable theoretical framework for studying off-policy estimation in RL and provides sharp statistical theory for FQE with differentiable function approximators.  ( 2 min )
    Settling the Communication Complexity for Distributed Offline Reinforcement Learning. (arXiv:2202.04862v1 [stat.ML])
    We study a novel setting in offline reinforcement learning (RL) where a number of distributed machines jointly cooperate to solve the problem but only one single round of communication is allowed and there is a budget constraint on the total number of information (in terms of bits) that each machine can send out. For value function prediction in contextual bandits, and both episodic and non-episodic MDPs, we establish information-theoretic lower bounds on the minimax risk for distributed statistical estimators; this reveals the minimum amount of communication required by any offline RL algorithms. Specifically, for contextual bandits, we show that the number of bits must scale at least as $\Omega(AC)$ to match the centralised minimax optimal rate, where $A$ is the number of actions and $C$ is the context dimension; meanwhile, we reach similar results in the MDP settings. Furthermore, we develop learning algorithms based on least-squares estimates and Monte-Carlo return estimates and provide a sharp analysis showing that they can achieve optimal risk up to logarithmic factors. Additionally, we also show that temporal difference is unable to efficiently utilise information from all available devices under the single-round communication setting due to the initial bias of this method. To our best knowledge, this paper presents the first minimax lower bounds for distributed offline RL problems.  ( 2 min )
    Case-based reasoning for rare events prediction on strategic sites. (arXiv:2202.04891v1 [cs.AI])
    Satellite imagery is now widely used in the defense sector for monitoring locations of interest. Although the increasing amount of data enables pattern identification and therefore prediction, carrying this task manually is hardly feasible. We hereby propose a cased-based reasoning approach for automatic prediction of rare events on strategic sites. This method allows direct incorporation of expert knowledge, and is adapted to irregular time series and small-size datasets. Experiments are carried out on two use-cases using real satellite images: the prediction of submarines arrivals and departures from a naval base, and the forecasting of imminent rocket launches on two space bases. The proposed method significantly outperforms a random selection of reference cases on these challenging applications, showing its strong potential.  ( 2 min )
    Heterogeneous Calibration: A post-hoc model-agnostic framework for improved generalization. (arXiv:2202.04837v1 [stat.ML])
    We introduce the notion of heterogeneous calibration that applies a post-hoc model-agnostic transformation to model outputs for improving AUC performance on binary classification tasks. We consider overconfident models, whose performance is significantly better on training vs test data and give intuition onto why they might under-utilize moderately effective simple patterns in the data. We refer to these simple patterns as heterogeneous partitions of the feature space and show theoretically that perfectly calibrating each partition separately optimizes AUC. This gives a general paradigm of heterogeneous calibration as a post-hoc procedure by which heterogeneous partitions of the feature space are identified through tree-based algorithms and post-hoc calibration techniques are applied to each partition to improve AUC. While the theoretical optimality of this framework holds for any model, we focus on deep neural networks (DNNs) and test the simplest instantiation of this paradigm on a variety of open-source datasets. Experiments demonstrate the effectiveness of this framework and the future potential for applying higher-performing partitioning schemes along with more effective calibration techniques.  ( 2 min )
    Augmenting Neural Networks with Priors on Function Values. (arXiv:2202.04798v1 [cs.LG])
    The need for function estimation in label-limited settings is common in the natural sciences. At the same time, prior knowledge of function values is often available in these domains. For example, data-free biophysics-based models can be informative on protein properties, while quantum-based computations can be informative on small molecule properties. How can we coherently leverage such prior knowledge to help improve a neural network model that is quite accurate in some regions of input space -- typically near the training data -- but wildly wrong in other regions? Bayesian neural networks (BNN) enable the user to specify prior information only on the neural network weights, not directly on the function values. Moreover, there is in general no clear mapping between these. Herein, we tackle this problem by developing an approach to augment BNNs with prior information on the function values themselves. Our probabilistic approach yields predictions that rely more heavily on the prior information when the epistemic uncertainty is large, and more heavily on the neural network when the epistemic uncertainty is small.  ( 2 min )
    Probabilistic learning inference of boundary value problem with uncertainties based on Kullback-Leibler divergence under implicit constraints. (arXiv:2202.05112v1 [stat.ML])
    In a first part, we present a mathematical analysis of a general methodology of a probabilistic learning inference that allows for estimating a posterior probability model for a stochastic boundary value problem from a prior probability model. The given targets are statistical moments for which the underlying realizations are not available. Under these conditions, the Kullback-Leibler divergence minimum principle is used for estimating the posterior probability measure. A statistical surrogate model of the implicit mapping, which represents the constraints, is introduced. The MCMC generator and the necessary numerical elements are given to facilitate the implementation of the methodology in a parallel computing framework. In a second part, an application is presented to illustrate the proposed theory and is also, as such, a contribution to the three-dimensional stochastic homogenization of heterogeneous linear elastic media in the case of a non-separation of the microscale and macroscale. For the construction of the posterior probability measure by using the probabilistic learning inference, in addition to the constraints defined by given statistical moments of the random effective elasticity tensor, the second-order moment of the random normalized residue of the stochastic partial differential equation has been added as a constraint. This constraint guarantees that the algorithm seeks to bring the statistical moments closer to their targets while preserving a small residue.  ( 2 min )
    Generalization Bounds via Convex Analysis. (arXiv:2202.04985v1 [stat.ML])
    Since the celebrated works of Russo and Zou (2016,2019) and Xu and Raginsky (2017), it has been well known that the generalization error of supervised learning algorithms can be bounded in terms of the mutual information between their input and the output, given that the loss of any fixed hypothesis has a subgaussian tail. In this work, we generalize this result beyond the standard choice of Shannon's mutual information to measure the dependence between the input and the output. Our main result shows that it is indeed possible to replace the mutual information by any strongly convex function of the joint input-output distribution, with the subgaussianity condition on the losses replaced by a bound on an appropriately chosen norm capturing the geometry of the dependence measure. This allows us to derive a range of generalization bounds that are either entirely new or strengthen previously known ones. Examples include bounds stated in terms of $p$-norm divergences and the Wasserstein-2 distance, which are respectively applicable for heavy-tailed loss distributions and highly smooth loss functions. Our analysis is entirely based on elementary tools from convex analysis by tracking the growth of a potential function associated with the dependence measure and the loss function.  ( 2 min )
    Learning Latent Causal Dynamics. (arXiv:2202.04828v1 [stat.ML])
    One critical challenge of time-series modeling is how to learn and quickly correct the model under unknown distribution shifts. In this work, we propose a principled framework, called LiLY, to first recover time-delayed latent causal variables and identify their relations from measured temporal data under different distribution shifts. The correction step is then formulated as learning the low-dimensional change factors with a few samples from the new environment, leveraging the identified causal structure. Specifically, the framework factorizes unknown distribution shifts into transition distribution changes caused by fixed dynamics and time-varying latent causal relations, and by global changes in observation. We establish the identifiability theories of nonparametric latent causal dynamics from their nonlinear mixtures under fixed dynamics and under changes. Through experiments, we show that time-delayed latent causal influences are reliably identified from observed variables under different distribution changes. By exploiting this modular representation of changes, we can efficiently learn to correct the model under unknown distribution shifts with only a few samples.  ( 2 min )
    Personalized pathology test for Cardio-vascular disease: Approximate Bayesian computation with discriminative summary statistics learning. (arXiv:2010.06465v2 [stat.ME] UPDATED)
    Cardio/cerebrovascular diseases (CVD) have become one of the major health issue in our societies. But recent studies show that the present pathology tests to detect CVD are ineffectual as they do not consider different stages of platelet activation or the molecular dynamics involved in platelet interactions and are incapable to consider inter-individual variability. Here we propose a stochastic platelet deposition model and an inferential scheme to estimate the biologically meaningful model parameters using approximate Bayesian computation with a summary statistic that maximally discriminates between different types of patients. Inferred parameters from data collected on healthy volunteers and different patient types help us to identify specific biological parameters and hence biological reasoning behind the dysfunction for each type of patients. This work opens up an unprecedented opportunity of personalized pathology test for CVD detection and medical treatment.  ( 2 min )
    L0Learn: A Scalable Package for Sparse Learning using L0 Regularization. (arXiv:2202.04820v1 [cs.LG])
    We introduce L0Learn: an open-source package for sparse regression and classification using L0 regularization. L0Learn implements scalable, approximate algorithms, based on coordinate descent and local combinatorial optimization. The package is built using C++ and has a user-friendly R interface. Our experiments indicate that L0Learn can scale to problems with millions of features, achieving competitive run times with state-of-the-art sparse learning packages. L0Learn is available on both CRAN and GitHub.  ( 2 min )
    Smoothed Online Learning is as Easy as Statistical Learning. (arXiv:2202.04690v1 [stat.ML])
    Much of modern learning theory has been split between two regimes: the classical \emph{offline} setting, where data arrive independently, and the \emph{online} setting, where data arrive adversarially. While the former model is often both computationally and statistically tractable, the latter requires no distributional assumptions. In an attempt to achieve the best of both worlds, previous work proposed the smooth online setting where each sample is drawn from an adversarially chosen distribution, which is smooth, i.e., it has a bounded density with respect to a fixed dominating measure. We provide tight bounds on the minimax regret of learning a nonparametric function class, with nearly optimal dependence on both the horizon and smoothness parameters. Furthermore, we provide the first oracle-efficient, no-regret algorithms in this setting. In particular, we propose an oracle-efficient improper algorithm whose regret achieves optimal dependence on the horizon and a proper algorithm requiring only a single oracle call per round whose regret has the optimal horizon dependence in the classification setting and is sublinear in general. Both algorithms have exponentially worse dependence on the smoothness parameter of the adversary than the minimax rate. We then prove a lower bound on the oracle complexity of any proper learning algorithm, which matches the oracle-efficient upper bounds up to a polynomial factor, thus demonstrating the existence of a statistical-computational gap in smooth online learning. Finally, we apply our results to the contextual bandit setting to show that if a function class is learnable in the classical setting, then there is an oracle-efficient, no-regret algorithm for contextual bandits in the case that contexts arrive in a smooth manner.  ( 2 min )
    Bayesian Nonparametrics for Offline Skill Discovery. (arXiv:2202.04675v1 [cs.LG])
    Skills or low-level policies in reinforcement learning are temporally extended actions that can speed up learning and enable complex behaviours. Recent work in offline reinforcement learning and imitation learning has proposed several techniques for skill discovery from a set of expert trajectories. While these methods are promising, the number K of skills to discover is always a fixed hyperparameter, which requires either prior knowledge about the environment or an additional parameter search to tune it. We first propose a method for offline learning of options (a particular skill framework) exploiting advances in variational inference and continuous relaxations. We then highlight an unexplored connection between Bayesian nonparametrics and offline skill discovery, and show how to obtain a nonparametric version of our model. This version is tractable thanks to a carefully structured approximate posterior with a dynamically-changing number of options, removing the need to specify K. We also show how our nonparametric extension can be applied in other skill frameworks, and empirically demonstrate that our method can outperform state-of-the-art offline skill learning algorithms across a variety of environments. Our code is available at https://github.com/layer6ai-labs/BNPO .  ( 2 min )
    A Coupled CP Decomposition for Principal Components Analysis of Symmetric Networks. (arXiv:2202.04719v1 [stat.ML])
    In a number of application domains, one observes a sequence of network data; for example, repeated measurements between users interactions in social media platforms, financial correlation networks over time, or across subjects, as in multi-subject studies of brain connectivity. One way to analyze such data is by stacking networks into a third-order array or tensor. We propose a principal components analysis (PCA) framework for sequence network data, based on a novel decomposition for semi-symmetric tensors. We derive efficient algorithms for computing our proposed "Coupled CP" decomposition and establish estimation consistency of our approach under an analogue of the spiked covariance model with rates the same as the matrix case up to a logarithmic term. Our framework inherits many of the strengths of classical PCA and is suitable for a wide range of unsupervised learning tasks, including identifying principal networks, isolating meaningful changepoints or outliers across observations, and for characterizing the "variability network" of the most varying edges. Finally, we demonstrate the effectiveness of our proposal on simulated data and on examples from political science and financial economics. The proof techniques used to establish our main consistency results are surprisingly straight-forward and may find use in a variety of other matrix and tensor decomposition problems.  ( 2 min )
    Online Learning to Transport via the Minimal Selection Principle. (arXiv:2202.04732v1 [cs.LG])
    Motivated by robust dynamic resource allocation in operations research, we study the Online Learning to Transport (OLT) problem where the decision variable is a probability measure, an infinite-dimensional object. We draw connections between online learning, optimal transport, and partial differential equations through an insight called the minimal selection principle, originally studied in the Wasserstein gradient flow setting by Ambrosio et al. (2005). This allows us to extend the standard online learning framework to the infinite-dimensional setting seamlessly. Based on our framework, we derive a novel method called the minimal selection or exploration (MSoE) algorithm to solve OLT problems using mean-field approximation and discretization techniques. In the displacement convex setting, the main theoretical message underpinning our approach is that minimizing transport cost over time (via the minimal selection principle) ensures optimal cumulative regret upper bounds. On the algorithmic side, our MSoE algorithm applies beyond the displacement convex setting, making the mathematical theory of optimal transport practically relevant to non-convex settings common in dynamic resource allocation.  ( 2 min )
    A survey of unsupervised learning methods for high-dimensional uncertainty quantification in black-box-type problems. (arXiv:2202.04648v1 [cs.LG])
    Constructing surrogate models for uncertainty quantification (UQ) on complex partial differential equations (PDEs) having inherently high-dimensional $\mathcal{O}(10^{\ge 2})$ stochastic inputs (e.g., forcing terms, boundary conditions, initial conditions) poses tremendous challenges. The curse of dimensionality can be addressed with suitable unsupervised learning techniques used as a pre-processing tool to encode inputs onto lower-dimensional subspaces while retaining its structural information and meaningful properties. In this work, we review and investigate thirteen dimension reduction methods including linear and nonlinear, spectral, blind source separation, convex and non-convex methods and utilize the resulting embeddings to construct a mapping to quantities of interest via polynomial chaos expansions (PCE). We refer to the general proposed approach as manifold PCE (m-PCE), where manifold corresponds to the latent space resulting from any of the studied dimension reduction methods. To investigate the capabilities and limitations of these methods we conduct numerical tests for three physics-based systems (treated as black-boxes) having high-dimensional stochastic inputs of varying complexity modeled as both Gaussian and non-Gaussian random fields to investigate the effect of the intrinsic dimensionality of input data. We demonstrate both the advantages and limitations of the unsupervised learning methods and we conclude that a suitable m-PCE model provides a cost-effective approach compared to alternative algorithms proposed in the literature, including recently proposed expensive deep neural network-based surrogates and can be readily applied for high-dimensional UQ in stochastic PDEs.  ( 2 min )
    Robust Bayesian Inference for Simulator-based Models via the MMD Posterior Bootstrap. (arXiv:2202.04744v1 [stat.ME])
    Simulator-based models are models for which the likelihood is intractable but simulation of synthetic data is possible. They are often used to describe complex real-world phenomena, and as such can often be misspecified in practice. Unfortunately, existing Bayesian approaches for simulators are known to perform poorly in those cases. In this paper, we propose a novel algorithm based on the posterior bootstrap and maximum mean discrepancy estimators. This leads to a highly-parallelisable Bayesian inference algorithm with strong robustness properties. This is demonstrated through an in-depth theoretical study which includes generalisation bounds and proofs of frequentist consistency and robustness of our posterior. The approach is then assessed on a range of examples including a g-and-k distribution and a toggle-switch model.  ( 2 min )

  • Open

    [P] DeepETA: How Uber Predicts Arrival Times Using Deep Learning
    Link to blog post Geospatial data for deep learning seems to be difficult and often overlooked, but interest seems to be increasing. It's nice to see an attempt at a more modern, general-purpose ETA tool like this, especially the way they use geospatial embeddings and linear transformer. submitted by /u/zrsmithson [link] [comments]
    [Discussion] Are there any good or interesting alternatives to Bert for text classification?
    I've seen that Bert comes in a ton of flavors/sizes but I'd like to know the trade offs between it and T4 (or other text classification capable models) submitted by /u/BearThreat [link] [comments]  ( 1 min )
    [D] Let's compare ML onboarding
    I was given a Macbook Pro (not an M1) and expected to develop locally but train on shared GPU resources. I was pointed to the repositories on Gitlab, told which tools the team used (docker, jupyter, etc), manually given access to datasets that I mounted on my machine, and was left to set things up myself. All in all, it took me about a week to get everything set up and ready to contribute, with ongoing issues over the next few weeks due to poor documentation in the setup phase. One of my issues is that model development is a bit tricky to develop locally and test on SLURM because there are always issues moving from CPU to GPU, and SLURM jobs don't run instantaneously. We also don't use any formal framework like Kubeflow or Weights and Biases experiment tracking. What was your development onboarding experience like? How long did it take you to be able to contribute to the team? What are some of your issues with your current ML stack and workflow? submitted by /u/just_some_music_ [link] [comments]  ( 1 min )
    [D] How does research in deep learning progress?
    Research generally advances through incremental advances (i.e building on top of the work of others). Does this hold true for deep learning research? Are the architecture breakthroughs obvious in retrospect given the knowledge of the field of the time or are they primarily the result of good (and unique) intuition? It seems working on blackbox models makes it hard to build foundational knowledge, which ultimately hinders the progress of the field. submitted by /u/t_montana [link] [comments]  ( 2 min )
    [D] FOMM Paper digest: First Order Motion Model for Image Animation explained, a 5-minute paper summary by Casual GAN Papers
    If you have ever used a face animation app, you have probably interacted with First Order Motion Model. Perhaps the reason that this method became ubiquitous is due to its ability to animate arbitrary objects. Aliaksandr Siarohin and the team from DISI, University of Trento, and Snap leverage a self-supervised approach to learn a specialized keypoint detector for a class of similar objects from a set of videos that warps the source frame according to a motion field from a reference frame. From the birds-eye view, the pipeline works like this: first, a set of keypoints is predicted for each of the two frames along with local affine transforms around the keypoints (this was the most confusing part for me, luckily we will cover it in detail later in the post). This information from two frames is combined to predict the motion field that tells where each pixel in the source frame should move to line up with the driving frame along with an occlusion mask that shows the image areas that need to be inpainted. As for the details. Let’s dive in, and learn, shall we? Full summary: https://t.me/casual_gan/259 Blog post: https://www.casualganpapers.com/self-supervised-image-animation-image-driving/First-Order-Motion-Model-explained.html First Order Motion Model: arxiv / code Subscribe to Casual GAN Papers and follow me on Twitter for weekly AI paper summaries! submitted by /u/KirillTheMunchKing [link] [comments]  ( 1 min )
    [R] Fuzzy Logic
    Hello, Below is an article for reporting my literature review regarding Fuzzy Logic. I hope you will enjoy reading it; please let me know if you have any feedback. https://medium.com/mlearning-ai/fuzzy-logic-3201d20370b9 submitted by /u/fgokmenoglu [link] [comments]
    [P] EvoJAX: Hardware-Accelerated Neuroevolution
    submitted by /u/sensetime [link] [comments]  ( 1 min )
    [D] Path interference in decision trees
    Hello, Here is a proposal for improving prediction accuracy in decision trees. In an analogy to physics, the method bears some resemblance to the interference of individual particles in the double-slit experiment. Interfering Paths in Decision Trees Feedback and questions are welcome. submitted by /u/crispub [link] [comments]  ( 1 min )
    [P] Python-FHEz Fully Homomorphically Encrypted Deep Learning Library
    Hey ML im a PhD candidate in encrypted deep learning and I thought since I have seen a lot of other related posts, about similar technologies I thought I might mention my own Python-FHEz library. - You can find Python-FHEz here: https://gitlab.com/deepcypher/python-fhez - the documentation can be found with Jupyter notebook example (Fashion-MNIST) here: https://python-fhez.readthedocs.io - The recent Python-FHEz / FHE + DL paper we have submitted (awaiting review) https://arxiv.org/abs/2110.13638 Python-FHEz is specifically an encrypted deep learning library. That is it implements many helpful abstractions for FHE like encryptable NumPy custom containers, so that you can quickly and easily encrypt and use data as if it was in NumPy. Not only this but it implements FHE compatible neural…  ( 5 min )
    [R] Pyramid Vision Transformers
    I'm doing my Master's research in object detection and I'm using Pyramid Vision Transformer (PVT) as a baseline. Did anyone have tried to visualize the attention maps in PVT? I am thinking of using the attention information to form a new loss function for object detection. submitted by /u/shingekichan1996 [link] [comments]  ( 1 min )
    YOLO + LSTM + DeepSORT? [D]
    Hey all, if this isnt the right place to post pls lmk. I am working on a design project for multi object tracking, our thought was to use a YOLO model for detection fed into an LSTM and then fed into a DeepSORT algorithm to track. does this seem like it could work or is this kind of implementation impossible? submitted by /u/JLeaman99 [link] [comments]
    Thesis topic : Artificial Intelligence/Machine Learning using Galaxy Zoo Data [R] [P] [D]
    Hello, I am a Masters student and I am trying to figure out the topic for my thesis. I want to utilize the data from GalaxyZoo Data. There are already a few galaxy classification models in practice, so I need something different and interesting which has potential to contribute. If you know any practical application which can be achieved by training an ML model using GalaxyZoo dataset please let me know. Thank you. submitted by /u/Spirited_Cancel_6049 [link] [comments]  ( 1 min )
  • Open

    Two Sparsities Are Better Than One: Unlocking the Performance Benefits of Sparse-Sparse Networks
    submitted by /u/nickb [link] [comments]
  • Open

    Temperature 0. Token Probability of the Numbers is quite high, which means the model is quite confident with these statements.
    submitted by /u/BeginningInfluence55 [link] [comments]  ( 1 min )
    Crashes caused by Tesla Autopilot are piling up, and there are consequences
    submitted by /u/regalalgorithm [link] [comments]  ( 1 min )
    how does Google translate work offline after downloading language pack ?
    What kind of inference are they using to give such decent translation results offline. submitted by /u/abyss_wonderer [link] [comments]
    [Private Beta Launch] Multilingual Speech Recognition service for IVR developers
    Hi Folks, I Just launched my project for private BETA. If you find it interesting please reach out to join our private BETA. https://amazethu.com/all-spoken-languages-matter-for-voicetechnologies/ submitted by /u/toondei [link] [comments]
    The Gretel Epoch #4 - What's new in privacy engineering
    submitted by /u/Repeat-or [link] [comments]
    FOMM Paper digest: First Order Motion Model for Image Animation explained, a 5-minute paper summary by Casual GAN Papers
    If you have ever used a face animation app, you have probably interacted with First Order Motion Model. Perhaps the reason that this method became ubiquitous is due to its ability to animate arbitrary objects. Aliaksandr Siarohin and the team from DISI, University of Trento, and Snap leverage a self-supervised approach to learn a specialized keypoint detector for a class of similar objects from a set of videos that warps the source frame according to a motion field from a reference frame. From the birds-eye view, the pipeline works like this: first, a set of keypoints is predicted for each of the two frames along with local affine transforms around the keypoints (this was the most confusing part for me, luckily we will cover it in detail later in the post). This information from two frames is combined to predict the motion field that tells where each pixel in the source frame should move to line up with the driving frame along with an occlusion mask that shows the image areas that need to be inpainted. As for the details. Let’s dive in, and learn, shall we? Full summary: https://t.me/casual_gan/259 Blog post: https://www.casualganpapers.com/self-supervised-image-animation-image-driving/First-Order-Motion-Model-explained.html First Order Motion Model arxiv / code Subscribe to Casual GAN Papers and follow me on Twitter for weekly AI paper summaries! submitted by /u/KirillTheMunchKing [link] [comments]  ( 1 min )
    Deepmind’s Latest Machine Learning Research Automatically Find Inputs That Elicit Harmful Text From Language Models
    Large generative language models (LMs) such as GPT-3 and Gopher have proven the ability to generate high-quality text. However, these models run the danger of producing destructive text. Therefore, they are challenging to deploy due to their potential to hurt people in ways that are nearly impossible to forecast. So many different inputs can lead to a model producing harmful text. As a result, it’s challenging to identify all scenarios in which a model fails before it’s used in the actual world. Continue Reading Paper: https://arxiv.org/pdf/2202.03286.pdf submitted by /u/ai-lover [link] [comments]  ( 1 min )
    Artificial Nightmares : Children of The Corn || DD PYTTI VQGAN AI Art Video [4K 60 FPS]
    submitted by /u/Thenamessd [link] [comments]
    Dead End Discovery Reinforcement Learning A.I. in Healthcare Policy
    submitted by /u/Beautiful-Credit-868 [link] [comments]
    videos of people drowning
    me and two friends are making a drone that supposed to identify people who are drowning, does any one know where to find videos we can "feed" to the ai submitted by /u/zrhz123 [link] [comments]
    Explained: Digital Asset And Taxation Of Digital Asset Around The World
    https://preview.redd.it/iszwtr7e16h81.jpg?width=778&format=pjpg&auto=webp&s=701ef461d0d040890d8701c9eed069072c44952d The tradition of the Finance Minister walking with a briefcase on a budget day with all camera shutters around is a fascinating moment. However, this tradition came to end in 2019, with the Finance Minister Nirmala Sitharaman carrying a red fourfold cloth. In the past two years, the gimmick of carrying the budget in the cloth may have worked for the government. But, in 2022, the announcement of a tax on the digital asset and with no change in the tax slabs and less to minimum relief to the middle-class income group, has brought the discussion back on the dining tables of the majority of the population. Read full Blog: submitted by /u/businessapac [link] [comments]  ( 1 min )
    UC Berkeley Researchers Introduce ‘imodels: A Python Package For Fitting Interpretable Machine Learning Models
    Recent developments in machine learning have resulted in more complicated predictive models, typically at the expense of interpretability. Interpretability is frequently required, especially in high-stakes health, biology, and political science applications. Furthermore, interpretable models aid in various tasks, including detecting errors, exploiting domain knowledge, and accelerating inference. Despite recent breakthroughs in the formulation and fitting of interpretable models, implementations are frequently challenging to locate, utilize, and compare. imodels solves this void by offering a single interface and implementation for a wide range of state-of-the-art interpretable modeling techniques, especially rule-based methods. imodels is basically a Python tool for predictive modeling that is simple, transparent, and accurate. It gives users a straightforward way to fit and use state-of-the-art interpretable models, all of which are compatible with scikit-learn (Pedregosa et al., 2011). These models can frequently replace black-box models while boosting interpretability and computing efficiency without compromising forecast accuracy. Continue Reading Paper: https://joss.theoj.org/papers/10.21105/joss.03192.pdf Github: https://github.com/csinva/imodels ​ https://i.redd.it/oubyutsnw3h81.gif submitted by /u/ai-lover [link] [comments]  ( 1 min )
    Thesis topic : Artificial Intelligence/Machine Learning using Galaxy Zoo Data
    Hello, I am a Masters student and I am trying to figure out the topic for my thesis. I want to utilize the data from GalaxyZoo Data. There are already a few galaxy classification models in practice, so I need something different and interesting which has potential to contribute. If you know any practical application which can be achieved by training an ML model using GalaxyZoo dataset please let me know. Thank you. submitted by /u/Spirited_Cancel_6049 [link] [comments]  ( 1 min )
    Looking for an app or a website
    Hey everyone, new here so not aware if it has ever been asked, but does anyone have an app or a website to generate Art like so Wombo or Deep Dream Generator, but does have any "credits" or "energy" mechanics ? Something just to use whenever needed to look at cool pic ? Thanks in advance ! submitted by /u/Tomori_San [link] [comments]  ( 1 min )
  • Open

    Apply profanity masking in Amazon Translate
    Amazon Translate is a neural machine translation service that delivers fast, high-quality, affordable, and customizable language translation. This post shows how you can mask profane words and phrases with a grawlix string (“?$#@$”). Amazon Translate typically chooses clean words for your translation output. But in some situations, you want to prevent words that are commonly […]  ( 4 min )
    How Süddeutsche Zeitung optimized their audio narration process with Amazon Polly
    This is a guest post by Jakob Kohl, a Software Developer at the Süddeutsche Zeitung. Süddeutsche Zeitung is one of the leading quality dailies in Germany when it comes to paid subscriptions and unique users. Its website, SZ.de, reaches more than 15 million monthly unique users as of October 2021. Thanks to smart speakers and […]  ( 5 min )
  • Open

    An International Scientific Challenge for the Diagnosis and Gleason Grading of Prostate Cancer
    Posted by Po-Hsuan Cameron Chen, Software Engineer, Google Health and Maggie Demkin, Program Manager, Kaggle In recent years, machine learning (ML) competitions in health have attracted ML scientists to work together to solve challenging clinical problems. These competitions provide access to relevant data and well-defined problems where experienced data scientists come to compete for solutions and learn new methods. However, a fundamental difficulty in organizing such challenges is obtaining and curating high quality datasets for model development and independent datasets for model evaluation. Importantly, to reduce the risk of bias and to ensure broad applicability of the algorithm, evaluation of the generalisability of resulting algorithms should ideally be performed on multiple indepe…  ( 6 min )
  • Open

    Dead End Discovery Reinforcement Learning A.I. in Healthcare Policy
    submitted by /u/Beautiful-Credit-868 [link] [comments]
    Deep Reinforcement Learning algorithm completing Tekken Tag Tournament at highest difficulty level
    submitted by /u/DIAMBRA_AIArena [link] [comments]  ( 2 min )
    I'm gonna have to say it, but my experience with RL so far has been consistently bad
    So, over the last year and a half I've tried to apply (on and off, not continuously) RL to a few non-standard problems I was interested in. I modified mountaincar-v0 a tiny amount (lowered the max speed by 10% so it would need minimum two swings to get up the hill) and was shocked that literally none of the implementations found in the gym leaderboard could still learn it. A few months later I tried to use RL to do some simple balancing problem, but the algorithm constantly found new ways to learn the wrong thing. Yesterday I tried to apply it to yet another problem and ran into the same issues; incredibly slow learning of any meaningful behavior, and a constant battle against it not falling into a local maximum of stupid behavior. The most likely explanation is that I simply don't know what I'm doing, but I figured I'd ask into the round whether this has been the experience of others as well. submitted by /u/wattnurt [link] [comments]  ( 4 min )
    Observation normalization
    I was trying to train an agent on gym robotics environments and noticed that without normalizing the observations, it was taking so long. Is normalizing the observations a popular thing to do in reinforcement learning? submitted by /u/140doritos [link] [comments]  ( 1 min )
    Modelling close loop joints in mujoco
    Hi, i am trying to modify the current gym environment , inverteddoublependulum into a closed-loop environment by adding another cart with poles connecting to the initial cart. However despite adding the connect constraint, the contact between the two joints can still slip off. The XML file is as such : # friction: 1st is sliding friction,x plane 2nd is torsional friction about the xy plane. 3rd is rolling, xz plane <geom name="floor" pos="…  ( 2 min )
    Computer scientists prove why bigger neural networks do better
    submitted by /u/RatonneLaveuse [link] [comments]  ( 1 min )
    Why are pixel inputs stacked / use 4 frames for every observation?
    Every paper that learns a policy from pixel observations seems to follow the Q learning convention for making each timestep a stack of 4 frames, and also uses action repeats with it. In addition, atari wrappers for gym had a max and skip wrapper, which seems to take the max of each pixel over multiple frames, which I don't understand at all. Is there a specific reason for this? I think I heard that its to account for partial observability but 4 timesteps isn't going to change how the environment looks too much so I'm not sure it really adds too much information at all. submitted by /u/junkQn [link] [comments]  ( 2 min )
  • Open

    AI, ML, or DL – Learn what it Means
    When it comes to technology in the world of business, AI and machine learning (ML) are among the commonly discussed terms. Technology has been making inroads continuously, and ignoring its effects and implications would be a perilous choice. Here is what some well-known personalities have had to say about these technologies: “AI doesn’t have to… Read More »AI, ML, or DL – Learn what it Means The post AI, ML, or DL – Learn what it Means appeared first on Data Science Central.  ( 4 min )
    How to Protect Devices from Backdoor Malware That’s Stealing Your Data
    Data theft is harmful to individuals and larger organizations alike.  For individuals, it causes major distress and, typically, long-lasting psychological damage.  This is especially true if a cybercriminal has used the target’s personal information to demand ransom. Or if they managed to get their credit card information. Even after the virus is no longer on… Read More »How to Protect Devices from Backdoor Malware That’s Stealing Your Data The post How to Protect Devices from Backdoor Malware That’s Stealing Your Data appeared first on Data Science Central.  ( 4 min )
    Microsoft Office 365 Backup: What Business Data to Back up?
    As of Q3 of 2021, over a million organizations rely on Microsoft 365 subscription plans to conduct their daily business tasks. The popularity of Microsoft’s productivity and collaboration suite has been attracting malicious actors and cybercriminals who try to breach organizations worldwide. Because of this, Microsoft 365 adopters should seek out a comprehensive Microsoft 365… Read More »Microsoft Office 365 Backup: What Business Data to Back up? The post Microsoft Office 365 Backup: What Business Data to Back up? appeared first on Data Science Central.  ( 4 min )
  • Open

    A Practical Approach to Proper Inference with Linked Data. (arXiv:1810.01538v2 [stat.ME] UPDATED)
    Entity resolution (ER), comprising record linkage and de-duplication, is the process of merging noisy databases in the absence of unique identifiers to remove duplicate entities. One major challenge of analysis with linked data is identifying a representative record among determined matches to pass to an inferential or predictive task, referred to as the \emph{downstream task}. Additionally, incorporating uncertainty from ER in the downstream task is critical to ensure proper inference. To bridge the gap between ER and the downstream task in an analysis pipeline, we propose five methods to choose a representative (or canonical) record from linked data, referred to as canonicalization. Our methods are scalable in the number of records, appropriate in general data scenarios, and provide natural error propagation via a Bayesian canonicalization stage. The proposed methodology is evaluated on three simulated data sets and one application -- determining the relationship between demographic information and party affiliation in voter registration data from the North Carolina State Board of Elections. We first perform Bayesian ER and evaluate our proposed methods for canonicalization before considering the downstream tasks of linear and logistic regression. Bayesian canonicalization methods are empirically shown to improve downstream inference in both settings through prediction and coverage.  ( 2 min )
    BayesFlow Can Reliably Detect Model Misspecification and Posterior Errors in Amortized Bayesian Inference. (arXiv:2112.08866v2 [stat.ME] UPDATED)
    Neural density estimators have proven remarkably powerful in performing efficient simulation-based Bayesian inference in various research domains. In particular, the BayesFlow framework uses a two-step approach to enable amortized parameter estimation in settings where the likelihood function is implicitly defined by a simulation program. But how faithful is such inference when simulations are poor representations of reality? In this paper, we conceptualize the types of model misspecification arising in simulation-based inference and systematically investigate the performance of the BayesFlow framework under these misspecifications. We propose an augmented optimization objective which imposes a probabilistic structure on the latent data space and utilize maximum mean discrepancy (MMD) to detect potentially catastrophic misspecifications during inference undermining the validity of the obtained results. We verify our detection criterion on a number of artificial and realistic misspecifications, ranging from toy conjugate models to complex models of decision making and disease outbreak dynamics applied to real data. Further, we show that posterior inference errors increase as a function of the distance between the true data-generating distribution and the typical set of simulations in the latent summary space. Thus, we demonstrate the dual utility of MMD as a method for detecting model misspecification and as a proxy for verifying the faithfulness of amortized Bayesian inference.  ( 2 min )
    Mean-Variance Policy Iteration for Risk-Averse Reinforcement Learning. (arXiv:2004.10888v5 [cs.LG] UPDATED)
    We present a mean-variance policy iteration (MVPI) framework for risk-averse control in a discounted infinite horizon MDP optimizing the variance of a per-step reward random variable. MVPI enjoys great flexibility in that any policy evaluation method and risk-neutral control method can be dropped in for risk-averse control off the shelf, in both on- and off-policy settings. This flexibility reduces the gap between risk-neutral control and risk-averse control and is achieved by working on a novel augmented MDP directly. We propose risk-averse TD3 as an example instantiating MVPI, which outperforms vanilla TD3 and many previous risk-averse control methods in challenging Mujoco robot simulation tasks under a risk-aware performance metric. This risk-averse TD3 is the first to introduce deterministic policies and off-policy learning into risk-averse reinforcement learning, both of which are key to the performance boost we show in Mujoco domains.  ( 2 min )
    pNLP-Mixer: an Efficient all-MLP Architecture for Language. (arXiv:2202.04350v1 [cs.CL])
    Large pre-trained language models drastically changed the natural language processing(NLP) landscape. Nowadays, they represent the go-to framework to tackle diverse NLP tasks, even with a limited number of annotations. However, using those models in production, either in the cloud or at the edge, remains a challenge due to the memory footprint and/or inference costs. As an alternative, recent work on efficient NLP has shown that small weight-efficient models can reach competitive performance at a fraction of the costs. Here, we introduce pNLP-Mixer, an embbedding-free model based on the MLP-Mixer architecture that achieves high weight-efficiency thanks to a novel linguistically informed projection layer. We evaluate our model on two multi-lingual semantic parsing datasets, MTOP and multiATIS. On MTOP our pNLP-Mixer almost matches the performance of mBERT, which has 38 times more parameters, and outperforms the state-of-the-art of tiny models (pQRNN) with 3 times fewer parameters. On a long-sequence classification task (Hyperpartisan) our pNLP-Mixer without pretraining outperforms RoBERTa, which has 100 times more parameters, demonstrating the potential of this architecture.  ( 2 min )
    Agree to Disagree: Diversity through Disagreement for Better Transferability. (arXiv:2202.04414v1 [cs.LG])
    Gradient-based learning algorithms have an implicit simplicity bias which in effect can limit the diversity of predictors being sampled by the learning procedure. This behavior can hinder the transferability of trained models by (i) favoring the learning of simpler but spurious features -- present in the training data but absent from the test data -- and (ii) by only leveraging a small subset of predictive features. Such an effect is especially magnified when the test distribution does not exactly match the train distribution -- referred to as the Out of Distribution (OOD) generalization problem. However, given only the training data, it is not always possible to apriori assess if a given feature is spurious or transferable. Instead, we advocate for learning an ensemble of models which capture a diverse set of predictive features. Towards this, we propose a new algorithm D-BAT (Diversity-By-disAgreement Training), which enforces agreement among the models on the training data, but disagreement on the OOD data. We show how D-BAT naturally emerges from the notion of generalized discrepancy, as well as demonstrate in multiple experiments how the proposed method can mitigate shortcut-learning, enhance uncertainty and OOD detection, as well as improve transferability.  ( 2 min )
    SLOSH: Set LOcality Sensitive Hashing via Sliced-Wasserstein Embeddings. (arXiv:2112.05872v2 [cs.LG] UPDATED)
    Learning from set-structured data is an essential problem with many applications in machine learning and computer vision. This paper focuses on non-parametric and data-independent learning from set-structured data using approximate nearest neighbor (ANN) solutions, particularly locality-sensitive hashing. We consider the problem of set retrieval from an input set query. Such retrieval problem requires: 1) an efficient mechanism to calculate the distances/dissimilarities between sets, and 2) an appropriate data structure for fast nearest neighbor search. To that end, we propose Sliced-Wasserstein set embedding as a computationally efficient "set-2-vector" mechanism that enables downstream ANN, with theoretical guarantees. The set elements are treated as samples from an unknown underlying distribution, and the Sliced-Wasserstein distance is used to compare sets. We demonstrate the effectiveness of our algorithm, denoted as Set-LOcality Sensitive Hashing (SLOSH), on various set retrieval datasets and compare our proposed embedding with standard set embedding approaches, including Generalized Mean (GeM) embedding/pooling, Featurewise Sort Pooling (FSPool), and Covariance Pooling and show consistent improvement in retrieval results. The code for replicating our results is available here: \href{https://github.com/mint-vu/SLOSH}{https://github.com/mint-vu/SLOSH}.  ( 2 min )
    A Multimodal Canonical-Correlated Graph Neural Network for Energy-Efficient Speech Enhancement. (arXiv:2202.04528v1 [cs.SD])
    This paper proposes a novel multimodal self-supervised architecture for energy-efficient AV speech enhancement by integrating graph neural networks with canonical correlation analysis (CCA-GNN). This builds on a state-of-the-art CCA-GNN that aims to learn representative embeddings by maximizing the correlation between pairs of augmented views of the same input while decorrelating disconnected features. The key idea of the conventional CCA-GNN involves discarding augmentation-variant information and preserving augmentation-invariant information whilst preventing capturing of redundant information. Our proposed AV CCA-GNN model is designed to deal with the challenging multimodal representation learning context. Specifically, our model improves contextual AV speech processing by maximizing canonical correlation from augmented views of the same channel, as well as canonical correlation from audio and visual embeddings. In addition, we propose a positional encoding of the nodes that considers a prior-frame sequence distance instead of a feature-space representation while computing the node's nearest neighbors. This serves to introduce temporal information in the embeddings through the neighborhood's connectivity. Experiments conducted with the benchmark ChiME3 dataset show that our proposed prior frame-based AV CCA-GNN reinforces better feature learning in the temporal context, leading to more energy-efficient speech reconstruction compared to state-of-the-art CCA-GNN and multi-layer perceptron models. The results demonstrate the potential of our proposed approach for exploitation in future assistive technology and energy-efficient multimodal devices.  ( 2 min )
    A hypothesis-driven method based on machine learning for neuroimaging data analysis. (arXiv:2202.04397v1 [stat.ML])
    There remains an open question about the usefulness and the interpretation of Machine learning (MLE) approaches for discrimination of spatial patterns of brain images between samples or activation states. In the last few decades, these approaches have limited their operation to feature extraction and linear classification tasks for between-group inference. In this context, statistical inference is assessed by randomly permuting image labels or by the use of random effect models that consider between-subject variability. These multivariate MLE-based statistical pipelines, whilst potentially more effective for detecting activations than hypotheses-driven methods, have lost their mathematical elegance, ease of interpretation, and spatial localization of the ubiquitous General linear Model (GLM). Recently, the estimation of the conventional GLM has been demonstrated to be connected to an univariate classification task when the design matrix is expressed as a binary indicator matrix. In this paper we explore the complete connection between the univariate GLM and MLE \emph{regressions}. To this purpose we derive a refined statistical test with the GLM based on the parameters obtained by a linear Support Vector Regression (SVR) in the \emph{inverse} problem (SVR-iGLM). Subsequently, random field theory (RFT) is employed for assessing statistical significance following a conventional GLM benchmark. Experimental results demonstrate how parameter estimations derived from each model (mainly GLM and SVR) result in different experimental design estimates that are significantly related to the predefined functional task. Moreover, using real data from a multisite initiative the proposed MLE-based inference demonstrates statistical power and the control of false positives, outperforming the regular GLM.  ( 2 min )
    Automatic Unstructured Handwashing Recognition using Smartwatch to Reduce Contact Transmission of Pathogens. (arXiv:2107.13405v3 [cs.LG] UPDATED)
    Current guidelines from the World Health Organization indicate that the SARSCoV-2 coronavirus, which results in the novel coronavirus disease (COVID-19), is transmitted through respiratory droplets or by contact. Contact transmission occurs when contaminated hands touch the mucous membrane of the mouth, nose, or eyes. Moreover, pathogens can also be transferred from one surface to another by contaminated hands, which facilitates transmission by indirect contact. Consequently, hands hygiene is extremely important to prevent the spread of the SARSCoV-2 virus. Additionally, hand washing and/or hand rubbing disrupts also the transmission of other viruses and bacteria that cause common colds, flu and pneumonia, thereby reducing the overall disease burden. The vast proliferation of wearable devices, such as smartwatches, containing acceleration, rotation, magnetic field sensors, etc., together with the modern technologies of artificial intelligence, such as machine learning and more recently deep-learning, allow the development of accurate applications for recognition and classification of human activities such as: walking, climbing stairs, running, clapping, sitting, sleeping, etc. In this work we evaluate the feasibility of an automatic system, based on current smartwatches, which is able to recognize when a subject is washing or rubbing its hands, in order to monitor parameters such as frequency and duration, and to evaluate the effectiveness of the gesture. Our preliminary results show a classification accuracy of about 95% and of about 94% for respectively deep and standard learning techniques.  ( 2 min )
    A Hybrid Inference System for Improved Curvature Estimation in the Level-Set Method Using Machine Learning. (arXiv:2104.02951v3 [cs.LG] UPDATED)
    We present a novel hybrid strategy based on machine learning to improve curvature estimation in the level-set method. The proposed inference system couples enhanced neural networks with standard numerical schemes to compute curvature more accurately. The core of our hybrid framework is a switching mechanism that relies on well established numerical techniques to gauge curvature. If the curvature magnitude is larger than a resolution-dependent threshold, it uses a neural network to yield a better approximation. Our networks are multilayer perceptrons fitted to synthetic data sets composed of sinusoidal- and circular-interface samples at various configurations. To reduce data set size and training complexity, we leverage the problem's characteristic symmetry and build our models on just half of the curvature spectrum. These savings lead to a powerful inference system able to outperform any of its numerical or neural component alone. Experiments with stationary, smooth interfaces show that our hybrid solver is notably superior to conventional numerical methods in coarse grids and along steep interface regions. Compared to prior research, we have observed outstanding gains in precision after training the regression model with data pairs from more than a single interface type and transforming data with specialized input preprocessing. In particular, our findings confirm that machine learning is a promising venue for reducing or removing mass loss in the level-set method.  ( 2 min )
    Adapting to Mixing Time in Stochastic Optimization with Markovian Data. (arXiv:2202.04428v1 [cs.LG])
    We consider stochastic optimization problems where data is drawn from a Markov chain. Existing methods for this setting crucially rely on knowing the mixing time of the chain, which in real-world applications is usually unknown. We propose the first optimization method that does not require the knowledge of the mixing time, yet obtains the optimal asymptotic convergence rate when applied to convex problems. We further show that our approach can be extended to: (i) finding stationary points in non-convex optimization with Markovian data, and (ii) obtaining better dependence on the mixing time in temporal difference (TD) learning; in both cases, our method is completely oblivious to the mixing time. Our method relies on a novel combination of multi-level Monte Carlo (MLMC) gradient estimation together with an adaptive learning method.  ( 2 min )
    Augmenting Decision Making via Interactive What-If Analysis. (arXiv:2109.06160v4 [cs.DB] UPDATED)
    The fundamental goal of business data analysis is to improve business decisions using data. Business users often make decisions to achieve key performance indicators (KPIs) such as increasing customer retention or sales, or decreasing costs. To discover the relationship between data attributes hypothesized to be drivers and those corresponding to KPIs of interest, business users currently need to perform lengthy exploratory analyses. This involves considering multitudes of combinations and scenarios and performing slicing, dicing, and transformations on the data accordingly, e.g., analyzing customer retention across quarters of the year or suggesting optimal media channels across strata of customers. However, the increasing complexity of datasets combined with the cognitive limitations of humans makes it challenging to carry over multiple hypotheses, even for simple datasets. Therefore mentally performing such analyses is hard. Existing commercial tools either provide partial solutions or fail to cater to business users altogether. Here we argue for four functionalities to enable business users to interactively learn and reason about the relationships between sets of data attributes thereby facilitating data-driven decision making. We implement these functionalities in SystemD, an interactive visual data analysis system enabling business users to experiment with the data by asking what-if questions. We evaluate the system through three business use cases: marketing mix modeling, customer retention analysis, and deal closing analysis, and report on feedback from multiple business users. Users find the SystemD functionalities highly useful for quick testing and validation of their hypotheses around their KPIs of interest, addressing their unmet analysis needs. The feedback also suggests that the UX design can be enhanced to further improve the understandability of these functionalities.  ( 2 min )
    Understanding the bias-variance tradeoff of Bregman divergences. (arXiv:2202.04167v1 [stat.ML])
    This paper builds upon the work of Pfau (2013), which generalized the bias variance tradeoff to any Bregman divergence loss function. Pfau (2013) showed that for Bregman divergences, the bias and variances are defined with respect to a central label, defined as the expected mean of the label, and a central prediction, of a more complex form. We show that, similarly to the label, the central prediction can be interpreted as the mean of a random variable, where the mean operates in a dual space defined by the loss function itself. Viewing the bias-variance tradeoff through operations taken in dual space, we subsequently derive several results of interest. In particular, (a) the variance terms satisfy a generalized law of total variance; (b) if a source of randomness cannot be controlled, its contribution to the bias and variance has a closed form; (c) there exist natural ensembling operations in the label and prediction spaces which reduce the variance and do not affect the bias.  ( 2 min )
    A decision-tree framework to select optimal box-sizes for product shipments. (arXiv:2202.04277v1 [cs.LG])
    In package-handling facilities, boxes of varying sizes are used to ship products. Improperly sized boxes with box dimensions much larger than the product dimensions create wastage and unduly increase the shipping costs. Since it is infeasible to make unique, tailor-made boxes for each of the $N$ products, the fundamental question that confronts e-commerce companies is: How many $K << N$ cuboidal boxes need to manufactured and what should be their dimensions? In this paper, we propose a solution for the single-count shipment containing one product per box in two steps: (i) reduce it to a clustering problem in the $3$ dimensional space of length, width and height where each cluster corresponds to the group of products that will be shipped in a particular size variant, and (ii) present an efficient forward-backward decision tree based clustering method with low computational complexity on $N$ and $K$ to obtain these $K$ clusters and corresponding box dimensions. Our algorithm has multiple constituent parts, each specifically designed to achieve a high-quality clustering solution. As our method generates clusters in an incremental fashion without discarding the present solution, adding or deleting a size variant is as simple as stopping the backward pass early or executing it for one more iteration. We tested the efficacy of our approach by simulating actual single-count shipments that were transported during a month by Amazon using the proposed box dimensions. Even by just modifying the existing box dimensions and not adding a new size variant, we achieved a reduction of $4.4\%$ in the shipment volume, contributing to the decrease in non-utilized, air volume space by $2.2\%$. The reduction in shipment volume and air volume improved significantly to $10.3\%$ and $6.1\%$ when we introduced $4$ additional boxes.  ( 2 min )
    Intelligent Autonomous Intersection Management. (arXiv:2202.04224v1 [cs.MA])
    Connected Autonomous Vehicles will make autonomous intersection management a reality replacing traditional traffic signal control. Autonomous intersection management requires time and speed adjustment of vehicles arriving at an intersection for collision-free passing through the intersection. Due to its computational complexity, this problem has been studied only when vehicle arrival times towards the vicinity of the intersection are known beforehand, which limits the applicability of these solutions for real-time deployment. To solve the real-time autonomous traffic intersection management problem, we propose a reinforcement learning (RL) based multiagent architecture and a novel RL algorithm coined multi-discount Q-learning. In multi-discount Q-learning, we introduce a simple yet effective way to solve a Markov Decision Process by preserving both short-term and long-term goals, which is crucial for collision-free speed control. Our empirical results show that our RL-based multiagent solution can achieve near-optimal performance efficiently when minimizing the travel time through an intersection.  ( 2 min )
    Generative multitask learning mitigates target-causing confounding. (arXiv:2202.04136v1 [cs.LG])
    We propose a simple and scalable approach to causal representation learning for multitask learning. Our approach requires minimal modification to existing ML systems, and improves robustness to prior probability shift. The improvement comes from mitigating unobserved confounders that cause the targets, but not the input. We refer to them as target-causing confounders. These confounders induce spurious dependencies between the input and targets. This poses a problem for the conventional approach to multitask learning, due to its assumption that the targets are conditionally independent given the input. Our proposed approach takes into account the dependency between the targets in order to alleviate target-causing confounding. All that is required in addition to usual practice is to estimate the joint distribution of the targets to switch from discriminative to generative classification, and to predict all targets jointly. Our results on the Attributes of People and Taskonomy datasets reflect the conceptual improvement in robustness to prior probability shift.  ( 2 min )
    Exploring Structural Sparsity in Neural Image Compression. (arXiv:2202.04595v1 [eess.IV])
    Neural image compression have reached or out-performed traditional methods (such as JPEG, BPG, WebP). However,their sophisticated network structures with cascaded convolution layers bring heavy computational burden for practical deployment. In this paper, we explore the structural sparsity in neural image compression network to obtain real-time acceleration without any specialized hardware design or algorithm. We propose a simple plug-in adaptive binary channel masking(ABCM) to judge the importance of each convolution channel and introduce sparsity during training. During inference, the unimportant channels are pruned to obtain slimmer network and less computation. We implement our method into three neural image compression networks with different entropy models to verify its effectiveness and generalization, the experiment results show that up to 7x computation reduction and 3x acceleration can be achieved with negligible performance drop.  ( 2 min )
    Topic Discovery via Latent Space Clustering of Pretrained Language Model Representations. (arXiv:2202.04582v1 [cs.CL])
    Topic models have been the prominent tools for automatic topic discovery from text corpora. Despite their effectiveness, topic models suffer from several limitations including the inability of modeling word ordering information in documents, the difficulty of incorporating external linguistic knowledge, and the lack of both accurate and efficient inference methods for approximating the intractable posterior. Recently, pretrained language models (PLMs) have brought astonishing performance improvements to a wide variety of tasks due to their superior representations of text. Interestingly, there have not been standard approaches to deploy PLMs for topic discovery as better alternatives to topic models. In this paper, we begin by analyzing the challenges of using PLM representations for topic discovery, and then propose a joint latent space learning and clustering framework built upon PLM embeddings. In the latent space, topic-word and document-topic distributions are jointly modeled so that the discovered topics can be interpreted by coherent and distinctive terms and meanwhile serve as meaningful summaries of the documents. Our model effectively leverages the strong representation power and superb linguistic features brought by PLMs for topic discovery, and is conceptually simpler than topic models. On two benchmark datasets in different domains, our model generates significantly more coherent and diverse topics than strong topic models, and offers better topic-wise document representations, based on both automatic and human evaluations.  ( 2 min )
    Reinforcement Learning with Sparse Rewards using Guidance from Offline Demonstration. (arXiv:2202.04628v1 [cs.LG])
    A major challenge in real-world reinforcement learning (RL) is the sparsity of reward feedback. Often, what is available is an intuitive but sparse reward function that only indicates whether the task is completed partially or fully. However, the lack of carefully designed, fine grain feedback implies that most existing RL algorithms fail to learn an acceptable policy in a reasonable time frame. This is because of the large number of exploration actions that the policy has to perform before it gets any useful feedback that it can learn from. In this work, we address this challenging problem by developing an algorithm that exploits the offline demonstration data generated by a sub-optimal behavior policy for faster and efficient online RL in such sparse reward settings. The proposed algorithm, which we call the Learning Online with Guidance Offline (LOGO) algorithm, merges a policy improvement step with an additional policy guidance step by using the offline demonstration data. The key idea is that by obtaining guidance from - not imitating - the offline data, LOGO orients its policy in the manner of the sub-optimal {policy}, while yet being able to learn beyond and approach optimality. We provide a theoretical analysis of our algorithm, and provide a lower bound on the performance improvement in each learning episode. We also extend our algorithm to the even more challenging incomplete observation setting, where the demonstration data contains only a censored version of the true state observation. We demonstrate the superior performance of our algorithm over state-of-the-art approaches on a number of benchmark environments with sparse rewards and censored state. Further, we demonstrate the value of our approach via implementing LOGO on a mobile robot for trajectory tracking and obstacle avoidance, where it shows excellent performance.  ( 2 min )
    Optimising hadronic collider simulations using amplitude neural networks. (arXiv:2202.04506v1 [hep-ph])
    Precision phenomenological studies of high-multiplicity scattering processes at collider experiments present a substantial theoretical challenge and are vitally important ingredients in experimental measurements. Machine learning technology has the potential to dramatically optimise simulations for complicated final states. We investigate the use of neural networks to approximate matrix elements, studying the case of loop-induced diphoton production through gluon fusion. We train neural network models on one-loop amplitudes from the NJet C++ library and interface them with the Sherpa Monte Carlo event generator to provide the matrix element within a realistic hadronic collider simulation. Computing some standard observables with the models and comparing to conventional techniques, we find excellent agreement in the distributions and a reduced total simulation time by a factor of thirty.  ( 2 min )
    Classification with Nearest Disjoint Centroids. (arXiv:2109.10436v2 [cs.LG] UPDATED)
    In this paper, we develop a new classification method based on nearest centroid, and it is called the nearest disjoint centroid classifier. Our method differs from the nearest centroid classifier in the following two aspects: (1) the centroids are defined based on disjoint subsets of features instead of all the features, and (2) the distance is induced by the dimensionality-normalized norm instead of the Euclidean norm. We provide a few theoretical results regarding our method. In addition, we propose a simple algorithm based on adapted k-means clustering that can find the disjoint subsets of features used in our method, and extend the algorithm to perform feature selection. We evaluate and compare the performance of our method to other classification methods on both simulated data and real-world gene expression datasets. The results demonstrate that our method is able to outperform other competing classifiers by having smaller misclassification rates and/or using fewer features in various settings and situations.  ( 2 min )
    Deep Augmented MUSIC Algorithm for Data-Driven DoA Estimation. (arXiv:2109.10581v3 [eess.SP] UPDATED)
    Direction of arrival (DoA) estimation is a crucial task in sensor array signal processing, giving rise to various successful model-based (MB) algorithms as well as recently developed data-driven (DD) methods. This paper introduces a new hybrid MB/DD DoA estimation architecture, based on the classical multiple signal classification (MUSIC) algorithm. Our approach augments crucial aspects of the original MUSIC structure with specifically designed neural architectures, allowing it to overcome certain limitations of the purely MB method, such as its inability to successfully localize coherent sources. The deep augmented MUSIC algorithm is shown to outperform its unaltered version with a superior resolution.  ( 2 min )
    Supervised Multivariate Learning with Simultaneous Feature Auto-grouping and Dimension Reduction. (arXiv:2112.09746v2 [stat.ML] UPDATED)
    Modern high-dimensional methods often adopt the "bet on sparsity" principle, while in supervised multivariate learning statisticians may face "dense" problems with a large number of nonzero coefficients. This paper proposes a novel clustered reduced-rank learning (CRL) framework that imposes two joint matrix regularizations to automatically group the features in constructing predictive factors. CRL is more interpretable than low-rank modeling and relaxes the stringent sparsity assumption in variable selection. In this paper, new information-theoretical limits are presented to reveal the intrinsic cost of seeking for clusters, as well as the blessing from dimensionality in multivariate learning. Moreover, an efficient optimization algorithm is developed, which performs subspace learning and clustering with guaranteed convergence. The obtained fixed-point estimators, though not necessarily globally optimal, enjoy the desired statistical accuracy beyond the standard likelihood setup under some regularity conditions. Moreover, a new kind of information criterion, as well as its scale-free form, is proposed for cluster and rank selection, and has a rigorous theoretical support without assuming an infinite sample size. Extensive simulations and real-data experiments demonstrate the statistical accuracy and interpretability of the proposed method.  ( 2 min )
    Characteristic Neural Ordinary Differential Equations. (arXiv:2111.13207v2 [cs.LG] UPDATED)
    We propose Characteristic Neural Ordinary Differential Equations (C-NODEs), a framework for extending Neural Ordinary Differential Equations (NODEs) beyond ODEs. While NODEs model the evolution of the latent state as the solution to an ODE, the proposed C-NODE models the evolution of the latent state as the solution of a family of first-order quasi-linear partial differential equations (PDE) on their characteristics, defined as curves along which the PDEs reduce to ODEs. The reduction, in turn, allows the application of the standard frameworks for solving ODEs to PDE settings. Additionally, the proposed framework can be cast as an extension of existing NODE architectures, thereby allowing the use of existing black-box ODE solvers. We prove that the C-NODE framework extends the classical NODE by exhibiting functions that cannot be represented by NODEs but are representable by C-NODEs. We further investigate the efficacy of the C-NODE framework by demonstrating its performance in many synthetic and real data scenarios. Empirical results demonstrate the improvements provided by the proposed method for CIFAR-10, SVHN, and MNIST datasets under a similar computational budget as the existing NODE methods.  ( 2 min )
    Don't Change the Algorithm, Change the Data: Exploratory Data for Offline Reinforcement Learning. (arXiv:2201.13425v2 [cs.LG] UPDATED)
    Recent progress in deep learning has relied on access to large and diverse datasets. Such data-driven progress has been less evident in offline reinforcement learning (RL), because offline RL data is usually collected to optimize specific target tasks limiting the data's diversity. In this work, we propose Exploratory data for Offline RL (ExORL), a data-centric approach to offline RL. ExORL first generates data with unsupervised reward-free exploration, then relabels this data with a downstream reward before training a policy with offline RL. We find that exploratory data allows vanilla off-policy RL algorithms, without any offline-specific modifications, to outperform or match state-of-the-art offline RL algorithms on downstream tasks. Our findings suggest that data generation is as important as algorithmic advances for offline RL and hence requires careful consideration from the community. Code and data can be found at https://github.com/denisyarats/exorl .  ( 2 min )
    InferGrad: Improving Diffusion Models for Vocoder by Considering Inference in Training. (arXiv:2202.03751v1 [eess.AS] CROSS LISTED)
    Denoising diffusion probabilistic models (diffusion models for short) require a large number of iterations in inference to achieve the generation quality that matches or surpasses the state-of-the-art generative models, which invariably results in slow inference speed. Previous approaches aim to optimize the choice of inference schedule over a few iterations to speed up inference. However, this results in reduced generation quality, mainly because the inference process is optimized separately, without jointly optimizing with the training process. In this paper, we propose InferGrad, a diffusion model for vocoder that incorporates inference process into training, to reduce the inference iterations while maintaining high generation quality. More specifically, during training, we generate data from random noise through a reverse process under inference schedules with a few iterations, and impose a loss to minimize the gap between the generated and ground-truth data samples. Then, unlike existing approaches, the training of InferGrad considers the inference process. The advantages of InferGrad are demonstrated through experiments on the LJSpeech dataset showing that InferGrad achieves better voice quality than the baseline WaveGrad under same conditions while maintaining the same voice quality as the baseline but with $3$x speedup ($2$ iterations for InferGrad vs $6$ iterations for WaveGrad).  ( 2 min )
    Comprehensive survey of computational learning methods for analysis of gene expression data in genomics. (arXiv:2202.02958v2 [q-bio.GN] UPDATED)
    Computational analysis methods including machine learning have a significant impact in the fields of genomics and medicine. High-throughput gene expression analysis methods such as microarray technology and RNA sequencing produce enormous amounts of data. Traditionally, statistical methods are used for comparative analysis of the gene expression data. However, more complex analysis for classification and discovery of feature genes or sample observations requires sophisticated computational approaches. In this review, we compile various statistical and computational tools used in analysis of expression microarray data. Even though, the methods are discussed in the context of expression microarray data, they can also be applied for the analysis of RNA sequencing or quantitative proteomics datasets. We specifically discuss methods for missing value (gene expression) imputation, feature gene scaling, selection and extraction of features for dimensionality reduction, and learning and analysis of expression data. We discuss the types of missing values and the methods and approaches usually employed in their imputation. We also discuss methods of data transformation and feature scaling viz. normalization and standardization. Various approaches used in feature selection and extraction are also reviewed. Lastly, learning and analysis methods including class comparison, class prediction, and class discovery along with their evaluation parameters are described in detail. We have described the process of generation of a microarray gene expression data along with advantages and limitations of the above-mentioned techniques. We believe that this detailed review will help the users to select appropriate methods based on the type of data and the expected outcome.  ( 2 min )
    Implicit Acoustic Echo Cancellation for Keyword Spotting and Device-Directed Speech Detection. (arXiv:2111.10639v2 [cs.SD] UPDATED)
    In many speech-enabled human-machine interaction scenarios, user speech can overlap with the device playback audio. In these instances, the performance of tasks such as keyword-spotting (KWS) and device-directed speech detection (DDD) can degrade significantly. To address this problem, we propose an implicit acoustic echo cancellation (iAEC) framework where a neural network is trained to exploit the additional information from a reference microphone channel to learn to ignore the interfering signal and improve detection performance. We study this framework for the tasks of KWS and DDD on, respectively, an augmented version of Google Speech Commands v2 and a real-world Alexa device dataset. Notably, we show a 56% reduction in false-reject rate for the DDD task during device playback conditions. We also show comparable or superior performance over a strong end-to-end neural echo cancellation + KWS baseline for the KWS task with an order of magnitude less computational requirements.  ( 2 min )
    High-dimensional multi-trait GWAS by reverse prediction of genotypes. (arXiv:2111.00108v2 [q-bio.GN] UPDATED)
    Multi-trait genome-wide association studies (GWAS) use multi-variate statistical methods to identify associations between genetic variants and multiple correlated traits simultaneously, and have higher statistical power than independent univariate analyses of traits. Reverse regression, where genotypes of genetic variants are regressed on multiple traits simultaneously, has emerged as a promising approach to perform multi-trait GWAS in high-dimensional settings where the number of traits exceeds the number of samples. We analyzed different machine learning methods (ridge regression, naive Bayes/independent univariate, random forests and support vector machines) for reverse regression in multi-trait GWAS, using genotypes, gene expression data and ground-truth transcriptional regulatory networks from the DREAM5 SysGen Challenge and from a cross between two yeast strains to evaluate methods. We found that genotype prediction performance, in terms of root mean squared error (RMSE), allowed to distinguish between genomic regions with high and low transcriptional activity. Moreover, model feature coefficients correlated with the strength of association between variants and individual traits, and were predictive of true trans-eQTL target genes, with complementary findings across methods. Code to reproduce the analysis is available at https://github.com/michoel-lab/Reverse-Pred-GWAS  ( 2 min )
    Unifying Likelihood-free Inference with Black-box Optimization and Beyond. (arXiv:2110.03372v2 [cs.LG] UPDATED)
    Black-box optimization formulations for biological sequence design have drawn recent attention due to their promising potential impact on the pharmaceutical industry. In this work, we propose to unify two seemingly distinct worlds: likelihood-free inference and black-box optimization, under one probabilistic framework. In tandem, we provide a recipe for constructing various sequence design methods based on this framework. We show how previous optimization approaches can be "reinvented" in our framework, and further propose new probabilistic black-box optimization algorithms. Extensive experiments on sequence design application illustrate the benefits of the proposed methodology.  ( 2 min )
    GBK-GNN: Gated Bi-Kernel Graph Neural Networks for Modeling Both Homophily and Heterophily. (arXiv:2110.15777v2 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) are widely used on a variety of graph-based machine learning tasks. For node-level tasks, GNNs have strong power to model the homophily property of graphs (i.e., connected nodes are more similar) while their ability to capture the heterophily property is often doubtful. This is partially caused by the design of the feature transformation with the same kernel for the nodes in the same hop and the followed aggregation operator. One kernel cannot model the similarity and the dissimilarity (i.e., the positive and negative correlation) between node features simultaneously even though we use attention mechanisms like Graph Attention Network (GAT), since the weight calculated by attention is always a positive value. In this paper, we propose a novel GNN model based on a bi-kernel feature transformation and a selection gate. Two kernels capture homophily and heterophily information respectively, and the gate is introduced to select which kernel we should use for the given node pairs. We conduct extensive experiments on various datasets with different homophily-heterophily properties. The experimental results show consistent and significant improvements against state-of-the-art GNN methods.  ( 2 min )
    Approximation error of single hidden layer neural networks with fixed weights. (arXiv:2202.03289v2 [cs.LG] UPDATED)
    This paper provides an explicit formula for the approximation error of single hidden layer neural networks with two fixed weights.  ( 2 min )
    Proximal Iteration for Deep Reinforcement Learning. (arXiv:2112.05848v2 [cs.LG] UPDATED)
    We employ Proximal Iteration for value-function optimization in deep reinforcement learning. Proximal Iteration is a computationally efficient technique that enables biasing the optimization procedure towards desirable solutions. As a concrete application, we endow the objective function of Deep Q-Network (DQN) and Rainbow agents with a proximal term to ensure robustness in presence of large noise. The resultant agents, which we call DQN Pro and Rainbow Pro, exhibit significant improvements over their original counterparts on the Atari benchmark. Our results accentuate the power of employing sound optimization techniques for deep reinforcement learning.  ( 2 min )
    LENAS: Learning-based Neural Architecture Search and Ensemble for 3D Radiotherapy Dose Prediction. (arXiv:2106.06733v3 [eess.IV] UPDATED)
    Radiation therapy treatment planning is a complex process, as the target dose prescription and normal tissue sparing are conflicting objectives. In order to reduce human planning time efforts and improve the quality of treatment planning, knowledge-based planning (KBP) is in high demand. In this study, we propose a novel learning-based ensemble approach, named LENAS, which integrates neural architecture search (NAS) with knowledge distillation for 3D radiotherapy dose prediction. Specifically, the prediction network first exhaustively searches each block from an enormous architecture space. Then, multiple architectures with promising performance and a large diversity are selected. To reduce the inference time, we adopt the teacher-student paradigm by treating the combination of diverse outputs from multiple learned networks as supervisions to guide the student network training. In addition, we apply adversarial learning to optimize the student network to recover the knowledge in teacher networks. To the best of our knowledge, this is the first attempt to investigate NAS and knowledge distillation in ensemble learning, especially in the field of medical image analysis. The proposed method has been evaluated on two public datasets, i.e., the OpenKBP and AIMIS dataset. Extensive experimental results demonstrate the effectiveness of our method and its superior performance to the state-of-the-art methods. In addition, several in-depth analysis and empirical guidelines are derived for ensemble learning.  ( 2 min )
    Decoupled Reinforcement Learning to Stabilise Intrinsically-Motivated Exploration. (arXiv:2107.08966v3 [cs.LG] UPDATED)
    Intrinsic rewards can improve exploration in reinforcement learning, but the exploration process may suffer from instability caused by non-stationary reward shaping and strong dependency on hyperparameters. In this work, we introduce Decoupled RL (DeRL) as a general framework which trains separate policies for intrinsically-motivated exploration and exploitation. Such decoupling allows DeRL to leverage the benefits of intrinsic rewards for exploration while demonstrating improved robustness and sample efficiency. We evaluate DeRL algorithms in two sparse-reward environments with multiple types of intrinsic rewards. Our results show that DeRL is more robust to varying scale and rate of decay of intrinsic rewards and converges to the same evaluation returns than intrinsically-motivated baselines in fewer interactions. Lastly, we discuss the challenge of distribution shift and show that divergence constraint regularisers can successfully minimise instability caused by divergence of exploration and exploitation policies.  ( 2 min )
    Learning Temporally Causal Latent Processes from General Temporal Data. (arXiv:2110.05428v4 [stat.ML] UPDATED)
    Our goal is to recover time-delayed latent causal variables and identify their relations from measured temporal data. Estimating causally-related latent variables from observations is particularly challenging as the latent variables are not uniquely recoverable in the most general case. In this work, we consider both a nonparametric, nonstationary setting and a parametric setting for the latent processes and propose two provable conditions under which temporally causal latent processes can be identified from their nonlinear mixtures. We propose LEAP, a theoretically-grounded framework that extends Variational AutoEncoders (VAEs) by enforcing our conditions through proper constraints in causal process prior. Experimental results on various datasets demonstrate that temporally causal latent processes are reliably identified from observed variables under different dependency structures and that our approach considerably outperforms baselines that do not properly leverage history or nonstationarity information. This demonstrates that using temporal information to learn latent processes from their invertible nonlinear mixtures in an unsupervised manner, for which we believe our work is one of the first, seems promising even without sparsity or minimality assumptions.  ( 2 min )
    A Dual Approach to Constrained Markov Decision Processes with Entropy Regularization. (arXiv:2110.08923v2 [cs.LG] UPDATED)
    We study entropy-regularized constrained Markov decision processes (CMDPs) under the soft-max parameterization, in which an agent aims to maximize the entropy-regularized value function while satisfying constraints on the expected total utility. By leveraging the entropy regularization, our theoretical analysis shows that its Lagrangian dual function is smooth and the Lagrangian duality gap can be decomposed into the primal optimality gap and the constraint violation. Furthermore, we propose an accelerated dual-descent method for entropy-regularized CMDPs. We prove that our method achieves the global convergence rate $\widetilde{\mathcal{O}}(1/T)$ for both the optimality gap and the constraint violation for entropy-regularized CMDPs. A discussion about a linear convergence rate for CMDPs with a single constraint is also provided.  ( 2 min )
    Recurrent Model-Free RL can be a Strong Baseline for Many POMDPs. (arXiv:2110.05038v2 [cs.LG] UPDATED)
    Many problems in RL, such as meta RL, robust RL, and generalization in RL, can be cast as POMDPs. In theory, simply augmenting model-free RL with memory-based architectures, such as recurrent neural networks, provides a general approach to solving all types of POMDPs. However, prior work has found that such recurrent model-free RL methods tend to perform worse than more specialized algorithms that are designed for specific types of POMDPs. This paper revisits this claim. We find that careful architecture and hyperparameter decisions can often yield a recurrent model-free implementation that performs on par with (and occasionally substantially better than) more sophisticated recent techniques. We compare to 19 environments from 5 prior specialized methods and find that our implementation achieves greater sample efficiency and asymptotic performance than these methods on 16/19 environments. We also release a simple and efficient implementation of recurrent model-free RL for future work to use as a baseline for POMDPs.  ( 2 min )
    SwiftAgg: Communication-Efficient and Dropout-Resistant Secure Aggregation for Federated Learning with Worst-Case Security Guarantees. (arXiv:2202.04169v1 [cs.IT])
    We propose SwiftAgg, a novel secure aggregation protocol for federated learning systems, where a central server aggregates local models of $N$ distributed users, each of size $L$, trained on their local data, in a privacy-preserving manner. Compared with state-of-the-art secure aggregation protocols, SwiftAgg significantly reduces the communication overheads without any compromise on security. Specifically, in presence of at most $D$ dropout users, SwiftAgg achieves a users-to-server communication load of $(T+1)L$ and a users-to-users communication load of up to $(N-1)(T+D+1)L$, with a worst-case information-theoretic security guarantee, against any subset of up to $T$ semi-honest users who may also collude with the curious server. The key idea of SwiftAgg is to partition the users into groups of size $D+T+1$, then in the first phase, secret sharing and aggregation of the individual models are performed within each group, and then in the second phase, model aggregation is performed on $D+T+1$ sequences of users across the groups. If a user in a sequence drops out in the second phase, the rest of the sequence remain silent. This design allows only a subset of users to communicate with each other, and only the users in a single group to directly communicate with the server, eliminating the requirements of 1) all-to-all communication network across users; and 2) all users communicating with the server, for other secure aggregation protocols. This helps to substantially slash the communication costs of the system.  ( 2 min )
    Managers versus Machines: Do Algorithms Replicate Human Intuition in Credit Ratings?. (arXiv:2202.04218v1 [econ.EM])
    We use machine learning techniques to investigate whether it is possible to replicate the behavior of bank managers who assess the risk of commercial loans made by a large commercial US bank. Even though a typical bank already relies on an algorithmic scorecard process to evaluate risk, bank managers are given significant latitude in adjusting the risk score in order to account for other holistic factors based on their intuition and experience. We show that it is possible to find machine learning algorithms that can replicate the behavior of the bank managers. The input to the algorithms consists of a combination of standard financials and soft information available to bank managers as part of the typical loan review process. We also document the presence of significant heterogeneity in the adjustment process that can be traced to differences across managers and industries. Our results highlight the effectiveness of machine learning based analytic approaches to banking and the potential challenges to high-skill jobs in the financial sector.  ( 2 min )
    Deep Neural Networks to Correct Sub-Precision Errors in CFD. (arXiv:2202.04233v1 [physics.flu-dyn])
    Loss of information in numerical simulations can arise from various sources while solving discretized partial differential equations. In particular, precision-related errors can accumulate in the quantities of interest when the simulations are performed using low-precision 16-bit floating-point arithmetic compared to an equivalent 64-bit simulation. Here, low-precision computation requires much lower resources than high-precision computation. Several machine learning (ML) techniques proposed recently have been successful in correcting the errors arising from spatial discretization. In this work, we extend these techniques to improve Computational Fluid Dynamics (CFD) simulations performed using low numerical precision. We first quantify the precision related errors accumulated in a Kolmogorov forced turbulence test case. Subsequently, we employ a Convolutional Neural Network together with a fully differentiable numerical solver performing 16-bit arithmetic to learn a tightly-coupled ML-CFD hybrid solver. Compared to the 16-bit solver, we demonstrate the efficacy of the ML-CFD hybrid solver towards reducing the error accumulation in the velocity field and improving the kinetic energy spectrum at higher frequencies.  ( 2 min )
    FedBABU: Towards Enhanced Representation for Federated Image Classification. (arXiv:2106.06042v2 [cs.LG] UPDATED)
    Federated learning has evolved to improve a single global model under data heterogeneity (as a curse) or to develop multiple personalized models using data heterogeneity (as a blessing). However, little research has considered both directions simultaneously. In this paper, we first investigate the relationship between them by analyzing Federated Averaging at the client level and determine that a better federated global model performance does not constantly improve personalization. To elucidate the cause of this personalization performance degradation problem, we decompose the entire network into the body (extractor), which is related to universality, and the head (classifier), which is related to personalization. We then point out that this problem stems from training the head. Based on this observation, we propose a novel federated learning algorithm, coined FedBABU, which only updates the body of the model during federated training (i.e., the head is randomly initialized and never updated), and the head is fine-tuned for personalization during the evaluation process. Extensive experiments show consistent performance improvements and an efficient personalization of FedBABU. The code is available at https://github.com/jhoon-oh/FedBABU.  ( 2 min )
    Universal Approximation Under Constraints is Possible with Transformers. (arXiv:2110.03303v2 [cs.LG] UPDATED)
    Many practical problems need the output of a machine learning model to satisfy a set of constraints, $K$. Nevertheless, there is no known guarantee that classical neural network architectures can exactly encode constraints while simultaneously achieving universality. We provide a quantitative constrained universal approximation theorem which guarantees that for any non-convex compact set $K$ and any continuous function $f:\mathbb{R}^n\rightarrow K$, there is a probabilistic transformer $\hat{F}$ whose randomized outputs all lie in $K$ and whose expected output uniformly approximates $f$. Our second main result is a "deep neural version" of Berge's Maximum Theorem (1963). The result guarantees that given an objective function $L$, a constraint set $K$, and a family of soft constraint sets, there is a probabilistic transformer $\hat{F}$ that approximately minimizes $L$ and whose outputs belong to $K$; moreover, $\hat{F}$ approximately satisfies the soft constraints. Our results imply the first universal approximation theorem for classical transformers with exact convex constraint satisfaction. They also yield that a chart-free universal approximation theorem for Riemannian manifold-valued functions subject to suitable geodesically convex constraints.  ( 2 min )
    Collaboration Challenges in Building ML-Enabled Systems: Communication, Documentation, Engineering, and Process. (arXiv:2110.10234v3 [cs.SE] UPDATED)
    The introduction of machine learning (ML) components in software projects has created the need for software engineers to collaborate with data scientists and other specialists. While collaboration can always be challenging, ML introduces additional challenges with its exploratory model development process, additional skills and knowledge needed, difficulties testing ML systems, need for continuous evolution and monitoring, and non-traditional quality requirements such as fairness and explainability. Through interviews with 45 practitioners from 28 organizations, we identified key collaboration challenges that teams face when building and deploying ML systems into production. We report on common collaboration points in the development of production ML systems for requirements, data, and integration, as well as corresponding team patterns and challenges. We find that most of these challenges center around communication, documentation, engineering, and process and collect recommendations to address these challenges.  ( 2 min )
    Rethinking Adam: A Twofold Exponential Moving Average Approach. (arXiv:2106.11514v3 [cs.LG] UPDATED)
    Adaptive gradient methods, e.g. \textsc{Adam}, have achieved tremendous success in machine learning. Scaling the learning rate element-wisely by a certain form of second moment estimate of gradients, such methods are able to attain rapid training of modern deep neural networks. Nevertheless, they are observed to suffer from compromised generalization ability compared with stochastic gradient descent (\textsc{SGD}) and tend to be trapped in local minima at an early stage during training. Intriguingly, we discover that substituting the gradient in the second raw moment estimate term with its momentumized version in \textsc{Adam} can resolve the issue. The intuition is that gradient with momentum contains more accurate directional information and therefore its second moment estimation is a more favorable option for learning rate scaling than that of the raw gradient. Thereby we propose \textsc{AdaMomentum} as a new optimizer reaching the goal of training fast while generalizing much better. We further develop a theory to back up the improvement in generalization and provide convergence guarantees under both convex and nonconvex settings. Extensive experiments on a wide range of tasks and models demonstrate that \textsc{AdaMomentum} exhibits state-of-the-art performance and superior training stability consistently.  ( 2 min )
    Improving Fairness via Federated Learning. (arXiv:2110.15545v2 [cs.LG] UPDATED)
    Recently, lots of algorithms have been proposed for learning a fair classifier from decentralized data. However, many theoretical and algorithmic questions remain open. First, is federated learning necessary, i.e., can we simply train locally fair classifiers and aggregate them? In this work, we first propose a new theoretical framework, with which we demonstrate that federated learning can strictly boost model fairness compared with such non-federated algorithms. We then theoretically and empirically show that the performance tradeoff of FedAvg-based fair learning algorithms is strictly worse than that of a fair classifier trained on centralized data. To bridge this gap, we propose FedFB, a private fair learning algorithm on decentralized data. The key idea is to modify the FedAvg protocol so that it can effectively mimic the centralized fair learning. Our experimental results show that FedFB significantly outperforms existing approaches, sometimes matching the performance of the centrally trained model.  ( 2 min )
    Independent mechanism analysis, a new concept?. (arXiv:2106.05200v3 [stat.ML] UPDATED)
    Independent component analysis provides a principled framework for unsupervised representation learning, with solid theory on the identifiability of the latent code that generated the data, given only observations of mixtures thereof. Unfortunately, when the mixing is nonlinear, the model is provably nonidentifiable, since statistical independence alone does not sufficiently constrain the problem. Identifiability can be recovered in settings where additional, typically observed variables are included in the generative process. We investigate an alternative path and consider instead including assumptions reflecting the principle of independent causal mechanisms exploited in the field of causality. Specifically, our approach is motivated by thinking of each source as independently influencing the mixing process. This gives rise to a framework which we term independent mechanism analysis. We provide theoretical and empirical evidence that our approach circumvents a number of nonidentifiability issues arising in nonlinear blind source separation.  ( 2 min )
    On Margin Maximization in Linear and ReLU Networks. (arXiv:2110.02732v3 [cs.LG] UPDATED)
    The implicit bias of neural networks has been extensively studied in recent years. Lyu and Li [2019] showed that in homogeneous networks trained with the exponential or the logistic loss, gradient flow converges to a KKT point of the max margin problem in the parameter space. However, that leaves open the question of whether this point will generally be an actual optimum of the max margin problem. In this paper, we study this question in detail, for several neural network architectures involving linear and ReLU activations. Perhaps surprisingly, we show that in many cases, the KKT point is not even a local optimum of the max margin problem. On the flip side, we identify multiple settings where a local or global optimum can be guaranteed.  ( 2 min )
    Distributionally Robust Deep Learning using Hardness Weighted Sampling. (arXiv:2001.02658v3 [cs.LG] UPDATED)
    Limiting failures of machine learning systems is of paramount importance for safety-critical applications. In order to improve the robustness of machine learning systems, Distributionally Robust Optimization (DRO) has been proposed as a generalization of Empirical Risk Minimization (ERM). However, its use in deep learning has been severely restricted due to the relative inefficiency of the optimizers available for DRO in comparison to the wide-spread variants of Stochastic Gradient Descent (SGD) optimizers for ERM. We propose SGD with hardness weighted sampling, a principled and efficient optimization method for DRO in machine learning that is particularly suited in the context of deep learning. Similar to a hard example mining strategy in practice, the proposed algorithm is straightforward to implement and computationally as efficient as SGD-based optimizers used for deep learning, requiring minimal overhead computation. In contrast to typical ad hoc hard mining approaches, we prove the convergence of our DRO algorithm for over-parameterized deep learning networks with ReLU activation and a finite number of layers and parameters. Our experiments on fetal brain 3D MRI segmentation and brain tumor segmentation in MRI demonstrate the feasibility and the usefulness of our approach. Using our hardness weighted sampling for training a state-of-the-art deep learning pipeline leads to improved robustness to anatomical variabilities in automatic fetal brain 3D MRI segmentation using deep learning and to improved robustness to the image protocol variations in brain tumor segmentation. Our code is available at https://github.com/LucasFidon/HardnessWeightedSampler.  ( 2 min )
    Uncertainty in Data-Driven Kalman Filtering for Partially Known State-Space Models. (arXiv:2110.04738v2 [eess.SP] UPDATED)
    Providing a metric of uncertainty alongside a state estimate is often crucial when tracking a dynamical system. Classic state estimators, such as the Kalman filter (KF), provide a time-dependent uncertainty measure from knowledge of the underlying statistics, however, deep learning based tracking systems struggle to reliably characterize uncertainty. In this paper, we investigate the ability of KalmanNet, a recently proposed hybrid model-based deep state tracking algorithm, to estimate an uncertainty measure. By exploiting the interpretable nature of KalmanNet, we show that the error covariance matrix can be computed based on its internal features, as an uncertainty measure. We demonstrate that when the system dynamics are known, KalmanNet-which learns its mapping from data without access to the statistics-provides uncertainty similar to that provided by the KF; and while in the presence of evolution model-mismatch, KalmanNet pro-vides a more accurate error estimation.  ( 2 min )
    Rare event estimation using stochastic spectral embedding. (arXiv:2106.05824v2 [cs.LG] UPDATED)
    Estimating the probability of rare failure events is an essential step in the reliability assessment of engineering systems. Computing this failure probability for complex non-linear systems is challenging, and has recently spurred the development of active-learning reliability methods. These methods approximate the limit-state function (LSF) using surrogate models trained with a sequentially enriched set of model evaluations. A recently proposed method called stochastic spectral embedding (SSE) aims to improve the local approximation accuracy of global, spectral surrogate modelling techniques by sequentially embedding local residual expansions in subdomains of the input space. In this work we apply SSE to the LSF, giving rise to a stochastic spectral embedding-based reliability (SSER) method. The resulting partition of the input space decomposes the failure probability into a set of easy-to-compute \rev{conditional} failure probabilities. We propose a set of modifications that tailor the algorithm to efficiently solve rare event estimation problems. These modifications include specialized refinement domain selection, partitioning and enrichment strategies. We showcase the algorithm performance on four benchmark problems of various dimensionality and complexity in the LSF.  ( 2 min )
    AirNet: Neural Network Transmission over the Air. (arXiv:2105.11166v3 [cs.NI] UPDATED)
    State-of-the-art performance for many emerging edge applications is achieved by deep neural networks (DNNs). Often, the employed DNNs are location- and time-dependent, and the parameters of a specific DNN must be delivered from an edge server to the edge device rapidly and efficiently to carry out time-sensitive inference tasks. This can be considered as a joint source-channel coding (JSCC) problem, in which the goal is not to recover the DNN coefficients with the minimal distortion, but in a manner that provides the highest accuracy in the downstream task. For this purpose we introduce AirNet, a novel training and analog transmission method to deliver DNNs over the air. We first train the DNN with noise injection to counter the wireless channel noise. We also employ pruning to identify the most significant DNN parameters that can be delivered within the available channel bandwidth, knowledge distillation, and non-linear bandwidth expansion to provide better error protection for the most important network parameters. We show that AirNet achieves significantly higher test accuracy compared to the separation-based alternative, and exhibits graceful degradation with channel quality.  ( 2 min )
    Balanced softmax cross-entropy for incremental learning with and without memory. (arXiv:2103.12532v4 [cs.LG] UPDATED)
    When incrementally trained on new classes, deep neural networks are subject to catastrophic forgetting which leads to an extreme deterioration of their performance on the old classes while learning the new ones. Using a small memory containing few samples from past classes has shown to be an effective method to mitigate catastrophic forgetting. However, due to the limited size of the replay memory, there is a large imbalance between the number of samples for the new and the old classes in the training dataset resulting in bias in the final model. To address this issue, we propose to use the Balanced Softmax Cross-Entropy and show that it can be seamlessly combined with state-of-the-art approaches for class-incremental learning in order to improve their accuracy while also potentially decreasing the computational cost of the training procedure. We further extend this approach to the more demanding class-incremental learning without memory setting and achieve competitive results with memory-based approaches. Experiments on the challenging ImageNet, ImageNet-Subset and CIFAR100 benchmarks with various settings demonstrate the benefits of our approach.  ( 2 min )
    Universal Hopfield Networks: A General Framework for Single-Shot Associative Memory Models. (arXiv:2202.04557v1 [cs.NE])
    A large number of neural network models of associative memory have been proposed in the literature. These include the classical Hopfield networks (HNs), sparse distributed memories (SDMs), and more recently the modern continuous Hopfield networks (MCHNs), which possesses close links with self-attention in machine learning. In this paper, we propose a general framework for understanding the operation of such memory networks as a sequence of three operations: similarity, separation, and projection. We derive all these memory models as instances of our general framework with differing similarity and separation functions. We extend the mathematical framework of Krotov et al (2020) to express general associative memory models using neural network dynamics with only second-order interactions between neurons, and derive a general energy function that is a Lyapunov function of the dynamics. Finally, using our framework, we empirically investigate the capacity of using different similarity functions for these associative memory models, beyond the dot product similarity measure, and demonstrate empirically that Euclidean or Manhattan distance similarity metrics perform substantially better in practice on many tasks, enabling a more robust retrieval and higher memory capacity than existing models.  ( 2 min )
    Frank-Wolfe with a Nearest Extreme Point Oracle. (arXiv:2102.02029v2 [math.OC] UPDATED)
    We consider variants of the classical Frank-Wolfe algorithm for constrained smooth convex minimization, that instead of access to the standard oracle for minimizing a linear function over the feasible set, have access to an oracle that can find an extreme point of the feasible set that is closest in Euclidean distance to a given vector. We first show that for many feasible sets of interest, such an oracle can be implemented with the same complexity as the standard linear optimization oracle. We then show that with such an oracle we can design new Frank-Wolfe variants which enjoy significantly improved complexity bounds in case the set of optimal solutions lies in the convex hull of a subset of extreme points with small diameter (e.g., a low-dimensional face of a polytope). In particular, for many $0\text{--}1$ polytopes, under quadratic growth and strict complementarity conditions, we obtain the first linearly convergent variant with rate that depends only on the dimension of the optimal face and not on the ambient dimension.  ( 2 min )
    Sharper Rates for Separable Minimax and Finite Sum Optimization via Primal-Dual Extragradient Methods. (arXiv:2202.04640v1 [math.OC])
    We design accelerated algorithms with improved rates for several fundamental classes of optimization problems. Our algorithms all build upon techniques related to the analysis of primal-dual extragradient methods via relative Lipschitzness proposed recently by [CST21]. (1) Separable minimax optimization. We study separable minimax optimization problems $\min_x \max_y f(x) - g(y) + h(x, y)$, where $f$ and $g$ have smoothness and strong convexity parameters $(L^x, \mu^x)$, $(L^y, \mu^y)$, and $h$ is convex-concave with a $(\Lambda^{xx}, \Lambda^{xy}, \Lambda^{yy})$-blockwise operator norm bounded Hessian. We provide an algorithm with gradient query complexity $\tilde{O}\left(\sqrt{\frac{L^{x}}{\mu^{x}}} + \sqrt{\frac{L^{y}}{\mu^{y}}} + \frac{\Lambda^{xx}}{\mu^{x}} + \frac{\Lambda^{xy}}{\sqrt{\mu^{x}\mu^{y}}} + \frac{\Lambda^{yy}}{\mu^{y}}\right)$. Notably, for convex-concave minimax problems with bilinear coupling (e.g.\ quadratics), where $\Lambda^{xx} = \Lambda^{yy} = 0$, our rate matches a lower bound of [ZHZ19]. (2) Finite sum optimization. We study finite sum optimization problems $\min_x \frac{1}{n}\sum_{i\in[n]} f_i(x)$, where each $f_i$ is $L_i$-smooth and the overall problem is $\mu$-strongly convex. We provide an algorithm with gradient query complexity $\tilde{O}\left(n + \sum_{i\in[n]} \sqrt{\frac{L_i}{n\mu}} \right)$. Notably, when the smoothness bounds $\{L_i\}_{i\in[n]}$ are non-uniform, our rate improves upon accelerated SVRG [LMH15, FGKS15] and Katyusha [All17] by up to a $\sqrt{n}$ factor. (3) Minimax finite sums. We generalize our algorithms for minimax and finite sum optimization to solve a natural family of minimax finite sum optimization problems at an accelerated rate, encapsulating both above results up to a logarithmic factor.  ( 2 min )
    Sampling possible reconstructions of undersampled acquisitions in MR imaging. (arXiv:2010.00042v3 [eess.IV] UPDATED)
    Undersampling the k-space during MR acquisitions saves time, however results in an ill-posed inversion problem, leading to an infinite set of images as possible solutions. Traditionally, this is tackled as a reconstruction problem by searching for a single "best" image out of this solution set according to some chosen regularization or prior. This approach, however, misses the possibility of other solutions and hence ignores the uncertainty in the inversion process. In this paper, we propose a method that instead returns multiple images which are possible under the acquisition model and the chosen prior to capture the uncertainty in the inversion process. To this end, we introduce a low dimensional latent space and model the posterior distribution of the latent vectors given the acquisition data in k-space, from which we can sample in the latent space and obtain the corresponding images. We use a variational autoencoder for the latent model and the Metropolis adjusted Langevin algorithm for the sampling. We evaluate our method on two datasets; with images from the Human Connectome Project and in-house measured multi-coil images. We compare to five alternative methods. Results indicate that the proposed method produces images that match the measured k-space data better than the alternatives, while showing realistic structural variability. Furthermore, in contrast to the compared methods, the proposed method yields higher uncertainty in the undersampled phase encoding direction, as expected. Keywords: Magnetic Resonance image reconstruction, uncertainty estimation, inverse problems, sampling, MCMC, deep learning, unsupervised learning.  ( 2 min )
    Polynomial-Time Exact MAP Inference on Discrete Models with Global Dependencies. (arXiv:1912.12090v3 [cs.DM] UPDATED)
    Considering the worst-case scenario, junction tree algorithm remains the most general solution for exact MAP inference with polynomial run-time guarantees. Unfortunately, its main tractability assumption requires the treewidth of a corresponding MRF to be bounded strongly limiting the range of admissible applications. In fact, many practical problems in the area of structured prediction require modelling of global dependencies by either directly introducing global factors or enforcing global constraints on the prediction variables. That, however, always results in a fully-connected graph making exact inference by means of this algorithm intractable. Previous work [1]-[4] focusing on the problem of loss-augmented inference has demonstrated how efficient inference can be performed on models with specific global factors representing non-decomposable loss functions within the training regime of SSVMs. In this paper, we extend the framework for an efficient exact inference proposed in in [3] by allowing much finer interactions between the energy of the core model and the sufficient statistics of the global terms with no additional computation costs. We demonstrate the usefulness of our method in several use cases, including one that cannot be handled by any of the previous approaches. Finally, we propose a new graph transformation technique via node cloning which ensures a polynomial run-time for solving our target problem independently of the form of a corresponding clique tree. This is important for the efficiency of the main algorithm and greatly improves upon the theoretical guarantees of the previous works.  ( 2 min )
    Policy gradient methods for distortion risk measures. (arXiv:2107.04422v4 [cs.LG] UPDATED)
    We propose policy-gradient algorithms for solving the problem of control in a risk-sensitive reinforcement learning context. The objective of our algorithms is to maximize the distortion risk measure (DRM) of the cumulative reward in an episodic Markov decision process. We derive a variant of the policy gradient theorem that caters to the DRM objective. Using this theorem in conjunction with a likelihood ratio-based gradient estimation scheme, we propose policy gradient algorithms for optimizing DRM in both on-policy and off-policy RL settings. We derive non-asymptotic bounds that establish the convergence of our algorithms to an approximate stationary point of the DRM objective.  ( 2 min )
    Optimal Hyperparameters and Structure Setting of Multi-Objective Robust CNN Systems via Generalized Taguchi Method and Objective Vector Norm. (arXiv:2202.04567v1 [cs.LG])
    Recently, Machine Learning (ML), Artificial Intelligence (AI), and Convolutional Neural Network (CNN) have made huge progress with broad applications, where their systems have deep learning structures and a large number of hyperparameters that determine the quality and performance of the CNNs and AI systems. These systems may have multi-objective ML and AI performance needs. There is a key requirement to find the optimal hyperparameters and structures for multi-objective robust optimal CNN systems. This paper proposes a generalized Taguchi approach to effectively determine the optimal hyperparameters and structure for the multi-objective robust optimal CNN systems via their objective performance vector norm. The proposed approach and methods are applied to a CNN classification system with the original ResNet for CIFAR-10 dataset as a demonstration and validation, which shows the proposed methods are highly effective to achieve an optimal accuracy rate of the original ResNet on CIFAR-10.  ( 2 min )
    AutoTune: Controller Tuning for High-Speed Flight. (arXiv:2103.10698v2 [cs.RO] UPDATED)
    Due to noisy actuation and external disturbances, tuning controllers for high-speed flight is very challenging. In this paper, we ask the following questions: How sensitive are controllers to tuning when tracking high-speed maneuvers? What algorithms can we use to automatically tune them? To answer the first question, we study the relationship between parameters and performance and find out that the faster the maneuver, the more sensitive a controller becomes to its parameters. To answer the second question, we review existing methods for controller tuning and discover that prior works often perform poorly on the task of high-speed flight. Therefore, we propose AutoTune, a sampling-based tuning algorithm specifically tailored to high-speed flight. In contrast to previous work, our algorithm does not assume any prior knowledge of the drone or its optimization function and can deal with the multi-modal characteristics of the parameters' optimization space. We thoroughly evaluate AutoTune both in simulation and in the physical world. In our experiments, we outperform existing tuning algorithms by up to 90% in trajectory completion. The resulting controllers are tested in the AirSim Game of Drones competition, where we outperform the winner by up to 25% in lap-time. Finally, we show that AutoTune improves tracking error when flying a physical platform with respect to parameters tuned by a human expert.  ( 2 min )
    Policy Optimization with Stochastic Mirror Descent. (arXiv:1906.10462v5 [cs.LG] UPDATED)
    Improving sample efficiency has been a longstanding goal in reinforcement learning. This paper proposes $\mathtt{VRMPO}$ algorithm: a sample efficient policy gradient method with stochastic mirror descent. In $\mathtt{VRMPO}$, a novel variance-reduced policy gradient estimator is presented to improve sample efficiency. We prove that the proposed $\mathtt{VRMPO}$ needs only $\mathcal{O}(\epsilon^{-3})$ sample trajectories to achieve an $\epsilon$-approximate first-order stationary point, which matches the best sample complexity for policy optimization. The extensive experimental results demonstrate that $\mathtt{VRMPO}$ outperforms the state-of-the-art policy gradient methods in various settings.  ( 2 min )
    Offline Reinforcement Learning with Realizability and Single-policy Concentrability. (arXiv:2202.04634v1 [cs.LG])
    Sample-efficiency guarantees for offline reinforcement learning (RL) often rely on strong assumptions on both the function classes (e.g., Bellman-completeness) and the data coverage (e.g., all-policy concentrability). Despite the recent efforts on relaxing these assumptions, existing works are only able to relax one of the two factors, leaving the strong assumption on the other factor intact. As an important open problem, can we achieve sample-efficient offline RL with weak assumptions on both factors? In this paper we answer the question in the positive. We analyze a simple algorithm based on the primal-dual formulation of MDPs, where the dual variables (discounted occupancy) are modeled using a density-ratio function against offline data. With proper regularization, we show that the algorithm enjoys polynomial sample complexity, under only realizability and single-policy concentrability. We also provide alternative analyses based on different assumptions to shed light on the nature of primal-dual algorithms for offline RL.  ( 2 min )
    End-to-End Blind Quality Assessment for Laparoscopic Videos using Neural Networks. (arXiv:2202.04517v1 [eess.IV])
    Video quality assessment is a challenging problem having a critical significance in the context of medical imaging. For instance, in laparoscopic surgery, the acquired video data suffers from different kinds of distortion that not only hinder surgery performance but also affect the execution of subsequent tasks in surgical navigation and robotic surgeries. For this reason, we propose in this paper neural network-based approaches for distortion classification as well as quality prediction. More precisely, a Residual Network (ResNet) based approach is firstly developed for simultaneous ranking and classification task. Then, this architecture is extended to make it appropriate for the quality prediction task by using an additional Fully Connected Neural Network (FCNN). To train the overall architecture (ResNet and FCNN models), transfer learning and end-to-end learning approaches are investigated. Experimental results, carried out on a new laparoscopic video quality database, have shown the efficiency of the proposed methods compared to recent conventional and deep learning based approaches.  ( 2 min )
    MODS -- A USV-oriented object detection and obstacle segmentation benchmark. (arXiv:2105.02359v2 [cs.CV] UPDATED)
    Small-sized unmanned surface vehicles (USV) are coastal water devices with a broad range of applications such as environmental control and surveillance. A crucial capability for autonomous operation is obstacle detection for timely reaction and collision avoidance, which has been recently explored in the context of camera-based visual scene interpretation. Owing to curated datasets, substantial advances in scene interpretation have been made in a related field of unmanned ground vehicles. However, the current maritime datasets do not adequately capture the complexity of real-world USV scenes and the evaluation protocols are not standardised, which makes cross-paper comparison of different methods difficult and hinders the progress. To address these issues, we introduce a new obstacle detection benchmark MODS, which considers two major perception tasks: maritime object detection and the more general maritime obstacle segmentation. We present a new diverse maritime evaluation dataset containing approximately 81k stereo images synchronized with an on-board IMU, with over 60k objects annotated. We propose a new obstacle segmentation performance evaluation protocol that reflects the detection accuracy in a way meaningful for practical USV navigation. Nineteen recent state-of-the-art object detection and obstacle segmentation methods are evaluated using the proposed protocol, creating a benchmark to facilitate development of the field. The proposed dataset, as well as evaluation routines, are made publicly available at vicos.si/resources.  ( 2 min )
    Are deep learning models superior for missing data imputation in large surveys? Evidence from an empirical comparison. (arXiv:2103.09316v2 [cs.LG] UPDATED)
    Multiple imputation (MI) is a popular approach for dealing with missing data arising from non-response in sample surveys. Multiple imputation by chained equations (MICE) is one of the most widely used MI algorithms for multivariate data, but it lacks theoretical foundation and is computationally intensive. Recently, missing data imputation methods based on deep learning models have been developed with encouraging results in small studies. However, there has been limited research on evaluating their performance in realistic settings compared to MICE, particularly in large-scale surveys. We conduct extensive simulation studies based on American Community Survey data to compare the repeated sampling properties of four machine learning based MI methods: MICE with classification trees, MICE with random forests, generative adversarial imputation networks, and multiple imputation using denoising autoencoders. We find the deep learning based imputation methods is superior to MICE in terms of computational time. However, with the default choice of hyperparameters in the common software packages, MICE with classification trees consistently outperforms, often by a large margin, the deep learning imputation methods in terms of bias, mean squared error, and coverage under a range of realistic settings.  ( 2 min )
    Noise fingerprints in quantum computers: Machine learning software tools. (arXiv:2202.04581v1 [quant-ph])
    In this paper we present the high-level functionalities of a quantum-classical machine learning software, whose purpose is to learn the main features (the fingerprint) of quantum noise sources affecting a quantum device, as a quantum computer. Specifically, the software architecture is designed to classify successfully (more than 99% of accuracy) the noise fingerprints in different quantum devices with similar technical specifications, or distinct time-dependences of a noise fingerprint in single quantum machines.  ( 2 min )
    Adjoint-aided inference of Gaussian process driven differential equations. (arXiv:2202.04589v1 [stat.ML])
    Linear systems occur throughout engineering and the sciences, most notably as differential equations. In many cases the forcing function for the system is unknown, and interest lies in using noisy observations of the system to infer the forcing, as well as other unknown parameters. In differential equations, the forcing function is an unknown function of the independent variables (typically time and space), and can be modelled as a Gaussian process (GP). In this paper we show how the adjoint of a linear system can be used to efficiently infer forcing functions modelled as GPs, after using a truncated basis expansion of the GP kernel. We show how exact conjugate Bayesian inference for the truncated GP can be achieved, in many cases with substantially lower computation than would be required using MCMC methods. We demonstrate the approach on systems of both ordinary and partial differential equations, and by testing on synthetic data, show that the basis expansion approach approximates well the true forcing with a modest number of basis vectors. Finally, we show how to infer point estimates for the non-linear model parameters, such as the kernel length-scales, using Bayesian optimisation.  ( 2 min )
    A Local Geometric Interpretation of Feature Extraction in Deep Feedforward Neural Networks. (arXiv:2202.04632v1 [cs.LG])
    In this paper, we present a local geometric analysis to interpret how deep feedforward neural networks extract low-dimensional features from high-dimensional data. Our study shows that, in a local geometric region, the optimal weight in one layer of the neural network and the optimal feature generated by the previous layer comprise a low-rank approximation of a matrix that is determined by the Bayes action of this layer. This result holds (i) for analyzing both the output layer and the hidden layers of the neural network, and (ii) for neuron activation functions that are locally strictly increasing and continuously differentiable. We use two supervised learning problems to illustrate our results: neural network based maximum likelihood classification (i.e., logistic regression) and neural network based minimum mean square estimation. Experimental validation of these theoretical results will be conducted in our future work.  ( 2 min )
    Reproducibility in Optimization: Theoretical Framework and Limits. (arXiv:2202.04598v1 [math.OC])
    We initiate a formal study of reproducibility in optimization. We define a quantitative measure of reproducibility of optimization procedures in the face of noisy or error-prone operations such as inexact or stochastic gradient computations or inexact initialization. We then analyze several convex optimization settings of interest such as smooth, non-smooth, and strongly-convex objective functions and establish tight bounds on the limits of reproducibility in each setting. Our analysis reveals a fundamental trade-off between computation and reproducibility: more computation is necessary (and sufficient) for better reproducibility.  ( 2 min )
    An Exploration of Multicalibration Uniform Convergence Bounds. (arXiv:2202.04530v1 [cs.LG])
    Recent works have investigated the sample complexity necessary for fair machine learning. The most advanced of such sample complexity bounds are developed by analyzing multicalibration uniform convergence for a given predictor class. We present a framework which yields multicalibration error uniform convergence bounds by reparametrizing sample complexities for Empirical Risk Minimization (ERM) learning. From this framework, we demonstrate that multicalibration error exhibits dependence on the classifier architecture as well as the underlying data distribution. We perform an experimental evaluation to investigate the behavior of multicalibration error for different families of classifiers. We compare the results of this evaluation to multicalibration error concentration bounds. Our investigation provides additional perspective on both algorithmic fairness and multicalibration error convergence bounds. Given the prevalence of ERM sample complexity bounds, our proposed framework enables machine learning practitioners to easily understand the convergence behavior of multicalibration error for a myriad of classifier architectures.  ( 2 min )
    Defeating Misclassification Attacks Against Transfer Learning. (arXiv:1908.11230v4 [cs.LG] UPDATED)
    Transfer learning is prevalent as a technique to efficiently generate new models (Student models) based on the knowledge transferred from a pre-trained model (Teacher model). However, Teacher models are often publicly available for sharing and reuse, which inevitably introduces vulnerability to trigger severe attacks against transfer learning systems. In this paper, we take a first step towards mitigating one of the most advanced misclassification attacks in transfer learning. We design a distilled differentiator via activation-based network pruning to enervate the attack transferability while retaining accuracy. We adopt an ensemble structure from variant differentiators to improve the defence robustness. To avoid the bloated ensemble size during inference, we propose a two-phase defence, in which inference from the Student model is firstly performed to narrow down the candidate differentiators to be assembled, and later only a small, fixed number of them can be chosen to validate clean or reject adversarial inputs effectively. Our comprehensive evaluations on both large and small image recognition tasks confirm that the Student models with our defence of only 5 differentiators are immune to over 90% of the adversarial inputs with an accuracy loss of less than 10%. Our comparison also demonstrates that our design outperforms prior problematic defences.  ( 2 min )
    When can unlabeled data improve the learning rate?. (arXiv:1905.11866v2 [cs.LG] UPDATED)
    In semi-supervised classification, one is given access both to labeled and unlabeled data. As unlabeled data is typically cheaper to acquire than labeled data, this setup becomes advantageous as soon as one can exploit the unlabeled data in order to produce a better classifier than with labeled data alone. However, the conditions under which such an improvement is possible are not fully understood yet. Our analysis focuses on improvements in the minimax learning rate in terms of the number of labeled examples (with the number of unlabeled examples being allowed to depend on the number of labeled ones). We argue that for such improvements to be realistic and indisputable, certain specific conditions should be satisfied and previous analyses have failed to meet those conditions. We then demonstrate examples where these conditions can be met, in particular showing rate changes from $1/\sqrt{\ell}$ to $e^{-c\ell}$ and from $1/\sqrt{\ell}$ to $1/\ell$. These results improve our understanding of what is and isn't possible in semi-supervised learning.  ( 2 min )
    A Probabilistic Generative Grammar for Semantic Parsing. (arXiv:1606.06361v2 [cs.CL] UPDATED)
    Domain-general semantic parsing is a long-standing goal in natural language processing, where the semantic parser is capable of robustly parsing sentences from domains outside of which it was trained. Current approaches largely rely on additional supervision from new domains in order to generalize to those domains. We present a generative model of natural language utterances and logical forms and demonstrate its application to semantic parsing. Our approach relies on domain-independent supervision to generalize to new domains. We derive and implement efficient algorithms for training, parsing, and sentence generation. The work relies on a novel application of hierarchical Dirichlet processes (HDPs) for structured prediction, which we also present in this manuscript. This manuscript is an excerpt of chapter 4 from the Ph.D. thesis of Saparov (2022), where the model plays a central role in a larger natural language understanding system. This manuscript provides a new simplified and more complete presentation of the work first introduced in Saparov, Saraswat, and Mitchell (2017). The description and proofs of correctness of the training algorithm, parsing algorithm, and sentence generation algorithm are much simplified in this new presentation. We also describe the novel application of hierarchical Dirichlet processes for structured prediction. In addition, we extend the earlier work with a new model of word morphology, which utilizes the comprehensive morphological data from Wiktionary.  ( 2 min )
    Stochastic Contextual Dueling Bandits under Linear Stochastic Transitivity Models. (arXiv:2202.04593v1 [cs.LG])
    We consider the regret minimization task in a dueling bandits problem with context information. In every round of the sequential decision problem, the learner makes a context-dependent selection of two choice alternatives (arms) to be compared with each other and receives feedback in the form of noisy preference information. We assume that the feedback process is determined by a linear stochastic transitivity model with contextualized utilities (CoLST), and the learner's task is to include the best arm (with highest latent context-dependent utility) in the duel. We propose a computationally efficient algorithm, $\texttt{CoLSTIM}$, which makes its choice based on imitating the feedback process using perturbed context-dependent utility estimates of the underlying CoLST model. If each arm is associated with a $d$-dimensional feature vector, we show that $\texttt{CoLSTIM}$ achieves a regret of order $\tilde O( \sqrt{dT})$ after $T$ learning rounds. Additionally, we also establish the optimality of $\texttt{CoLSTIM}$ by showing a lower bound for the weak regret that refines the existing average regret analysis. Our experiments demonstrate its superiority over state-of-art algorithms for special cases of CoLST models.  ( 2 min )
    Neural Sheaf Diffusion: A Topological Perspective on Heterophily and Oversmoothing in GNNs. (arXiv:2202.04579v1 [cs.LG])
    Cellular sheaves equip graphs with "geometrical" structure by assigning vector spaces and linear maps to nodes and edges. Graph Neural Networks (GNNs) implicitly assume a graph with a trivial underlying sheaf. This choice is reflected in the structure of the graph Laplacian operator, the properties of the associated diffusion equation, and the characteristics of the convolutional models that discretise this equation. In this paper, we use cellular sheaf theory to show that the underlying geometry of the graph is deeply linked with the performance of GNNs in heterophilic settings and their oversmoothing behaviour. By considering a hierarchy of increasingly general sheaves, we study how the ability of the sheaf diffusion process to achieve linear separation of the classes in the infinite time limit expands. At the same time, we prove that when the sheaf is non-trivial, discretised parametric diffusion processes have greater control than GNNs over their asymptotic behaviour. On the practical side, we study how sheaves can be learned from data. The resulting sheaf diffusion models have many desirable properties that address the limitations of classical graph diffusion equations (and corresponding GNN models) and obtain state-of-the-art results in heterophilic settings. Overall, our work provides new connections between GNNs and algebraic topology and would be of interest to both fields.  ( 2 min )
    Contextualize Me -- The Case for Context in Reinforcement Learning. (arXiv:2202.04500v1 [cs.LG])
    While Reinforcement Learning (RL) has made great strides towards solving increasingly complicated problems, many algorithms are still brittle to even slight changes in environments. Contextual Reinforcement Learning (cRL) provides a theoretical framework to model such changes in a principled manner, thereby enabling flexible, precise and interpretable task specification and generation. Thus, cRL is an important formalization for studying generalization in RL. In this work, we reason about solving cRL in theory and practice. We show that theoretically optimal behavior in contextual Markov Decision Processes requires explicit context information. In addition, we empirically explore context-based task generation, utilizing context information in training and propose cGate, our state-modulating policy architecture. To this end, we introduce the first benchmark library designed for generalization based on cRL extensions of popular benchmarks, CARL. In short: Context matters!  ( 2 min )
    Recurrent Spectral Network (RSN): shaping the basin of attraction of a discrete map to reach automated classification. (arXiv:2202.04497v1 [cond-mat.dis-nn])
    A novel strategy to automated classification is introduced which exploits a fully trained dynamical system to steer items belonging to different categories toward distinct asymptotic attractors. These latter are incorporated into the model by taking advantage of the spectral decomposition of the operator that rules the linear evolution across the processing network. Non-linear terms act for a transient and allow to disentangle the data supplied as initial condition to the discrete dynamical system, shaping the boundaries of different attractors. The network can be equipped with several memory kernels which can be sequentially activated for serial datasets handling. Our novel approach to classification, that we here term Recurrent Spectral Network (RSN), is successfully challenged against a simple test-bed model, created for illustrative purposes, as well as a standard dataset for image processing training.  ( 2 min )
    Conditional Drums Generation using Compound Word Representations. (arXiv:2202.04464v1 [cs.SD])
    The field of automatic music composition has seen great progress in recent years, specifically with the invention of transformer-based architectures. When using any deep learning model which considers music as a sequence of events with multiple complex dependencies, the selection of a proper data representation is crucial. In this paper, we tackle the task of conditional drums generation using a novel data encoding scheme inspired by the Compound Word representation, a tokenization process of sequential data. Therefore, we present a sequence-to-sequence architecture where a Bidirectional Long short-term memory (BiLSTM) Encoder receives information about the conditioning parameters (i.e., accompanying tracks and musical attributes), while a Transformer-based Decoder with relative global attention produces the generated drum sequences. We conducted experiments to thoroughly compare the effectiveness of our method to several baselines. Quantitative evaluation shows that our model is able to generate drums sequences that have similar statistical distributions and characteristics to the training corpus. These features include syncopation, compression ratio, and symmetry among others. We also verified, through a listening test, that generated drum sequences sound pleasant, natural and coherent while they "groove" with the given accompaniment.  ( 2 min )
    Leverage Score Sampling for Tensor Product Matrices in Input Sparsity Time. (arXiv:2202.04515v1 [cs.LG])
    We give an input sparsity time sampling algorithm for spectrally approximating the Gram matrix corresponding to the $q$-fold column-wise tensor product of $q$ matrices using a nearly optimal number of samples, improving upon all previously known methods by poly$(q)$ factors. Furthermore, for the important special care of the $q$-fold self-tensoring of a dataset, which is the feature matrix of the degree-$q$ polynomial kernel, the leading term of our method's runtime is proportional to the size of the dataset and has no dependence on $q$. Previous techniques either incur a poly$(q)$ factor slowdown in their runtime or remove the dependence on $q$ at the expense of having sub-optimal target dimension, and depend quadratically on the number of data-points in their runtime. Our sampling technique relies on a collection of $q$ partially correlated random projections which can be simultaneously applied to a dataset $X$ in total time that only depends on the size of $X$, and at the same time their $q$-fold Kronecker product acts as a near-isometry for any fixed vector in the column span of $X^{\otimes q}$. We show that our sampling methods generalize to other classes of kernels beyond polynomial, such as Gaussian and Neural Tangent kernels.  ( 2 min )
    Optimal learning rate schedules in high-dimensional non-convex optimization problems. (arXiv:2202.04509v1 [cs.LG])
    Learning rate schedules are ubiquitously used to speed up and improve optimisation. Many different policies have been introduced on an empirical basis, and theoretical analyses have been developed for convex settings. However, in many realistic problems the loss-landscape is high-dimensional and non convex -- a case for which results are scarce. In this paper we present a first analytical study of the role of learning rate scheduling in this setting, focusing on Langevin optimization with a learning rate decaying as $\eta(t)=t^{-\beta}$. We begin by considering models where the loss is a Gaussian random function on the $N$-dimensional sphere ($N\rightarrow \infty$), featuring an extensive number of critical points. We find that to speed up optimization without getting stuck in saddles, one must choose a decay rate $\beta<1$, contrary to convex setups where $\beta=1$ is generally optimal. We then add to the problem a signal to be recovered. In this setting, the dynamics decompose into two phases: an \emph{exploration} phase where the dynamics navigates through rough parts of the landscape, followed by a \emph{convergence} phase where the signal is detected and the dynamics enter a convex basin. In this case, it is optimal to keep a large learning rate during the exploration phase to escape the non-convex region as quickly as possible, then use the convex criterion $\beta=1$ to converge rapidly to the solution. Finally, we demonstrate that our conclusions hold in a common regression task involving neural networks.  ( 2 min )
    Stability Analysis of Recurrent Neural Networks by IQC with Copositive Mutipliers. (arXiv:2202.04592v1 [math.OC])
    This paper is concerned with the stability analysis of the recurrent neural networks (RNNs) by means of the integral quadratic constraint (IQC) framework. The rectified linear unit (ReLU) is typically employed as the activation function of the RNN, and the ReLU has specific nonnegativity properties regarding its input and output signals. Therefore, it is effective if we can derive IQC-based stability conditions with multipliers taking care of such nonnegativity properties. However, such nonnegativity (linear) properties are hardly captured by the existing multipliers defined on the positive semidefinite cone. To get around this difficulty, we loosen the standard positive semidefinite cone to the copositive cone, and employ copositive multipliers to capture the nonnegativity properties. We show that, within the framework of the IQC, we can employ copositive multipliers (or their inner approximation) together with existing multipliers such as Zames-Falb multipliers and polytopic bounding multipliers, and this directly enables us to ensure that the introduction of the copositive multipliers leads to better (no more conservative) results. We finally illustrate the effectiveness of the IQC-based stability conditions with the copositive multipliers by numerical examples.  ( 2 min )
    Prediction Sensitivity: Continual Audit of Counterfactual Fairness in Deployed Classifiers. (arXiv:2202.04504v1 [cs.LG])
    As AI-based systems increasingly impact many areas of our lives, auditing these systems for fairness is an increasingly high-stakes problem. Traditional group fairness metrics can miss discrimination against individuals and are difficult to apply after deployment. Counterfactual fairness describes an individualized notion of fairness but is even more challenging to evaluate after deployment. We present prediction sensitivity, an approach for continual audit of counterfactual fairness in deployed classifiers. Prediction sensitivity helps answer the question: would this prediction have been different, if this individual had belonged to a different demographic group -- for every prediction made by the deployed model. Prediction sensitivity can leverage correlations between protected status and other features and does not require protected status information at prediction time. Our empirical results demonstrate that prediction sensitivity is effective for detecting violations of counterfactual fairness.  ( 2 min )
    Obtaining Dyadic Fairness by Optimal Transport. (arXiv:2202.04520v1 [cs.LG])
    Fairness has been taken as a critical metric on machine learning models. Many works studying how to obtain fairness for different tasks emerge. This paper considers obtaining fairness for link prediction tasks, which can be measured by dyadic fairness. We aim to propose a pre-processing methodology to obtain dyadic fairness through data repairing and optimal transport. To obtain dyadic fairness with satisfying flexibility and unambiguity requirements, we transform the dyadic repairing to the conditional distribution alignment problem based on optimal transport and obtain theoretical results on the connection between the proposed alignment and dyadic fairness. The optimal transport-based dyadic fairness algorithm is proposed for graph link prediction. Our proposed algorithm shows superior results on obtaining fairness compared with the other pre-processing methods on two benchmark graph datasets.  ( 2 min )
    Generating Training Data with Language Models: Towards Zero-Shot Language Understanding. (arXiv:2202.04538v1 [cs.CL])
    Pretrained language models (PLMs) have demonstrated remarkable performance in various natural language processing tasks: Unidirectional PLMs (e.g., GPT) are well known for their superior text generation capabilities; bidirectional PLMs (e.g., BERT) have been the prominent choice for natural language understanding (NLU) tasks. While both types of models have achieved promising few-shot learning performance, their potential for zero-shot learning has been underexplored. In this paper, we present a simple approach that uses both types of PLMs for fully zero-shot learning of NLU tasks without requiring any task-specific data: A unidirectional PLM generates class-conditioned texts guided by prompts, which are used as the training data for fine-tuning a bidirectional PLM. With quality training data selected based on the generation probability and regularization techniques (label smoothing and temporal ensembling) applied to the fine-tuning stage for better generalization and stability, our approach demonstrates strong performance across seven classification tasks of the GLUE benchmark (e.g., 72.3/73.8 on MNLI-m/mm and 92.8 on SST-2), significantly outperforming zero-shot prompting methods and achieving even comparable results to strong few-shot approaches using 32 training samples per class.  ( 2 min )
    CRAT-Pred: Vehicle Trajectory Prediction with Crystal Graph Convolutional Neural Networks and Multi-Head Self-Attention. (arXiv:2202.04488v1 [cs.CV])
    Predicting the motion of surrounding vehicles is essential for autonomous vehicles, as it governs their own motion plan. Current state-of-the-art vehicle prediction models heavily rely on map information. In reality, however, this information is not always available. We therefore propose CRAT-Pred, a multi-modal and non-rasterization-based trajectory prediction model, specifically designed to effectively model social interactions between vehicles, without relying on map information. CRAT-Pred applies a graph convolution method originating from the field of material science to vehicle prediction, allowing to efficiently leverage edge features, and combines it with multi-head self-attention. Compared to other map-free approaches, the model achieves state-of-the-art performance with a significantly lower number of model parameters. In addition to that, we quantitatively show that the self-attention mechanism is able to learn social interactions between vehicles, with the weights representing a measurable interaction score. The source code is publicly available.  ( 2 min )
    Scenario-Assisted Deep Reinforcement Learning. (arXiv:2202.04337v1 [cs.LG])
    Deep reinforcement learning has proven remarkably useful in training agents from unstructured data. However, the opacity of the produced agents makes it difficult to ensure that they adhere to various requirements posed by human engineers. In this work-in-progress report, we propose a technique for enhancing the reinforcement learning training process (specifically, its reward calculation), in a way that allows human engineers to directly contribute their expert knowledge, making the agent under training more likely to comply with various relevant constraints. Moreover, our proposed approach allows formulating these constraints using advanced model engineering techniques, such as scenario-based modeling. This mix of black-box learning-based tools with classical modeling approaches could produce systems that are effective and efficient, but are also more transparent and maintainable. We evaluated our technique using a case-study from the domain of internet congestion control, obtaining promising results.  ( 2 min )
    Explainable Predictive Modeling for Limited Spectral Data. (arXiv:2202.04527v1 [cs.LG])
    Feature selection of high-dimensional labeled data with limited observations is critical for making powerful predictive modeling accessible, scalable, and interpretable for domain experts. Spectroscopy data, which records the interaction between matter and electromagnetic radiation, particularly holds a lot of information in a single sample. Since acquiring such high-dimensional data is a complex task, it is crucial to exploit the best analytical tools to extract necessary information. In this paper, we investigate the most commonly used feature selection techniques and introduce applying recent explainable AI techniques to interpret the prediction outcomes of high-dimensional and limited spectral data. Interpretation of the prediction outcome is beneficial for the domain experts as it ensures the transparency and faithfulness of the ML models to the domain knowledge. Due to the instrument resolution limitations, pinpointing important regions of the spectroscopy data creates a pathway to optimize the data collection process through the miniaturization of the spectrometer device. Reducing the device size and power and therefore cost is a requirement for the real-world deployment of such a sensor-to-prediction system as a whole. We specifically design three different scenarios to ensure that the evaluation of ML models is robust for the real-time practice of the developed methodologies and to uncover the hidden effect of noise sources on the final outcome.  ( 2 min )
    Lightweight Jet Reconstruction and Identification as an Object Detection Task. (arXiv:2202.04499v1 [hep-ex])
    We apply object detection techniques based on deep convolutional blocks to end-to-end jet identification and reconstruction tasks encountered at the CERN Large Hadron Collider (LHC). Collision events produced at the LHC and represented as an image composed of calorimeter and tracker cells are given as an input to a Single Shot Detection network. The algorithm, named PFJet-SSD performs simultaneous localization, classification and regression tasks to cluster jets and reconstruct their features. This all-in-one single feed-forward pass gives advantages in terms of execution time and an improved accuracy w.r.t. traditional rule-based methods. A further gain is obtained from network slimming, homogeneous quantization, and optimized runtime for meeting memory and latency constraints of a typical real-time processing environment. We experiment with 8-bit and ternary quantization, benchmarking their accuracy and inference latency against a single-precision floating-point. We show that the ternary network closely matches the performance of its full-precision equivalent and outperforms the state-of-the-art rule-based algorithm. Finally, we report the inference latency on different hardware platforms and discuss future applications.  ( 2 min )
    Precision Radiotherapy via Information Integration of Expert Human Knowledge and AI Recommendation to Optimize Clinical Decision Making. (arXiv:2202.04565v1 [stat.ML])
    In the precision medicine era, there is a growing need for precision radiotherapy where the planned radiation dose needs to be optimally determined by considering a myriad of patient-specific information in order to ensure treatment efficacy. Existing artificial-intelligence (AI) methods can recommend radiation dose prescriptions within the scope of this available information. However, treating physicians may not fully entrust the AI's recommended prescriptions due to known limitations or when the AI recommendation may go beyond physicians' current knowledge. This paper lays out a systematic method to integrate expert human knowledge with AI recommendations for optimizing clinical decision making. Towards this goal, Gaussian process (GP) models are integrated with deep neural networks (DNNs) to quantify the uncertainty of the treatment outcomes given by physicians and AI recommendations, respectively, which are further used as a guideline to educate clinical physicians and improve AI models performance. The proposed method is demonstrated in a comprehensive dataset where patient-specific information and treatment outcomes are prospectively collected during radiotherapy of $67$ non-small cell lung cancer patients and retrospectively analyzed.  ( 2 min )
    False Memory Formation in Continual Learners Through Imperceptible Backdoor Trigger. (arXiv:2202.04479v1 [cs.LG])
    In this brief, we show that sequentially learning new information presented to a continual (incremental) learning model introduces new security risks: an intelligent adversary can introduce small amount of misinformation to the model during training to cause deliberate forgetting of a specific task or class at test time, thus creating "false memory" about that task. We demonstrate such an adversary's ability to assume control of the model by injecting "backdoor" attack samples to commonly used generative replay and regularization based continual learning approaches using continual learning benchmark variants of MNIST, as well as the more challenging SVHN and CIFAR 10 datasets. Perhaps most damaging, we show this vulnerability to be very acute and exceptionally effective: the backdoor pattern in our attack model can be imperceptible to human eye, can be provided at any point in time, can be added into the training data of even a single possibly unrelated task and can be achieved with as few as just 1\% of total training dataset of a single task.  ( 2 min )
    Predicting the intended action using internal simulation of perception. (arXiv:2202.04466v1 [cs.AI])
    This article proposes an architecture, which allows the prediction of intention by internally simulating perceptual states represented by action pattern vectors. To this end, associative self-organising neural networks (A-SOM) is utilised to build a hierarchical cognitive architecture for recognition and simulation of the skeleton based human actions. The abilities of the proposed architecture in recognising and predicting actions is evaluated in experiments using three different datasets of 3D actions. Based on the experiments of this article, applying internally simulated perceptual states represented by action pattern vectors improves the performance of the recognition task in all experiments. Furthermore, internal simulation of perception addresses the problem of having limited access to the sensory input, and also the future prediction of the consecutive perceptual sequences. The performance of the system is compared and discussed with similar architecture using self-organizing neural networks (SOM).  ( 2 min )
    Rethinking Goal-conditioned Supervised Learning and Its Connection to Offline RL. (arXiv:2202.04478v1 [cs.LG])
    Solving goal-conditioned tasks with sparse rewards using self-supervised learning is promising because of its simplicity and stability over current reinforcement learning (RL) algorithms. A recent work, called Goal-Conditioned Supervised Learning (GCSL), provides a new learning framework by iteratively relabeling and imitating self-generated experiences. In this paper, we revisit the theoretical property of GCSL -- optimizing a lower bound of the goal reaching objective, and extend GCSL as a novel offline goal-conditioned RL algorithm. The proposed method is named Weighted GCSL (WGCSL), in which we introduce an advanced compound weight consisting of three parts (1) discounted weight for goal relabeling, (2) goal-conditioned exponential advantage weight, and (3) best-advantage weight. Theoretically, WGCSL is proved to optimize an equivalent lower bound of the goal-conditioned RL objective and generates monotonically improved policies via an iterated scheme. The monotonic property holds for any behavior policies, and therefore WGCSL can be applied to both online and offline settings. To evaluate algorithms in the offline goal-conditioned RL setting, we provide a benchmark including a range of point and simulated robot domains. Experiments in the introduced benchmark demonstrate that WGCSL can consistently outperform GCSL and existing state-of-the-art offline methods in the fully offline goal-conditioned setting.  ( 2 min )
    The no-free-lunch theorems of supervised learning. (arXiv:2202.04513v1 [cs.LG])
    The no-free-lunch theorems promote a skeptical conclusion that all possible machine learning algorithms equally lack justification. But how could this leave room for a learning theory, that shows that some algorithms are better than others? Drawing parallels to the philosophy of induction, we point out that the no-free-lunch results presuppose a conception of learning algorithms as purely data-driven. On this conception, every algorithm must have an inherent inductive bias, that wants justification. We argue that many standard learning algorithms should rather be understood as model-dependent: in each application they also require for input a model, representing a bias. Generic algorithms themselves, they can be given a model-relative justification.  ( 2 min )
    Finding Optimal Arms in Non-stochastic Combinatorial Bandits with Semi-bandit Feedback and Finite Budget. (arXiv:2202.04487v1 [cs.LG])
    We consider the combinatorial bandits problem with semi-bandit feedback under finite sampling budget constraints, in which the learner can carry out its action only for a limited number of times specified by an overall budget. The action is to choose a set of arms, whereupon feedback for each arm in the chosen set is received. Unlike existing works, we study this problem in a non-stochastic setting with subset-dependent feedback, i.e., the semi-bandit feedback received could be generated by an oblivious adversary and also might depend on the chosen set of arms. In addition, we consider a general feedback scenario covering both the numerical-based as well as preference-based case and introduce a sound theoretical framework for this setting guaranteeing sensible notions of optimal arms, which a learner seeks to find. We suggest a generic algorithm suitable to cover the full spectrum of conceivable arm elimination strategies from aggressive to conservative. Theoretical questions about the sufficient and necessary budget of the algorithm to find the best arm are answered and complemented by deriving lower bounds for any learning algorithm for this problem scenario.  ( 2 min )
    Empirical Risk Minimization with Relative Entropy Regularization: Optimality and Sensitivity Analysis. (arXiv:2202.04385v1 [cs.LG])
    The optimality and sensitivity of the empirical risk minimization problem with relative entropy regularization (ERM-RER) are investigated for the case in which the reference is a sigma-finite measure instead of a probability measure. This generalization allows for a larger degree of flexibility in the incorporation of prior knowledge over the set of models. In this setting, the interplay of the regularization parameter, the reference measure, the risk function, and the empirical risk induced by the solution of the ERM-RER problem is characterized. This characterization yields necessary and sufficient conditions for the existence of a regularization parameter that achieves an arbitrarily small empirical risk with arbitrarily high probability. The sensitivity of the expected empirical risk to deviations from the solution of the ERM-RER problem is studied. The sensitivity is then used to provide upper and lower bounds on the expected empirical risk. Moreover, it is shown that the expectation of the sensitivity is upper bounded, up to a constant factor, by the square root of the lautum information between the models and the datasets.  ( 2 min )
    House Price Valuation Model Based on Geographically Neural Network Weighted Regression: The Case Study of Shenzhen, China. (arXiv:2202.04358v1 [stat.AP])
    Confronted with the spatial heterogeneity of real estate market, some traditional research utilized Geographically Weighted Regression (GWR) to estimate the house price. However, its kernel function is non-linear, elusive, and complex to opt bandwidth, the predictive power could also be improved. Consequently, a novel technique, Geographical Neural Network Weighted Regression (GNNWR), has been applied to improve the accuracy of real estate appraisal with the help of neural networks. Based on Shenzhen house price dataset, this work conspicuously captures the weight distribution of different variants at Shenzhen real estate market, which GWR is difficult to materialize. Moreover, we focus on the performance of GNNWR, verify its robustness and superiority, refine the experiment process with 10-fold cross-validation, extend its application area from natural to socioeconomic geospatial data. It's a practical and trenchant way to assess house price, and we demonstrate the effectiveness of GNNWR on a complex socioeconomic dataset.  ( 2 min )
    Model Architecture Adaption for Bayesian Neural Networks. (arXiv:2202.04392v1 [cs.LG])
    Bayesian Neural Networks (BNNs) offer a mathematically grounded framework to quantify the uncertainty of model predictions but come with a prohibitive computation cost for both training and inference. In this work, we show a novel network architecture search (NAS) that optimizes BNNs for both accuracy and uncertainty while having a reduced inference latency. Different from canonical NAS that optimizes solely for in-distribution likelihood, the proposed scheme searches for the uncertainty performance using both in- and out-of-distribution data. Our method is able to search for the correct placement of Bayesian layer(s) in a network. In our experiments, the searched models show comparable uncertainty quantification ability and accuracy compared to the state-of-the-art (deep ensemble). In addition, the searched models use only a fraction of the runtime compared to many popular BNN baselines, reducing the inference runtime cost by $2.98 \times$ and $2.92 \times$ respectively on the CIFAR10 dataset when compared to MCDropout and deep ensemble.  ( 2 min )
    Missing Data Imputation and Acquisition with Deep Hierarchical Models and Hamiltonian Monte Carlo. (arXiv:2202.04599v1 [cs.LG])
    Variational Autoencoders (VAEs) have recently been highly successful at imputing and acquiring heterogeneous missing data and identifying outliers. However, within this specific application domain, existing VAE methods are restricted by using only one layer of latent variables and strictly Gaussian posterior approximations. To address these limitations, we present HH-VAEM, a Hierarchical VAE model for mixed-type incomplete data that uses Hamiltonian Monte Carlo with automatic hyper-parameter tuning for improved approximate inference. Our experiments show that HH-VAEM outperforms existing baselines in the tasks of missing data imputation, supervised learning and outlier identification with missing features. Finally, we also present a sampling-based approach for efficiently computing the information gain when missing features are to be acquired with HH-VAEM. Our experiments show that this sampling-based approach is superior to alternatives based on Gaussian approximations.  ( 2 min )
    On the Implicit Bias of Gradient Descent for Temporal Extrapolation. (arXiv:2202.04302v1 [cs.LG])
    Common practice when using recurrent neural networks (RNNs) is to apply a model to sequences longer than those seen in training. This "extrapolating" usage deviates from the traditional statistical learning setup where guarantees are provided under the assumption that train and test distributions are identical. Here we set out to understand when RNNs can extrapolate, focusing on a simple case where the data generating distribution is memoryless. We first show that even with infinite training data, there exist RNN models that interpolate perfectly (i.e., they fit the training data) yet extrapolate poorly to longer sequences. We then show that if gradient descent is used for training, learning will converge to perfect extrapolation under certain assumption on initialization. Our results complement recent studies on the implicit bias of gradient descent, showing that it plays a key role in extrapolation when learning temporal prediction models.  ( 2 min )
    Imitation Learning by State-Only Distribution Matching. (arXiv:2202.04332v1 [cs.LG])
    Imitation Learning from observation describes policy learning in a similar way to human learning. An agent's policy is trained by observing an expert performing a task. While many state-only imitation learning approaches are based on adversarial imitation learning, one main drawback is that adversarial training is often unstable and lacks a reliable convergence estimator. If the true environment reward is unknown and cannot be used to select the best-performing model, this can result in bad real-world policy performance. We propose a non-adversarial learning-from-observations approach, together with an interpretable convergence and performance metric. Our training objective minimizes the Kulback-Leibler divergence (KLD) between the policy and expert state transition trajectories which can be optimized in a non-adversarial fashion. Such methods demonstrate improved robustness when learned density models guide the optimization. We further improve the sample efficiency by rewriting the KLD minimization as the Soft Actor Critic objective based on a modified reward using additional density models that estimate the environment's forward and backward dynamics. Finally, we evaluate the effectiveness of our approach on well-known continuous control environments and show state-of-the-art performance while having a reliable performance estimator compared to several recent learning-from-observation methods.  ( 2 min )
    Generalized Strategic Classification and the Case of Aligned Incentives. (arXiv:2202.04357v1 [cs.LG])
    Predicative machine learning models are frequently being used by companies, institutes and organizations to make choices about humans. Strategic classification studies learning in settings where self-interested users can strategically modify their features to obtain favorable predictive outcomes. A key working assumption, however, is that 'favorable' always means 'positive'; this may be appropriate in some applications (e.g., loan approval, university admissions and hiring), but reduces to a fairly narrow view what user interests can be. In this work we argue for a broader perspective on what accounts for strategic user behavior, and propose and study a flexible model of generalized strategic classification. Our generalized model subsumes most current models, but includes other novel settings; among these, we identify and target one intriguing sub-class of problems in which the interests of users and the system are aligned. For this cooperative setting, we provide an in-depth analysis, and propose a practical learning approach that is effective and efficient. We compare our approach to existing learning methods and show its statistical and optimization benefits. Returning to our fully generalized model, we show how our results and approach can extend to the most general case. We conclude with a set of experiments that empirically demonstrate the utility of our approach.  ( 2 min )
    Improving greedy core-set configurations for active learning with uncertainty-scaled distances. (arXiv:2202.04251v1 [cs.LG])
    We scale perceived distances of the core-set algorithm by a factor of uncertainty and search for low-confidence configurations, finding significant improvements in sample efficiency across CIFAR10/100 and SVHN image classification, especially in larger acquisition sizes. We show the necessity of our modifications and explain how the improvement is due to a probabilistic quadratic speed-up in the convergence of core-set loss, under assumptions about the relationship of model uncertainty and misclassification.  ( 2 min )
    Log-based Anomaly Detection with Deep Learning: How Far Are We?. (arXiv:2202.04301v1 [cs.SE])
    Software-intensive systems produce logs for troubleshooting purposes. Recently, many deep learning models have been proposed to automatically detect system anomalies based on log data. These models typically claim very high detection accuracy. For example, most models report an F-measure greater than 0.9 on the commonly-used HDFS dataset. To achieve a profound understanding of how far we are from solving the problem of log-based anomaly detection, in this paper, we conduct an in-depth analysis of five state-of-the-art deep learning-based models for detecting system anomalies on four public log datasets. Our experiments focus on several aspects of model evaluation, including training data selection, data grouping, class distribution, data noise, and early detection ability. Our results point out that all these aspects have significant impact on the evaluation, and that all the studied models do not always work well. The problem of log-based anomaly detection has not been solved yet. Based on our findings, we also suggest possible future work.  ( 2 min )
    A Reinforcement Learning Approach to Domain-Knowledge Inclusion Using Grammar Guided Symbolic Regression. (arXiv:2202.04367v1 [cs.LG])
    In recent years, symbolic regression has been of wide interest to provide an interpretable symbolic representation of potentially large data relationships. Initially circled to genetic algorithms, symbolic regression methods now include a variety of Deep Learning based alternatives. However, these methods still do not generalize well to real-world data, mainly because they hardly include domain knowledge nor consider physical relationships between variables such as known equations and units. Regarding these issues, we propose a Reinforcement-Based Grammar-Guided Symbolic Regression (RBG2-SR) method that constrains the representational space with domain-knowledge using context-free grammar as reinforcement action space. We detail a Partially-Observable Markov Decision Process (POMDP) modeling of the problem and benchmark our approach against state-of-the-art methods. We also analyze the POMDP state definition and propose a physical equation search use case on which we compare our approach to grammar-based and non-grammarbased symbolic regression methods. The experiment results show that our method is competitive against other state-of-the-art methods on the benchmarks and offers the best error-complexity trade-off, highlighting the interest of using a grammar-based method in a real-world scenario.  ( 2 min )
    Evaluating Causal Inference Methods. (arXiv:2202.04208v1 [stat.ME])
    The fundamental challenge of drawing causal inference is that counterfactual outcomes are not fully observed for any unit. Furthermore, in observational studies, treatment assignment is likely to be confounded. Many statistical methods have emerged for causal inference under unconfoundedness conditions given pre-treatment covariates, including propensity score-based methods, prognostic score-based methods, and doubly robust methods. Unfortunately for applied researchers, there is no `one-size-fits-all' causal method that can perform optimally universally. In practice, causal methods are primarily evaluated quantitatively on handcrafted simulated data. Such data-generative procedures can be of limited value because they are typically stylized models of reality. They are simplified for tractability and lack the complexities of real-world data. For applied researchers, it is critical to understand how well a method performs for the data at hand. Our work introduces a deep generative model-based framework, Credence, to validate causal inference methods. The framework's novelty stems from its ability to generate synthetic data anchored at the empirical distribution for the observed sample, and therefore virtually indistinguishable from the latter. The approach allows the user to specify ground truth for the form and magnitude of causal effects and confounding bias as functions of covariates. Thus simulated data sets are used to evaluate the potential performance of various causal estimation methods when applied to data similar to the observed sample. We demonstrate Credence's ability to accurately assess the relative performance of causal estimation techniques in an extensive simulation study and two real-world data applications from Lalonde and Project STAR studies.  ( 2 min )
    Covariate-informed Representation Learning with Samplewise Optimal Identifiable Variational Autoencoders. (arXiv:2202.04206v1 [stat.ML])
    Recently proposed identifiable variational autoencoder (iVAE, Khemakhem et al. (2020)) framework provides a promising approach for learning latent independent components of the data. Although the identifiability is appealing, the objective function of iVAE does not enforce the inverse relation between encoders and decoders. Without the inverse relation, representations from the encoder in iVAE may not reconstruct observations,i.e., representations lose information in observations. To overcome this limitation, we develop a new approach, covariate-informed identifiable VAE (CI-iVAE). Different from previous iVAE implementations, our method critically leverages the posterior distribution of latent variables conditioned only on observations. In doing so, the objective function enforces the inverse relation, and learned representation contains more information of observations. Furthermore, CI-iVAE extends the original iVAE objective function to a larger class and finds the optimal one among them, thus providing a better fit to the data. Theoretically, our method has tighter evidence lower bounds (ELBOs) than the original iVAE. We demonstrate that our approach can more reliably learn features of various synthetic datasets, two benchmark image datasets (EMNIST and Fashion MNIST), and a large-scale brain imaging dataset for adolescent mental health research.  ( 2 min )
    TinyM$^2$Net: A Flexible System Algorithm Co-designed Multimodal Learning Framework for Tiny Devices. (arXiv:2202.04303v1 [cs.LG])
    With the emergence of Artificial Intelligence (AI), new attention has been given to implement AI algorithms on resource constrained tiny devices to expand the application domain of IoT. Multimodal Learning has recently become very popular with the classification task due to its impressive performance for both image and audio event classification. This paper presents TinyM$^2$Net -- a flexible system algorithm co-designed multimodal learning framework for resource constrained tiny devices. The framework was designed to be evaluated on two different case-studies: COVID-19 detection from multimodal audio recordings and battle field object detection from multimodal images and audios. In order to compress the model to implement on tiny devices, substantial network architecture optimization and mixed precision quantization were performed (mixed 8-bit and 4-bit). TinyM$^2$Net shows that even a tiny multimodal learning model can improve the classification performance than that of any unimodal frameworks. The most compressed TinyM$^2$Net achieves 88.4% COVID-19 detection accuracy (14.5% improvement from unimodal base model) and 96.8\% battle field object detection accuracy (3.9% improvement from unimodal base model). Finally, we test our TinyM$^2$Net models on a Raspberry Pi 4 to see how they perform when deployed to a resource constrained tiny device.  ( 2 min )
    MBCT: Tree-Based Feature-Aware Binning for Individual Uncertainty Calibration. (arXiv:2202.04348v1 [cs.LG])
    Most machine learning classifiers only concern classification accuracy, while certain applications (such as medical diagnosis, meteorological forecasting, and computation advertising) require the model to predict the true probability, known as a calibrated estimate. In previous work, researchers have developed several calibration methods to post-process the outputs of a predictor to obtain calibrated values, such as binning and scaling methods. Compared with scaling, binning methods are shown to have distribution-free theoretical guarantees, which motivates us to prefer binning methods for calibration. However, we notice that existing binning methods have several drawbacks: (a) the binning scheme only considers the original prediction values, thus limiting the calibration performance; and (b) the binning approach is non-individual, mapping multiple samples in a bin to the same value, and thus is not suitable for order-sensitive applications. In this paper, we propose a feature-aware binning framework, called Multiple Boosting Calibration Trees (MBCT), along with a multi-view calibration loss to tackle the above issues. Our MBCT optimizes the binning scheme by the tree structures of features, and adopts a linear function in a tree node to achieve individual calibration. Our MBCT is non-monotonic, and has the potential to improve order accuracy, due to its learnable binning scheme and the individual calibration. We conduct comprehensive experiments on three datasets in different fields. Results show that our method outperforms all competing models in terms of both calibration error and order accuracy. We also conduct simulation experiments, justifying that the proposed multi-view calibration loss is a better metric in modeling calibration error.  ( 2 min )
    Cost-effective Framework for Gradual Domain Adaptation with Multifidelity. (arXiv:2202.04359v1 [stat.ML])
    In domain adaptation, when there is a large distance between the source and target domains, the prediction performance will degrade. Gradual domain adaptation is one of the solutions to such an issue, assuming that we have access to intermediate domains, which shift gradually from the source to target domains. In previous works, it was assumed that the number of samples in the intermediate domains is sufficiently large; hence, self-training was possible without the need for labeled data. If access to an intermediate domain is restricted, self-training will fail. Practically, the cost of samples in intermediate domains will vary, and it is natural to consider that the closer an intermediate domain is to the target domain, the higher the cost of obtaining samples from the intermediate domain is. To solve the trade-off between cost and accuracy, we propose a framework that combines multifidelity and active domain adaptation. The effectiveness of the proposed method is evaluated by experiments with both artificial and real-world datasets. Codes are available at https://github.com/ssgw320/gdamf.  ( 2 min )
    Regulatory Instruments for Fair Personalized Pricing. (arXiv:2202.04245v1 [cs.CY])
    Personalized pricing is a business strategy to charge different prices to individual consumers based on their characteristics and behaviors. It has become common practice in many industries nowadays due to the availability of a growing amount of high granular consumer data. The discriminatory nature of personalized pricing has triggered heated debates among policymakers and academics on how to design regulation policies to balance market efficiency and equity. In this paper, we propose two sound policy instruments, i.e., capping the range of the personalized prices or their ratios. We investigate the optimal pricing strategy of a profit-maximizing monopoly under both regulatory constraints and the impact of imposing them on consumer surplus, producer surplus, and social welfare. We theoretically prove that both proposed constraints can help balance consumer surplus and producer surplus at the expense of total surplus for common demand distributions, such as uniform, logistic, and exponential distributions. Experiments on both simulation and real-world datasets demonstrate the correctness of these theoretical results. Our findings and insights shed light on regulatory policy design for the increasingly monopolized business in the digital era.  ( 2 min )
    FMP: Toward Fair Graph Message Passing against Topology Bias. (arXiv:2202.04187v1 [cs.LG])
    Despite recent advances in achieving fair representations and predictions through regularization, adversarial debiasing, and contrastive learning in graph neural networks (GNNs), the working mechanism (i.e., message passing) behind GNNs inducing unfairness issue remains unknown. In this work, we theoretically and experimentally demonstrate that representative aggregation in message-passing schemes accumulates bias in node representation due to topology bias induced by graph topology. Thus, a \textsf{F}air \textsf{M}essage \textsf{P}assing (FMP) scheme is proposed to aggregate useful information from neighbors but minimize the effect of topology bias in a unified framework considering graph smoothness and fairness objectives. The proposed FMP is effective, transparent, and compatible with back-propagation training. An acceleration approach on gradient calculation is also adopted to improve algorithm efficiency. Experiments on node classification tasks demonstrate that the proposed FMP outperforms the state-of-the-art baselines in effectively and efficiently mitigating bias on three real-world datasets.  ( 2 min )
    Optimal Clustering with Bandit Feedback. (arXiv:2202.04294v1 [cs.LG])
    This paper considers the problem of online clustering with bandit feedback. A set of arms (or items) can be partitioned into various groups that are unknown. Within each group, the observations associated to each of the arms follow the same distribution with the same mean vector. At each time step, the agent queries or pulls an arm and obtains an independent observation from the distribution it is associated to. Subsequent pulls depend on previous ones as well as the previously obtained samples. The agent's task is to uncover the underlying partition of the arms with the least number of arm pulls and with a probability of error not exceeding a prescribed constant $\delta$. The problem proposed finds numerous applications from clustering of variants of viruses to online market segmentation. We present an instance-dependent information-theoretic lower bound on the expected sample complexity for this task, and design a computationally efficient and asymptotically optimal algorithm, namely Bandit Online Clustering (BOC). The algorithm includes a novel stopping rule for adaptive sequential testing that circumvents the need to exactly solve any NP-hard weighted clustering problem as its subroutines. We show through extensive simulations on synthetic and real-world datasets that BOC's performance matches the lower bound asymptotically, and significantly outperforms a non-adaptive baseline algorithm.  ( 2 min )
    Gradient Methods Provably Converge to Non-Robust Networks. (arXiv:2202.04347v1 [cs.LG])
    Despite a great deal of research, it is still unclear why neural networks are so susceptible to adversarial examples. In this work, we identify natural settings where depth-$2$ ReLU networks trained with gradient flow are provably non-robust (susceptible to small adversarial $\ell_2$-perturbations), even when robust networks that classify the training dataset correctly exist. Perhaps surprisingly, we show that the well-known implicit bias towards margin maximization induces bias towards non-robust networks, by proving that every network which satisfies the KKT conditions of the max-margin problem is non-robust.  ( 2 min )
    Parametric t-Stochastic Neighbor Embedding With Quantum Neural Network. (arXiv:2202.04238v1 [quant-ph])
    t-Stochastic Neighbor Embedding (t-SNE) is a non-parametric data visualization method in classical machine learning. It maps the data from the high-dimensional space into a low-dimensional space, especially a two-dimensional plane, while maintaining the relationship, or similarities, between the surrounding points. In t-SNE, the initial position of the low-dimensional data is randomly determined, and the visualization is achieved by moving the low-dimensional data to minimize a cost function. Its variant called parametric t-SNE uses neural networks for this mapping. In this paper, we propose to use quantum neural networks for parametric t-SNE to reflect the characteristics of high-dimensional quantum data on low-dimensional data. We use fidelity-based metrics instead of Euclidean distance in calculating high-dimensional data similarity. We visualize both classical (Iris dataset) and quantum (time-depending Hamiltonian dynamics) data for classification tasks. Since this method allows us to represent a quantum dataset in a higher dimensional Hilbert space by a quantum dataset in a lower dimension while keeping their similarity, the proposed method can also be used to compress quantum data for further quantum machine learning.  ( 2 min )
    MMLN: Leveraging Domain Knowledge for Multimodal Diagnosis. (arXiv:2202.04266v1 [cs.LG])
    Recent studies show that deep learning models achieve good performance on medical imaging tasks such as diagnosis prediction. Among the models, multimodality has been an emerging trend, integrating different forms of data such as chest X-ray (CXR) images and electronic medical records (EMRs). However, most existing methods incorporate them in a model-free manner, which lacks theoretical support and ignores the intrinsic relations between different data sources. To address this problem, we propose a knowledge-driven and data-driven framework for lung disease diagnosis. By incorporating domain knowledge, machine learning models can reduce the dependence on labeled data and improve interpretability. We formulate diagnosis rules according to authoritative clinical medicine guidelines and learn the weights of rules from text data. Finally, a multimodal fusion consisting of text and image data is designed to infer the marginal probability of lung disease. We conduct experiments on a real-world dataset collected from a hospital. The results show that the proposed method outperforms the state-of-the-art multimodal baselines in terms of accuracy and interpretability.  ( 2 min )
    Parsimonious Learning-Augmented Caching. (arXiv:2202.04262v1 [cs.DS])
    Learning-augmented algorithms -- in which, traditional algorithms are augmented with machine-learned predictions -- have emerged as a framework to go beyond worst-case analysis. The overarching goal is to design algorithms that perform near-optimally when the predictions are accurate yet retain certain worst-case guarantees irrespective of the accuracy of the predictions. This framework has been successfully applied to online problems such as caching where the predictions can be used to alleviate uncertainties. In this paper we introduce and study the setting in which the learning-augmented algorithm can utilize the predictions parsimoniously. We consider the caching problem -- which has been extensively studied in the learning-augmented setting -- and show that one can achieve quantitatively similar results but only using a sublinear number of predictions.  ( 2 min )
    A new perspective on classification: optimally allocating limited resources to uncertain tasks. (arXiv:2202.04369v1 [cs.LG])
    A central problem in business concerns the optimal allocation of limited resources to a set of available tasks, where the payoff of these tasks is inherently uncertain. In credit card fraud detection, for instance, a bank can only assign a small subset of transactions to their fraud investigations team. Typically, such problems are solved using a classification framework, where the focus is on predicting task outcomes given a set of characteristics. Resources are then allocated to the tasks that are predicted to be the most likely to succeed. However, we argue that using classification to address task uncertainty is inherently suboptimal as it does not take into account the available capacity. Therefore, we first frame the problem as a type of assignment problem. Then, we present a novel solution using learning to rank by directly optimizing the assignment's expected profit given limited, stochastic capacity. This is achieved by optimizing a specific instance of the net discounted cumulative gain, a commonly used class of metrics in learning to rank. Empirically, we demonstrate that our new method achieves higher expected profit and expected precision compared to a classification approach for a wide variety of application areas and data sets. This illustrates the benefit of an integrated approach and of explicitly considering the available resources when learning a predictive model.  ( 2 min )
    Amplitude Spectrum Transformation for Open Compound Domain Adaptive Semantic Segmentation. (arXiv:2202.04287v1 [cs.CV])
    Open compound domain adaptation (OCDA) has emerged as a practical adaptation setting which considers a single labeled source domain against a compound of multi-modal unlabeled target data in order to generalize better on novel unseen domains. We hypothesize that an improved disentanglement of domain-related and task-related factors of dense intermediate layer features can greatly aid OCDA. Prior-arts attempt this indirectly by employing adversarial domain discriminators on the spatial CNN output. However, we find that latent features derived from the Fourier-based amplitude spectrum of deep CNN features hold a more tractable mapping with domain discrimination. Motivated by this, we propose a novel feature space Amplitude Spectrum Transformation (AST). During adaptation, we employ the AST auto-encoder for two purposes. First, carefully mined source-target instance pairs undergo a simulation of cross-domain feature stylization (AST-Sim) at a particular layer by altering the AST-latent. Second, AST operating at a later layer is tasked to normalize (AST-Norm) the domain content by fixing its latent to a mean prototype. Our simplified adaptation technique is not only clustering-free but also free from complex adversarial alignment. We achieve leading performance against the prior arts on the OCDA scene segmentation benchmarks.  ( 2 min )
    RECOVER: sequential model optimization platform for combination drug repurposing identifies novel synergistic compounds in vitro. (arXiv:2202.04202v1 [q-bio.QM])
    Selecting optimal drug repurposing combinations for further preclinical development is a challenging technical feat. Due to the toxicity of many therapeutic agents (e.g., chemotherapy), practitioners have favoured selection of synergistic compounds whereby lower doses can be used whilst maintaining high efficacy. For a fixed small molecule library, an exhaustive combinatorial chemical screen becomes infeasible to perform for academic and industry laboratories alike. Deep learning models have achieved state-of-the-art results in silico for the prediction of synergy scores. However, databases of drug combinations are highly biased towards synergistic agents and these results do not necessarily generalise out of distribution. We employ a sequential model optimization search applied to a deep learning model to quickly discover highly synergistic drug combinations active against a cancer cell line, while requiring substantially less screening than an exhaustive evaluation. Through iteratively adapting the model to newly acquired data, after only 3 rounds of ML-guided experimentation (including a calibration round), we find that the set of combinations queried by our model is enriched for highly synergistic combinations. Remarkably, we rediscovered a synergistic drug combination that was later confirmed to be under study within clinical trials.  ( 2 min )
    Vertical Federated Learning: Challenges, Methodologies and Experiments. (arXiv:2202.04309v1 [cs.LG])
    Recently, federated learning (FL) has emerged as a promising distributed machine learning (ML) technology, owing to the advancing computational and sensing capacities of end-user devices, however with the increasing concerns on users' privacy. As a special architecture in FL, vertical FL (VFL) is capable of constructing a hyper ML model by embracing sub-models from different clients. These sub-models are trained locally by vertically partitioned data with distinct attributes. Therefore, the design of VFL is fundamentally different from that of conventional FL, raising new and unique research issues. In this paper, we aim to discuss key challenges in VFL with effective solutions, and conduct experiments on real-life datasets to shed light on these issues. Specifically, we first propose a general framework on VFL, and highlight the key differences between VFL and conventional FL. Then, we discuss research challenges rooted in VFL systems under four aspects, i.e., security and privacy risks, expensive computation and communication costs, possible structural damage caused by model splitting, and system heterogeneity. Afterwards, we develop solutions to addressing the aforementioned challenges, and conduct extensive experiments to showcase the effectiveness of our proposed solutions.  ( 2 min )
    A Data-Driven Approach to Robust Hypothesis Testing Using Sinkhorn Uncertainty Sets. (arXiv:2202.04258v1 [stat.ML])
    This paper is eligible for the Jack Keil Wolf ISIT Student Paper Award. Hypothesis testing for small-sample scenarios is a practically important problem. In this paper, we investigate the robust hypothesis testing problem in a data-driven manner, where we seek the worst-case detector over distributional uncertainty sets centered around the empirical distribution from samples using Sinkhorn distance. Compared with the Wasserstein robust test, the corresponding least favorable distributions are supported beyond the training samples, which provides a more flexible detector. Various numerical experiments are conducted on both synthetic and real datasets to validate the competitive performances of our proposed method.  ( 2 min )
    Exploring the Limits of Domain-Adaptive Training for Detoxifying Large-Scale Language Models. (arXiv:2202.04173v1 [cs.CL])
    Pre-trained language models (LMs) are shown to easily generate toxic language. In this work, we systematically explore domain-adaptive training to reduce the toxicity of language models. We conduct this study on three dimensions: training corpus, model size, and parameter efficiency. For the training corpus, we propose to leverage the generative power of LMs and generate nontoxic datasets for domain-adaptive training, which mitigates the exposure bias and is shown to be more data-efficient than using a curated pre-training corpus. We demonstrate that the self-generation method consistently outperforms the existing baselines across various model sizes on both automatic and human evaluations, even when it uses a 1/3 smaller training corpus. We then comprehensively study detoxifying LMs with parameter sizes ranging from 126M up to 530B (3x larger than GPT-3), a scale that has never been studied before. We find that i) large LMs have similar toxicity levels as smaller ones given the same pre-training corpus, and ii) large LMs require more endeavor to detoxify. We also explore parameter-efficient training methods for detoxification. We demonstrate that adding and training adapter-only layers in LMs not only saves a lot of parameters but also achieves a better trade-off between toxicity and perplexity than whole model adaptation for the large-scale models.  ( 2 min )
    On Almost Sure Convergence Rates of Stochastic Gradient Methods. (arXiv:2202.04295v1 [cs.LG])
    The vast majority of convergence rates analysis for stochastic gradient methods in the literature focus on convergence in expectation, whereas trajectory-wise almost sure convergence is clearly important to ensure that any instantiation of the stochastic algorithms would converge with probability one. Here we provide a unified almost sure convergence rates analysis for stochastic gradient descent (SGD), stochastic heavy-ball (SHB), and stochastic Nesterov's accelerated gradient (SNAG) methods. We show, for the first time, that the almost sure convergence rates obtained for these stochastic gradient methods on strongly convex functions, are arbitrarily close to their optimal convergence rates possible. For non-convex objective functions, we not only show that a weighted average of the squared gradient norms converges to zero almost surely, but also the last iterates of the algorithms. We further provide last-iterate almost sure convergence rates analysis for stochastic gradient methods on weakly convex smooth functions, in contrast with most existing results in the literature that only provide convergence in expectation for a weighted average of the iterates.  ( 2 min )
    Federated Learning of Generative Image Priors for MRI Reconstruction. (arXiv:2202.04175v1 [eess.IV])
    Multi-institutional efforts can facilitate training of deep MRI reconstruction models, albeit privacy risks arise during cross-site sharing of imaging data. Federated learning (FL) has recently been introduced to address privacy concerns by enabling distributed training without transfer of imaging data. Existing FL methods for MRI reconstruction employ conditional models to map from undersampled to fully-sampled acquisitions via explicit knowledge of the imaging operator. Since conditional models generalize poorly across different acceleration rates or sampling densities, imaging operators must be fixed between training and testing, and they are typically matched across sites. To improve generalization and flexibility in multi-institutional collaborations, here we introduce a novel method for MRI reconstruction based on Federated learning of Generative IMage Priors (FedGIMP). FedGIMP leverages a two-stage approach: cross-site learning of a generative MRI prior, and subject-specific injection of the imaging operator. The global MRI prior is learned via an unconditional adversarial model that synthesizes high-quality MR images based on latent variables. Specificity in the prior is preserved via a mapper subnetwork that produces site-specific latents. During inference, the prior is combined with subject-specific imaging operators to enable reconstruction, and further adapted to individual test samples by minimizing data-consistency loss. Comprehensive experiments on multi-institutional datasets clearly demonstrate enhanced generalization performance of FedGIMP against site-specific and federated methods based on conditional models, as well as traditional reconstruction methods.  ( 2 min )
    VAEL: Bridging Variational Autoencoders and Probabilistic Logic Programming. (arXiv:2202.04178v1 [cs.PL])
    We present VAEL, a neuro-symbolic generative model integrating variational autoencoders (VAE) with the reasoning capabilities of probabilistic logic (L) programming. Besides standard latent subsymbolic variables, our model exploits a probabilistic logic program to define a further structured representation, which is used for logical reasoning. The entire process is end-to-end differentiable. Once trained, VAEL can solve new unseen generation tasks by (i) leveraging the previously acquired knowledge encoded in the neural component and (ii) exploiting new logical programs on the structured latent space. Our experiments provide support on the benefits of this neuro-symbolic integration both in terms of task generalization and data efficiency. To the best of our knowledge, this work is the first to propose a general-purpose end-to-end framework integrating probabilistic logic programming into a deep generative model.  ( 2 min )
    Simplified Graph Convolution with Heterophily. (arXiv:2202.04139v1 [cs.LG])
    Graph convolutional networks (GCNs) (Kipf & Welling, 2017) attempt to extend the success of deep learning in modeling image and text data to graphs. However, like other deep models, GCNs comprise repeated nonlinear transformations of inputs and are therefore time and memory intensive to train. Recent work has shown that a much simpler and faster model, Simple Graph Convolution (SGC) (Wu et al., 2019), is competitive with GCNs in common graph machine learning benchmarks. The use of graph data in SGC implicitly assumes the common but not universal graph characteristic of homophily, wherein nodes link to nodes which are similar. Here we show that SGC is indeed ineffective for heterophilous (i.e., non-homophilous) graphs via experiments on synthetic and real-world datasets. We propose Adaptive Simple Graph Convolution (ASGC), which we show can adapt to both homophilous and heterophilous graph structure. Like SGC, ASGC is not a deep model, and hence is fast, scalable, and interpretable. We find that our non-deep method often outperforms state-of-the-art deep models at node classification on a benchmark of real-world datasets. The SGC paper questioned whether the complexity of graph neural networks is warranted for common graph problems involving homophilous networks; our results suggest that this question is still open even for more complicated problems involving heterophilous networks.  ( 2 min )
    Learning Robust Convolutional Neural Networks with Relevant Feature Focusing via Explanations. (arXiv:2202.04237v1 [cs.CV])
    Existing image recognition techniques based on convolutional neural networks (CNNs) basically assume that the training and test datasets are sampled from i.i.d distributions. However, this assumption is easily broken in the real world because of the distribution shift that occurs when the co-occurrence relations between objects and backgrounds in input images change. Under this type of distribution shift, CNNs learn to focus on features that are not task-relevant, such as backgrounds from the training data, and degrade their accuracy on the test data. To tackle this problem, we propose relevant feature focusing (ReFF). ReFF detects task-relevant features and regularizes CNNs via explanation outputs (e.g., Grad-CAM). Since ReFF is composed of post-hoc explanation modules, it can be easily applied to off-the-shelf CNNs. Furthermore, ReFF requires no additional inference cost at test time because it is only used for regularization while training. We demonstrate that CNNs trained with ReFF focus on features relevant to the target task and that ReFF improves the test-time accuracy.  ( 2 min )
    A Novel Ontology-guided Attribute Partitioning Ensemble Learning Model for Early Prediction of Cognitive Deficits using Quantitative Structural MRI in Very Preterm Infants. (arXiv:2202.04134v1 [cs.LG])
    Structural magnetic resonance imaging studies have shown that brain anatomical abnormalities are associated with cognitive deficits in preterm infants. Brain maturation and geometric features can be used with machine learning models for predicting later neurodevelopmental deficits. However, traditional machine learning models would suffer from a large feature-to-instance ratio (i.e., a large number of features but a small number of instances/samples). Ensemble learning is a paradigm that strategically generates and integrates a library of machine learning classifiers and has been successfully used on a wide variety of predictive modeling problems to boost model performance. Attribute (i.e., feature) bagging method is the most commonly used feature partitioning scheme, which randomly and repeatedly draws feature subsets from the entire feature set. Although attribute bagging method can effectively reduce feature dimensionality to handle the large feature-to-instance ratio, it lacks consideration of domain knowledge and latent relationship among features. In this study, we proposed a novel Ontology-guided Attribute Partitioning (OAP) method to better draw feature subsets by considering domain-specific relationship among features. With the better partitioned feature subsets, we developed an ensemble learning framework, which is referred to as OAP Ensemble Learning (OAP-EL). We applied the OAP-EL to predict cognitive deficits at 2 year of age using quantitative brain maturation and geometric features obtained at term equivalent age in very preterm infants. We demonstrated that the proposed OAP-EL approach significantly outperformed the peer ensemble learning and traditional machine learning approaches.  ( 2 min )
    Machine Learning in Heterogeneous Porous Materials. (arXiv:2202.04137v1 [cs.LG])
    The "Workshop on Machine learning in heterogeneous porous materials" brought together international scientific communities of applied mathematics, porous media, and material sciences with experts in the areas of heterogeneous materials, machine learning (ML) and applied mathematics to identify how ML can advance materials research. Within the scope of ML and materials research, the goal of the workshop was to discuss the state-of-the-art in each community, promote crosstalk and accelerate multi-disciplinary collaborative research, and identify challenges and opportunities. As the end result, four topic areas were identified: ML in predicting materials properties, and discovery and design of novel materials, ML in porous and fractured media and time-dependent phenomena, Multi-scale modeling in heterogeneous porous materials via ML, and Discovery of materials constitutive laws and new governing equations. This workshop was part of the AmeriMech Symposium series sponsored by the National Academies of Sciences, Engineering and Medicine and the U.S. National Committee on Theoretical and Applied Mechanics.  ( 2 min )
    Financial Vision Based Reinforcement Learning Trading Strategy. (arXiv:2202.04115v1 [cs.AI])
    Recent advances in artificial intelligence (AI) for quantitative trading have led to its general superhuman performance in significant trading performance. However, the potential risk of AI trading is a "black box" decision. Some AI computing mechanisms are complex and challenging to understand. If we use AI without proper supervision, AI may lead to wrong choices and make huge losses. Hence, we need to ask about the AI "black box", including why did AI decide to do this or not? Why can people trust AI or not? How can people fix their mistakes? These problems also highlight the challenges that AI technology can explain in the trading field.  ( 2 min )
    PGMax: Factor Graphs for Discrete Probabilistic Graphical Models and Loopy Belief Propagation in JAX. (arXiv:2202.04110v1 [cs.LG])
    PGMax is an open-source Python package for easy specification of discrete Probabilistic Graphical Models (PGMs) as factor graphs, and automatic derivation of efficient and scalable loopy belief propagation (LBP) implementation in JAX. It supports general factor graphs, and can effectively leverage modern accelerators like GPUs for inference. Compared with existing alternatives, PGMax obtains higher-quality inference results with orders-of-magnitude inference speedups. PGMax additionally interacts seamlessly with the rapidly growing JAX ecosystem, opening up exciting new possibilities. Our source code, examples and documentation are available at https://github.com/vicariousinc/PGMax.  ( 2 min )
    A Lagrangian Duality Approach to Active Learning. (arXiv:2202.04108v1 [cs.LG])
    We consider the batch active learning problem, where only a subset of the training data is labeled, and the goal is to query a batch of unlabeled samples to be labeled so as to maximally improve model performance. We formulate the learning problem using constrained optimization, where each constraint bounds the performance of the model on labeled samples. Considering a primal-dual approach, we optimize the primal variables, corresponding to the model parameters, as well as the dual variables, corresponding to the constraints. As each dual variable indicates how significantly the perturbation of the respective constraint affects the optimal value of the objective function, we use it as a proxy of the informativeness of the corresponding training sample. Our approach, which we refer to as Active Learning via Lagrangian dualitY, or ALLY, leverages this fact to select a diverse set of unlabeled samples with the highest estimated dual variables as our query set. We show, via numerical experiments, that our proposed approach performs similarly to or better than state-of-the-art active learning methods in a variety of classification and regression tasks. We also demonstrate how ALLY can be used in a generative mode to create novel, maximally-informative samples. The implementation code for ALLY can be found at https://github.com/juanelenter/ALLY.  ( 2 min )
    Improving Computational Complexity in Statistical Models with Second-Order Information. (arXiv:2202.04219v1 [stat.ML])
    It is known that when the statistical models are singular, i.e., the Fisher information matrix at the true parameter is degenerate, the fixed step-size gradient descent algorithm takes polynomial number of steps in terms of the sample size $n$ to converge to a final statistical radius around the true parameter, which can be unsatisfactory for the application. To further improve that computational complexity, we consider the utilization of the second-order information in the design of optimization algorithms. Specifically, we study the normalized gradient descent (NormGD) algorithm for solving parameter estimation in parametric statistical models, which is a variant of gradient descent algorithm whose step size is scaled by the maximum eigenvalue of the Hessian matrix of the empirical loss function of statistical models. When the population loss function, i.e., the limit of the empirical loss function when $n$ goes to infinity, is homogeneous in all directions, we demonstrate that the NormGD iterates reach a final statistical radius around the true parameter after a logarithmic number of iterations in terms of $n$. Therefore, for fixed dimension $d$, the NormGD algorithm achieves the optimal overall computational complexity $\mathcal{O}(n)$ to reach the final statistical radius. This computational complexity is cheaper than that of the fixed step-size gradient descent algorithm, which is of the order $\mathcal{O}(n^{\tau})$ for some $\tau > 1$, to reach the same statistical radius. We illustrate our general theory under two statistical models: generalized linear models and mixture models, and experimental results support our prediction with general theory.  ( 2 min )
    A multiscale spatiotemporal approach for smallholder irrigation detection. (arXiv:2202.04239v1 [cs.CV])
    In presenting an irrigation detection methodology that leverages multiscale satellite imagery of vegetation abundance, this paper introduces a process to supplement limited ground-collected labels and ensure classifier applicability in an area of interest. Spatiotemporal analysis of MODIS 250m Enhanced Vegetation Index (EVI) timeseries characterizes native vegetation phenologies at regional scale to provide the basis for a continuous phenology map that guides supplementary label collection over irrigated and non-irrigated agriculture. Subsequently, validated dry season greening and senescence cycles observed in 10m Sentinel-2 imagery are used to train a suite of classifiers for automated detection of potential smallholder irrigation. Strategies to improve model robustness are demonstrated, including a method of data augmentation that randomly shifts training samples; and an assessment of classifier types that produce the best performance in withheld target regions. The methodology is applied to detect smallholder irrigation in two states in the Ethiopian highlands, Tigray and Amhara. Results show that a transformer-based neural network architecture allows for the most robust prediction performance in withheld regions, followed closely by a CatBoost random forest model. Over withheld ground-collection survey labels, the transformer-based model achieves 96.7% accuracy over non-irrigated samples and 95.9% accuracy over irrigated samples. Over a larger set of samples independently collected via the introduced method of label supplementation, non-irrigated and irrigated labels are predicted with 98.3% and 95.5% accuracy, respectively. The detection model is then deployed over Tigray and Amhara, revealing crop rotation patterns and year-over-year irrigated area change. Predictions suggest that irrigated area in these two states has decreased by approximately 40% from 2020 to 2021.  ( 2 min )
    Fault Detection and Diagnosis with Imbalanced and Noisy Data: A Hybrid Framework for Rotating Machinery. (arXiv:2202.04212v1 [cs.LG])
    Fault diagnosis plays an essential role in reducing the maintenance costs of rotating machinery manufacturing systems. In many real applications of fault detection and diagnosis, data tend to be imbalanced, meaning that the number of samples for some fault classes is much less than the normal data samples. At the same time, in an industrial condition, accelerometers encounter high levels of disruptive signals and the collected samples turn out to be heavily noisy. As a consequence, many traditional Fault Detection and Diagnosis (FDD) frameworks get poor classification performances when dealing with real-world circumstances. Three main solutions have been proposed in the literature to cope with this problem: (1) the implementation of generative algorithms to increase the amount of under-represented input samples, (2) the employment of a classifier being powerful to learn from imbalanced and noisy data, (3) the development of an efficient data pre-processing including feature extraction and data augmentation. This paper proposes a hybrid framework which uses the three aforementioned components to achieve an effective signal-based FDD system for imbalanced conditions. Specifically, it first extracts the fault features, using Fourier and wavelet transforms to make full use of the signals. Then, it employs Wasserstein Generative Adversarial Networks (WGAN) to generate synthetic samples to populate the rare fault class and enhance the training set. Moreover, to achieve a higher performance a novel combination of Convolutional Long Short-term Memory (CLSTM) and Weighted Extreme Learning Machine (WELM) is proposed. To verify the effectiveness of the developed framework, different datasets settings on different imbalance severities and noise degrees were used. The comparative results demonstrate that in different scenarios GAN-CLSTM-ELM outperforms the other state-of-the-art FDD frameworks.  ( 2 min )
    Detecting and Localizing Copy-Move and Image-Splicing Forgery. (arXiv:2202.04069v1 [cs.CV])
    In the world of fake news and deepfakes, there have been an alarmingly large number of cases of images being tampered with and published in newspapers, used in court, and posted on social media for defamation purposes. Detecting these tampered images is an important task and one we try to tackle. In this paper, we focus on the methods to detect if an image has been tampered with using both Deep Learning and Image transformation methods and comparing the performances and robustness of each method. We then attempt to identify the tampered area of the image and predict the corresponding mask. Based on the results, suggestions and approaches are provided to achieve a more robust framework to detect and identify the forgeries.  ( 2 min )
    Teaching Networks to Solve Optimization Problems. (arXiv:2202.04104v1 [cs.LG])
    Leveraging machine learning to optimize the optimization process is an emerging field which holds the promise to bypass the fundamental computational bottleneck caused by traditional iterative solvers in critical applications requiring near-real-time optimization. The majority of existing approaches focus on learning data-driven optimizers that lead to fewer iterations in solving an optimization. In this paper, we take a different approach and propose to replace the iterative solvers altogether with a trainable parametric set function that outputs the optimal arguments/parameters of an optimization problem in a single feed-forward. We denote our method as, Learning to Optimize the Optimization Process (LOOP). We show the feasibility of learning such parametric (set) functions to solve various classic optimization problems, including linear/nonlinear regression, principal component analysis, transport-based core-set, and quadratic programming in supply management applications. In addition, we propose two alternative approaches for learning such parametric functions, with and without a solver in the-LOOP. Finally, we demonstrate the effectiveness of our proposed approach through various numerical experiments.  ( 2 min )
    The Rate-Distortion-Perception Tradeoff: The Role of Common Randomness. (arXiv:2202.04147v1 [cs.IT])
    A rate-distortion-perception (RDP) tradeoff has recently been proposed by Blau and Michaeli and also Matsumoto. Focusing on the case of perfect realism, which coincides with the problem of distribution-preserving lossy compression studied by Li et al., a coding theorem for the RDP tradeoff that allows for a specified amount of common randomness between the encoder and decoder is provided. The existing RDP tradeoff is recovered by allowing for the amount of common randomness to be infinite. The quadratic Gaussian case is examined in detail.  ( 2 min )
    The EMory BrEast imaging Dataset (EMBED): A Racially Diverse, Granular Dataset of 3.5M Screening and Diagnostic Mammograms. (arXiv:2202.04073v1 [eess.IV])
    Developing and validating artificial intelligence models in medical imaging requires datasets that are large, granular, and diverse. To date, the majority of publicly available breast imaging datasets lack in one or more of these areas. Models trained on these data may therefore underperform on patient populations or pathologies that have not previously been encountered. The EMory BrEast imaging Dataset (EMBED) addresses these gaps by providing 3650,000 2D and DBT screening and diagnostic mammograms for 116,000 women divided equally between White and African American patients. The dataset also contains 40,000 annotated lesions linked to structured imaging descriptors and 61 ground truth pathologic outcomes grouped into six severity classes. Our goal is to share this dataset with research partners to aid in development and validation of breast AI models that will serve all patients fairly and help decrease bias in medical AI.  ( 2 min )
    Learning Similarity Metrics for Volumetric Simulations with Multiscale CNNs. (arXiv:2202.04109v1 [cs.LG])
    Simulations that produce three-dimensional data are ubiquitous in science, ranging from fluid flows to plasma physics. We propose a similarity model based on entropy, which allows for the creation of physically meaningful ground truth distances for the similarity assessment of scalar and vectorial data, produced from transport and motion-based simulations. Utilizing two data acquisition methods derived from this model, we create collections of fields from numerical PDE solvers and existing simulation data repositories, and highlight the importance of an appropriate data distribution for an effective training process. Furthermore, a multiscale CNN architecture that computes a volumetric similarity metric (VolSiM) is proposed. To the best of our knowledge this is the first learning method inherently designed to address the challenges arising for the similarity assessment of high-dimensional simulation data. Additionally, the tradeoff between a large batch size and an accurate correlation computation for correlation-based loss functions is investigated, and the metric's invariance with respect to rotation and scale operations is analyzed. Finally, the robustness and generalization of VolSiM is evaluated on a large range of test data, as well as a particularly challenging turbulence case study, that is close to potential real-world applications.  ( 2 min )
    Independent Policy Gradient for Large-Scale Markov Potential Games: Sharper Rates, Function Approximation, and Game-Agnostic Convergence. (arXiv:2202.04129v1 [cs.LG])
    We examine global non-asymptotic convergence properties of policy gradient methods for multi-agent reinforcement learning (RL) problems in Markov potential games (MPG). To learn a Nash equilibrium of an MPG in which the size of state space and/or the number of players can be very large, we propose new independent policy gradient algorithms that are run by all players in tandem. When there is no uncertainty in the gradient evaluation, we show that our algorithm finds an $\epsilon$-Nash equilibrium with $O(1/\epsilon^2)$ iteration complexity which does not explicitly depend on the state space size. When the exact gradient is not available, we establish $O(1/\epsilon^5)$ sample complexity bound in a potentially infinitely large state space for a sample-based algorithm that utilizes function approximation. Moreover, we identify a class of independent policy gradient algorithms that enjoys convergence for both zero-sum Markov games and Markov cooperative games with the players that are oblivious to the types of games being played. Finally, we provide computational experiments to corroborate the merits and the effectiveness of our theoretical developments.  ( 2 min )
    Police Text Analysis: Topic Modeling and Spatial Relative Density Estimation. (arXiv:2202.04176v1 [cs.LG])
    We analyze a large corpus of police incident narrative documents in understanding the spatial distribution of the topics. The motivation for doing this is that police narratives in each incident report contains very fine-grained information that is richer than the category that is manually assigned by the police. Our approach is to split the corpus into topics using two different unsupervised machine learning algorithms - Latent Dirichlet Allocation and Non-negative Matrix Factorization. We validate the performance of each learned topic model using model coherence. Then, using a k-nearest neighbors density ratio estimation (kNN-DRE) approach that we propose, we estimate the spatial density ratio per topic and use this for data discovery and analysis of each topic, allowing for insights into the described incidents at scale. We provide a qualitative assessment of each topic and highlight some key benefits for using our kNN-DRE model for estimating spatial trends.  ( 2 min )
    Hierarchical Dependency Constrained Tree Augmented Naive Bayes Classifiers for Hierarchical Feature Spaces. (arXiv:2202.04105v1 [cs.LG])
    The Tree Augmented Naive Bayes (TAN) classifier is a type of probabilistic graphical model that constructs a single-parent dependency tree to estimate the distribution of the data. In this work, we propose two novel Hierarchical dependency-based Tree Augmented Naive Bayes algorithms, i.e. Hie-TAN and Hie-TAN-Lite. Both methods exploit the pre-defined parent-child (generalisation-specialisation) relationships between features as a type of constraint to learn the tree representation of dependencies among features, whilst the latter further eliminates the hierarchical redundancy during the classifier learning stage. The experimental results showed that Hie-TAN successfully obtained better predictive performance than several other hierarchical dependency constrained classification algorithms, and its predictive performance was further improved by eliminating the hierarchical redundancy, as suggested by the higher accuracy obtained by Hie-TAN-Lite.  ( 2 min )
    TransformNet: Self-supervised representation learning through predicting geometric transformations. (arXiv:2202.04181v1 [cs.CV])
    Deep neural networks need a big amount of training data, while in the real world there is a scarcity of data available for training purposes. To resolve this issue unsupervised methods are used for training with limited data. In this report, we describe the unsupervised semantic feature learning approach for recognition of the geometric transformation applied to the input data. The basic concept of our approach is that if someone is unaware of the objects in the images, he/she would not be able to quantitatively predict the geometric transformation that was applied to them. This self supervised scheme is based on pretext task and the downstream task. The pretext classification task to quantify the geometric transformations should force the CNN to learn high-level salient features of objects useful for image classification. In our baseline model, we define image rotations by multiples of 90 degrees. The CNN trained on this pretext task will be used for the classification of images in the CIFAR-10 dataset as a downstream task. we run the baseline method using various models, including ResNet, DenseNet, VGG-16, and NIN with a varied number of rotations in feature extracting and fine-tuning settings. In extension of this baseline model we experiment with transformations other than rotation in pretext task. We compare performance of selected models in various settings with different transformations applied to images,various data augmentation techniques as well as using different optimizers. This series of different type of experiments will help us demonstrate the recognition accuracy of our self-supervised model when applied to a downstream task of classification.  ( 2 min )
    A Mini-Block Natural Gradient Method for Deep Neural Networks. (arXiv:2202.04124v1 [cs.LG])
    The training of deep neural networks (DNNs) is currently predominantly done using first-order methods. Some of these methods (e.g., Adam, AdaGrad, and RMSprop, and their variants) incorporate a small amount of curvature information by using a diagonal matrix to precondition the stochastic gradient. Recently, effective second-order methods, such as KFAC, K-BFGS, Shampoo, and TNT, have been developed for training DNNs, by preconditioning the stochastic gradient by layer-wise block-diagonal matrices. Here we propose and analyze the convergence of an approximate natural gradient method, mini-block Fisher (MBF), that lies in between these two classes of methods. Specifically, our method uses a block-diagonal approximation to the Fisher matrix, where for each layer in the DNN, whether it is convolutional or feed-forward and fully connected, the associated diagonal block is also block-diagonal and is composed of a large number of mini-blocks of modest size. Our novel approach utilizes the parallelism of GPUs to efficiently perform computations on the large number of matrices in each layer. Consequently, MBF's per-iteration computational cost is only slightly higher than it is for first-order methods. Finally, the performance of our proposed method is compared to that of several baseline methods, on both Auto-encoder and CNN problems, to validate its effectiveness both in terms of time efficiency and generalization power.  ( 2 min )
  • Open

    Graph-Relational Domain Adaptation. (arXiv:2202.03628v1 [cs.LG] CROSS LISTED)
    Existing domain adaptation methods tend to treat every domain equally and align them all perfectly. Such uniform alignment ignores topological structures among different domains; therefore it may be beneficial for nearby domains, but not necessarily for distant domains. In this work, we relax such uniform alignment by using a domain graph to encode domain adjacency, e.g., a graph of states in the US with each state as a domain and each edge indicating adjacency, thereby allowing domains to align flexibly based on the graph structure. We generalize the existing adversarial learning framework with a novel graph discriminator using encoding-conditioned graph embeddings. Theoretical analysis shows that at equilibrium, our method recovers classic domain adaptation when the graph is a clique, and achieves non-trivial alignment for other types of graphs. Empirical results show that our approach successfully generalizes uniform alignment, naturally incorporates domain information represented by graphs, and improves upon existing domain adaptation methods on both synthetic and real-world datasets. Code will soon be available at https://github.com/Wang-ML-Lab/GRDA.  ( 2 min )
    Learning Temporally Causal Latent Processes from General Temporal Data. (arXiv:2110.05428v4 [stat.ML] UPDATED)
    Our goal is to recover time-delayed latent causal variables and identify their relations from measured temporal data. Estimating causally-related latent variables from observations is particularly challenging as the latent variables are not uniquely recoverable in the most general case. In this work, we consider both a nonparametric, nonstationary setting and a parametric setting for the latent processes and propose two provable conditions under which temporally causal latent processes can be identified from their nonlinear mixtures. We propose LEAP, a theoretically-grounded framework that extends Variational AutoEncoders (VAEs) by enforcing our conditions through proper constraints in causal process prior. Experimental results on various datasets demonstrate that temporally causal latent processes are reliably identified from observed variables under different dependency structures and that our approach considerably outperforms baselines that do not properly leverage history or nonstationarity information. This demonstrates that using temporal information to learn latent processes from their invertible nonlinear mixtures in an unsupervised manner, for which we believe our work is one of the first, seems promising even without sparsity or minimality assumptions.  ( 2 min )
    Approximate Bayesian inference and forecasting in huge-dimensional multi-country VARs. (arXiv:2103.04944v2 [econ.EM] UPDATED)
    Panel Vector Autoregressions (PVARs) are a popular tool for analyzing multi-country datasets. However, the number of estimated parameters can be enormous, leading to computational and statistical issues. In this paper, we develop fast Bayesian methods for estimating PVARs using integrated rotated Gaussian approximations. We exploit the fact that domestic information is often more important than international information and group the coefficients accordingly. Fast approximations are used to estimate the latter while the former are estimated with precision using Markov chain Monte Carlo techniques. We illustrate, using a huge model of the world economy, that it produces competitive forecasts quickly.  ( 2 min )
    Unifying Likelihood-free Inference with Black-box Optimization and Beyond. (arXiv:2110.03372v2 [cs.LG] UPDATED)
    Black-box optimization formulations for biological sequence design have drawn recent attention due to their promising potential impact on the pharmaceutical industry. In this work, we propose to unify two seemingly distinct worlds: likelihood-free inference and black-box optimization, under one probabilistic framework. In tandem, we provide a recipe for constructing various sequence design methods based on this framework. We show how previous optimization approaches can be "reinvented" in our framework, and further propose new probabilistic black-box optimization algorithms. Extensive experiments on sequence design application illustrate the benefits of the proposed methodology.  ( 2 min )
    On Margin Maximization in Linear and ReLU Networks. (arXiv:2110.02732v3 [cs.LG] UPDATED)
    The implicit bias of neural networks has been extensively studied in recent years. Lyu and Li [2019] showed that in homogeneous networks trained with the exponential or the logistic loss, gradient flow converges to a KKT point of the max margin problem in the parameter space. However, that leaves open the question of whether this point will generally be an actual optimum of the max margin problem. In this paper, we study this question in detail, for several neural network architectures involving linear and ReLU activations. Perhaps surprisingly, we show that in many cases, the KKT point is not even a local optimum of the max margin problem. On the flip side, we identify multiple settings where a local or global optimum can be guaranteed.  ( 2 min )
    Are deep learning models superior for missing data imputation in large surveys? Evidence from an empirical comparison. (arXiv:2103.09316v2 [cs.LG] UPDATED)
    Multiple imputation (MI) is a popular approach for dealing with missing data arising from non-response in sample surveys. Multiple imputation by chained equations (MICE) is one of the most widely used MI algorithms for multivariate data, but it lacks theoretical foundation and is computationally intensive. Recently, missing data imputation methods based on deep learning models have been developed with encouraging results in small studies. However, there has been limited research on evaluating their performance in realistic settings compared to MICE, particularly in large-scale surveys. We conduct extensive simulation studies based on American Community Survey data to compare the repeated sampling properties of four machine learning based MI methods: MICE with classification trees, MICE with random forests, generative adversarial imputation networks, and multiple imputation using denoising autoencoders. We find the deep learning based imputation methods is superior to MICE in terms of computational time. However, with the default choice of hyperparameters in the common software packages, MICE with classification trees consistently outperforms, often by a large margin, the deep learning imputation methods in terms of bias, mean squared error, and coverage under a range of realistic settings.  ( 2 min )
    Supervised Multivariate Learning with Simultaneous Feature Auto-grouping and Dimension Reduction. (arXiv:2112.09746v2 [stat.ML] UPDATED)
    Modern high-dimensional methods often adopt the "bet on sparsity" principle, while in supervised multivariate learning statisticians may face "dense" problems with a large number of nonzero coefficients. This paper proposes a novel clustered reduced-rank learning (CRL) framework that imposes two joint matrix regularizations to automatically group the features in constructing predictive factors. CRL is more interpretable than low-rank modeling and relaxes the stringent sparsity assumption in variable selection. In this paper, new information-theoretical limits are presented to reveal the intrinsic cost of seeking for clusters, as well as the blessing from dimensionality in multivariate learning. Moreover, an efficient optimization algorithm is developed, which performs subspace learning and clustering with guaranteed convergence. The obtained fixed-point estimators, though not necessarily globally optimal, enjoy the desired statistical accuracy beyond the standard likelihood setup under some regularity conditions. Moreover, a new kind of information criterion, as well as its scale-free form, is proposed for cluster and rank selection, and has a rigorous theoretical support without assuming an infinite sample size. Extensive simulations and real-data experiments demonstrate the statistical accuracy and interpretability of the proposed method.  ( 2 min )
    When can unlabeled data improve the learning rate?. (arXiv:1905.11866v2 [cs.LG] UPDATED)
    In semi-supervised classification, one is given access both to labeled and unlabeled data. As unlabeled data is typically cheaper to acquire than labeled data, this setup becomes advantageous as soon as one can exploit the unlabeled data in order to produce a better classifier than with labeled data alone. However, the conditions under which such an improvement is possible are not fully understood yet. Our analysis focuses on improvements in the minimax learning rate in terms of the number of labeled examples (with the number of unlabeled examples being allowed to depend on the number of labeled ones). We argue that for such improvements to be realistic and indisputable, certain specific conditions should be satisfied and previous analyses have failed to meet those conditions. We then demonstrate examples where these conditions can be met, in particular showing rate changes from $1/\sqrt{\ell}$ to $e^{-c\ell}$ and from $1/\sqrt{\ell}$ to $1/\ell$. These results improve our understanding of what is and isn't possible in semi-supervised learning.  ( 2 min )
    Rare event estimation using stochastic spectral embedding. (arXiv:2106.05824v2 [cs.LG] UPDATED)
    Estimating the probability of rare failure events is an essential step in the reliability assessment of engineering systems. Computing this failure probability for complex non-linear systems is challenging, and has recently spurred the development of active-learning reliability methods. These methods approximate the limit-state function (LSF) using surrogate models trained with a sequentially enriched set of model evaluations. A recently proposed method called stochastic spectral embedding (SSE) aims to improve the local approximation accuracy of global, spectral surrogate modelling techniques by sequentially embedding local residual expansions in subdomains of the input space. In this work we apply SSE to the LSF, giving rise to a stochastic spectral embedding-based reliability (SSER) method. The resulting partition of the input space decomposes the failure probability into a set of easy-to-compute \rev{conditional} failure probabilities. We propose a set of modifications that tailor the algorithm to efficiently solve rare event estimation problems. These modifications include specialized refinement domain selection, partitioning and enrichment strategies. We showcase the algorithm performance on four benchmark problems of various dimensionality and complexity in the LSF.  ( 2 min )
    BayesFlow Can Reliably Detect Model Misspecification and Posterior Errors in Amortized Bayesian Inference. (arXiv:2112.08866v2 [stat.ME] UPDATED)
    Neural density estimators have proven remarkably powerful in performing efficient simulation-based Bayesian inference in various research domains. In particular, the BayesFlow framework uses a two-step approach to enable amortized parameter estimation in settings where the likelihood function is implicitly defined by a simulation program. But how faithful is such inference when simulations are poor representations of reality? In this paper, we conceptualize the types of model misspecification arising in simulation-based inference and systematically investigate the performance of the BayesFlow framework under these misspecifications. We propose an augmented optimization objective which imposes a probabilistic structure on the latent data space and utilize maximum mean discrepancy (MMD) to detect potentially catastrophic misspecifications during inference undermining the validity of the obtained results. We verify our detection criterion on a number of artificial and realistic misspecifications, ranging from toy conjugate models to complex models of decision making and disease outbreak dynamics applied to real data. Further, we show that posterior inference errors increase as a function of the distance between the true data-generating distribution and the typical set of simulations in the latent summary space. Thus, we demonstrate the dual utility of MMD as a method for detecting model misspecification and as a proxy for verifying the faithfulness of amortized Bayesian inference.  ( 2 min )
    Missing Data Imputation and Acquisition with Deep Hierarchical Models and Hamiltonian Monte Carlo. (arXiv:2202.04599v1 [cs.LG])
    Variational Autoencoders (VAEs) have recently been highly successful at imputing and acquiring heterogeneous missing data and identifying outliers. However, within this specific application domain, existing VAE methods are restricted by using only one layer of latent variables and strictly Gaussian posterior approximations. To address these limitations, we present HH-VAEM, a Hierarchical VAE model for mixed-type incomplete data that uses Hamiltonian Monte Carlo with automatic hyper-parameter tuning for improved approximate inference. Our experiments show that HH-VAEM outperforms existing baselines in the tasks of missing data imputation, supervised learning and outlier identification with missing features. Finally, we also present a sampling-based approach for efficiently computing the information gain when missing features are to be acquired with HH-VAEM. Our experiments show that this sampling-based approach is superior to alternatives based on Gaussian approximations.  ( 2 min )
    Simple Adaptive Estimation of Quadratic Functionals in Nonparametric IV Models. (arXiv:2101.12282v2 [math.ST] UPDATED)
    This paper considers adaptive, minimax estimation of a quadratic functional in a nonparametric instrumental variables (NPIV) model, which is an important problem in optimal estimation of a nonlinear functional of an ill-posed inverse regression with an unknown operator. We first show that a leave-one-out, sieve NPIV estimator of the quadratic functional can attain a convergence rate that coincides with the lower bound previously derived in Chen and Christensen [2018]. The minimax rate is achieved by the optimal choice of the sieve dimension (a key tuning parameter) that depends on the smoothness of the NPIV function and the degree of ill-posedness, both are unknown in practice. We next propose a Lepski-type data-driven choice of the key sieve dimension adaptive to the unknown NPIV model features. The adaptive estimator of the quadratic functional is shown to attain the minimax optimal rate in the severely ill-posed case and in the regular mildly ill-posed case, but up to a multiplicative $\sqrt{\log n}$ factor in the irregular mildly ill-posed case.  ( 2 min )
    Frank-Wolfe with a Nearest Extreme Point Oracle. (arXiv:2102.02029v2 [math.OC] UPDATED)
    We consider variants of the classical Frank-Wolfe algorithm for constrained smooth convex minimization, that instead of access to the standard oracle for minimizing a linear function over the feasible set, have access to an oracle that can find an extreme point of the feasible set that is closest in Euclidean distance to a given vector. We first show that for many feasible sets of interest, such an oracle can be implemented with the same complexity as the standard linear optimization oracle. We then show that with such an oracle we can design new Frank-Wolfe variants which enjoy significantly improved complexity bounds in case the set of optimal solutions lies in the convex hull of a subset of extreme points with small diameter (e.g., a low-dimensional face of a polytope). In particular, for many $0\text{--}1$ polytopes, under quadratic growth and strict complementarity conditions, we obtain the first linearly convergent variant with rate that depends only on the dimension of the optimal face and not on the ambient dimension.  ( 2 min )
    Classification with Nearest Disjoint Centroids. (arXiv:2109.10436v2 [cs.LG] UPDATED)
    In this paper, we develop a new classification method based on nearest centroid, and it is called the nearest disjoint centroid classifier. Our method differs from the nearest centroid classifier in the following two aspects: (1) the centroids are defined based on disjoint subsets of features instead of all the features, and (2) the distance is induced by the dimensionality-normalized norm instead of the Euclidean norm. We provide a few theoretical results regarding our method. In addition, we propose a simple algorithm based on adapted k-means clustering that can find the disjoint subsets of features used in our method, and extend the algorithm to perform feature selection. We evaluate and compare the performance of our method to other classification methods on both simulated data and real-world gene expression datasets. The results demonstrate that our method is able to outperform other competing classifiers by having smaller misclassification rates and/or using fewer features in various settings and situations.  ( 2 min )
    Mean-Variance Policy Iteration for Risk-Averse Reinforcement Learning. (arXiv:2004.10888v5 [cs.LG] UPDATED)
    We present a mean-variance policy iteration (MVPI) framework for risk-averse control in a discounted infinite horizon MDP optimizing the variance of a per-step reward random variable. MVPI enjoys great flexibility in that any policy evaluation method and risk-neutral control method can be dropped in for risk-averse control off the shelf, in both on- and off-policy settings. This flexibility reduces the gap between risk-neutral control and risk-averse control and is achieved by working on a novel augmented MDP directly. We propose risk-averse TD3 as an example instantiating MVPI, which outperforms vanilla TD3 and many previous risk-averse control methods in challenging Mujoco robot simulation tasks under a risk-aware performance metric. This risk-averse TD3 is the first to introduce deterministic policies and off-policy learning into risk-averse reinforcement learning, both of which are key to the performance boost we show in Mujoco domains.  ( 2 min )
    Precision Radiotherapy via Information Integration of Expert Human Knowledge and AI Recommendation to Optimize Clinical Decision Making. (arXiv:2202.04565v1 [stat.ML])
    In the precision medicine era, there is a growing need for precision radiotherapy where the planned radiation dose needs to be optimally determined by considering a myriad of patient-specific information in order to ensure treatment efficacy. Existing artificial-intelligence (AI) methods can recommend radiation dose prescriptions within the scope of this available information. However, treating physicians may not fully entrust the AI's recommended prescriptions due to known limitations or when the AI recommendation may go beyond physicians' current knowledge. This paper lays out a systematic method to integrate expert human knowledge with AI recommendations for optimizing clinical decision making. Towards this goal, Gaussian process (GP) models are integrated with deep neural networks (DNNs) to quantify the uncertainty of the treatment outcomes given by physicians and AI recommendations, respectively, which are further used as a guideline to educate clinical physicians and improve AI models performance. The proposed method is demonstrated in a comprehensive dataset where patient-specific information and treatment outcomes are prospectively collected during radiotherapy of $67$ non-small cell lung cancer patients and retrospectively analyzed.  ( 2 min )
    A Projection-free Algorithm for Constrained Stochastic Multi-level Composition Optimization. (arXiv:2202.04296v1 [math.OC])
    We propose a projection-free conditional gradient-type algorithm for smooth stochastic multi-level composition optimization, where the objective function is a nested composition of $T$ functions and the constraint set is a closed convex set. Our algorithm assumes access to noisy evaluations of the functions and their gradients, through a stochastic first-order oracle satisfying certain standard unbiasedness and second moment assumptions. We show that the number of calls to the stochastic first-order oracle and the linear-minimization oracle required by the proposed algorithm, to obtain an $\epsilon$-stationary solution, are of order $\mathcal{O}_T(\epsilon^{-2})$ and $\mathcal{O}_T(\epsilon^{-3})$ respectively, where $\mathcal{O}_T$ hides constants in $T$. Notably, the dependence of these complexity bounds on $\epsilon$ and $T$ are separate in the sense that changing one does not impact the dependence of the bounds on the other. Moreover, our algorithm is parameter-free and does not require any (increasing) order of mini-batches to converge unlike the common practice in the analysis of stochastic conditional gradient-type algorithms.  ( 2 min )
    Adapting to Mixing Time in Stochastic Optimization with Markovian Data. (arXiv:2202.04428v1 [cs.LG])
    We consider stochastic optimization problems where data is drawn from a Markov chain. Existing methods for this setting crucially rely on knowing the mixing time of the chain, which in real-world applications is usually unknown. We propose the first optimization method that does not require the knowledge of the mixing time, yet obtains the optimal asymptotic convergence rate when applied to convex problems. We further show that our approach can be extended to: (i) finding stationary points in non-convex optimization with Markovian data, and (ii) obtaining better dependence on the mixing time in temporal difference (TD) learning; in both cases, our method is completely oblivious to the mixing time. Our method relies on a novel combination of multi-level Monte Carlo (MLMC) gradient estimation together with an adaptive learning method.  ( 2 min )
    Independent mechanism analysis, a new concept?. (arXiv:2106.05200v3 [stat.ML] UPDATED)
    Independent component analysis provides a principled framework for unsupervised representation learning, with solid theory on the identifiability of the latent code that generated the data, given only observations of mixtures thereof. Unfortunately, when the mixing is nonlinear, the model is provably nonidentifiable, since statistical independence alone does not sufficiently constrain the problem. Identifiability can be recovered in settings where additional, typically observed variables are included in the generative process. We investigate an alternative path and consider instead including assumptions reflecting the principle of independent causal mechanisms exploited in the field of causality. Specifically, our approach is motivated by thinking of each source as independently influencing the mixing process. This gives rise to a framework which we term independent mechanism analysis. We provide theoretical and empirical evidence that our approach circumvents a number of nonidentifiability issues arising in nonlinear blind source separation.  ( 2 min )
    Finding Optimal Arms in Non-stochastic Combinatorial Bandits with Semi-bandit Feedback and Finite Budget. (arXiv:2202.04487v1 [cs.LG])
    We consider the combinatorial bandits problem with semi-bandit feedback under finite sampling budget constraints, in which the learner can carry out its action only for a limited number of times specified by an overall budget. The action is to choose a set of arms, whereupon feedback for each arm in the chosen set is received. Unlike existing works, we study this problem in a non-stochastic setting with subset-dependent feedback, i.e., the semi-bandit feedback received could be generated by an oblivious adversary and also might depend on the chosen set of arms. In addition, we consider a general feedback scenario covering both the numerical-based as well as preference-based case and introduce a sound theoretical framework for this setting guaranteeing sensible notions of optimal arms, which a learner seeks to find. We suggest a generic algorithm suitable to cover the full spectrum of conceivable arm elimination strategies from aggressive to conservative. Theoretical questions about the sufficient and necessary budget of the algorithm to find the best arm are answered and complemented by deriving lower bounds for any learning algorithm for this problem scenario.  ( 2 min )
    Cost-effective Framework for Gradual Domain Adaptation with Multifidelity. (arXiv:2202.04359v1 [stat.ML])
    In domain adaptation, when there is a large distance between the source and target domains, the prediction performance will degrade. Gradual domain adaptation is one of the solutions to such an issue, assuming that we have access to intermediate domains, which shift gradually from the source to target domains. In previous works, it was assumed that the number of samples in the intermediate domains is sufficiently large; hence, self-training was possible without the need for labeled data. If access to an intermediate domain is restricted, self-training will fail. Practically, the cost of samples in intermediate domains will vary, and it is natural to consider that the closer an intermediate domain is to the target domain, the higher the cost of obtaining samples from the intermediate domain is. To solve the trade-off between cost and accuracy, we propose a framework that combines multifidelity and active domain adaptation. The effectiveness of the proposed method is evaluated by experiments with both artificial and real-world datasets. Codes are available at https://github.com/ssgw320/gdamf.  ( 2 min )
    Improving Computational Complexity in Statistical Models with Second-Order Information. (arXiv:2202.04219v1 [stat.ML])
    It is known that when the statistical models are singular, i.e., the Fisher information matrix at the true parameter is degenerate, the fixed step-size gradient descent algorithm takes polynomial number of steps in terms of the sample size $n$ to converge to a final statistical radius around the true parameter, which can be unsatisfactory for the application. To further improve that computational complexity, we consider the utilization of the second-order information in the design of optimization algorithms. Specifically, we study the normalized gradient descent (NormGD) algorithm for solving parameter estimation in parametric statistical models, which is a variant of gradient descent algorithm whose step size is scaled by the maximum eigenvalue of the Hessian matrix of the empirical loss function of statistical models. When the population loss function, i.e., the limit of the empirical loss function when $n$ goes to infinity, is homogeneous in all directions, we demonstrate that the NormGD iterates reach a final statistical radius around the true parameter after a logarithmic number of iterations in terms of $n$. Therefore, for fixed dimension $d$, the NormGD algorithm achieves the optimal overall computational complexity $\mathcal{O}(n)$ to reach the final statistical radius. This computational complexity is cheaper than that of the fixed step-size gradient descent algorithm, which is of the order $\mathcal{O}(n^{\tau})$ for some $\tau > 1$, to reach the same statistical radius. We illustrate our general theory under two statistical models: generalized linear models and mixture models, and experimental results support our prediction with general theory.  ( 2 min )
    Reproducibility in Optimization: Theoretical Framework and Limits. (arXiv:2202.04598v1 [math.OC])
    We initiate a formal study of reproducibility in optimization. We define a quantitative measure of reproducibility of optimization procedures in the face of noisy or error-prone operations such as inexact or stochastic gradient computations or inexact initialization. We then analyze several convex optimization settings of interest such as smooth, non-smooth, and strongly-convex objective functions and establish tight bounds on the limits of reproducibility in each setting. Our analysis reveals a fundamental trade-off between computation and reproducibility: more computation is necessary (and sufficient) for better reproducibility.  ( 2 min )
    A hypothesis-driven method based on machine learning for neuroimaging data analysis. (arXiv:2202.04397v1 [stat.ML])
    There remains an open question about the usefulness and the interpretation of Machine learning (MLE) approaches for discrimination of spatial patterns of brain images between samples or activation states. In the last few decades, these approaches have limited their operation to feature extraction and linear classification tasks for between-group inference. In this context, statistical inference is assessed by randomly permuting image labels or by the use of random effect models that consider between-subject variability. These multivariate MLE-based statistical pipelines, whilst potentially more effective for detecting activations than hypotheses-driven methods, have lost their mathematical elegance, ease of interpretation, and spatial localization of the ubiquitous General linear Model (GLM). Recently, the estimation of the conventional GLM has been demonstrated to be connected to an univariate classification task when the design matrix is expressed as a binary indicator matrix. In this paper we explore the complete connection between the univariate GLM and MLE \emph{regressions}. To this purpose we derive a refined statistical test with the GLM based on the parameters obtained by a linear Support Vector Regression (SVR) in the \emph{inverse} problem (SVR-iGLM). Subsequently, random field theory (RFT) is employed for assessing statistical significance following a conventional GLM benchmark. Experimental results demonstrate how parameter estimations derived from each model (mainly GLM and SVR) result in different experimental design estimates that are significantly related to the predefined functional task. Moreover, using real data from a multisite initiative the proposed MLE-based inference demonstrates statistical power and the control of false positives, outperforming the regular GLM.  ( 2 min )
    The Lifecycle of a Statistical Model: Model Failure Detection, Identification, and Refitting. (arXiv:2202.04166v1 [stat.ME])
    The statistical machine learning community has demonstrated considerable resourcefulness over the years in developing highly expressive tools for estimation, prediction, and inference. The bedrock assumptions underlying these developments are that the data comes from a fixed population and displays little heterogeneity. But reality is significantly more complex: statistical models now routinely fail when released into real-world systems and scientific applications, where such assumptions rarely hold. Consequently, we pursue a different path in this paper vis-a-vis the well-worn trail of developing new methodology for estimation and prediction. In this paper, we develop tools and theory for detecting and identifying regions of the covariate space (subpopulations) where model performance has begun to degrade, and study intervening to fix these failures through refitting. We present empirical results with three real-world data sets -- including a time series involving forecasting the incidence of COVID-19 -- showing that our methodology generates interpretable results, is useful for tracking model performance, and can boost model performance through refitting. We complement these empirical results with theory proving that our methodology is minimax optimal for recovering anomalous subpopulations as well as refitting to improve accuracy in a structured normal means setting.  ( 2 min )
    A new perspective on classification: optimally allocating limited resources to uncertain tasks. (arXiv:2202.04369v1 [cs.LG])
    A central problem in business concerns the optimal allocation of limited resources to a set of available tasks, where the payoff of these tasks is inherently uncertain. In credit card fraud detection, for instance, a bank can only assign a small subset of transactions to their fraud investigations team. Typically, such problems are solved using a classification framework, where the focus is on predicting task outcomes given a set of characteristics. Resources are then allocated to the tasks that are predicted to be the most likely to succeed. However, we argue that using classification to address task uncertainty is inherently suboptimal as it does not take into account the available capacity. Therefore, we first frame the problem as a type of assignment problem. Then, we present a novel solution using learning to rank by directly optimizing the assignment's expected profit given limited, stochastic capacity. This is achieved by optimizing a specific instance of the net discounted cumulative gain, a commonly used class of metrics in learning to rank. Empirically, we demonstrate that our new method achieves higher expected profit and expected precision compared to a classification approach for a wide variety of application areas and data sets. This illustrates the benefit of an integrated approach and of explicitly considering the available resources when learning a predictive model.  ( 2 min )
    Multi-model Ensemble Analysis with Neural Network Gaussian Processes. (arXiv:2202.04152v1 [stat.AP])
    Multi-model ensemble analysis integrates information from multiple climate models into a unified projection. However, existing integration approaches based on model averaging can dilute fine-scale spatial information and incur bias from rescaling low-resolution climate models. We propose a statistical approach, called NN-GPR, using Gaussian process regression (GPR) with an infinitely wide deep neural network based covariance function. NN-GPR requires no assumptions about the relationships between models, no interpolation to a common grid, no stationarity assumptions, and automatically downscales as part of its prediction algorithm. Model experiments show that NN-GPR can be highly skillful at surface temperature and precipitation forecasting by preserving geospatial signals at multiple scales and capturing inter-annual variability. Our projections particularly show improved accuracy and uncertainty quantification skill in regions of high variability, which allows us to cheaply assess tail behavior at a 0.44$^\circ$/50 km spatial resolution without a regional climate model (RCM). Evaluations on reanalysis data and SSP245 forced climate models show that NN-GPR produces similar, overall climatologies to the model ensemble while better capturing fine scale spatial patterns. Finally, we compare NN-GPR's regional predictions against two RCMs and show that NN-GPR can rival the performance of RCMs using only global model data as input.  ( 2 min )
    A Probabilistic Generative Grammar for Semantic Parsing. (arXiv:1606.06361v2 [cs.CL] UPDATED)
    Domain-general semantic parsing is a long-standing goal in natural language processing, where the semantic parser is capable of robustly parsing sentences from domains outside of which it was trained. Current approaches largely rely on additional supervision from new domains in order to generalize to those domains. We present a generative model of natural language utterances and logical forms and demonstrate its application to semantic parsing. Our approach relies on domain-independent supervision to generalize to new domains. We derive and implement efficient algorithms for training, parsing, and sentence generation. The work relies on a novel application of hierarchical Dirichlet processes (HDPs) for structured prediction, which we also present in this manuscript. This manuscript is an excerpt of chapter 4 from the Ph.D. thesis of Saparov (2022), where the model plays a central role in a larger natural language understanding system. This manuscript provides a new simplified and more complete presentation of the work first introduced in Saparov, Saraswat, and Mitchell (2017). The description and proofs of correctness of the training algorithm, parsing algorithm, and sentence generation algorithm are much simplified in this new presentation. We also describe the novel application of hierarchical Dirichlet processes for structured prediction. In addition, we extend the earlier work with a new model of word morphology, which utilizes the comprehensive morphological data from Wiktionary.  ( 2 min )
    Adjoint-aided inference of Gaussian process driven differential equations. (arXiv:2202.04589v1 [stat.ML])
    Linear systems occur throughout engineering and the sciences, most notably as differential equations. In many cases the forcing function for the system is unknown, and interest lies in using noisy observations of the system to infer the forcing, as well as other unknown parameters. In differential equations, the forcing function is an unknown function of the independent variables (typically time and space), and can be modelled as a Gaussian process (GP). In this paper we show how the adjoint of a linear system can be used to efficiently infer forcing functions modelled as GPs, after using a truncated basis expansion of the GP kernel. We show how exact conjugate Bayesian inference for the truncated GP can be achieved, in many cases with substantially lower computation than would be required using MCMC methods. We demonstrate the approach on systems of both ordinary and partial differential equations, and by testing on synthetic data, show that the basis expansion approach approximates well the true forcing with a modest number of basis vectors. Finally, we show how to infer point estimates for the non-linear model parameters, such as the kernel length-scales, using Bayesian optimisation.  ( 2 min )
    Policy Optimization with Stochastic Mirror Descent. (arXiv:1906.10462v5 [cs.LG] UPDATED)
    Improving sample efficiency has been a longstanding goal in reinforcement learning. This paper proposes $\mathtt{VRMPO}$ algorithm: a sample efficient policy gradient method with stochastic mirror descent. In $\mathtt{VRMPO}$, a novel variance-reduced policy gradient estimator is presented to improve sample efficiency. We prove that the proposed $\mathtt{VRMPO}$ needs only $\mathcal{O}(\epsilon^{-3})$ sample trajectories to achieve an $\epsilon$-approximate first-order stationary point, which matches the best sample complexity for policy optimization. The extensive experimental results demonstrate that $\mathtt{VRMPO}$ outperforms the state-of-the-art policy gradient methods in various settings.  ( 2 min )
    Understanding the bias-variance tradeoff of Bregman divergences. (arXiv:2202.04167v1 [stat.ML])
    This paper builds upon the work of Pfau (2013), which generalized the bias variance tradeoff to any Bregman divergence loss function. Pfau (2013) showed that for Bregman divergences, the bias and variances are defined with respect to a central label, defined as the expected mean of the label, and a central prediction, of a more complex form. We show that, similarly to the label, the central prediction can be interpreted as the mean of a random variable, where the mean operates in a dual space defined by the loss function itself. Viewing the bias-variance tradeoff through operations taken in dual space, we subsequently derive several results of interest. In particular, (a) the variance terms satisfy a generalized law of total variance; (b) if a source of randomness cannot be controlled, its contribution to the bias and variance has a closed form; (c) there exist natural ensembling operations in the label and prediction spaces which reduce the variance and do not affect the bias.  ( 2 min )
    Stochastic Contextual Dueling Bandits under Linear Stochastic Transitivity Models. (arXiv:2202.04593v1 [cs.LG])
    We consider the regret minimization task in a dueling bandits problem with context information. In every round of the sequential decision problem, the learner makes a context-dependent selection of two choice alternatives (arms) to be compared with each other and receives feedback in the form of noisy preference information. We assume that the feedback process is determined by a linear stochastic transitivity model with contextualized utilities (CoLST), and the learner's task is to include the best arm (with highest latent context-dependent utility) in the duel. We propose a computationally efficient algorithm, $\texttt{CoLSTIM}$, which makes its choice based on imitating the feedback process using perturbed context-dependent utility estimates of the underlying CoLST model. If each arm is associated with a $d$-dimensional feature vector, we show that $\texttt{CoLSTIM}$ achieves a regret of order $\tilde O( \sqrt{dT})$ after $T$ learning rounds. Additionally, we also establish the optimality of $\texttt{CoLSTIM}$ by showing a lower bound for the weak regret that refines the existing average regret analysis. Our experiments demonstrate its superiority over state-of-art algorithms for special cases of CoLST models.  ( 2 min )
    A Data-Driven Approach to Robust Hypothesis Testing Using Sinkhorn Uncertainty Sets. (arXiv:2202.04258v1 [stat.ML])
    This paper is eligible for the Jack Keil Wolf ISIT Student Paper Award. Hypothesis testing for small-sample scenarios is a practically important problem. In this paper, we investigate the robust hypothesis testing problem in a data-driven manner, where we seek the worst-case detector over distributional uncertainty sets centered around the empirical distribution from samples using Sinkhorn distance. Compared with the Wasserstein robust test, the corresponding least favorable distributions are supported beyond the training samples, which provides a more flexible detector. Various numerical experiments are conducted on both synthetic and real datasets to validate the competitive performances of our proposed method.  ( 2 min )
    Covariate-informed Representation Learning with Samplewise Optimal Identifiable Variational Autoencoders. (arXiv:2202.04206v1 [stat.ML])
    Recently proposed identifiable variational autoencoder (iVAE, Khemakhem et al. (2020)) framework provides a promising approach for learning latent independent components of the data. Although the identifiability is appealing, the objective function of iVAE does not enforce the inverse relation between encoders and decoders. Without the inverse relation, representations from the encoder in iVAE may not reconstruct observations,i.e., representations lose information in observations. To overcome this limitation, we develop a new approach, covariate-informed identifiable VAE (CI-iVAE). Different from previous iVAE implementations, our method critically leverages the posterior distribution of latent variables conditioned only on observations. In doing so, the objective function enforces the inverse relation, and learned representation contains more information of observations. Furthermore, CI-iVAE extends the original iVAE objective function to a larger class and finds the optimal one among them, thus providing a better fit to the data. Theoretically, our method has tighter evidence lower bounds (ELBOs) than the original iVAE. We demonstrate that our approach can more reliably learn features of various synthetic datasets, two benchmark image datasets (EMNIST and Fashion MNIST), and a large-scale brain imaging dataset for adolescent mental health research.  ( 2 min )
    A Neural Phillips Curve and a Deep Output Gap. (arXiv:2202.04146v1 [econ.EM])
    Many problems plague the estimation of Phillips curves. Among them is the hurdle that the two key components, inflation expectations and the output gap, are both unobserved. Traditional remedies include creating reasonable proxies for the notable absentees or extracting them via some form of assumptions-heavy filtering procedure. I propose an alternative route: a Hemisphere Neural Network (HNN) whose peculiar architecture yields a final layer where components can be interpreted as latent states within a Neural Phillips Curve. There are benefits. First, HNN conducts the supervised estimation of nonlinearities that arise when translating a high-dimensional set of observed regressors into latent states. Second, computations are fast. Third, forecasts are economically interpretable. Fourth, inflation volatility can also be predicted by merely adding a hemisphere to the model. Among other findings, the contribution of real activity to inflation appears severely underestimated in traditional econometric specifications. Also, HNN captures out-of-sample the 2021 upswing in inflation and attributes it first to an abrupt and sizable disanchoring of the expectations component, followed by a wildly positive gap starting from late 2020. HNN's gap unique path comes from dispensing with unemployment and GDP in favor of an amalgam of nonlinearly processed alternative tightness indicators -- some of which are skyrocketing as of early 2022.  ( 2 min )
    Optimal learning rate schedules in high-dimensional non-convex optimization problems. (arXiv:2202.04509v1 [cs.LG])
    Learning rate schedules are ubiquitously used to speed up and improve optimisation. Many different policies have been introduced on an empirical basis, and theoretical analyses have been developed for convex settings. However, in many realistic problems the loss-landscape is high-dimensional and non convex -- a case for which results are scarce. In this paper we present a first analytical study of the role of learning rate scheduling in this setting, focusing on Langevin optimization with a learning rate decaying as $\eta(t)=t^{-\beta}$. We begin by considering models where the loss is a Gaussian random function on the $N$-dimensional sphere ($N\rightarrow \infty$), featuring an extensive number of critical points. We find that to speed up optimization without getting stuck in saddles, one must choose a decay rate $\beta<1$, contrary to convex setups where $\beta=1$ is generally optimal. We then add to the problem a signal to be recovered. In this setting, the dynamics decompose into two phases: an \emph{exploration} phase where the dynamics navigates through rough parts of the landscape, followed by a \emph{convergence} phase where the signal is detected and the dynamics enter a convex basin. In this case, it is optimal to keep a large learning rate during the exploration phase to escape the non-convex region as quickly as possible, then use the convex criterion $\beta=1$ to converge rapidly to the solution. Finally, we demonstrate that our conclusions hold in a common regression task involving neural networks.  ( 2 min )
    Offline Reinforcement Learning with Realizability and Single-policy Concentrability. (arXiv:2202.04634v1 [cs.LG])
    Sample-efficiency guarantees for offline reinforcement learning (RL) often rely on strong assumptions on both the function classes (e.g., Bellman-completeness) and the data coverage (e.g., all-policy concentrability). Despite the recent efforts on relaxing these assumptions, existing works are only able to relax one of the two factors, leaving the strong assumption on the other factor intact. As an important open problem, can we achieve sample-efficient offline RL with weak assumptions on both factors? In this paper we answer the question in the positive. We analyze a simple algorithm based on the primal-dual formulation of MDPs, where the dual variables (discounted occupancy) are modeled using a density-ratio function against offline data. With proper regularization, we show that the algorithm enjoys polynomial sample complexity, under only realizability and single-policy concentrability. We also provide alternative analyses based on different assumptions to shed light on the nature of primal-dual algorithms for offline RL.  ( 2 min )
    Hierarchical Dependency Constrained Tree Augmented Naive Bayes Classifiers for Hierarchical Feature Spaces. (arXiv:2202.04105v1 [cs.LG])
    The Tree Augmented Naive Bayes (TAN) classifier is a type of probabilistic graphical model that constructs a single-parent dependency tree to estimate the distribution of the data. In this work, we propose two novel Hierarchical dependency-based Tree Augmented Naive Bayes algorithms, i.e. Hie-TAN and Hie-TAN-Lite. Both methods exploit the pre-defined parent-child (generalisation-specialisation) relationships between features as a type of constraint to learn the tree representation of dependencies among features, whilst the latter further eliminates the hierarchical redundancy during the classifier learning stage. The experimental results showed that Hie-TAN successfully obtained better predictive performance than several other hierarchical dependency constrained classification algorithms, and its predictive performance was further improved by eliminating the hierarchical redundancy, as suggested by the higher accuracy obtained by Hie-TAN-Lite.  ( 2 min )
    Generative multitask learning mitigates target-causing confounding. (arXiv:2202.04136v1 [cs.LG])
    We propose a simple and scalable approach to causal representation learning for multitask learning. Our approach requires minimal modification to existing ML systems, and improves robustness to prior probability shift. The improvement comes from mitigating unobserved confounders that cause the targets, but not the input. We refer to them as target-causing confounders. These confounders induce spurious dependencies between the input and targets. This poses a problem for the conventional approach to multitask learning, due to its assumption that the targets are conditionally independent given the input. Our proposed approach takes into account the dependency between the targets in order to alleviate target-causing confounding. All that is required in addition to usual practice is to estimate the joint distribution of the targets to switch from discriminative to generative classification, and to predict all targets jointly. Our results on the Attributes of People and Taskonomy datasets reflect the conceptual improvement in robustness to prior probability shift.  ( 2 min )
    Turnpike in optimal control of PDEs, ResNets, and beyond. (arXiv:2202.04097v1 [math.OC])
    The \emph{turnpike property} in contemporary macroeconomics asserts that if an economic planner seeks to move an economy from one level of capital to another, then the most efficient path, as long as the planner has enough time, is to rapidly move stock to a level close to the optimal stationary or constant path, then allow for capital to develop along that path until the desired term is nearly reached, at which point the stock ought to be moved to the final target. Motivated in part by its nature as a resource allocation strategy, over the past decade, the turnpike property has also been shown to hold for several classes of partial differential equations arising in mechanics. When formalized mathematically, the turnpike theory corroborates the insights from economics: for an optimal control problem set in a finite-time horizon, optimal controls and corresponding states, are close (often exponentially), during most of the time, except near the initial and final time, to the optimal control and corresponding state for the associated stationary optimal control problem. In particular, the former are mostly constant over time. This fact provides a rigorous meaning to the asymptotic simplification that some optimal control problems appear to enjoy over long time intervals, allowing the consideration of the corresponding stationary problem for computing and applications. We review a slice of the theory developed over the past decade --the controllability of the underlying system is an important ingredient, and can even be used to devise simple turnpike-like strategies which are nearly optimal--, and present several novel applications, including, among many others, the characterization of Hamilton-Jacobi-Bellman asymptotics, and stability estimates in deep learning via residual neural networks.  ( 2 min )
    PGMax: Factor Graphs for Discrete Probabilistic Graphical Models and Loopy Belief Propagation in JAX. (arXiv:2202.04110v1 [cs.LG])
    PGMax is an open-source Python package for easy specification of discrete Probabilistic Graphical Models (PGMs) as factor graphs, and automatic derivation of efficient and scalable loopy belief propagation (LBP) implementation in JAX. It supports general factor graphs, and can effectively leverage modern accelerators like GPUs for inference. Compared with existing alternatives, PGMax obtains higher-quality inference results with orders-of-magnitude inference speedups. PGMax additionally interacts seamlessly with the rapidly growing JAX ecosystem, opening up exciting new possibilities. Our source code, examples and documentation are available at https://github.com/vicariousinc/PGMax.  ( 2 min )

  • Open

    Action Saturation - Pixel Observations [D]
    I'm working on a new project that requires me to use pixels as an observation (never used pixel observations before-so please feel free to suggest even things that seem obvious). It's a custom environment, where I am trying to control the size of a specific feature in my image in a highly non-linear environment. My initial implementation wasn't working so I've taken the non-linearity out of it and made a similar toy problem. At each time step, I'm now initializing a black image with a single white stripe with the vertical length set equal to the previous action (scaled-in pixels). The reward is something like -( (length_observed - length_desired)**2 / max_length ), where length_desired is just a random intermediate length. So, basically, I'm trying to train it so that it can consistently choose actions equal to length_desired. I'm seeing my actions saturate basically immediately. My action space is continuous (-1,+1), which I'm scaling to a range of pixels from 25% to 90% of the image height (so a=-1 gives a white bar spanning 25% of the image, a=+1 gives a white bar spanning 90% of the image). I've checked my observations and rewards, and they all make sense in terms of what I expect. I also scale my observations to -1/+1, and it's a mono8 image so only 1 channel. I was using an implementation of TD3 which I've used on several custom environments in the past so I trust it. When those actions were saturating, I read that this may happen with TD3 and CNNs, so I switched to SAC but it hasn't really helped. I'm wondering if it has something to do with how sparse the images are? Like I said, this is my first time using pixels for any RL application, so I may just be missing something basic. tl;dr: TD3 and SAC both seem to be showing action saturation in basic pixel environment with majority of pixels set equal to zero. Any suggestions appreciated. submitted by /u/snky_pte [link] [comments]  ( 1 min )
    MADDPG with continuous and discrete actions
    How to implement MADDPG with combined actions, more specifically I dont know how to output discrete actions? In my case I have 3 continous actions and 1 discrete action. If I represent 1 discrete action with Softmax, I will have to sample it but I dont think policy update will work. submitted by /u/TheGuy839 [link] [comments]  ( 1 min )
    How is stable-baslines3 dqn so fast?
    I used sb3 dqn on a simple environment and it ran 100,000 steps in under 2 minutes. I made a vanilla dqn and it would take days to get through 100,000 steps. How is sb3 so fast? How can I optimize my own dqn? submitted by /u/GTHTMeadow [link] [comments]  ( 1 min )
    Altering maximum episode length for custom environment
    Hi, I'm working with the LunarLander environment, and I tried making my own env since I needed to alter the reward system, and I did so by just overriding the 'step' function. But is there any way I can change the maximum episode length as well? I tried looking around, and this is the closest I got: https://www.reddit.com/r/reinforcementlearning/comments/agd6j4/how_to_change_episode_length_and_reward/ But this only works for environments created with gym and not custom environments. I feel like it's impossible submitted by /u/PeterBergmann69420 [link] [comments]  ( 2 min )
    The best resource for learning deep q learning
    Hi Folks, I already know abour RL, I want to know is there any good resource for learning deep q learning? Thanks. submitted by /u/mohfaz [link] [comments]  ( 1 min )
  • Open

    Guiding Frozen Language Models with Learned Soft Prompts
    Posted by Brian Lester, AI Resident and Noah Constant, Senior Staff Software Engineer, Google Research Large pre-trained language models, which are continuing to grow in size, achieve state-of-art results on many natural language processing (NLP) benchmarks. Since the development of GPT and BERT, standard practice has been to fine-tune models on downstream tasks, which involves adjusting every weight in the network (i.e., model tuning). However, as models become larger, storing and serving a tuned copy of the model for each downstream task becomes impractical. An appealing alternative is to share across all downstream tasks a single frozen pre-trained language model, in which all weights are fixed. In an exciting development, GPT-3 showed convincingly that a frozen model can be conditio…  ( 7 min )
  • Open

    Where to apply masters in this field
    I’m doing my B.Tech in ECE from a reputed college in India. I have taken courses of ML and neural networks and I am deeply interested in this field. Could anyone suggest which unis to aim for masters and what would be the requirements to get into the top universities? submitted by /u/Actual-Performer-832 [link] [comments]  ( 1 min )
    Computer Scientists Prove Why Bigger Neural Networks Do Better
    submitted by /u/nickb [link] [comments]  ( 1 min )
  • Open

    Reduce costs and complexity of ML preprocessing with Amazon S3 Object Lambda
    Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance. Often, customers have objects in S3 buckets that need further processing to be used effectively by consuming applications. Data engineers must support these application-specific data views with trade-offs between persisting derived copies or transforming data […]  ( 6 min )
  • Open

    [P] Training synthetic models on highly complex datasets.
    submitted by /u/Repeat-or [link] [comments]
    Allen Institute For AI Researchers Propose GPV-2: A Webly Supervised Concept Expansion For General Purpose Vision Models
    While much of the work in computer vision has been focused on developing task-specific models, there has recently been a drive to develop vision systems that are more generic in nature. GPVs (General purpose vision) or general-purpose vision systems, unlike specialized models, attempt to facilitate learning a wide range of tasks natively, generalize learned skills and concepts to novel skill-concept combinations and learn new tasks quickly. GPVs are currently trained and evaluated on datasets that are heavily supervised, such as COCO and VISUALGENOME, and expose models to a variety of skill-concept combinations. Learning localization from COCO, for example, exposes the models to 80 ideas in that skill. Expanding the model’s conceptual vocabulary to include new ideas in this paradigm necessitates collecting fully-supervised task data for each of those concepts. Continue Reading The code to download the Web10K dataset is on Github. Project: https://prior.allenai.org/projects/gpv2 Paper: https://arxiv.org/pdf/2202.02317.pdf Github: https://github.com/allenai/gpv2 submitted by /u/ai-lover [link] [comments]  ( 1 min )
    Diabetes management technology and Artificial Intelligence
    Globally, diabetes is becoming one of the most prominent diseases in healthcare. We are entering a crisis where the number of individuals developing diabetes far exceeds the number of expert care available. Now, with the pandemic, hospitals are also overwhelmed by the intake of COVID patients. Doctors and nurses are being overworked and must make sacrifices in which patients to treat first. The importance of improving our clinical settings to better treat patients is now at the top of everyone's mind. Diabetes can be a difficult disease to manage for patients, and with the shortage of physicians a platform to provide assistance to both physicians and patients is in the works. Abbott is a large medical device company that has been in the market for developing diabetes tech for many years …  ( 2 min )
    How Does A.I. Intersect with the Future of the Brain-Computer Interface?
    submitted by /u/Beautiful-Credit-868 [link] [comments]
    AES Chooses Fluence Market Optimization Software to Maximize Solar Ventures
    submitted by /u/RomondaVargo [link] [comments]
    Hedge Portfolio Evaluation with Monte Carlo
    submitted by /u/promach [link] [comments]
    Increasing use of Emotion AI/ Affective computing in various industries
    Video calling has become an important aspect of communication and sometimes we tend to miss how the other person is responding. Do you know that it is possible to interpret a person’s expressions? Yes, with EmotionAI, you can find out how the individual is reacting. I have written some blogs covering the industry use cases where AI is making this possible and the benefits businesses are reaping. https://www.enablex.io/insights/affective-computing-how-does-this-work/ submitted by /u/JasonWills343 [link] [comments]  ( 1 min )
    Instagram chatbots. What you can do there with automation and NLP
    As you probably heard, Facebook has launched Messenger API for Instagram. That means different possibilities for automation, like Instagram #chatbots. I asked several developers from different companies on what can be possible done with Instagram chatbots. Here's one of the cool features. ⚡️You can create automated replies to your brand's story mentions. So let's say someone posted a Story and tagged u/your_company. Now, you can send an automated direct message like: "Hey u/user_name, thanks for mentioning us in your Story! Do you need any assistance?" Or you can do some cool marketing tricks to increase brand loyalty by offering a discount for a story mention of your brand, saying, "Thanks for the mention u/user_name! We want to offer you a 10% discount on our products! Use code INSTADISCOUNT10 to save 10% on your next order!" Pretty cool, right? 😈 Here is the link to the whole guide on Instagram possibilities ⬇️ https://botscrew.com/blog/instagram-chatbot/?utm\_source=Reddit&utm\_medium=Atificial Intelligence&utm_campaign=Instagran chatbot&utm_term=&utm_content= submitted by /u/Avandegraund [link] [comments]  ( 1 min )
    Traditional Leadership Style Versus Modern Day Business Approach
    ​ https://preview.redd.it/gxbr0mfljzg81.jpg?width=860&format=pjpg&auto=webp&s=668ef41ad16fa439ca3268e28002db5b68da25d0 One of the excellent contributors in the field of leadership, Steve Jobs had a unique style of bringing out the best in people. He used to challenge his team by letting them try their best to convince him about their great ideas that eventually led them to give positive results. This signifies him as a visionary as well as a great business leader. But what makes Jobs an example of an ideal leader. Let’s take a tour of the history to understand the traditional leadership style versus the modern-day business approach. Where do Traditional Leadership and new Leadership Qualities collide? For the last 100 years, humanity has adapted technological, social, and economic refo…  ( 1 min )
    I made a video where i used Deep Fake techniques to make the world a better place, what do you think?
    submitted by /u/nonchelange [link] [comments]  ( 1 min )
    [P] What we learned by accelerating by 5X Hugging Face generative language models
    submitted by /u/pommedeterresautee [link] [comments]  ( 3 min )
    Best resources to learn data science books, courses
    submitted by /u/sivasiriyapureddy [link] [comments]
    Google Introduces ‘PipelineDP’: A New Differential Privacy Framework For Python Developers To Process Data
    Google unveiled a new milestone. a differential privacy framework, along with OpenMined that lets any Python developer handle data with differential privacy. The two have been working on the project for a year, and according to Google, the open-source privacy infrastructure will help millions of people around the world “develop and launch innovative, differentiated privacy applications that may deliver beneficial insights and services without disclosing any personal data.” Continue Reading Github: https://github.com/OpenMined/PipelineDP https://preview.redd.it/qrj8f66smxg81.png?width=356&format=png&auto=webp&s=2e5f0714405273d9bff13a3012620ddc0343a38f submitted by /u/ai-lover [link] [comments]  ( 1 min )
  • Open

    [D] IMLR ICLR withdraw/reject rates
    I am on openreview and I see some faculty at my department have many many paper withdrawals from these conferences. Maybe half the submissions were withdrawals. What’s going on, what does this mean? Is this person still able to list it as a work submitted to the conference? Also is this common, even among top researchers in the field? Edit: I’m still a student, this is all unfamiliar to me submitted by /u/A_N_Kolmogorov [link] [comments]  ( 1 min )
    FAST-RIR: Fast neural diffuse room impulse response generator
    submitted by /u/Snoo63916 [link] [comments]
    [Discussion] Is RETRO inference faster than GPT-3?
    The claim in the RETRO paper is that it has 25x fewer parameters. GPT-3: 175B parameters RETRO: 7.5B parameters Does this have inference-time benefits? The paper doesn't talk about any inference-time benefits of this approach, so I suspect neighbors are still looked up at inference time. I'd be interested to know what the runtime penalty is of the nearest-neighbors look up, and how many inferences per second RETRO can do. submitted by /u/ndronen [link] [comments]  ( 1 min )
    [P] Aim 3.5 - use the super-powers of Aim on your TensorBoard logs
    Hi r/MachineLearning, Gev here, author of Aim. blog: https://bit.ly/3GFmDCk code: https://github.com/aimhubio/aim Aim 3.5 is now available! Here are the latest additions: Tensorboard logs support Use Aim's explorers to quickly explore your TensorBoard logs! https://preview.redd.it/r58spljuh1h81.jpg?width=1600&format=pjpg&auto=webp&s=d78f88c035d9fb36263ad4da916d28776d6bfe39 Matplotlib figure tracking Track your Matplotlib figures and observe on the Run Details page. You can also track them as images and compare, search on the Images Explorer page. Track system params, CLI, Env, Git, Installed packages Enable Aim to automatically track your environment variables as params and use them to query, compare your experiments. Huge reproducibility booster! ​ Looking forward to hearing what you are building with Aim. We are shipping bi-weekly. Aim is rapidly growing. If you find Aim useful, drop us a star.🌟 Join the Aim community to connect with the authors and maintainers! submitted by /u/sgevorg [link] [comments]  ( 1 min )
    [P] Mercury: Publish Jupyter Notebook as web app by adding YAML header (similar to R Markdown)
    I would like to share with you an open-source project that I was working on for the last three months. It is an open-source framework for converting Jupyter Notebook to web app by adding YAML header (similar to R Markdown). Mercury is a perfect tool to share your Python notebooks with non-programmers. You can turn your notebook into web app. Sharing is as easy as sending them the URL to your server. You can add interactive input to your notebook by defining the YAML header. Your users can change the input and execute the notebook. You can hide your code to not scare your (non-coding) collaborators. Users can interact with notebook and save they results. You can share notebook as a web app with multiple users - they don't overwrite original notebook. The Mercury is open-source with code on GitHub https://github.com/mljar/mercury submitted by /u/pp314159 [link] [comments]  ( 1 min )
    [P]Missing Values in Binary Multi-factor Classifier
    Despite sleuthing Google for some advice, I've come up pretty dry for methods outside of imputation and what this source paper did (deepNN for prediction of Toxicity). The data has ~10k rows with ~2000 complete variables but 12 incomplete binary classifiers. Only 20% of rows have values for all classifiers and around 30% have a value for only 1 classifier. As I understand it, the source paper handled the missing classifier values by essentially learning "generating error" on only known values. Here is the error function and their snippit from a different version of the same paper (Section 2.2.3). ​ https://preview.redd.it/ku6fcuuyx0h81.png?width=1097&format=png&auto=webp&s=76c25e88088428d8cab9cec72d16efc91f349e81 Would imputation be a simpler and viable option to try here? Could using the mean of each classifier work for most ML models? Unless I'm mistaken, I would need to write a custom scoring function for "off the shelf" models to mimic the source paper (not sure I need a DNN). Any advise would be appreciated. submitted by /u/Zabadoo222 [link] [comments]  ( 1 min )
    [D] Injecting domain knowledge in textual form for trajectory prediction
    Injection of domain knowledge can be a good substitute for cases where data is not abundant or unavailable. I am looking for some pointers where the domain knowledge of surrounding in 'Textual written form' is/can be integrated in trajectory prediction. Idea is to use this domain external knowledge(text form ) over the image scene for better trajectory prediction, as based on rules some trajectory can be prioritized than others. One way I can think is to use knowledge graph but they are very restricted, as construction of knowledge graph over the scene at inference time is difficult. Other idea I can think of is to use Large language models(LLM) over text rules, but not sure what to do after generation of embeddings of textual rules using LLM in trajectory prediction. Suggestions and SOTA papers related to text rules as a domain knowledge in trajectory prediction are appreciated. submitted by /u/projekt_treadstone [link] [comments]  ( 1 min )
    [P] ML over Encrypted Data
    Hi everyone, we have developed a library that applies numpy functions over encrypted data (using homomorphic encryption). The repo is available in open source at https://github.com/zama-ai/concrete-numpy We are applying this to many popular machine learning algorithms/libraries such as sklearn, statsmodel, xgboost, lightgbm or pytorch and plan to release this as a new library (you can find some early examples here). Any feedback/question is much welcome ! submitted by /u/strojax [link] [comments]  ( 1 min )
    [D] how does everyone evaluate their clusters?
    Looking for some ways everyone evaluate their clustering. I did some research and came across a barrage of statistical techniques and methods but I’m not sure if there is some industry standard way? Research papers seem to vary a lot as well in measuring them. Would love to know how others are approaching this! Particularly in the NLP embedding space but also in general. submitted by /u/ConfectionSafe954 [link] [comments]  ( 1 min )
    [D] When a model over fits, how do you know if it’s because you need more data, or if the network you have is too “powerful” (too many layers, not enough dropout etc).
    Basically in my case, I’m training a network, and for the sake of argument, it starts to overfit on the train data, once the train loss hits around 0.05 Regardless of what changes I add to my base architecture, such as introducing more dropout, or reducing the number of layers / embedding sizes, the network consistently over fits once the the training loss arrives in the 0.05 range. The validation loss also drops consistently with the train loss until this point of 0.05, upon reaching that point, it hits a lowest loss value, and then starts rising until it levels off at around 0.07. While the validation loss levels, the training loss continues to drop and overfits. I’m not sure how to approach this problem. Other than introduce more data in the training set, I can’t think of anything to do to improve it further. submitted by /u/DaBeastGeek [link] [comments]  ( 2 min )
    [R] [D] Where do you guys go to get the latest and greatest papers in your sub fields
    With so many conferences in AI + so many papers accepted in each conference, what do you guys use to keep up? submitted by /u/abhasatin [link] [comments]  ( 1 min )
    [R] EleutherAI releases weights for GPT-NeoX 20B and a tech report
    Tech report: http://eaidata.bmk.sh/data/GPT_NeoX_20B.pdf GitHub Repo: https://github.com/EleutherAI/gpt-neox Slim Weights: https://mystic.the-eye.eu/public/AI/models/GPT-NeoX-20B/slim_weights/ Full Weights: https://mystic.the-eye.eu/public/AI/models/GPT-NeoX-20B/full_weights/ Twitter announcement: https://twitter.com/BlancheMinerva/status/1491621024676392960?s=20&t=FlRGryrT34NJUz_WpCB4DQ edit: When I posted this thread, I did not have performance testing numbers on anything smaller than 48 A100s 😂. After speaking to some people who have deployed the model on more reasonable hardware, it appears that the most cost-effective approach is to use an A6000. On an A6000, with a prompt of 1395 tokens, generating a further 653 tokens takes just under 60 seconds. VRAM usage tops out just over 43GiB. With a pair of 3090s you can get better throughput, but a pair of 3090s is more expensive both as a piece of hardware and in terms of dollars per token generated on most cloud services. submitted by /u/StellaAthena [link] [comments]  ( 2 min )
    [D] Informal "Test of Time" papers discussion (5+ years old)
    Hey all! I was scanning through my library recently and stumbled on some papers from way back. Leafing through, I found some papers where, at the time, I thought, "Wow, this is going to change everything," and some did while more didn't. What papers, in your opinion, did stand that test? Of course, your answer might depend on your specific area of interest, so do include that as well. Some of my picks: - Dropout: A Simple Way to Prevent Neural Networks from Overfitting Continuous control with deep reinforcement learning Deep Residual Learning for Image Recognition Some PPML papers (maybe not as test-of-time-y as the above): CryptoNets: Applying Neural Networks to Encrypted Data with High Throughput and Accuracy Privacy Preserving Multi-party Machine Learning with Homomorphic Encryption Communication-Efficient Learning of Deep Networks from Decentralized Data submitted by /u/iamquah [link] [comments]  ( 1 min )
    [D] ML in videogame use cases
    Hello everyone! I got a question about ML and videogames: in what ways is ML used In the gaming industry? For example, EA sports released a paper for a DL model to simulate cloth deformations. I also know that game testing may be done with AI. What are other popular use of ML in videogames? Other than analysing data at user level, is possible to do something for the “artistic” component? Creating a complex behaviour for a NPG? Let me know! submitted by /u/Financial-Option9496 [link] [comments]  ( 1 min )
  • Open

    DSC Weekly Digest 08 Feb 2022: All That eGlitters Is Not Gold
    It ended with a sock puppet. The year was 1999, and the dot-com era was in full swing. Every company was not just an Internet company, but an e-Commerce company, and venture capitalists and software developers alike retired to their bed chambers at night thinking about how, in the next morning, or maybe the one… Read More »DSC Weekly Digest 08 Feb 2022: All That eGlitters Is Not Gold The post DSC Weekly Digest 08 Feb 2022: All That eGlitters Is Not Gold appeared first on Data Science Central.  ( 8 min )
  • Open

    Play PC Games on Your Phone With GeForce NOW This GFN Thursday
    Who says you have to put your play on pause just because you’re not at your PC? This GFN Thursday takes a look at how GeForce NOW makes PC gaming possible on Android and iOS mobile devices to support gamers on the go. This week also comes with sweet in-game rewards for members playing Eternal Read article > The post Play PC Games on Your Phone With GeForce NOW This GFN Thursday appeared first on The Official NVIDIA Blog.  ( 5 min )
  • Open

    Graph-Relational Domain Adaptation. (arXiv:2202.03628v1 [cs.LG])
    Existing domain adaptation methods tend to treat every domain equally and align them all perfectly. Such uniform alignment ignores topological structures among different domains; therefore it may be beneficial for nearby domains, but not necessarily for distant domains. In this work, we relax such uniform alignment by using a domain graph to encode domain adjacency, e.g., a graph of states in the US with each state as a domain and each edge indicating adjacency, thereby allowing domains to align flexibly based on the graph structure. We generalize the existing adversarial learning framework with a novel graph discriminator using encoding-conditioned graph embeddings. Theoretical analysis shows that at equilibrium, our method recovers classic domain adaptation when the graph is a clique, and achieves non-trivial alignment for other types of graphs. Empirical results show that our approach successfully generalizes uniform alignment, naturally incorporates domain information represented by graphs, and improves upon existing domain adaptation methods on both synthetic and real-world datasets. Code will soon be available at https://github.com/Wang-ML-Lab/GRDA.  ( 2 min )
    Data driven design of optical resonators. (arXiv:2202.03578v1 [cs.LG])
    Optical devices lie at the heart of most of the technology we see around us. When one actually wants to make such an optical device, one can predict its optical behavior using computational simulations of Maxwell's equations. If one then asks what the optimal design would be in order to obtain a certain optical behavior, the only way to go further would be to try out all of the possible designs and compute the electromagnetic spectrum they produce. When there are many design parameters, this brute force approach quickly becomes too computationally expensive. We therefore need other methods to create optimal optical devices. An alternative to the brute force approach is inverse design. In this paradigm, one starts from the desired optical response of a material and then determines the design parameters that are needed to obtain this optical response. There are many algorithms known in the literature that implement this inverse design. Some of the best performing, recent approaches are based on Deep Learning. The central idea is to train a neural network to predict the optical response for given design parameters. Since neural networks are completely differentiable, we can compute gradients of the response with respect to the design parameters. We can use these gradients to update the design parameters and get an optical response closer to the one we want. This allows us to obtain an optimal design much faster compared to the brute force approach. In my thesis, I use Deep Learning for the inverse design of the Fabry-P\'erot resonator. This system can be described fully analytically and is therefore ideal to study.  ( 2 min )
    Forecasting: theory and practice. (arXiv:2012.03854v4 [stat.AP] CROSS LISTED)
    Forecasting has always been at the forefront of decision making and planning. The uncertainty that surrounds the future is both exciting and challenging, with individuals and organisations seeking to minimise risks and maximise utilities. The large number of forecasting applications calls for a diverse set of forecasting methods to tackle real-life challenges. This article provides a non-systematic review of the theory and the practice of forecasting. We provide an overview of a wide range of theoretical, state-of-the-art models, methods, principles, and approaches to prepare, produce, organise, and evaluate forecasts. We then demonstrate how such theoretical concepts are applied in a variety of real-life contexts. We do not claim that this review is an exhaustive list of methods and applications. However, we wish that our encyclopedic presentation will offer a point of reference for the rich work that has been undertaken over the last decades, with some key insights for the future of forecasting theory and practice. Given its encyclopedic nature, the intended mode of reading is non-linear. We offer cross-references to allow the readers to navigate through the various topics. We complement the theoretical concepts and applications covered by large lists of free or open-source software implementations and publicly-available databases.  ( 3 min )
    Context-Aware Discrimination Detection in Job Vacancies using Computational Language Models. (arXiv:2202.03907v1 [cs.LG])
    Discriminatory job vacancies are disapproved worldwide, but remain persistent. Discrimination in job vacancies can be explicit by directly referring to demographic memberships of candidates. More implicit forms of discrimination are also present that may not always be illegal but still influence the diversity of applicants. Explicit written discrimination is still present in numerous job vacancies, as was recently observed in the Netherlands. Current efforts for the detection of explicit discrimination concern the identification of job vacancies containing potentially discriminating terms such as "young" or "male". However, automatic detection is inefficient due to low precision: e.g. "we are a young company" or "working with mostly male patients" are phrases that contain explicit terms, while the context shows that these do not reflect discriminatory content. In this paper, we show how machine learning based computational language models can raise precision in the detection of explicit discrimination by identifying when the potentially discriminating terms are used in a discriminatory context. We focus on gender discrimination, which indeed suffers from low precision when filtering explicit terms. First, we created a data set for gender discrimination in job vacancies. Second, we investigated a variety of computational language models for discriminatory context detection. Third, we evaluated the capability of these models to detect unforeseen discriminating terms in context. The results show that machine learning based methods can detect explicit gender discrimination with high precision and help in finding new forms of discrimination. Accordingly, the proposed methods can substantially increase the effectiveness of detecting job vacancies which are highly suspected to be discriminatory. In turn, this may lower the discrimination experienced at the start of the recruitment process.  ( 2 min )
    Robust Hybrid Learning With Expert Augmentation. (arXiv:2202.03881v1 [cs.LG])
    Hybrid modelling reduces the misspecification of expert models by combining them with machine learning (ML) components learned from data. Like for many ML algorithms, hybrid model performance guarantees are limited to the training distribution. Leveraging the insight that the expert model is usually valid even outside the training domain, we overcome this limitation by introducing a hybrid data augmentation strategy termed \textit{expert augmentation}. Based on a probabilistic formalization of hybrid modelling, we show why expert augmentation improves generalization. Finally, we validate the practical benefits of augmented hybrid models on a set of controlled experiments, modelling dynamical systems described by ordinary and partial differential equations.  ( 2 min )
    Sparse-RS: a versatile framework for query-efficient sparse black-box adversarial attacks. (arXiv:2006.12834v3 [cs.LG] UPDATED)
    We propose a versatile framework based on random search, Sparse-RS, for score-based sparse targeted and untargeted attacks in the black-box setting. Sparse-RS does not rely on substitute models and achieves state-of-the-art success rate and query efficiency for multiple sparse attack models: $l_0$-bounded perturbations, adversarial patches, and adversarial frames. The $l_0$-version of untargeted Sparse-RS outperforms all black-box and even all white-box attacks for different models on MNIST, CIFAR-10, and ImageNet. Moreover, our untargeted Sparse-RS achieves very high success rates even for the challenging settings of $20\times20$ adversarial patches and $2$-pixel wide adversarial frames for $224\times224$ images. Finally, we show that Sparse-RS can be applied to generate targeted universal adversarial patches where it significantly outperforms the existing approaches. The code of our framework is available at https://github.com/fra31/sparse-rs.  ( 2 min )
    Post-mortem on a deep learning contest: a Simpson's paradox and the complementary roles of scale metrics versus shape metrics. (arXiv:2106.00734v2 [cs.LG] UPDATED)
    To understand better good generalization performance in state-of-the-art neural network (NN) models, and in particular the success of the ALPHAHAT metric based on Heavy-Tailed Self-Regularization (HT-SR) theory, we analyze of a corpus of models that was made publicly-available for a contest to predict the generalization accuracy of NNs. These models include a wide range of qualities and were trained with a range of architectures and regularization hyperparameters. We break ALPHAHAT into its two subcomponent metrics: a scale-based metric; and a shape-based metric. We identify what amounts to a Simpson's paradox: where "scale" metrics (from traditional statistical learning theory) perform well in aggregate, but can perform poorly on subpartitions of the data of a given depth, when regularization hyperparameters are varied; and where "shape" metrics (from HT-SR theory) perform well on each subpartition of the data, when hyperparameters are varied for models of a given depth, but can perform poorly overall when models with varying depths are aggregated. Our results highlight the subtlety of comparing models when both architectures and hyperparameters are varied; the complementary role of implicit scale versus implicit shape parameters in understanding NN model quality; and the need to go beyond one-size-fits-all metrics based on upper bounds from generalization theory to describe the performance of NN models. Our results also clarify further why the ALPHAHAT metric from HT-SR theory works so well at predicting generalization across a broad range of CV and NLP models.  ( 2 min )
    Practical Challenges in Differentially-Private Federated Survival Analysis of Medical Data. (arXiv:2202.03758v1 [cs.LG])
    Survival analysis or time-to-event analysis aims to model and predict the time it takes for an event of interest to happen in a population or an individual. In the medical context this event might be the time of dying, metastasis, recurrence of cancer, etc. Recently, the use of neural networks that are specifically designed for survival analysis has become more popular and an attractive alternative to more traditional methods. In this paper, we take advantage of the inherent properties of neural networks to federate the process of training of these models. This is crucial in the medical domain since data is scarce and collaboration of multiple health centers is essential to make a conclusive decision about the properties of a treatment or a disease. To ensure the privacy of the datasets, it is common to utilize differential privacy on top of federated learning. Differential privacy acts by introducing random noise to different stages of training, thus making it harder for an adversary to extract details about the data. However, in the realistic setting of small medical datasets and only a few data centers, this noise makes it harder for the models to converge. To address this problem, we propose DPFed-post which adds a post-processing stage to the private federated learning scheme. This extra step helps to regulate the magnitude of the noisy average parameter update and easier convergence of the model. For our experiments, we choose 3 real-world datasets in the realistic setting when each health center has only a few hundred records, and we show that DPFed-post successfully increases the performance of the models by an average of up to $17\%$ compared to the standard differentially private federated learning scheme.  ( 2 min )
    A Variational Edge Partition Model for Supervised Graph Representation Learning. (arXiv:2202.03233v2 [stat.ML] UPDATED)
    Graph neural networks (GNNs), which propagate the node features through the edges and learn how to transform the aggregated features under label supervision, have achieved great success in supervised feature extraction for both node-level and graph-level classification tasks. However, GNNs typically treat the graph structure as given and ignore how the edges are formed. This paper introduces a graph generative process to model how the observed edges are generated by aggregating the node interactions over a set of overlapping node communities, each of which contributes to the edges via a logical OR mechanism. Based on this generative model, we partition each edge into the summation of multiple community-specific weighted edges and use them to define community-specific GNNs. A variational inference framework is proposed to jointly learn a GNN based inference network that partitions the edges into different communities, these community-specific GNNs, and a GNN based predictor that combines community-specific GNNs for the end classification task. Extensive evaluations on real-world graph datasets have verified the effectiveness of the proposed method in learning discriminative representations for both node-level and graph-level classification tasks.  ( 2 min )
    Online Optimization with Untrusted Predictions. (arXiv:2202.03519v1 [cs.LG])
    We examine the problem of online optimization, where a decision maker must sequentially choose points in a general metric space to minimize the sum of per-round, non-convex hitting costs and the costs of switching decisions between rounds. The decision maker has access to a black-box oracle, such as a machine learning model, that provides untrusted and potentially inaccurate predictions of the optimal decision in each round. The goal of the decision maker is to exploit the predictions if they are accurate, while guaranteeing performance that is not much worse than the hindsight optimal sequence of decisions, even when predictions are inaccurate. We impose the standard assumption that hitting costs are globally $\alpha$-polyhedral. We propose a novel algorithm, Adaptive Online Switching (AOS), and prove that, for any desired $\delta > 0$, it is $(1+2\delta)$-competitive if predictions are perfect, while also maintaining a uniformly bounded competitive ratio of $2^{\tilde{\mathcal{O}}(1/(\alpha \delta))}$ even when predictions are adversarial. Further, we prove that this trade-off is necessary and nearly optimal in the sense that any deterministic algorithm which is $(1+\delta)$-competitive if predictions are perfect must be at least $2^{\tilde{\Omega}(1/(\alpha \delta))}$-competitive when predictions are inaccurate.  ( 2 min )
    Backdoor Detection in Reinforcement Learning. (arXiv:2202.03609v1 [cs.LG])
    While the real world application of reinforcement learning (RL) is becoming popular, the safety concern and the robustness of an RL system require more attention. A recent work reveals that, in a multi-agent RL environment, backdoor trigger actions can be injected into a victim agent (a.k.a. trojan agent), which can result in a catastrophic failure as soon as it sees the backdoor trigger action. We propose the problem of RL Backdoor Detection, aiming to address this safety vulnerability. An interesting observation we drew from extensive empirical studies is a trigger smoothness property where normal actions similar to the backdoor trigger actions can also trigger low performance of the trojan agent. Inspired by this observation, we propose a reinforcement learning solution TrojanSeeker to find approximate trigger actions for the trojan agents, and further propose an efficient approach to mitigate the trojan agents based on machine unlearning. Experiments show that our approach can correctly distinguish and mitigate all the trojan agents across various types of agents and environments.  ( 2 min )
    Towards More Generalizable One-shot Visual Imitation Learning. (arXiv:2110.13423v2 [cs.RO] UPDATED)
    A general-purpose robot should be able to master a wide range of tasks and quickly learn a novel one by leveraging past experiences. One-shot imitation learning (OSIL) approaches this goal by training an agent with (pairs of) expert demonstrations, such that at test time, it can directly execute a new task from just one demonstration. However, so far this framework has been limited to training on many variations of one task, and testing on other unseen but similar variations of the same task. In this work, we push for a higher level of generalization ability by investigating a more ambitious multi-task setup. We introduce a diverse suite of vision-based robot manipulation tasks, consisting of 7 tasks, a total of 61 variations, and a continuum of instances within each variation. For consistency and comparison purposes, we first train and evaluate single-task agents (as done in prior few-shot imitation work). We then study the multi-task setting, where multi-task training is followed by (i) one-shot imitation on variations within the training tasks, (ii) one-shot imitation on new tasks, and (iii) fine-tuning on new tasks. Prior state-of-the-art, while performing well within some single tasks, struggles in these harder multi-task settings. To address these limitations, we propose MOSAIC (Multi-task One-Shot Imitation with self-Attention and Contrastive learning), which integrates a self-attention model architecture and a temporal contrastive module to enable better task disambiguation and more robust representation learning. Our experiments show that MOSAIC outperforms prior state of the art in learning efficiency, final performance, and learns a multi-task policy with promising generalization ability via fine-tuning on novel tasks.  ( 2 min )
    OpenFWI: Benchmark Seismic Datasets for Machine Learning-Based Full Waveform Inversion. (arXiv:2111.02926v2 [cs.LG] UPDATED)
    We present OpenFWI, a collection of large-scale open-source benchmark datasets for seismic full waveform inversion (FWI). OpenFWI is the first-of-its-kind in the geoscience and machine learning community to facilitate diversified, rigorous, and reproducible research on machine learning-based FWI. OpenFWI includes datasets of multiple scales, encompasses diverse domains, and covers various levels of model complexity. Along with the dataset, we also perform an empirical study on each dataset with a fully-convolutional deep learning model. OpenFWI has been meticulously maintained and will be regularly updated with new data and experimental results. We appreciate the inputs from the community to help us further improve OpenFWI. At the current version, we publish seven datasets in OpenFWI, of which one is specified for 3D FWI and the rest are for 2D scenarios. All datasets and related information can be accessed through our website at https://openfwi-lanl.github.io/.  ( 2 min )
    Rainbow Differential Privacy. (arXiv:2202.03974v1 [cs.CR])
    We extend a previous framework for designing differentially private (DP) mechanisms via randomized graph colorings that was restricted to binary functions, corresponding to colorings in a graph, to multi-valued functions. As before, datasets are nodes in the graph and any two neighboring datasets are connected by an edge. In our setting, we assume each dataset has a preferential ordering for the possible outputs of the mechanism, which we refer to as a rainbow. Different rainbows partition the graph of datasets into different regions. We show that when the DP mechanism is pre-specified at the boundary of such regions, at most one optimal mechanism can exist. Moreover, if the mechanism is to behave identically for all same-rainbow boundary datasets, the problem can be greatly simplified and solved by means of a morphism to a line graph. We then show closed form expressions for the line graph in the case of ternary functions. Treatment of ternary queries in this paper displays enough richness to be extended to higher-dimensional query spaces with preferential query ordering, but the optimality proof does not seem to follow directly from the ternary proof.  ( 2 min )
    SMAC3: A Versatile Bayesian Optimization Package for Hyperparameter Optimization. (arXiv:2109.09831v2 [cs.LG] UPDATED)
    Algorithm parameters, in particular hyperparameters of machine learning algorithms, can substantially impact their performance. To support users in determining well-performing hyperparameter configurations for their algorithms, datasets and applications at hand, SMAC3 offers a robust and flexible framework for Bayesian Optimization, which can improve performance within a few evaluations. It offers several facades and pre-sets for typical use cases, such as optimizing hyperparameters, solving low dimensional continuous (artificial) global optimization problems and configuring algorithms to perform well across multiple problem instances. The SMAC3 package is available under a permissive BSD-license at https://github.com/automl/SMAC3.  ( 2 min )
    PrivFair: a Library for Privacy-Preserving Fairness Auditing. (arXiv:2202.04058v1 [cs.LG])
    Machine learning (ML) has become prominent in applications that directly affect people's quality of life, including in healthcare, justice, and finance. ML models have been found to exhibit discrimination based on sensitive attributes such as gender, race, or disability. Assessing if an ML model is free of bias remains challenging to date, and by definition has to be done with sensitive user characteristics that are subject of anti-discrimination and data protection law. Existing libraries for fairness auditing of ML models offer no mechanism to protect the privacy of the audit data. We present PrivFair, a library for privacy-preserving fairness audits of ML models. Through the use of Secure Multiparty Computation (MPC), \textsc{PrivFair} protects the confidentiality of the model under audit and the sensitive data used for the audit, hence it supports scenarios in which a proprietary classifier owned by a company is audited using sensitive audit data from an external investigator. We demonstrate the use of PrivFair for group fairness auditing with tabular data or image data, without requiring the investigator to disclose their data to anyone in an unencrypted manner, or the model owner to reveal their model parameters to anyone in plaintext.  ( 2 min )
    Transferable Student Performance Modeling for Intelligent Tutoring Systems. (arXiv:2202.03980v1 [cs.LG])
    Millions of learners worldwide are now using intelligent tutoring systems (ITSs). At their core, ITSs rely on machine learning algorithms to track each user's changing performance level over time to provide personalized instruction. Crucially, student performance models are trained using interaction sequence data of previous learners to analyse data generated by future learners. This induces a cold-start problem when a new course is introduced for which no training data is available. Here, we consider transfer learning techniques as a way to provide accurate performance predictions for new courses by leveraging log data from existing courses. We study two settings: (i) In the naive transfer setting, we propose course-agnostic performance models that can be applied to any course. (ii) In the inductive transfer setting, we tune pre-trained course-agnostic performance models to new courses using small-scale target course data (e.g., collected during a pilot study). We evaluate the proposed techniques using student interaction sequence data from 5 different mathematics courses containing data from over 47,000 students in a real world large-scale ITS. The course-agnostic models that use additional features provided by human domain experts (e.g, difficulty ratings for questions in the new course) but no student interaction training data for the new course, achieve prediction accuracy on par with standard BKT and PFA models that use training data from thousands of students in the new course. In the inductive setting our transfer learning approach yields more accurate predictions than conventional performance models when only limited student interaction training data (<100 students) is available to both.  ( 2 min )
    Multi-Agent Path Finding with Prioritized Communication Learning. (arXiv:2202.03634v1 [cs.RO])
    Multi-agent path finding (MAPF) has been widely used to solve large-scale real-world problems, e.g. automation warehouse. The learning-based fully decentralized framework has been introduced to simultaneously alleviate real-time problem and pursuit the optimal planning policy. However, existing methods might generate significantly more vertex conflicts (called collision), which lead to low success rate or more makespan. In this paper, we propose a PrIoritized COmmunication learning method (PICO), which incorporates the implicit planning priorities into the communication topology within the decentralized multi-agent reinforcement learning framework. Assembling with the classic coupled planners, the implicit priority learning module can be utilized to form the dynamic communication topology, which also build an effective collision-avoiding mechanism. PICO performs significantly better in large-scale multi-agent path finding tasks in both success rates and collision rates than state-of-the-art learning-based planners.  ( 2 min )
    Speech Emotion Recognition using Self-Supervised Features. (arXiv:2202.03896v1 [cs.SD])
    Self-supervised pre-trained features have consistently delivered state-of-art results in the field of natural language processing (NLP); however, their merits in the field of speech emotion recognition (SER) still need further investigation. In this paper we introduce a modular End-to- End (E2E) SER system based on an Upstream + Downstream architecture paradigm, which allows easy use/integration of a large variety of self-supervised features. Several SER experiments for predicting categorical emotion classes from the IEMOCAP dataset are performed. These experiments investigate interactions among fine-tuning of self-supervised feature models, aggregation of frame-level features into utterance-level features and back-end classification networks. The proposed monomodal speechonly based system not only achieves SOTA results, but also brings light to the possibility of powerful and well finetuned self-supervised acoustic features that reach results similar to the results achieved by SOTA multimodal systems using both Speech and Text modalities.  ( 2 min )
    Robust Neural Network Classification via Double Regularization. (arXiv:2112.08102v2 [stat.ML] UPDATED)
    The presence of mislabeled observations in data is a notoriously challenging problem in statistics and machine learning, associated with poor generalization properties for both traditional classifiers and, perhaps even more so, flexible classifiers like neural networks. Here we propose a novel double regularization of the neural network training loss that combines a penalty on the complexity of the classification model and an optimal reweighting of training observations. The combined penalties result in improved generalization properties and strong robustness against overfitting in different settings of mislabeled training data and also against variation in initial parameter values when training. We provide a theoretical justification for our proposed method derived for a simple case of logistic regression. We demonstrate the double regularization model, here denoted by DRFit, for neural net classification of (i) MNIST and (ii) CIFAR-10, in both cases with simulated mislabeling. We also illustrate that DRFit identifies mislabeled data points with very good precision. This provides strong support for DRFit as a practical of-the-shelf classifier, since, without any sacrifice in performance, we get a classifier that simultaneously reduces overfitting against mislabeling and gives an accurate measure of the trustworthiness of the labels.
    G-Elo: Generalization of the Elo algorithm by modelling the discretized margin of victory. (arXiv:2010.11187v3 [stat.ME] UPDATED)
    In this work we develop a new algorithm for rating of teams (or players) in one-on-one games by exploiting the observed difference of the game-points (such as goals), also known as a margin of victory (MOV). Our objective is to obtain the Elo-style algorithm whose operation is simple to implement and to understand intuitively. This is done in three steps: first, we define the probabilistic model between the teams' skills and the discretized MOV variable: this generalizes the model underpinning the Elo algorithm, where the MOV variable is discretized into three categories (win/loss/draw). Second, with the formal probabilistic model at hand, the optimization required by the maximum likelihood rule is implemented via stochastic gradient; this yields simple on-line equations for the rating updates which are identical in their general form to those characteristic of the Elo algorithm: the main difference lies in the way the scores and the expected scores are defined. Third, we propose a simple method to estimate the coefficients of the model, and thus define the operation of the algorithm; it is done in a closed form using the historical data so the algorithm is tailored to the sport of interest and the coefficients defining its operation are determined in entirely transparent manner. The alternative, optimization-based strategy to find the coefficients is also presented. We show numerical examples based on the results of the association football of the English Premier League and the American football of the National Football League.
    Asymptotics of representation learning in finite Bayesian neural networks. (arXiv:2106.00651v5 [cs.LG] UPDATED)
    Recent works have suggested that finite Bayesian neural networks may sometimes outperform their infinite cousins because finite networks can flexibly adapt their internal representations. However, our theoretical understanding of how the learned hidden layer representations of finite networks differ from the fixed representations of infinite networks remains incomplete. Perturbative finite-width corrections to the network prior and posterior have been studied, but the asymptotics of learned features have not been fully characterized. Here, we argue that the leading finite-width corrections to the average feature kernels for any Bayesian network with linear readout and Gaussian likelihood have a largely universal form. We illustrate this explicitly for three tractable network architectures: deep linear fully-connected and convolutional networks, and networks with a single nonlinear hidden layer. Our results begin to elucidate how task-relevant learning signals shape the hidden layer representations of wide Bayesian neural networks.
    Budgeted Combinatorial Multi-Armed Bandits. (arXiv:2202.03704v1 [cs.LG])
    We consider a budgeted combinatorial multi-armed bandit setting where, in every round, the algorithm selects a super-arm consisting of one or more arms. The goal is to minimize the total expected regret after all rounds within a limited budget. Existing techniques in this literature either fix the budget per round or fix the number of arms pulled in each round. Our setting is more general where based on the remaining budget and remaining number of rounds, the algorithm can decide how many arms to be pulled in each round. First, we propose CBwK-Greedy-UCB algorithm, which uses a greedy technique, CBwK-Greedy, to allocate the arms to the rounds. Next, we propose a reduction of this problem to Bandits with Knapsacks (BwK) with a single pull. With this reduction, we propose CBwK-LPUCB that uses PrimalDualBwK ingeniously. We rigorously prove regret bounds for CBwK-LP-UCB. We experimentally compare the two algorithms and observe that CBwK-Greedy-UCB performs incrementally better than CBwK-LP-UCB. We also show that for very high budgets, the regret goes to zero.  ( 2 min )
    What Would Jiminy Cricket Do? Towards Agents That Behave Morally. (arXiv:2110.13136v2 [cs.LG] UPDATED)
    When making everyday decisions, people are guided by their conscience, an internal sense of right and wrong. By contrast, artificial agents are currently not endowed with a moral sense. As a consequence, they may learn to behave immorally when trained on environments that ignore moral concerns, such as violent video games. With the advent of generally capable agents that pretrain on many environments, it will become necessary to mitigate inherited biases from environments that teach immoral behavior. To facilitate the development of agents that avoid causing wanton harm, we introduce Jiminy Cricket, an environment suite of 25 text-based adventure games with thousands of diverse, morally salient scenarios. By annotating every possible game state, the Jiminy Cricket environments robustly evaluate whether agents can act morally while maximizing reward. Using models with commonsense moral knowledge, we create an elementary artificial conscience that assesses and guides agents. In extensive experiments, we find that the artificial conscience approach can steer agents towards moral behavior without sacrificing performance.  ( 2 min )
    A Survey of Breast Cancer Screening Techniques: Thermography and Electrical Impedance Tomography. (arXiv:2202.03737v1 [eess.IV])
    Breast cancer is a disease that threatens many women's life, thus, early and accurate detection plays a key role in reducing the mortality rate. Mammography stands as the reference technique for breast cancer screening; nevertheless, many countries still lack access to mammograms due to economic, social, and cultural issues. Last advances in computational tools, infrared cameras, and devices for bio-impedance quantification allowed the development of parallel techniques like thermography, infrared imaging, and electrical impedance tomography, these being faster, reliable and cheaper. In the last decades, these have been considered as complement procedures for breast cancer diagnosis, where many studies concluded that false positive and false negative rates are greatly reduced. This work aims to review the last breakthroughs about the three above-mentioned techniques describing the benefits of mixing several computational skills to obtain a better global performance. In addition, we provide a comparison between several machine learning techniques applied to breast cancer diagnosis going from logistic regression, decision trees, and random forest to artificial, deep, and convolutional neural networks. Finally, it is mentioned several recommendations for 3D breast simulations, pre-processing techniques, biomedical devices in the research field, prediction of tumor location and size.  ( 2 min )
    SCR: Smooth Contour Regression with Geometric Priors. (arXiv:2202.03784v1 [cs.CV])
    While object detection methods traditionally make use of pixel-level masks or bounding boxes, alternative representations such as polygons or active contours have recently emerged. Among them, methods based on the regression of Fourier or Chebyshev coefficients have shown high potential on freeform objects. By defining object shapes as polar functions, they are however limited to star-shaped domains. We address this issue with SCR: a method that captures resolution-free object contours as complex periodic functions. The method offers a good compromise between accuracy and compactness thanks to the design of efficient geometric shape priors. We benchmark SCR on the popular COCO 2017 instance segmentation dataset, and show its competitiveness against existing algorithms in the field. In addition, we design a compact version of our network, which we benchmark on embedded hardware with a wide range of power targets, achieving up to real-time performance.  ( 2 min )
    Multi-Label Classification of Thoracic Diseases using Dense Convolutional Network on Chest Radiographs. (arXiv:2202.03583v1 [eess.IV])
    Chest X-ray images are one of the most common medical diagnosis techniques to identify different thoracic diseases. However, identification of pathologies in X-ray images requires skilled manpower and are often cited as a time-consuming task with varied level of interpretation, particularly in cases where the identification of disease only by images is difficult for human eyes. With recent achievements of deep learning in image classification, its application in disease diagnosis has been widely explored. This research project presents a multi-label disease diagnosis model of chest x-rays. Using Dense Convolutional Neural Network (DenseNet), the diagnosis system was able to obtain high classification predictions. The model obtained the highest AUC score of 0.896 for condition Cardiomegaly and the lowest AUC score for Nodule, 0.655. The model also localized the parts of the chest radiograph that indicated the presence of each pathology using GRADCAM, thus contributing to the model interpretability of a deep learning algorithm.  ( 2 min )
    Mapping DNN Embedding Manifolds for Network Generalization Prediction. (arXiv:2202.03868v1 [cs.CV])
    Understanding Deep Neural Network (DNN) performance in changing conditions is essential for deploying DNNs in safety critical applications with unconstrained environments, e.g., perception for self-driving vehicles or medical image analysis. Recently, the task of Network Generalization Prediction (NGP) has been proposed to predict how a DNN will generalize in a new operating domain. Previous NGP approaches have relied on labeled metadata and known distributions for the new operating domains. In this study, we propose the first NGP approach that predicts DNN performance based solely on how unlabeled images from an external operating domain map in the DNN embedding space. We demonstrate this technique for pedestrian, melanoma, and animal classification tasks and show state of the art NGP in 13 of 15 NGP tasks without requiring domain knowledge. Additionally, we show that our NGP embedding maps can be used to identify misclassified images when the DNN performance is poor.  ( 2 min )
    Maximum Likelihood Uncertainty Estimation: Robustness to Outliers. (arXiv:2202.03870v1 [cs.LG])
    We benchmark the robustness of maximum likelihood based uncertainty estimation methods to outliers in training data for regression tasks. Outliers or noisy labels in training data results in degraded performances as well as incorrect estimation of uncertainty. We propose the use of a heavy-tailed distribution (Laplace distribution) to improve the robustness to outliers. This property is evaluated using standard regression benchmarks and on a high-dimensional regression task of monocular depth estimation, both containing outliers. In particular, heavy-tailed distribution based maximum likelihood provides better uncertainty estimates, better separation in uncertainty for out-of-distribution data, as well as better detection of adversarial attacks in the presence of outliers.  ( 2 min )
    Improved Convergence Rates for Sparse Approximation Methods in Kernel-Based Learning. (arXiv:2202.04005v1 [cs.LG])
    Kernel-based models such as kernel ridge regression and Gaussian processes are ubiquitous in machine learning applications for regression and optimization. It is well known that a serious downside for kernel-based models is the high computational cost; given a dataset of $n$ samples, the cost grows as $\mathcal{O}(n^3)$. Existing sparse approximation methods can yield a significant reduction in the computational cost, effectively reducing the real world cost down to as low as $\mathcal{O}(n)$ in certain cases. Despite this remarkable empirical success, significant gaps remain in the existing results for the analytical confidence bounds on the error due to approximation. In this work, we provide novel confidence intervals for the Nystr\"om method and the sparse variational Gaussian processes approximation method. Our confidence intervals lead to improved error bounds in both regression and optimization. We establish these confidence intervals using novel interpretations of the approximate (surrogate) posterior variance of the models.  ( 2 min )
    Optimizing Warfarin Dosing using Deep Reinforcement Learning. (arXiv:2202.03486v1 [cs.LG])
    Warfarin is a widely used anticoagulant, and has a narrow therapeutic range. Dosing of warfarin should be individualized, since slight overdosing or underdosing can have catastrophic or even fatal consequences. Despite much research on warfarin dosing, current dosing protocols do not live up to expectations, especially for patients sensitive to warfarin. We propose a deep reinforcement learning-based dosing model for warfarin. To overcome the issue of relatively small sample sizes in dosing trials, we use a Pharmacokinetic/ Pharmacodynamic (PK/PD) model of warfarin to simulate dose-responses of virtual patients. Applying the proposed algorithm on virtual test patients shows that this model outperforms a set of clinically accepted dosing protocols by a wide margin.  ( 2 min )
    Reinforcement learning for multi-item retrieval in the puzzle-based storage system. (arXiv:2202.03424v1 [cs.LG])
    Nowadays, fast delivery services have created the need for high-density warehouses. The puzzle-based storage system is a practical way to enhance the storage density, however, facing difficulties in the retrieval process. In this work, a deep reinforcement learning algorithm, specifically the Double&Dueling Deep Q Network, is developed to solve the multi-item retrieval problem in the system with general settings, where multiple desired items, escorts, and I/O points are placed randomly. Additionally, we propose a general compact integer programming model to evaluate the solution quality. Extensive numerical experiments demonstrate that the reinforcement learning approach can yield high-quality solutions and outperforms three related state-of-the-art heuristic algorithms. Furthermore, a conversion algorithm and a decomposition framework are proposed to handle simultaneous movement and large-scale instances respectively, thus improving the applicability of the PBS system.  ( 2 min )
    Distribution Regression with Sliced Wasserstein Kernels. (arXiv:2202.03926v1 [stat.ML])
    The problem of learning functions over spaces of probabilities - or distribution regression - is gaining significant interest in the machine learning community. A key challenge behind this problem is to identify a suitable representation capturing all relevant properties of the underlying functional mapping. A principled approach to distribution regression is provided by kernel mean embeddings, which lifts kernel-induced similarity on the input domain at the probability level. This strategy effectively tackles the two-stage sampling nature of the problem, enabling one to derive estimators with strong statistical guarantees, such as universal consistency and excess risk bounds. However, kernel mean embeddings implicitly hinge on the maximum mean discrepancy (MMD), a metric on probabilities, which may fail to capture key geometrical relations between distributions. In contrast, optimal transport (OT) metrics, are potentially more appealing, as documented by the recent literature on the topic. In this work, we propose the first OT-based estimator for distribution regression. We build on the Sliced Wasserstein distance to obtain an OT-based representation. We study the theoretical properties of a kernel ridge regression estimator based on such representation, for which we prove universal consistency and excess risk bounds. Preliminary experiments complement our theoretical findings by showing the effectiveness of the proposed approach and compare it with MMD-based estimators.
    Robustness Verification for Attention Networks using Mixed Integer Programming. (arXiv:2202.03932v1 [cs.LG])
    Attention networks such as transformers have been shown powerful in many applications ranging from natural language processing to object recognition. This paper further considers their robustness properties from both theoretical and empirical perspectives. Theoretically, we formulate a variant of attention networks containing linearized layer normalization and sparsemax activation, and reduce its robustness verification to a Mixed Integer Programming problem. Apart from a na\"ive encoding, we derive tight intervals from admissible perturbation regions and examine several heuristics to speed up the verification process. More specifically, we find a novel bounding technique for sparsemax activation, which is also applicable to softmax activation in general neural networks. Empirically, we evaluate our proposed techniques with a case study on lane departure warning and demonstrate a performance gain of approximately an order of magnitude. Furthermore, although attention networks typically deliver higher accuracy than general neural networks, contrasting its robustness against a similar-sized multi-layer perceptron surprisingly shows that they are not necessarily more robust.
    Uncertainty Modeling for Out-of-Distribution Generalization. (arXiv:2202.03958v1 [cs.CV])
    Though remarkable progress has been achieved in various vision tasks, deep neural networks still suffer obvious performance degradation when tested in out-of-distribution scenarios. We argue that the feature statistics (mean and standard deviation), which carry the domain characteristics of the training data, can be properly manipulated to improve the generalization ability of deep learning models. Common methods often consider the feature statistics as deterministic values measured from the learned features and do not explicitly consider the uncertain statistics discrepancy caused by potential domain shifts during testing. In this paper, we improve the network generalization ability by modeling the uncertainty of domain shifts with synthesized feature statistics during training. Specifically, we hypothesize that the feature statistic, after considering the potential uncertainties, follows a multivariate Gaussian distribution. Hence, each feature statistic is no longer a deterministic value, but a probabilistic point with diverse distribution possibilities. With the uncertain feature statistics, the models can be trained to alleviate the domain perturbations and achieve better robustness against potential domain shifts. Our method can be readily integrated into networks without additional parameters. Extensive experiments demonstrate that our proposed method consistently improves the network generalization ability on multiple vision tasks, including image classification, semantic segmentation, and instance retrieval. The code will be released soon at https://github.com/lixiaotong97/DSU.
    Best of Both Worlds: Practical and Theoretically Optimal Submodular Maximization in Parallel. (arXiv:2111.07917v2 [cs.DS] UPDATED)
    For the problem of maximizing a monotone, submodular function with respect to a cardinality constraint $k$ on a ground set of size $n$, we provide an algorithm that achieves the state-of-the-art in both its empirical performance and its theoretical properties, in terms of adaptive complexity, query complexity, and approximation ratio; that is, it obtains, with high probability, query complexity of $O(n)$ in expectation, adaptivity of $O(\log(n))$, and approximation ratio of nearly $1-1/e$. The main algorithm is assembled from two components which may be of independent interest. The first component of our algorithm, LINEARSEQ, is useful as a preprocessing algorithm to improve the query complexity of many algorithms. Moreover, a variant of LINEARSEQ is shown to have adaptive complexity of $O( \log (n / k) )$ which is smaller than that of any previous algorithm in the literature. The second component is a parallelizable thresholding procedure THRESHOLDSEQ for adding elements with gain above a constant threshold. Finally, we demonstrate that our main algorithm empirically outperforms, in terms of runtime, adaptive rounds, total queries, and objective values, the previous state-of-the-art algorithm FAST in a comprehensive evaluation with six submodular objective functions.
    Deep Residual Shrinkage Networks for EMG-based Gesture Identification. (arXiv:2202.02984v2 [eess.SP] UPDATED)
    This work introduces a method for high-accuracy EMG based gesture identification. A newly developed deep learning method, namely, deep residual shrinkage network is applied to perform gesture identification. Based on the feature of EMG signal resulting from gestures, optimizations are made to improve the identification accuracy. Finally, three different algorithms are applied to compare the accuracy of EMG signal recognition with that of DRSN. The result shows that DRSN excel traditional neural networks in terms of EMG recognition accuracy. This paper provides a reliable way to classify EMG signals, as well as exploring possible applications of DRSN.
    Penalizing Gradient Norm for Efficiently Improving Generalization in Deep Learning. (arXiv:2202.03599v1 [cs.LG])
    How to train deep neural networks (DNNs) to generalize well is a central concern in deep learning, especially for severely overparameterized networks nowadays. In this paper, we propose an effective method to improve the model generalization by additionally penalizing the gradient norm of loss function during optimization. We demonstrate that confining the gradient norm of loss function could help lead the optimizers towards finding flat minima. We leverage the first-order approximation to efficiently implement the corresponding gradient to fit well in the gradient descent framework. In our experiments, we confirm that when using our methods, generalization performance of various models could be improved on different datasets. Also, we show that the recent sharpness-aware minimization method \cite{DBLP:conf/iclr/ForetKMN21} is a special, but not the best, case of our method, where the best case of our method could give new state-of-art performance on these tasks.
    APPFL: Open-Source Software Framework for Privacy-Preserving Federated Learning. (arXiv:2202.03672v1 [cs.LG])
    Federated learning (FL) enables training models at different sites and updating the weights from the training instead of transferring data to a central location and training as in classical machine learning. The FL capability is especially important to domains such as biomedicine and smart grid, where data may not be shared freely or stored at a central location because of policy challenges. Thanks to the capability of learning from decentralized datasets, FL is now a rapidly growing research field, and numerous FL frameworks have been developed. In this work, we introduce APPFL, the Argonne Privacy-Preserving Federated Learning framework. APPFL allows users to leverage implemented privacy-preserving algorithms, implement new algorithms, and simulate and deploy various FL algorithms with privacy-preserving techniques. The modular framework enables users to customize the components for algorithms, privacy, communication protocols, neural network models, and user data. We also present a new communication-efficient algorithm based on an inexact alternating direction method of multipliers. The algorithm requires significantly less communication between the server and the clients than does the current state of the art. We demonstrate the computational capabilities of APPFL, including differentially private FL on various test datasets and its scalability, by using multiple algorithms and datasets on different computing environments.
    Unsupervised Source Separation via Self-Supervised Training. (arXiv:2202.03875v1 [eess.AS])
    We introduce two novel unsupervised (blind) source separation methods, which involve self-supervised training from single-channel two-source speech mixtures without any access to the ground truth source signals. Our first method employs permutation invariant training (PIT) to separate artificially-generated mixtures of the original mixtures back into the original mixtures, which we named mixture permutation invariant training (MixPIT). We found this challenging objective to be a valid proxy task for learning to separate the underlying sources. We improve upon this first method by creating mixtures of source estimates and employing PIT to separate these new mixtures in a cyclic fashion. We named this second method cyclic mixture permutation invariant training (MixCycle), where cyclic refers to the fact that we use the same model to produce artificial mixtures and to learn from them continuously. We show that MixPIT outperforms a common baseline (MixIT) on our small dataset (SC09Mix), and they have comparable performance on a standard dataset (LibriMix). Strikingly, we also show that MixCycle surpasses the performance of supervised PIT by being data-efficient, thanks to its inherent data augmentation mechanism. To the best of our knowledge, no other purely unsupervised method is able to match or exceed the performance of supervised training.
    A Deep Neural Network for SSVEP-based Brain-Computer Interfaces. (arXiv:2011.08562v3 [cs.LG] UPDATED)
    Objective: Target identification in brain-computer interface (BCI) spellers refers to the electroencephalogram (EEG) classification for predicting the target character that the subject intends to spell. When the visual stimulus of each character is tagged with a distinct frequency, the EEG records steady-state visually evoked potentials (SSVEP) whose spectrum is dominated by the harmonics of the target frequency. In this setting, we address the target identification and propose a novel deep neural network (DNN) architecture. Method: The proposed DNN processes the multi-channel SSVEP with convolutions across the sub-bands of harmonics, channels, time, and classifies at the fully connected layer. We test with two publicly available large scale (the benchmark and BETA) datasets consisting of in total 105 subjects with 40 characters. Our first stage training learns a global model by exploiting the statistical commonalities among all subjects, and the second stage fine tunes to each subject separately by exploiting the individualities. Results: Our DNN achieves impressive information transfer rates (ITRs) on both datasets, 265.23 bits/min and 196.59 bits/min, respectively, with only 0.4 seconds of stimulation. The code is available for reproducibility at https://github.com/osmanberke/Deep-SSVEP-BCI. Conclusion: The presented DNN strongly outperforms the state-of-the-art techniques as our accuracy and ITR rates are the highest ever reported performance results on these datasets. Significance: Due to its unprecedentedly high speller ITRs and flawless applicability to general SSVEP systems, our technique has great potential in various biomedical engineering settings of BCIs such as communication, rehabilitation and control.
    Thoughts on the Consistency between Ricci Flow and Neural Network Behavior. (arXiv:2111.08410v2 [cs.LG] UPDATED)
    The Ricci flow is a partial differential equation for evolving the metric in a Riemannian manifold to make it more regular. On the other hand, neural networks seem to have similar geometric behavior for specific tasks. In this paper, we construct the linearly nearly Euclidean manifold as a background to observe the evolution of Ricci flow and the training of neural networks. Under the Ricci-DeTurck flow, we prove the dynamical stability and convergence of the linearly nearly Euclidean metric for an $L^2$-Norm perturbation. In practice, from the information geometry and mirror descent points of view, we give the steepest descent gradient flow for neural networks on the linearly nearly Euclidean manifold. During the training process of the neural network, we observe that its metric will also regularly converge to the linearly nearly Euclidean metric, which is consistent with the convergent behavior of linearly nearly Euclidean metrics under the Ricci-DeTurck flow.
    Conditional Gradients for the Approximately Vanishing Ideal. (arXiv:2202.03349v2 [cs.LG] UPDATED)
    The vanishing ideal of a set of points $X\subseteq \mathbb{R}^n$ is the set of polynomials that evaluate to $0$ over all points $\mathbf{x} \in X$ and admits an efficient representation by a finite set of polynomials called generators. To accommodate the noise in the data set, we introduce the Conditional Gradients Approximately Vanishing Ideal algorithm (CGAVI) for the construction of the set of generators of the approximately vanishing ideal. The constructed set of generators captures polynomial structures in data and gives rise to a feature map that can, for example, be used in combination with a linear classifier for supervised learning. In CGAVI, we construct the set of generators by solving specific instances of (constrained) convex optimization problems with the Pairwise Frank-Wolfe algorithm (PFW). Among other things, the constructed generators inherit the LASSO generalization bound and not only vanish on the training but also on out-sample data. Moreover, CGAVI admits a compact representation of the approximately vanishing ideal by constructing few generators with sparse coefficient vectors.
    Class Density and Dataset Quality in High-Dimensional, Unstructured Data. (arXiv:2202.03856v1 [cs.LG])
    We provide a definition for class density that can be used to measure the aggregate similarity of the samples within each of the classes in a high-dimensional, unstructured dataset. We then put forth several candidate methods for calculating class density and analyze the correlation between the values each method produces with the corresponding individual class test accuracies achieved on a trained model. Additionally, we propose a definition for dataset quality for high-dimensional, unstructured data and show that those datasets that met a certain quality threshold (experimentally demonstrated to be > 10 for the datasets studied) were candidates for eliding redundant data based on the individual class densities.
    Transfer Learning under High-dimensional Generalized Linear Models. (arXiv:2105.14328v3 [stat.ML] UPDATED)
    In this work, we study the transfer learning problem under high-dimensional generalized linear models (GLMs), which aim to improve the fit on target data by borrowing information from useful source data. Given which sources to transfer, we propose a transfer learning algorithm on GLM, and derive its $\ell_1/\ell_2$-estimation error bounds as well as a bound for a prediction error measure. The theoretical analysis shows that when the target and source are sufficiently close to each other, these bounds could be improved over those of the classical penalized estimator using only target data under mild conditions. When we don't know which sources to transfer, an algorithm-free transferable source detection approach is introduced to detect informative sources. The detection consistency is proved under the high-dimensional GLM transfer learning setting. We also propose an algorithm to construct confidence intervals of each coefficient component, and the corresponding theories are provided. Extensive simulations and a real-data experiment verify the effectiveness of our algorithms. We implement the proposed GLM transfer learning algorithms in a new R package glmtrans, which is available on CRAN.
    Assessing the Performance of Online Students -- New Data, New Approaches, Improved Accuracy. (arXiv:2109.01753v2 [cs.LG] UPDATED)
    We consider the problem of assessing the changing performance levels of individual students as they go through online courses. This student performance (SP) modeling problem is a critical step for building adaptive online teaching systems. Specifically, we conduct a study of how to utilize various types and large amounts of student log data to train accurate machine learning (ML) models that predict the performance of future students. This study is the first to use four very large sets of student data made available recently from four distinct intelligent tutoring systems. Our results include a new ML approach that defines a new state of the art for logistic regression based SP modeling, improving over earlier methods in several ways: First, we achieve improved accuracy by introducing new features that can be easily computed from conventional question-response logs (e.g., the pattern in the student 's most recent answers). Second, we take advantage of features of the student history that go beyond question-response pairs (e.g., features such as which video segments the student watched, or skipped) as well as information about prerequisite structure in the curriculum. Third, we train multiple specialized SP models for different aspects of the curriculum (e.g., specializing in early versus later segments of the student history), then combine these specialized models to create a group prediction of the SP. Taken together, these innovations yield an average AUC score across these four datasets of 0.808 compared to the previous best logistic regression approach score of 0.767, and also outperforming state-of-the-art deep neural net approaches. Importantly, we observe consistent improvements from each of our three methodological innovations, in each dataset, suggesting that our methods are of general utility and likely to produce improvements for other online tutoring systems as well.
    Systematically improving existing k-means initialization algorithms at nearly no cost, by pairwise-nearest-neighbor smoothing. (arXiv:2202.03949v1 [cs.LG])
    We present a meta-method for initializing (seeding) the $k$-means clustering algorithm called PNN-smoothing. It consists in splitting a given dataset into $J$ random subsets, clustering each of them individually, and merging the resulting clusterings with the pairwise-nearest-neighbor (PNN) method. It is a meta-method in the sense that when clustering the individual subsets any seeding algorithm can be used. If the computational complexity of that seeding algorithm is linear in the size of the data $N$ and the number of clusters $k$, PNN-smoothing is also almost linear with an appropriate choice of $J$, and in fact only at most a few percent slower in most cases in practice. We show empirically, using several existing seeding methods and testing on several synthetic and real datasets, that this procedure results in systematically better costs. It can even be applied recursively, and easily parallelized. Our implementation is publicly available at https://github.com/carlobaldassi/KMeansPNNSmoothing.jl
    Soft Actor-Critic with Inhibitory Networks for Faster Retraining. (arXiv:2202.02918v2 [cs.LG] UPDATED)
    Reusing previously trained models is critical in deep reinforcement learning to speed up training of new agents. However, it is unclear how to acquire new skills when objectives and constraints are in conflict with previously learned skills. Moreover, when retraining, there is an intrinsic conflict between exploiting what has already been learned and exploring new skills. In soft actor-critic (SAC) methods, a temperature parameter can be dynamically adjusted to weight the action entropy and balance the explore $\times$ exploit trade-off. However, controlling a single coefficient can be challenging within the context of retraining, even more so when goals are contradictory. In this work, inspired by neuroscience research, we propose a novel approach using inhibitory networks to allow separate and adaptive state value evaluations, as well as distinct automatic entropy tuning. Ultimately, our approach allows for controlling inhibition to handle conflict between exploiting less risky, acquired behaviors and exploring novel ones to overcome more challenging tasks. We validate our method through experiments in OpenAI Gym environments.
    Development of a deep learning platform for optimising sheet stamping geometries subject to manufacturing constraints. (arXiv:2202.03422v1 [cs.LG])
    The latest sheet stamping processes enable efficient manufacturing of complex shape structural components that have high stiffness to weight ratios, but these processes can introduce defects. To assist component design for stamping processes, this paper presents a novel deep-learning-based platform for optimising 3D component geometries. The platform adopts a non-parametric modelling approach that is capable of optimising arbitrary geometries from multiple geometric parameterisation schema. This approach features the interaction of two neural networks: 1) a geometry generator and 2) a manufacturing performance evaluator. The generator predicts continuous 3D signed distance fields (SDFs) for geometries of different classes, and each SDF is conditioned on a latent vector. The zero-level-set of each SDF implicitly represents a generated geometry. Novel training strategies for the generator are introduced and include a new loss function which is tailored for sheet stamping applications. These strategies enable the differentiable generation of high quality, large scale component geometries with tight local features for the first time. The evaluator maps a 2D projection of these generated geometries to their post-stamping physical (e.g., strain) distributions. Manufacturing constraints are imposed based on these distributions and are used to formulate a novel objective function for optimisation. A new gradient-based optimisation technique is employed to iteratively update the latent vectors, and therefore geometries, to minimise this objective function and thus meet the manufacturing constraints. Case studies based on optimising box geometries subject to a sheet thinning constraint for a hot stamping process are presented and discussed. The results show that expressive geometric changes are achievable, and that these changes are driven by stamping performance.  ( 2 min )
    Deletion Inference, Reconstruction, and Compliance in Machine (Un)Learning. (arXiv:2202.03460v1 [cs.LG])
    Privacy attacks on machine learning models aim to identify the data that is used to train such models. Such attacks, traditionally, are studied on static models that are trained once and are accessible by the adversary. Motivated to meet new legal requirements, many machine learning methods are recently extended to support machine unlearning, i.e., updating models as if certain examples are removed from their training sets, and meet new legal requirements. However, privacy attacks could potentially become more devastating in this new setting, since an attacker could now access both the original model before deletion and the new model after the deletion. In fact, the very act of deletion might make the deleted record more vulnerable to privacy attacks. Inspired by cryptographic definitions and the differential privacy framework, we formally study privacy implications of machine unlearning. We formalize (various forms of) deletion inference and deletion reconstruction attacks, in which the adversary aims to either identify which record is deleted or to reconstruct (perhaps part of) the deleted records. We then present successful deletion inference and reconstruction attacks for a variety of machine learning models and tasks such as classification, regression, and language models. Finally, we show that our attacks would provably be precluded if the schemes satisfy (variants of) Deletion Compliance (Garg, Goldwasser, and Vasudevan, Eurocrypt' 20).  ( 2 min )
    A two-step approach to leverage contextual data: speech recognition in air-traffic communications. (arXiv:2202.03725v1 [cs.CL])
    Automatic Speech Recognition (ASR), as the assistance of speech communication between pilots and air-traffic controllers, can significantly reduce the complexity of the task and increase the reliability of transmitted information. ASR application can lead to a lower number of incidents caused by misunderstanding and improve air traffic management (ATM) efficiency. Evidently, high accuracy predictions, especially, of key information, i.e., callsigns and commands, are required to minimize the risk of errors. We prove that combining the benefits of ASR and Natural Language Processing (NLP) methods to make use of surveillance data (i.e. additional modality) helps to considerably improve the recognition of callsigns (named entity). In this paper, we investigate a two-step callsign boosting approach: (1) at the 1 step (ASR), weights of probable callsign n-grams are reduced in G.fst and/or in the decoding FST (lattices), (2) at the 2 step (NLP), callsigns extracted from the improved recognition outputs with Named Entity Recognition (NER) are correlated with the surveillance data to select the most suitable one. Boosting callsign n-grams with the combination of ASR and NLP methods eventually leads up to 53.7% of an absolute, or 60.4% of a relative, improvement in callsign recognition.  ( 2 min )
    Backtrack Tie-Breaking for Decision Trees: A Note on Deodata Predictors. (arXiv:2202.03865v1 [cs.LG])
    A tie-breaking method is proposed for choosing the predicted class, or outcome, in a decision tree. The method is an adaptation of a similar technique used for deodata predictors.  ( 1 min )
    On Continuous Integration / Continuous Delivery for Automated Deployment of Machine Learning Models using MLOps. (arXiv:2202.03541v1 [cs.SE])
    Model deployment in machine learning has emerged as an intriguing field of research in recent years. It is comparable to the procedure defined for conventional software development. Continuous Integration and Continuous Delivery (CI/CD) have been shown to smooth down software advancement and speed up businesses when used in conjunction with development and operations (DevOps). Using CI/CD pipelines in an application that includes Machine Learning Operations (MLOps) components, on the other hand, has difficult difficulties, and pioneers in the area solve them by using unique tools, which is typically provided by cloud providers. This research provides a more in-depth look at the machine learning lifecycle and the key distinctions between DevOps and MLOps. In the MLOps approach, we discuss tools and approaches for executing the CI/CD pipeline of machine learning frameworks. Following that, we take a deep look into push and pull-based deployments in Github Operations (GitOps). Open exploration issues are also identified and added, which may guide future study.  ( 2 min )
    Conformal prediction for the design problem. (arXiv:2202.03613v1 [cs.LG])
    In many real-world deployments of machine learning, we use a prediction algorithm to choose what data to test next. For example, in the protein design problem, we have a regression model that predicts some real-valued property of a protein sequence, which we use to propose new sequences believed to exhibit higher property values than observed in the training data. Since validating designed sequences in the wet lab is typically costly, it is important to know how much we can trust the model's predictions. In such settings, however, there is a distinct type of distribution shift between the training and test data: one where the training and test data are statistically dependent, as the latter is chosen based on the former. Consequently, the model's error on the test data -- that is, the designed sequences -- has some non-trivial relationship with its error on the training data. Herein, we introduce a method to quantify predictive uncertainty in such settings. We do so by constructing confidence sets for predictions that account for the dependence between the training and test data. The confidence sets we construct have finite-sample guarantees that hold for any prediction algorithm, even when a trained model chooses the test-time input distribution. As a motivating use case, we demonstrate how our method quantifies uncertainty for the predicted fitness of designed protein using several real data sets.  ( 2 min )
    An Improved Analysis of Gradient Tracking for Decentralized Machine Learning. (arXiv:2202.03836v1 [cs.DC])
    We consider decentralized machine learning over a network where the training data is distributed across $n$ agents, each of which can compute stochastic model updates on their local data. The agent's common goal is to find a model that minimizes the average of all local loss functions. While gradient tracking (GT) algorithms can overcome a key challenge, namely accounting for differences between workers' local data distributions, the known convergence rates for GT algorithms are not optimal with respect to their dependence on the mixing parameter $p$ (related to the spectral gap of the connectivity matrix). We provide a tighter analysis of the GT method in the stochastic strongly convex, convex and non-convex settings. We improve the dependency on $p$ from $\mathcal{O}(p^{-2})$ to $\mathcal{O}(p^{-1}c^{-1})$ in the noiseless case and from $\mathcal{O}(p^{-3/2})$ to $\mathcal{O}(p^{-1/2}c^{-1})$ in the general stochastic case, where $c \geq p$ is related to the negative eigenvalues of the connectivity matrix (and is a constant in most practical applications). This improvement was possible due to a new proof technique which could be of independent interest.  ( 2 min )
    GraphDCA -- a Framework for Node Distribution Comparison in Real and Synthetic Graphs. (arXiv:2202.03884v1 [cs.LG])
    We argue that when comparing two graphs, the distribution of node structural features is more informative than global graph statistics which are often used in practice, especially to evaluate graph generative models. Thus, we present GraphDCA - a framework for evaluating similarity between graphs based on the alignment of their respective node representation sets. The sets are compared using a recently proposed method for comparing representation spaces, called Delaunay Component Analysis (DCA), which we extend to graph data. To evaluate our framework, we generate a benchmark dataset of graphs exhibiting different structural patterns and show, using three node structure feature extractors, that GraphDCA recognizes graphs with both similar and dissimilar local structure. We then apply our framework to evaluate three publicly available real-world graph datasets and demonstrate, using gradual edge perturbations, that GraphDCA satisfyingly captures gradually decreasing similarity, unlike global statistics. Finally, we use GraphDCA to evaluate two state-of-the-art graph generative models, NetGAN and CELL, and conclude that further improvements are needed for these models to adequately reproduce local structural features.  ( 2 min )
    Calibrated Learning to Defer with One-vs-All Classifiers. (arXiv:2202.03673v1 [cs.LG])
    The learning to defer (L2D) framework has the potential to make AI systems safer. For a given input, the system can defer the decision to a human if the human is more likely than the model to take the correct action. We study the calibration of L2D systems, investigating if the probabilities they output are sound. We find that Mozannar & Sontag's (2020) multiclass framework is not calibrated with respect to expert correctness. Moreover, it is not even guaranteed to produce valid probabilities due to its parameterization being degenerate for this purpose. We propose an L2D system based on one-vs-all classifiers that is able to produce calibrated probabilities of expert correctness. Furthermore, our loss function is also a consistent surrogate for multiclass L2D, like Mozannar & Sontag's (2020). Our experiments verify that not only is our system calibrated, but this benefit comes at no cost to accuracy. Our model's accuracy is always comparable (and often superior) to Mozannar & Sontag's (2020) model's in tasks ranging from hate speech detection to galaxy classification to diagnosis of skin lesions.
    Locally Random P-adic Alloy Codes with ChannelCoding Theorems for Distributed Coded Tensors. (arXiv:2202.03469v1 [cs.IT])
    Tensors, i.e., multi-linear functions, are a fundamental building block of machine learning algorithms. In order to train on large data-sets, it is common practice to distribute the computation amongst workers. However, stragglers and other faults can severely impact the performance and overall training time. A novel strategy to mitigate these failures is the use of coded computation. We introduce a new metric for analysis called the typical recovery threshold, which focuses on the most likely event and provide a novel construction of distributed coded tensor operations which are optimal with this measure. We show that our general framework encompasses many other computational schemes and metrics as a special case. In particular, we prove that the recovery threshold and the tensor rank can be recovered as a special case of the typical recovery threshold when the probability of noise, i.e., a fault, is equal to zero, thereby providing a noisy generalization of noiseless computation as a serendipitous result. Far from being a purely theoretical construction, these definitions lead us to practical random code constructions, i.e., locally random p-adic alloy codes, which are optimal with respect to the measures. We analyze experiments conducted on Amazon EC2 and establish that they are faster and more numerically stable than many other benchmark computation schemes in practice, as is predicted by theory.
    Particle Transformer for Jet Tagging. (arXiv:2202.03772v1 [hep-ph])
    Jet tagging is a critical yet challenging classification task in particle physics. While deep learning has transformed jet tagging and significantly improved performance, the lack of a large-scale public dataset impedes further enhancement. In this work, we present JetClass, a new comprehensive dataset for jet tagging. The JetClass dataset consists of 100 M jets, about two orders of magnitude larger than existing public datasets. A total of 10 types of jets are simulated, including several types unexplored for tagging so far. Based on the large dataset, we propose a new Transformer-based architecture for jet tagging, called Particle Transformer (ParT). By incorporating pairwise particle interactions in the attention mechanism, ParT achieves higher tagging performance than a plain Transformer and surpasses the previous state-of-the-art, ParticleNet, by a large margin. The pre-trained ParT models, once fine-tuned, also substantially enhance the performance on two widely adopted jet tagging benchmarks.
    COVID-19 Hospitalizations Forecasts Using Internet Search Data. (arXiv:2202.03869v1 [cs.LG])
    As the COVID-19 spread over the globe and new variants of COVID-19 keep occurring, reliable real-time forecasts of COVID-19 hospitalizations are critical for public health decision on medical resources allocations such as ICU beds, ventilators, and personnel to prepare for the surge of COVID-19 pandemics. Inspired by the strong association between public search behavior and hospitalization admission, we extended previously-proposed influenza tracking model, ARGO (AutoRegression with GOogle search data), to predict future 2-week national and state-level COVID-19 new hospital admissions. Leveraging the COVID-19 related time series information and Google search data, our method is able to robustly capture new COVID-19 variants' surges, and self-correct at both national and state level. Based on our retrospective out-of-sample evaluation over 12-month comparison period, our method achieves on average 15\% error reduction over the best alternative models collected from COVID-19 forecast hub. Overall, we showed that our method is flexible, self-correcting, robust, accurate, and interpretable, making it a potentially powerful tool to assist health-care officials and decision making for the current and future infectious disease outbreak.
    A Unified Prediction Framework for Signal Maps. (arXiv:2202.03679v1 [cs.LG])
    Signal maps are essential for the planning and operation of cellular networks. However, the measurements needed to create such maps are expensive, often biased, not always reflecting the metrics of interest, and posing privacy risks. In this paper, we develop a unified framework for predicting cellular signal maps from limited measurements. We propose and combine three mechanisms that deal with the fact that not all measurements are equally important for a particular prediction task. First, we design \emph{quality-of-service functions ($Q$)}, including signal strength (RSRP) but also other metrics of interest, such as coverage (improving recall by 76\%-92\%) and call drop probability (reducing error by as much as 32\%). By implicitly altering the training loss function, quality functions can also improve prediction for RSRP itself where it matters (e.g. MSE reduction up to 27\% in the low signal strength regime, where errors are critical). Second, we introduce \emph{weight functions} ($W$) to specify the relative importance of prediction at different parts of the feature space. We propose re-weighting based on importance sampling to obtain unbiased estimators when the sampling and target distributions mismatch(yielding 20\% improvement for targets on spatially uniform loss or on user population density). Third, we apply the {\em Data Shapley} framework for the first time in this context: to assign values ($\phi$) to individual measurement points, which capture the importance of their contribution to the prediction task. This can improve prediction (e.g. from 64\% to 94\% in recall for coverage loss) by removing points with negative values, and can also enable data minimization (i.e. we show that we can remove 70\% of data w/o loss in performance). We evaluate our methods and demonstrate significant improvement in prediction performance, using several real-world datasets.
    Scaling Out-of-Distribution Detection for Real-World Settings. (arXiv:1911.11132v3 [cs.CV] UPDATED)
    Detecting out-of-distribution examples is important for safety-critical machine learning applications such as detecting novel biological phenomena and self-driving cars. However, existing research mainly focuses on simple small-scale settings. To set the stage for more realistic out-of-distribution detection, we depart from small-scale settings and explore large-scale multiclass and multi-label settings with high-resolution images and thousands of classes. To make future work in real-world settings possible, we create new benchmarks for three large-scale settings. To test ImageNet multiclass anomaly detectors, we introduce the Species dataset containing over 700,000 images and over a thousand anomalous species. We leverage ImageNet-21K to evaluate PASCAL VOC and COCO multilabel anomaly detectors. Third, we introduce a new benchmark for anomaly segmentation by introducing a segmentation benchmark with road anomalies. We conduct extensive experiments in these more realistic settings for out-of-distribution detection and find that a surprisingly simple detector based on the maximum logit outperforms prior methods in all the large-scale multi-class, multi-label, and segmentation tasks, establishing a simple new baseline for future work.
    Quantifying Local Specialization in Deep Neural Networks. (arXiv:2110.08058v2 [cs.LG] UPDATED)
    A neural network is locally specialized to the extent that parts of its computational graph (i.e. structure) can be abstractly represented as performing some comprehensible sub-task relevant to the overall task (i.e. functionality). Are modern deep neural networks locally specialized? How can this be quantified? In this paper, we consider the problem of taking a neural network whose neurons are partitioned into clusters, and quantifying how functionally specialized the clusters are. We propose two proxies for this: importance, which reflects how crucial sets of neurons are to network performance; and coherence, which reflects how consistently their neurons associate with features of the inputs. To measure these proxies, we develop a set of statistical methods based on techniques conventionally used to interpret individual neurons. We apply the proxies to partitionings generated by spectrally clustering a graph representation of the network's neurons with edges determined either by network weights or correlations of activations. We show that these partitionings, even ones based only on weights (i.e. strictly from non-runtime analysis), reveal groups of neurons that are important and coherent. These results suggest that graph-based partitioning can reveal local specialization and that statistical methods can be used to automatedly screen for sets of neurons that can be understood abstractly.
    Universality of parametric Coupling Flows over parametric diffeomorphisms. (arXiv:2202.02906v2 [cs.LG] UPDATED)
    Invertible neural networks based on Coupling Flows CFlows) have various applications such as image synthesis and data compression. The approximation universality for CFlows is of paramount importance to ensure the model expressiveness. In this paper, we prove that CFlows can approximate any diffeomorphism in C^k-norm if its layers can approximate certain single-coordinate transforms. Specifically, we derive that a composition of affine coupling layers and invertible linear transforms achieves this universality. Furthermore, in parametric cases where the diffeomorphism depends on some extra parameters, we prove the corresponding approximation theorems for our proposed parametric coupling flows named Para-CFlows. In practice, we apply Para-CFlows as a neural surrogate model in contextual Bayesian optimization tasks, to demonstrate its superiority over other neural surrogate models in terms of optimization performance.
    Linear Time Kernel Matrix Approximation via Hyperspherical Harmonics. (arXiv:2202.03655v1 [cs.LG])
    We propose a new technique for constructing low-rank approximations of matrices that arise in kernel methods for machine learning. Our approach pairs a novel automatically constructed analytic expansion of the underlying kernel function with a data-dependent compression step to further optimize the approximation. This procedure works in linear time and is applicable to any isotropic kernel. Moreover, our method accepts the desired error tolerance as input, in contrast to prevalent methods which accept the rank as input. Experimental results show our approach compares favorably to the commonly used Nystrom method with respect to both accuracy for a given rank and computational time for a given accuracy across a variety of kernels, dimensions, and datasets. Notably, in many of these problem settings our approach produces near-optimal low-rank approximations. We provide an efficient open-source implementation of our new technique to complement our theoretical developments and experimental results.  ( 2 min )
    Fracture Detection in Wrist X-ray Images Using Deep Learning-Based Object Detection Models. (arXiv:2111.07355v3 [eess.IV] UPDATED)
    Hospitals, especially their emergency services, receive a high number of wrist fracture cases. For correct diagnosis and proper treatment of these, images obtained from various medical equipment must be viewed by physicians, along with the patients medical records and physical examination. The aim of this study is to perform fracture detection by use of deep learning on wrist Xray images to support physicians in the diagnosis of these fractures, particularly in the emergency services. Using SABL, RegNet, RetinaNet, PAA, Libra R_CNN, FSAF, Faster R_CNN, Dynamic R_CNN and DCN deep learning based object detection models with various backbones, 20 different fracture detection procedures were performed on Gazi University Hospitals dataset of wrist Xray images. To further improve these procedures, five different ensemble models were developed and then used to reform an ensemble model to develop a unique detection model, wrist fracture detection_combo (WFD_C). From 26 different models for fracture detection, the highest detection result obtained was 0.8639 average precision (AP50) in the WFD-C model. Huawei Turkey R&D Center supports this study within the scope of the ongoing cooperation project coded 071813 between Gazi University, Huawei and Medskor. Code is available at https://github.com/fatihuysal88/wrist-d
    Non Asymptotic Bounds for Optimization via Online Multiplicative Stochastic Gradient Descent. (arXiv:2112.07110v7 [stat.ML] UPDATED)
    The gradient noise of Stochastic Gradient Descent (SGD) is considered to play a key role in its properties (e.g. escaping low potential points and regularization). Past research has indicated that the covariance of the SGD error done via minibatching plays a critical role in determining its regularization and escape from low potential points. It is however not much explored how much the distribution of the error influences the behavior of the algorithm. Motivated by some new research in this area, we prove universality results by showing that noise classes that have the same mean and covariance structure of SGD via minibatching have similar properties. We mainly consider the Multiplicative Stochastic Gradient Descent (M-SGD) algorithm as introduced by Wu et al., which has a much more general noise class than the SGD algorithm done via minibatching. We establish nonasymptotic bounds for the M-SGD algorithm mainly with respect to the Stochastic Differential Equation corresponding to SGD via minibatching. We also show that the M-SGD error is approximately a scaled Gaussian distribution with mean $0$ at any fixed point of the M-SGD algorithm. We also establish bounds for the convergence of the M-SGD algorithm in the strongly convex regime.
    FisrEbp: Enterprise Bankruptcy Prediction via Fusing its Intra-risk and Spillover-Risk. (arXiv:2202.03874v1 [q-fin.RM])
    In this paper, we propose to model enterprise bankruptcy risk by fusing its intra-risk and spillover-risk. Under this framework, we propose a novel method that is equipped with an LSTM-based intra-risk encoder and GNNs-based spillover-risk encoder. Specifically, the intra-risk encoder is able to capture enterprise intra-risk using the statistic correlated indicators from the basic business information and litigation information. The spillover-risk encoder consists of hypergraph neural networks and heterogeneous graph neural networks, which aim to model spillover risk through two aspects, i.e. hyperedge and multiplex heterogeneous relations among enterprise knowledge graph, respectively. To evaluate the proposed model, we collect multi-sources SMEs data and build a new dataset SMEsD, on which the experimental results demonstrate the superiority of the proposed method. The dataset is expected to become a significant benchmark dataset for SMEs bankruptcy prediction and promote the development of financial risk study further.  ( 2 min )
    Machine learning method for light field refocusing. (arXiv:2103.16020v2 [eess.IV] UPDATED)
    Light field imaging introduced the capability to refocus an image after capturing. Currently there are two popular methods for refocusing, shift-and-sum and Fourier slice methods. Neither of these two methods can refocus the light field in real-time without any pre-processing. In this paper we introduce a machine learning based refocusing technique that is capable of extracting 16 refocused images with refocusing parameters of \alpha=0.125,0.250,0.375,...,2.0 in real-time. We have trained our network, which is called RefNet, in two experiments. Once using the Fourier slice method as the training -- i.e., "ground truth" -- data and another using the shift-and-sum method as the training data. We showed that in both cases, not only is the RefNet method at least 134x faster than previous approaches, but also the color prediction of RefNet is superior to both Fourier slice and shift-and-sum methods while having similar depth of field and focus distance performance.
    Low Precision Decentralized Distributed Training over IID and non-IID Data. (arXiv:2111.09389v2 [cs.LG] UPDATED)
    Decentralized distributed learning is the key to enabling large-scale machine learning (training) on the edge devices utilizing private user-generated local data, without relying on the cloud. However, the practical realization of such on-device training is limited by the communication and compute bottleneck. In this paper, we propose and show the convergence of low precision decentralized training that aims to reduce the computational complexity and communication cost of decentralized training. Many feedback-based compression techniques have been proposed in the literature to reduce communication costs. To the best of our knowledge, there is no work that applies and shows compute efficient training techniques such quantization, pruning, etc., for peer-to-peer decentralized learning setups. Since real-world applications have a significant skew in the data distribution, we design "Range-EvoNorm" as the normalization activation layer which is better suited for low precision training over non-IID data. Moreover, we show that the proposed low precision training can be used in synergy with other communication compression methods decreasing the communication cost further. Our experiments indicate that 8-bit decentralized training has minimal accuracy loss compared to its full precision counterpart even with non-IID data. However, when low precision training is accompanied by communication compression through sparsification we observe a 1-2% drop in accuracy. The proposed low precision decentralized training decreases computational complexity, memory usage, and communication cost by 4x and compute energy by a factor of ~20x, while trading off less than a $1\%$ accuracy for both IID and non-IID data. In particular, with higher skew values, we observe an increase in accuracy (by ~ 0.5%) with low precision training, indicating the regularization effect of the quantization.
    Decision boundaries and convex hulls in the feature space that deep learning functions learn from images. (arXiv:2202.04052v1 [cs.CV])
    The success of deep neural networks in image classification and learning can be partly attributed to the features they extract from images. It is often speculated about the properties of a low-dimensional manifold that models extract and learn from images. However, there is not sufficient understanding about this low-dimensional space based on theory or empirical evidence. For image classification models, their last hidden layer is the one where images of each class is separated from other classes and it also has the least number of features. Here, we develop methods and formulations to study that feature space for any model. We study the partitioning of the domain in feature space, identify regions guaranteed to have certain classifications, and investigate its implications for the pixel space. We observe that geometric arrangements of decision boundaries in feature space is significantly different compared to pixel space, providing insights about adversarial vulnerabilities, image morphing, extrapolation, ambiguity in classification, and the mathematical understanding of image classification models.
    TACTiS: Transformer-Attentional Copulas for Time Series. (arXiv:2202.03528v1 [cs.LG])
    The estimation of time-varying quantities is a fundamental component of decision making in fields such as healthcare and finance. However, the practical utility of such estimates is limited by how accurately they quantify predictive uncertainty. In this work, we address the problem of estimating the joint predictive distribution of high-dimensional multivariate time series. We propose a versatile method, based on the transformer architecture, that estimates joint distributions using an attention-based decoder that provably learns to mimic the properties of non-parametric copulas. The resulting model has several desirable properties: it can scale to hundreds of time series, supports both forecasting and interpolation, can handle unaligned and non-uniformly sampled data, and can seamlessly adapt to missing data during training. We demonstrate these properties empirically and show that our model produces state-of-the-art predictions on several real-world datasets.  ( 2 min )
    Evaluating Robustness of Cooperative MARL: A Model-based Approach. (arXiv:2202.03558v1 [cs.LG])
    In recent years, a proliferation of methods were developed for cooperative multi-agent reinforcement learning (c-MARL). However, the robustness of c-MARL agents against adversarial attacks has been rarely explored. In this paper, we propose to evaluate the robustness of c-MARL agents via a model-based approach. Our proposed formulation can craft stronger adversarial state perturbations of c-MARL agents(s) to lower total team rewards more than existing model-free approaches. In addition, we propose the first victim-agent selection strategy which allows us to develop even stronger adversarial attack. Numerical experiments on multi-agent MuJoCo benchmarks illustrate the advantage of our approach over other baselines. The proposed model-based attack consistently outperforms other baselines in all tested environments.
    Self-supervised Speaker Recognition Training Using Human-Machine Dialogues. (arXiv:2202.03484v1 [cs.LG])
    Speaker recognition, recognizing speaker identities based on voice alone, enables important downstream applications, such as personalization and authentication. Learning speaker representations, in the context of supervised learning, heavily depends on both clean and sufficient labeled data, which is always difficult to acquire. Noisy unlabeled data, on the other hand, also provides valuable information that can be exploited using self-supervised training methods. In this work, we investigate how to pretrain speaker recognition models by leveraging dialogues between customers and smart-speaker devices. However, the supervisory information in such dialogues is inherently noisy, as multiple speakers may speak to a device in the course of the same dialogue. To address this issue, we propose an effective rejection mechanism that selectively learns from dialogues based on their acoustic homogeneity. Both reconstruction-based and contrastive-learning-based self-supervised methods are compared. Experiments demonstrate that the proposed method provides significant performance improvements, superior to earlier work. Dialogue pretraining when combined with the rejection mechanism yields 27.10% equal error rate (EER) reduction in speaker recognition, compared to a model without self-supervised pretraining.
    Comparative Study Between Distance Measures On Supervised Optimum-Path Forest Classification. (arXiv:2202.03854v1 [cs.LG])
    Machine Learning has attracted considerable attention throughout the past decade due to its potential to solve far-reaching tasks, such as image classification, object recognition, anomaly detection, and data forecasting. A standard approach to tackle such applications is based on supervised learning, which is assisted by large sets of labeled data and is conducted by the so-called classifiers, such as Logistic Regression, Decision Trees, Random Forests, and Support Vector Machines, among others. An alternative to traditional classifiers is the parameterless Optimum-Path Forest (OPF), which uses a graph-based methodology and a distance measure to create arcs between nodes and hence sets of trees, responsible for conquering the nodes, defining their labels, and shaping the forests. Nevertheless, its performance is strongly associated with an appropriate distance measure, which may vary according to the dataset's nature. Therefore, this work proposes a comparative study over a wide range of distance measures applied to the supervised Optimum-Path Forest classification. The experimental results are conducted using well-known literature datasets and compared across benchmarking classifiers, illustrating OPF's ability to adapt to distinct domains.
    Learnability Lock: Authorized Learnability Control Through Adversarial Invertible Transformations. (arXiv:2202.03576v1 [cs.LG])
    Owing much to the revolution of information technology, the recent progress of deep learning benefits incredibly from the vastly enhanced access to data available in various digital formats. However, in certain scenarios, people may not want their data being used for training commercial models and thus studied how to attack the learnability of deep learning models. Previous works on learnability attack only consider the goal of preventing unauthorized exploitation on the specific dataset but not the process of restoring the learnability for authorized cases. To tackle this issue, this paper introduces and investigates a new concept called "learnability lock" for controlling the model's learnability on a specific dataset with a special key. In particular, we propose adversarial invertible transformation, that can be viewed as a mapping from image to image, to slightly modify data samples so that they become "unlearnable" by machine learning models with negligible loss of visual features. Meanwhile, one can unlock the learnability of the dataset and train models normally using the corresponding key. The proposed learnability lock leverages class-wise perturbation that applies a universal transformation function on data samples of the same label. This ensures that the learnability can be easily restored with a simple inverse transformation while remaining difficult to be detected or reverse-engineered. We empirically demonstrate the success and practicability of our method on visual classification tasks.
    From Generalisation Error to Transportation-cost Inequalities and Back. (arXiv:2202.03956v1 [cs.IT])
    In this work, we connect the problem of bounding the expected generalisation error with transportation-cost inequalities. Exposing the underlying pattern behind both approaches we are able to generalise them and go beyond Kullback-Leibler Divergences/Mutual Information and sub-Gaussian measures. In particular, we are able to provide a result showing the equivalence between two families of inequalities: one involving functionals and one involving measures. This result generalises the one proposed by Bobkov and G\"otze that connects transportation-cost inequalities with concentration of measure. Moreover, it allows us to recover all standard generalisation error bounds involving mutual information and to introduce new, more general bounds, that involve arbitrary divergence measures.
    Learning Sinkhorn divergences for supervised change point detection. (arXiv:2202.04000v1 [cs.LG])
    Many modern applications require detecting change points in complex sequential data. Most existing methods for change point detection are unsupervised and, as a consequence, lack any information regarding what kind of changes we want to detect or if some kinds of changes are safe to ignore. This often results in poor change detection performance. We present a novel change point detection framework that uses true change point instances as supervision for learning a ground metric such that Sinkhorn divergences can be then used in two-sample tests on sliding windows to detect change points in an online manner. Our method can be used to learn a sparse metric which can be useful for both feature selection and interpretation in high-dimensional change point detection settings. Experiments on simulated as well as real world sequences show that our proposed method can substantially improve change point detection performance over existing unsupervised change point detection methods using only few labeled change point instances.
    HistBERT: A Pre-trained Language Model for Diachronic Lexical Semantic Analysis. (arXiv:2202.03612v1 [cs.CL])
    Contextualized word embeddings have demonstrated state-of-the-art performance in various natural language processing tasks including those that concern historical semantic change. However, language models such as BERT was trained primarily on contemporary corpus data. To investigate whether training on historical corpus data improves diachronic semantic analysis, we present a pre-trained BERT-based language model, HistBERT, trained on the balanced Corpus of Historical American English. We examine the effectiveness of our approach by comparing the performance of the original BERT and that of HistBERT, and we report promising results in word similarity and semantic shift analysis. Our work suggests that the effectiveness of contextual embeddings in diachronic semantic analysis is dependent on the temporal profile of the input text and care should be taken in applying this methodology to study historical semantic change.  ( 2 min )
    Results and findings of the 2021 Image Similarity Challenge. (arXiv:2202.04007v1 [cs.CV])
    The 2021 Image Similarity Challenge introduced a dataset to serve as a new benchmark to evaluate recent image copy detection methods. There were 200 participants to the competition. This paper presents a quantitative and qualitative analysis of the top submissions. It appears that the most difficult image transformations involve either severe image crops or hiding into unrelated images, combined with local pixel perturbations. The key algorithmic elements in the winning submissions are: training on strong augmentations, self-supervised learning, score normalization, explicit overlay detection, and global descriptor matching followed by pairwise image comparison.
    Improving the Sample-Complexity of Deep Classification Networks with Invariant Integration. (arXiv:2202.03967v1 [cs.LG])
    Leveraging prior knowledge on intraclass variance due to transformations is a powerful method to improve the sample complexity of deep neural networks. This makes them applicable to practically important use-cases where training data is scarce. Rather than being learned, this knowledge can be embedded by enforcing invariance to those transformations. Invariance can be imposed using group-equivariant convolutions followed by a pooling operation. For rotation-invariance, previous work investigated replacing the spatial pooling operation with invariant integration which explicitly constructs invariant representations. Invariant integration uses monomials which are selected using an iterative approach requiring expensive pre-training. We propose a novel monomial selection algorithm based on pruning methods to allow an application to more complex problems. Additionally, we replace monomials with different functions such as weighted sums, multi-layer perceptrons and self-attention, thereby streamlining the training of invariant-integration-based architectures. We demonstrate the improved sample complexity on the Rotated-MNIST, SVHN and CIFAR-10 datasets where rotation-invariant-integration-based Wide-ResNet architectures using monomials and weighted sums outperform the respective baselines in the limited sample regime. We achieve state-of-the-art results using full data on Rotated-MNIST and SVHN where rotation is a main source of intraclass variation. On STL-10 we outperform a standard and a rotation-equivariant convolutional neural network using pooling.
    Data Consistency for Weakly Supervised Learning. (arXiv:2202.03987v1 [cs.LG])
    In many applications, training machine learning models involves using large amounts of human-annotated data. Obtaining precise labels for the data is expensive. Instead, training with weak supervision provides a low-cost alternative. We propose a novel weak supervision algorithm that processes noisy labels, i.e., weak signals, while also considering features of the training data to produce accurate labels for training. Our method searches over classifiers of the data representation to find plausible labelings. We call this paradigm data consistent weak supervision. A key facet of our framework is that we are able to estimate labels for data examples low or no coverage from the weak supervision. In addition, we make no assumptions about the joint distribution of the weak signals and true labels of the data. Instead, we use weak signals and the data features to solve a constrained optimization that enforces data consistency among the labels we generate. Empirical evaluation of our method on different datasets shows that it significantly outperforms state-of-the-art weak supervision methods on both text and image classification tasks.
    Divergent representations of ethological visual inputs emerge from supervised, unsupervised, and reinforcement learning. (arXiv:2112.02027v2 [q-bio.NC] UPDATED)
    Artificial neural systems trained using reinforcement, supervised, and unsupervised learning all acquire internal representations of high dimensional input. To what extent these representations depend on the different learning objectives is largely unknown. Here we compare the representations learned by eight different convolutional neural networks, each with identical ResNet architectures and trained on the same family of egocentric images, but embedded within different learning systems. Specifically, the representations are trained to guide action in a compound reinforcement learning task; to predict one or a combination of three task-related targets with supervision; or using one of three different unsupervised objectives. Using representational similarity analysis, we find that the network trained with reinforcement learning differs most from the other networks. Using metrics inspired by the neuroscience literature, we find that the model trained with reinforcement learning has a sparse and high-dimensional representation wherein individual images are represented with very different patterns of neural activity. Further analysis suggests these representations may arise in order to guide long-term behavior and goal-seeking in the RL agent. Finally, we compare the representations learned by the RL agent to neural activity from mouse visual cortex and find it to perform as well or better than other models. Our results provide insights into how the properties of neural representations are influenced by objective functions and can inform transfer learning approaches.
    PatClArC: Using Pattern Concept Activation Vectors for Noise-Robust Model Debugging. (arXiv:2202.03482v1 [cs.CV])
    State-of-the-art machine learning models are commonly (pre-)trained on large benchmark datasets. These often contain biases, artifacts, or errors that have remained unnoticed in the data collection process and therefore fail in representing the real world truthfully. This can cause models trained on these datasets to learn undesired behavior based upon spurious correlations, e.g., the existence of a copyright tag in an image. Concept Activation Vectors (CAV) have been proposed as a tool to model known concepts in latent space and have been used for concept sensitivity testing and model correction. Specifically, class artifact compensation (ClArC) corrects models using CAVs to represent data artifacts in feature space linearly. Modeling CAVs with filters of linear models, however, causes a significant influence of the noise portion within the data, as recent work proposes the unsuitability of linear model filters to find the signal direction in the input, which can be avoided by instead using patterns. In this paper we propose Pattern Concept Activation Vectors (PCAV) for noise-robust concept representations in latent space. We demonstrate that pattern-based artifact modeling has beneficial effects on the application of CAVs as a means to remove influence of confounding features from models via the ClArC framework.
    Efficiently Escaping Saddle Points in Bilevel Optimization. (arXiv:2202.03684v1 [cs.LG])
    Bilevel optimization is one of the fundamental problems in machine learning and optimization. Recent theoretical developments in bilevel optimization focus on finding the first-order stationary points for nonconvex-strongly-convex cases. In this paper, we analyze algorithms that can escape saddle points in nonconvex-strongly-convex bilevel optimization. Specifically, we show that the perturbed approximate implicit differentiation (AID) with a warm start strategy finds $\epsilon$-approximate local minimum of bilevel optimization in $\tilde{O}(\epsilon^{-2})$ iterations with high probability. Moreover, we propose an inexact NEgative-curvature-Originated-from-Noise Algorithm (iNEON), a pure first-order algorithm that can escape saddle point and find local minimum of stochastic bilevel optimization. As a by-product, we provide the first nonasymptotic analysis of perturbed multi-step gradient descent ascent (GDmax) algorithm that converges to local minimax point for minimax problems.
    Locally Sparse Neural Networks for Tabular Biomedical Data. (arXiv:2106.06468v2 [cs.LG] UPDATED)
    Tabular datasets with low-sample-size or many variables are prevalent in biomedicine. Practitioners in this domain prefer linear or tree-based models over neural networks since the latter are harder to interpret and tend to overfit when applied to tabular datasets. To address these neural networks' shortcomings, we propose an intrinsically interpretable network for heterogeneous biomedical data. We design a locally sparse neural network where the local sparsity is learned to identify the subset of most relevant features for each sample. This sample-specific sparsity is predicted via a \textit{gating} network, which is trained in tandem with the \textit{prediction} network. By forcing the model to select a subset of the most informative features for each sample, we reduce model overfitting in low-sample-size data and obtain an interpretable model. We demonstrate that our method outperforms state-of-the-art models when applied to synthetic or real-world biomedical datasets using extensive experiments. Furthermore, the proposed framework dramatically outperforms existing schemes when evaluating its interpretability capabilities. Finally, we demonstrate the applicability of our model to two important biomedical tasks: survival analysis and marker gene identification.
    Kohn-Sham regularizer for spin density functional theory and weakly correlated systems. (arXiv:2110.14846v3 [physics.chem-ph] UPDATED)
    Kohn-Sham regularizer (KSR) is a machine learning approach that optimizes a physics-informed exchange-correlation functional within a differentiable Kohn-Sham density functional theory (DFT) framework. We generalize KSR to spin DFT and create local, semilocal, and nonlocal approximations for the exchange-correlation functional. We explore KSR for weakly correlated systems, by training on atoms and testing on molecules at equilibrium. The generalization error from our semilocal approximation is comparable to other differentiable approaches. Our nonlocal functional outperforms any existing machine learning functionals by predicting the ground-state energies of the test systems with a mean absolute error of 2.7 milli-Hartrees.
    CAUSPref: Causal Preference Learning for Out-of-Distribution Recommendation. (arXiv:2202.03984v1 [cs.LG])
    In spite of the tremendous development of recommender system owing to the progressive capability of machine learning recently, the current recommender system is still vulnerable to the distribution shift of users and items in realistic scenarios, leading to the sharp decline of performance in testing environments. It is even more severe in many common applications where only the implicit feedback from sparse data is available. Hence, it is crucial to promote the performance stability of recommendation method in different environments. In this work, we first make a thorough analysis of implicit recommendation problem from the viewpoint of out-of-distribution (OOD) generalization. Then under the guidance of our theoretical analysis, we propose to incorporate the recommendation-specific DAG learner into a novel causal preference-based recommendation framework named CAUSPref, mainly consisting of causal learning of invariant user preference and anti-preference negative sampling to deal with implicit feedback. Extensive experimental results from real-world datasets clearly demonstrate that our approach surpasses the benchmark models significantly under types of out-of-distribution settings, and show its impressive interpretability.
    Efficient Algorithms for High-Dimensional Convex Subspace Optimization via Strict Complementarity. (arXiv:2202.04020v1 [math.OC])
    We consider optimization problems in which the goal is find a $k$-dimensional subspace of $\reals^n$, $k<<n$, which minimizes a convex and smooth loss. Such problemsgeneralize the fundamental task of principal component analysis (PCA) to include robust and sparse counterparts, and logistic PCA for binary data, among others. While this problem is not convex it admits natural algorithms with very efficient iterations and memory requirements, which is highly desired in high-dimensional regimes however, arguing about their fast convergence to a global optimal solution is difficult. On the other hand, there exists a simple convex relaxation for which convergence to the global optimum is straightforward, however corresponding algorithms are not efficient when the dimension is very large. In this work we present a natural deterministic sufficient condition so that the optimal solution to the convex relaxation is unique and is also the optimal solution to the original nonconvex problem. Mainly, we prove that under this condition, a natural highly-efficient nonconvex gradient method, which we refer to as \textit{gradient orthogonal iteration}, when initialized with a "warm-start", converges linearly for the nonconvex problem. We also establish similar results for the nonconvex projected gradient method, and the Frank-Wolfe method when applied to the convex relaxation. We conclude with empirical evidence on synthetic data which demonstrate the appeal of our approach.
    LCDNet: Deep Loop Closure Detection and Point Cloud Registration for LiDAR SLAM. (arXiv:2103.05056v4 [cs.RO] UPDATED)
    Loop closure detection is an essential component of Simultaneous Localization and Mapping (SLAM) systems, which reduces the drift accumulated over time. Over the years, several deep learning approaches have been proposed to address this task, however their performance has been subpar compared to handcrafted techniques, especially while dealing with reverse loops. In this paper, we introduce the novel LCDNet that effectively detects loop closures in LiDAR point clouds by simultaneously identifying previously visited places and estimating the 6-DoF relative transformation between the current scan and the map. LCDNet is composed of a shared encoder, a place recognition head that extracts global descriptors, and a relative pose head that estimates the transformation between two point clouds. We introduce a novel relative pose head based on the unbalanced optimal transport theory that we implement in a differentiable manner to allow for end-to-end training. Extensive evaluations of LCDNet on multiple real-world autonomous driving datasets show that our approach outperforms state-of-the-art loop closure detection and point cloud registration techniques by a large margin, especially while dealing with reverse loops. Moreover, we integrate our proposed loop closure detection approach into a LiDAR SLAM library to provide a complete mapping system and demonstrate the generalization ability using different sensor setup in an unseen city.
    Diversified Sampling for Batched Bayesian Optimization with Determinantal Point Processes. (arXiv:2110.11665v2 [cs.LG] UPDATED)
    In Bayesian Optimization (BO) we study black-box function optimization with noisy point evaluations and Bayesian priors. Convergence of BO can be greatly sped up by batching, where multiple evaluations of the black-box function are performed in a single round. The main difficulty in this setting is to propose at the same time diverse and informative batches of evaluation points. In this work, we introduce DPP-Batch Bayesian Optimization (DPP-BBO), a universal framework for inducing batch diversity in sampling based BO by leveraging the repulsive properties of Determinantal Point Processes (DPP) to naturally diversify the batch sampling procedure. We illustrate this framework by formulating DPP-Thompson Sampling (DPP-TS) as a variant of the popular Thompson Sampling (TS) algorithm and introducing a Markov Chain Monte Carlo procedure to sample from it. We then prove novel Bayesian simple regret bounds for both classical batched TS as well as our counterpart DPP-TS, with the latter bound being tighter. Our real-world, as well as synthetic, experiments demonstrate improved performance of DPP-BBO over classical batching methods with Gaussian process and Cox process models.
    Deep Probability Estimation. (arXiv:2111.10734v2 [cs.LG] UPDATED)
    Reliable probability estimation is of crucial importance in many real-world applications where there is inherent uncertainty, such as weather forecasting, medical prognosis, or collision avoidance in autonomous vehicles. Probability-estimation models are trained on observed outcomes (e.g. whether it has rained or not, or whether a patient has died or not), because the ground-truth probabilities of the events of interest are typically unknown. The problem is therefore analogous to binary classification, with the important difference that the objective is to estimate probabilities rather than predicting the specific outcome. The goal of this work is to investigate probability estimation from high-dimensional data using deep neural networks. There exist several methods to improve the probabilities generated by these models but they mostly focus on classification problems where the probabilities are related to model uncertainty. In the case of problems with inherent uncertainty, it is challenging to evaluate performance without access to ground-truth probabilities. To address this, we build a synthetic dataset to study and compare different computable metrics. We evaluate existing methods on the synthetic data as well as on three real-world probability estimation tasks, all of which involve inherent uncertainty. We also give a theoretical analysis of a model for high-dimensional probability estimation which reproduces several of the phenomena evinced in our experiments. Finally, we propose a new method for probability estimation using neural networks, which modifies the training process to promote output probabilities that are consistent with empirical probabilities computed from the data. The method outperforms existing approaches on most metrics on the simulated as well as real-world data.
    Semantic features of object concepts generated with GPT-3. (arXiv:2202.03753v1 [cs.CL])
    Semantic features have been playing a central role in investigating the nature of our conceptual representations. Yet the enormous time and effort required to empirically sample and norm features from human raters has restricted their use to a limited set of manually curated concepts. Given recent promising developments with transformer-based language models, here we asked whether it was possible to use such models to automatically generate meaningful lists of properties for arbitrary object concepts and whether these models would produce features similar to those found in humans. To this end, we probed a GPT-3 model to generate semantic features for 1,854 objects and compared automatically-generated features to existing human feature norms. GPT-3 generated many more features than humans, yet showed a similar distribution in the types of generated features. Generated feature norms rivaled human norms in predicting similarity, relatedness, and category membership, while variance partitioning demonstrated that these predictions were driven by similar variance in humans and GPT-3. Together, these results highlight the potential of large language models to capture important facets of human knowledge and yield a new approach for automatically generating interpretable feature sets, thus drastically expanding the potential use of semantic features in psychological and linguistic studies.
    Offline Reinforcement Learning for Mobile Notifications. (arXiv:2202.03867v1 [cs.LG])
    Mobile notification systems have taken a major role in driving and maintaining user engagement for online platforms. They are interesting recommender systems to machine learning practitioners with more sequential and long-term feedback considerations. Most machine learning applications in notification systems are built around response-prediction models, trying to attribute both short-term impact and long-term impact to a notification decision. However, a user's experience depends on a sequence of notifications and attributing impact to a single notification is not always accurate, if not impossible. In this paper, we argue that reinforcement learning is a better framework for notification systems in terms of performance and iteration speed. We propose an offline reinforcement learning framework to optimize sequential notification decisions for driving user engagement. We describe a state-marginalized importance sampling policy evaluation approach, which can be used to evaluate the policy offline and tune learning hyperparameters. Through simulations that approximate the notifications ecosystem, we demonstrate the performance and benefits of the offline evaluation approach as a part of the reinforcement learning modeling approach. Finally, we collect data through online exploration in the production system, train an offline Double Deep Q-Network and launch a successful policy online. We also discuss the practical considerations and results obtained by deploying these policies for a large-scale recommendation system use-case.
    Equivariance versus Augmentation for Spherical Images. (arXiv:2202.03990v1 [cs.LG])
    We analyze the role of rotational equivariance in convolutional neural networks (CNNs) applied to spherical images. We compare the performance of the group equivariant networks known as S2CNNs and standard non-equivariant CNNs trained with an increasing amount of data augmentation. The chosen architectures can be considered baseline references for the respective design paradigms. Our models are trained and evaluated on single or multiple items from the MNIST or FashionMNIST dataset projected onto the sphere. For the task of image classification, which is inherently rotationally invariant, we find that by considerably increasing the amount of data augmentation and the size of the networks, it is possible for the standard CNNs to reach at least the same performance as the equivariant network. In contrast, for the inherently equivariant task of semantic segmentation, the non-equivariant networks are consistently outperformed by the equivariant networks with significantly fewer parameters. We also analyze and compare the inference latency and training times of the different networks, enabling detailed tradeoff considerations between equivariant architectures and data augmentation for practical problems. The equivariant spherical networks used in the experiments will be made available at https://github.com/JanEGerken/sem_seg_s2cnn .
    Novel Techniques to Assess Predictive Systems and Reduce Their Alarm Burden. (arXiv:2102.05691v2 [cs.LG] UPDATED)
    Machine prediction algorithms (e.g., binary classifiers) often are adopted on the basis of claimed performance using classic metrics such as sensitivity and predictive value. However, classifier performance depends heavily upon the context (workflow) in which the classifier operates. Classic metrics do not reflect the realized utility of a predictor unless certain implicit assumptions are met, and these assumptions cannot be met in many common clinical scenarios. This often results in suboptimal implementations and in disappointment when expected outcomes are not achieved. One common failure mode for classic metrics arises when multiple predictions can be made for the same event, particularly when redundant true positive predictions produce little additional value. This describes many clinical alerting systems. We explain why classic metrics cannot correctly represent predictor performance in such contexts, and introduce an improved performance assessment technique using utility functions to score predictions based on their utility in a specific workflow context. The resulting utility metrics (u-metrics) explicitly account for the effects of temporal relationships on prediction utility. Compared to traditional measures, u-metrics more accurately reflect the real world costs and benefits of a predictor operating in a live clinical context. The improvement can be significant. We also describe a formal approach to snoozing, a mitigation strategy in which some predictions are suppressed to improve predictor performance by reducing false positives while retaining event capture. Snoozing is especially useful for predictors that generate interruptive alarms. U-metrics correctly measure and predict the performance benefits of snoozing, whereas traditional metrics do not.
    Modeling Structure with Undirected Neural Networks. (arXiv:2202.03760v1 [cs.LG])
    Neural networks are powerful function estimators, leading to their status as a paradigm of choice for modeling structured data. However, unlike other structured representations that emphasize the modularity of the problem -- e.g., factor graphs -- neural networks are usually monolithic mappings from inputs to outputs, with a fixed computation order. This limitation prevents them from capturing different directions of computation and interaction between the modeled variables. In this paper, we combine the representational strengths of factor graphs and of neural networks, proposing undirected neural networks (UNNs): a flexible framework for specifying computations that can be performed in any order. For particular choices, our proposed models subsume and extend many existing architectures: feed-forward, recurrent, self-attention networks, auto-encoders, and networks with implicit layers. We demonstrate the effectiveness of undirected neural architectures, both unstructured and structured, on a range of tasks: tree-constrained dependency parsing, convolutional image classification, and sequence completion with attention. By varying the computation order, we show how a single UNN can be used both as a classifier and a prototype generator, and how it can fill in missing parts of an input sequence, making them a promising field for further research.
    Optimal Transport of Binary Classifiers to Fairness. (arXiv:2202.03814v1 [cs.LG])
    Much of the past work on fairness in machine learning has focused on forcing the predictions of classifiers to have similar statistical properties for individuals of different demographics. Yet, such methods often simply perform a rescaling of the classifier scores and ignore whether individuals of different groups have similar features. Our proposed method, Optimal Transport to Fairness (OTF), applies Optimal Transport (OT) to take this similarity into account by quantifying unfairness as the smallest cost of OT between a classifier and any score function that satisfies fairness constraints. For a flexible class of linear fairness constraints, we show a practical way to compute OTF as an unfairness cost term that can be added to any standard classification setting. Experiments show that OTF can be used to achieve an effective trade-off between predictive power and fairness.
    Generalized Funnelling: Ensemble Learning and Heterogeneous Document Embeddings for Cross-Lingual Text Classification. (arXiv:2110.14764v2 [cs.CL] UPDATED)
    \emph{Funnelling} (Fun) is a recently proposed method for cross-lingual text classification (CLTC) based on a two-tier learning ensemble for heterogeneous transfer learning (HTL). In this ensemble method, 1st-tier classifiers, each working on a different and language-dependent feature space, return a vector of calibrated posterior probabilities (with one dimension for each class) for each document, and the final classification decision is taken by a metaclassifier that uses this vector as its input. The metaclassifier can thus exploit class-class correlations, and this (among other things) gives Fun an edge over CLTC systems in which these correlations cannot be brought to bear. In this paper we describe \emph{Generalized Funnelling} (gFun), a generalization of Fun consisting of an HTL architecture in which 1st-tier components can be arbitrary \emph{view-generating functions}, i.e., language-dependent functions that each produce a language-independent representation ("view") of the (monolingual) document. We describe an instance of gFun in which the metaclassifier receives as input a vector of calibrated posterior probabilities (as in Fun) aggregated to other embedded representations that embody other types of correlations, such as word-class correlations (as encoded by \emph{Word-Class Embeddings}), word-word correlations (as encoded by \emph{Multilingual Unsupervised or Supervised Embeddings}), and word-context correlations (as encoded by \emph{multilingual BERT}). We show that this instance of \textsc{gFun} substantially improves over Fun and over state-of-the-art baselines, by reporting experimental results obtained on two large, standard datasets for multilingual multilabel text classification. Our code that implements gFun is publicly available.
    Advantage of Deep Neural Networks for Estimating Functions with Singularity on Hypersurfaces. (arXiv:2011.02256v2 [stat.ML] UPDATED)
    We develop a minimax rate analysis to describe the reason that deep neural networks (DNNs) perform better than other standard methods. For nonparametric regression problems, it is well known that many standard methods attain the minimax optimal rate of estimation errors for smooth functions, and thus, it is not straightforward to identify the theoretical advantages of DNNs. This study tries to fill this gap by considering the estimation for a class of non-smooth functions that have singularities on hypersurfaces. Our findings are as follows: (i) We derive the generalization error of a DNN estimator and prove that its convergence rate is almost optimal. (ii) We elucidate a phase diagram of estimation problems, which describes the situations where the DNNs outperform a general class of estimators, including kernel methods, Gaussian process methods, and others. We additionally show that DNNs outperform harmonic analysis based estimators. This advantage of DNNs comes from the fact that a shape of singularity can be successfully handled by their multi-layered structure.
    Neural Collapse Under MSE Loss: Proximity to and Dynamics on the Central Path. (arXiv:2106.02073v3 [cs.LG] UPDATED)
    The recently discovered Neural Collapse (NC) phenomenon occurs pervasively in today's deep net training paradigm of driving cross-entropy (CE) loss towards zero. During NC, last-layer features collapse to their class-means, both classifiers and class-means collapse to the same Simplex Equiangular Tight Frame, and classifier behavior collapses to the nearest-class-mean decision rule. Recent works demonstrated that deep nets trained with mean squared error (MSE) loss perform comparably to those trained with CE. As a preliminary, we empirically establish that NC emerges in such MSE-trained deep nets as well through experiments on three canonical networks and five benchmark datasets. We provide, in a Google Colab notebook, PyTorch code for reproducing MSE-NC and CE-NC: at https://colab.research.google.com/github/neuralcollapse/neuralcollapse/blob/main/neuralcollapse.ipynb. The analytically-tractable MSE loss offers more mathematical opportunities than the hard-to-analyze CE loss, inspiring us to leverage MSE loss towards the theoretical investigation of NC. We develop three main contributions: (I) We show a new decomposition of the MSE loss into (A) terms directly interpretable through the lens of NC and which assume the last-layer classifier is exactly the least-squares classifier; and (B) a term capturing the deviation from this least-squares classifier. (II) We exhibit experiments on canonical datasets and networks demonstrating that term-(B) is negligible during training. This motivates us to introduce a new theoretical construct: the central path, where the linear classifier stays MSE-optimal for feature activations throughout the dynamics. (III) By studying renormalized gradient flow along the central path, we derive exact dynamics that predict NC.
    Towards automated Capability Assessment leveraging Deep Learning. (arXiv:2202.04051v1 [cs.GR])
    Aiming for a higher economic efficiency in manufacturing, an increased degree of automation is a key enabler. However, assessing the technical feasibility of an automated assembly solution for a dedicated process is difficult and often determined by the geometry of the given product parts. Among others, decisive criterions of the automation feasibility are the ability to separate and isolate single parts or the capability of component self-alignment in final position. To assess the feasibility, a questionnaire based evaluation scheme has been developed and applied by Fraunhofer researchers. However, the results strongly depend on the implicit knowledge and experience of the single engineer performing the assessment. This paper presents NeuroCAD, a software tool that automates the assessment using voxelization techniques. The approach enables the assessment of abstract and production relevant geometries features through deep-learning based on CAD files.
    X-Class: Text Classification with Extremely Weak Supervision. (arXiv:2010.12794v2 [cs.CL] UPDATED)
    In this paper, we explore text classification with extremely weak supervision, i.e., only relying on the surface text of class names. This is a more challenging setting than the seed-driven weak supervision, which allows a few seed words per class. We opt to attack this problem from a representation learning perspective -- ideal document representations should lead to nearly the same results between clustering and the desired classification. In particular, one can classify the same corpus differently (e.g., based on topics and locations), so document representations should be adaptive to the given class names. We propose a novel framework X-Class to realize the adaptive representations. Specifically, we first estimate class representations by incrementally adding the most similar word to each class until inconsistency arises. Following a tailored mixture of class attention mechanisms, we obtain the document representation via a weighted average of contextualized word representations. With the prior of each document assigned to its nearest class, we then cluster and align the documents to classes. Finally, we pick the most confident documents from each cluster to train a text classifier. Extensive experiments demonstrate that X-Class can rival and even outperform seed-driven weakly supervised methods on 7 benchmark datasets. Our dataset and code are released at https://github.com/ZihanWangKi/XClass/ .
    Batch size-invariance for policy optimization. (arXiv:2110.00641v2 [cs.LG] UPDATED)
    We say an algorithm is batch size-invariant if changes to the batch size can largely be compensated for by changes to other hyperparameters. Stochastic gradient descent is well-known to have this property at small batch sizes, via the learning rate. However, some policy optimization algorithms (such as PPO) do not have this property, because of how they control the size of policy updates. In this work we show how to make these algorithms batch size-invariant. Our key insight is to decouple the proximal policy (used for controlling policy updates) from the behavior policy (used for off-policy corrections). Our experiments help explain why these algorithms work, and additionally show how they can make more efficient use of stale data.
    A Theory of the Inductive Bias and Generalization of Kernel Regression and Wide Neural Networks. (arXiv:2110.03922v3 [cs.LG] UPDATED)
    Kernel regression is an important nonparametric learning algorithm with an equivalence to neural networks in the infinite-width limit. Understanding its generalization behavior is thus an important task for machine learning theory. In this work, we provide a theory of the inductive bias and generalization of kernel regression using a new measure characterizing the "learnability" of a given target function. We prove that a kernel's inductive bias can be characterized as a fixed budget of learnability, allocated to its eigenmodes, that can only be increased with the addition of more training data. We then use this rule to derive expressions for the mean and covariance of the predicted function and gain insight into the overfitting and adversarial robustness of kernel regression and the hardness of the classic parity problem. We show agreement between our theoretical results and both kernel regression and wide finite networks on real and synthetic learning tasks.
    skrl: Modular and Flexible Library for Reinforcement Learning. (arXiv:2202.03825v1 [cs.LG])
    skrl is an open-source modular library for reinforcement learning written in Python and designed with a focus on readability, simplicity, and transparency of algorithm implementations. Apart from supporting environments that use the traditional OpenAI Gym interface, it allows loading, configuring, and operating NVIDIA Isaac Gym environments, enabling the parallel training of several agents with adjustable scopes, which may or may not share resources, in the same execution. The library's documentation can be found at https://skrl.readthedocs.io and its source code is available on GitHub at url{https://github.com/Toni-SM/skrl.
    Learning to Predict Graphs with Fused Gromov-Wasserstein Barycenters. (arXiv:2202.03813v1 [stat.ML])
    This paper introduces a novel and generic framework to solve the flagship task of supervised labeled graph prediction by leveraging Optimal Transport tools. We formulate the problem as regression with the Fused Gromov-Wasserstein (FGW) loss and propose a predictive model relying on a FGW barycenter whose weights depend on inputs. First we introduce a non-parametric estimator based on kernel ridge regression for which theoretical results such as consistency and excess risk bound are proved. Next we propose an interpretable parametric model where the barycenter weights are modeled with a neural network and the graphs on which the FGW barycenter is calculated are additionally learned. Numerical experiments show the strength of the method and its ability to interpolate in the labeled graph space on simulated data and on a difficult metabolic identification problem where it can reach very good performance with very little engineering.
    KENN: Enhancing Deep Neural Networks by Leveraging Knowledge for Time Series Forecasting. (arXiv:2202.03903v1 [cs.LG])
    End-to-end data-driven machine learning methods often have exuberant requirements in terms of quality and quantity of training data which are often impractical to fulfill in real-world applications. This is specifically true in time series domain where problems like disaster prediction, anomaly detection, and demand prediction often do not have a large amount of historical data. Moreover, relying purely on past examples for training can be sub-optimal since in doing so we ignore one very important domain i.e knowledge, which has its own distinct advantages. In this paper, we propose a novel knowledge fusion architecture, Knowledge Enhanced Neural Network (KENN), for time series forecasting that specifically aims towards combining strengths of both knowledge and data domains while mitigating their individual weaknesses. We show that KENN not only reduces data dependency of the overall framework but also improves performance by producing predictions that are better than the ones produced by purely knowledge and data driven domains. We also compare KENN with state-of-the-art forecasting methods and show that predictions produced by KENN are significantly better even when trained on only 50\% of the data.
    OptiLIME: Optimized LIME Explanations for Diagnostic Computer Algorithms. (arXiv:2006.05714v3 [cs.LG] UPDATED)
    Local Interpretable Model-Agnostic Explanations (LIME) is a popular method to perform interpretability of any kind of Machine Learning (ML) model. It explains one ML prediction at a time, by learning a simple linear model around the prediction. The model is trained on randomly generated data points, sampled from the training dataset distribution and weighted according to the distance from the reference point - the one being explained by LIME. Feature selection is applied to keep only the most important variables. LIME is widespread across different domains, although its instability - a single prediction may obtain different explanations - is one of the major shortcomings. This is due to the randomness in the sampling step, as well as to the flexibility in tuning the weights and determines a lack of reliability in the retrieved explanations, making LIME adoption problematic. In Medicine especially, clinical professionals trust is mandatory to determine the acceptance of an explainable algorithm, considering the importance of the decisions at stake and the related legal issues. In this paper, we highlight a trade-off between explanation's stability and adherence, namely how much it resembles the ML model. Exploiting our innovative discovery, we propose a framework to maximise stability, while retaining a predefined level of adherence. OptiLIME provides freedom to choose the best adherence-stability trade-off level and more importantly, it clearly highlights the mathematical properties of the retrieved explanation. As a result, the practitioner is provided with tools to decide whether the explanation is reliable, according to the problem at hand. We extensively test OptiLIME on a toy dataset - to present visually the geometrical findings - and a medical dataset. In the latter, we show how the method comes up with meaningful explanations both from a medical and mathematical standpoint.
    Verification-Aided Deep Ensemble Selection. (arXiv:2202.03898v1 [cs.LG])
    Deep neural networks (DNNs) have become the technology of choice for realizing a variety of complex tasks. However, as highlighted by many recent studies, even an imperceptible perturbation to a correctly classified input can lead to misclassification by a DNN. This renders DNNs vulnerable to strategic input manipulations by attackers, and also prone to oversensitivity to environmental noise. To mitigate this phenomenon, practitioners apply joint classification by an ensemble of DNNs. By aggregating the classification outputs of different individual DNNs for the same input, ensemble-based classification reduces the risk of misclassifications due to the specific realization of the stochastic training process of any single DNN. However, the effectiveness of a DNN ensemble is highly dependent on its members not simultaneously erring on many different inputs. In this case study, we harness recent advances in DNN verification to devise a methodology for identifying ensemble compositions that are less prone to simultaneous errors, even when the input is adversarially perturbed -- resulting in more robustly-accurate ensemble-based classification. Our proposed framework uses a DNN verifier as a backend, and includes heuristics that help reduce the high complexity of directly verifying ensembles. More broadly, our work puts forth a novel universal objective for formal verification that can potentially improve the robustness of real-world, deep-learning-based systems across a variety of application domains.
    Time Series Anomaly Detection by Cumulative Radon Features. (arXiv:2202.04067v1 [cs.LG])
    Detecting anomalous time series is key for scientific, medical and industrial tasks, but is challenging due to its inherent unsupervised nature. In recent years, progress has been made on this task by learning increasingly more complex features, often using deep neural networks. In this work, we argue that shallow features suffice when combined with distribution distance measures. Our approach models each time series as a high dimensional empirical distribution of features, where each time-point constitutes a single sample. Modeling the distance between a test time series and the normal training set therefore requires efficiently measuring the distance between multivariate probability distributions. We show that by parameterizing each time series using cumulative Radon features, we are able to efficiently and effectively model the distribution of normal time series. Our theoretically grounded but simple-to-implement approach is evaluated on multiple datasets and shown to achieve better results than established, classical methods as well as complex, state-of-the-art deep learning methods. Code is provided.
    Detecting Anomalies within Time Series using Local Neural Transformations. (arXiv:2202.03944v1 [cs.LG])
    We develop a new method to detect anomalies within time series, which is essential in many application domains, reaching from self-driving cars, finance, and marketing to medical diagnosis and epidemiology. The method is based on self-supervised deep learning that has played a key role in facilitating deep anomaly detection on images, where powerful image transformations are available. However, such transformations are widely unavailable for time series. Addressing this, we develop Local Neural Transformations(LNT), a method learning local transformations of time series from data. The method produces an anomaly score for each time step and thus can be used to detect anomalies within time series. We prove in a theoretical analysis that our novel training objective is more suitable for transformation learning than previous deep Anomaly detection(AD) methods. Our experiments demonstrate that LNT can find anomalies in speech segments from the LibriSpeech data set and better detect interruptions to cyber-physical systems than previous work. Visualization of the learned transformations gives insight into the type of transformations that LNT learns.
    Reinforcement Learning, Bit by Bit. (arXiv:2103.04047v6 [cs.LG] UPDATED)
    Reinforcement learning agents have demonstrated remarkable achievements in simulated environments. Data efficiency poses an impediment to carrying this success over to real environments. The design of data-efficient agents calls for a deeper understanding of information acquisition and representation. We develop concepts and establish a regret bound that together offer principled guidance. The bound sheds light on questions of what information to seek, how to seek that information, and it what information to retain. To illustrate concepts, we design simple agents that build on them and present computational results that demonstrate improvements in data efficiency.
    A Survey on Active Deep Learning: From Model-driven to Data-driven. (arXiv:2101.09933v3 [cs.LG] UPDATED)
    Which samples should be labelled in a large data set is one of the most important problems for trainingof deep learning. So far, a variety of active sample selection strategies related to deep learning havebeen proposed in many literatures. We defined them as Active Deep Learning (ADL) only if theirpredictor is deep model, where the basic learner is called as predictor and the labeling schemes iscalled selector. In this survey, three fundamental factors in selector designation were summarized. Wecategory ADL into model-driven ADL and data-driven ADL, by whether its selector is model-drivenor data-driven. The different characteristics of the two major type of ADL were addressed in indetail respectively. Furthermore, different sub-classes of data-driven and model-driven ADL are alsosummarized and discussed emphatically. The advantages and disadvantages between data-driven ADLand model-driven ADL are thoroughly analyzed. We pointed out that, with the development of deeplearning, the selector in ADL also is experiencing the stage from model-driven to data-driven. Finally,we make discussion on ADL about its uncertainty, explanatory, foundations of cognitive science etc.and survey on the trend of ADL from model-driven to data-driven.
    Sea Ice Forecasting using Attention-based Ensemble LSTM. (arXiv:2108.00853v2 [physics.ao-ph] UPDATED)
    Accurately forecasting Arctic sea ice from subseasonal to seasonal scales has been a major scientific effort with fundamental challenges at play. In addition to physics-based earth system models, researchers have been applying multiple statistical and machine learning models for sea ice forecasting. Looking at the potential of data-driven sea ice forecasting, we propose an attention-based Long Short Term Memory (LSTM) ensemble method to predict monthly sea ice extent up to 1 month ahead. Using daily and monthly satellite retrieved sea ice data from NSIDC and atmospheric and oceanic variables from ERA5 reanalysis product for 39 years, we show that our multi-temporal ensemble method outperforms several baseline and recently proposed deep learning models. This will substantially improve our ability in predicting future Arctic sea ice changes, which is fundamental for forecasting transporting routes, resource development, coastal erosion, threats to Arctic coastal communities and wildlife.
    Can We Generate Shellcodes via Natural Language? An Empirical Study. (arXiv:2202.03755v1 [cs.SE])
    Writing software exploits is an important practice for offensive security analysts to investigate and prevent attacks. In particular, shellcodes are especially time-consuming and a technical challenge, as they are written in assembly language. In this work, we address the task of automatically generating shellcodes, starting purely from descriptions in natural language, by proposing an approach based on Neural Machine Translation (NMT). We then present an empirical study using a novel dataset (Shellcode_IA32), which consists of 3,200 assembly code snippets of real Linux/x86 shellcodes from public databases, annotated using natural language. Moreover, we propose novel metrics to evaluate the accuracy of NMT at generating shellcodes. The empirical analysis shows that NMT can generate assembly code snippets from the natural language with high accuracy and that in many cases can generate entire shellcodes with no errors.
    Self-Conditioned Generative Adversarial Networks for Image Editing. (arXiv:2202.04040v1 [cs.CV])
    Generative Adversarial Networks (GANs) are susceptible to bias, learned from either the unbalanced data, or through mode collapse. The networks focus on the core of the data distribution, leaving the tails - or the edges of the distribution - behind. We argue that this bias is responsible not only for fairness concerns, but that it plays a key role in the collapse of latent-traversal editing methods when deviating away from the distribution's core. Building on this observation, we outline a method for mitigating generative bias through a self-conditioning process, where distances in the latent-space of a pre-trained generator are used to provide initial labels for the data. By fine-tuning the generator on a re-sampled distribution drawn from these self-labeled data, we force the generator to better contend with rare semantic attributes and enable more realistic generation of these properties. We compare our models to a wide range of latent editing methods, and show that by alleviating the bias they achieve finer semantic control and better identity preservation through a wider range of transformations. Our code and models will be available at https://github.com/yzliu567/sc-gan
    On Completeness-aware Concept-Based Explanations in Deep Neural Networks. (arXiv:1910.07969v6 [cs.LG] UPDATED)
    Human explanations of high-level decisions are often expressed in terms of key concepts the decisions are based on. In this paper, we study such concept-based explainability for Deep Neural Networks (DNNs). First, we define the notion of completeness, which quantifies how sufficient a particular set of concepts is in explaining a model's prediction behavior based on the assumption that complete concept scores are sufficient statistics of the model prediction. Next, we propose a concept discovery method that aims to infer a complete set of concepts that are additionally encouraged to be interpretable, which addresses the limitations of existing methods on concept explanations. To define an importance score for each discovered concept, we adapt game-theoretic notions to aggregate over sets and propose ConceptSHAP. Via proposed metrics and user studies, on a synthetic dataset with apriori-known concept explanations, as well as on real-world image and language datasets, we validate the effectiveness of our method in finding concepts that are both complete in explaining the decisions and interpretable. (The code is released at https://github.com/chihkuanyeh/concept_exp)
    "Average" Approximates "First Principal Component"? An Empirical Analysis on Representations from Neural Language Models. (arXiv:2104.08673v2 [cs.CL] UPDATED)
    Contextualized representations based on neural language models have furthered the state of the art in various NLP tasks. Despite its great success, the nature of such representations remains a mystery. In this paper, we present an empirical property of these representations -- "average" approximates "first principal component". Specifically, experiments show that the average of these representations shares almost the same direction as the first principal component of the matrix whose columns are these representations. We believe this explains why the average representation is always a simple yet strong baseline. Our further examinations show that this property also holds in more challenging scenarios, for example, when the representations are from a model right after its random initialization. Therefore, we conjecture that this property is intrinsic to the distribution of representations and not necessarily related to the input structure. We realize that these representations empirically follow a normal distribution for each dimension, and by assuming this is true, we demonstrate that the empirical property can be in fact derived mathematically.
    Deep learning fluid flow reconstruction around arbitrary two-dimensional objects from sparse sensors using conformal mappings. (arXiv:2202.03798v1 [physics.flu-dyn])
    The usage of deep neural networks (DNNs) for flow reconstruction (FR) tasks from a limited number of sensors is attracting strong research interest, owing to DNNs' ability to replicate very high dimensional relationships. Trained over a single flow case for a given Reynolds number or over a reduced range of Reynolds numbers, these models are unfortunately not able to handle fluid flows around different objects without re-training. In this work, we propose a new framework called Spatial Multi-Geometry FR (SMGFR) task, capable of reconstructing fluid flows around different two-dimensional objects without re-training, mapping the computational domain as an annulus. Different DNNs for different sensor setups (where information about the flow is collected) are trained with high-fidelity simulation data for a Reynolds number equal to approximately $300$ for 64 objects randomly generated using Bezier curves. The performance of the models and sensor setups are then assessed for the flow around 16 unseen objects. It is shown that our mapping approach improves percentage errors by up to 15\% in SMGFR when compared to a more conventional approach where the models are trained on a Cartesian grid. Finally, the SMGFR task is extended to predictions of fluid flow snapshots in the future, introducing the Spatio-temporal MGFR (STMGFR) task. For this spatio-temporal reconstruction task, a novel approach is developed involving splitting DNNs into a spatial and a temporal component. Our results demonstrate that this approach is able to reproduce, in time and in space, the main features of a fluid flow around arbitrary objects.
    Pruned Neural Networks are Surprisingly Modular. (arXiv:2003.04881v6 [cs.NE] UPDATED)
    The learned weights of a neural network are often considered devoid of scrutable internal structure. To discern structure in these weights, we introduce a measurable notion of modularity for multi-layer perceptrons (MLPs), and investigate the modular structure of MLPs trained on datasets of small images. Our notion of modularity comes from the graph clustering literature: a "module" is a set of neurons with strong internal connectivity but weak external connectivity. We find that training and weight pruning produces MLPs that are more modular than randomly initialized ones, and often significantly more modular than random MLPs with the same (sparse) distribution of weights. Interestingly, they are much more modular when trained with dropout. We also present exploratory analyses of the importance of different modules for performance and how modules depend on each other. Understanding the modular structure of neural networks, when such structure exists, will hopefully render their inner workings more interpretable to engineers. Note that this paper has been superceded by "Clusterability in Neural Networks", arxiv:2103.03386 and "Quantifying Local Specialization in Deep Neural Networks", arxiv:2110.08058!
    Provable Reinforcement Learning with a Short-Term Memory. (arXiv:2202.03983v1 [cs.LG])
    Real-world sequential decision making problems commonly involve partial observability, which requires the agent to maintain a memory of history in order to infer the latent states, plan and make good decisions. Coping with partial observability in general is extremely challenging, as a number of worst-case statistical and computational barriers are known in learning Partially Observable Markov Decision Processes (POMDPs). Motivated by the problem structure in several physical applications, as well as a commonly used technique known as "frame stacking", this paper proposes to study a new subclass of POMDPs, whose latent states can be decoded by the most recent history of a short length $m$. We establish a set of upper and lower bounds on the sample complexity for learning near-optimal policies for this class of problems in both tabular and rich-observation settings (where the number of observations is enormous). In particular, in the rich-observation setting, we develop new algorithms using a novel "moment matching" approach with a sample complexity that scales exponentially with the short length $m$ rather than the problem horizon, and is independent of the number of observations. Our results show that a short-term memory suffices for reinforcement learning in these environments.
    Accelerometer-based Bed Occupancy Detection for Automatic, Non-invasive Long-term Cough Monitoring. (arXiv:2202.03936v1 [cs.LG])
    We present a machine learning based long-term cough monitoring system by detecting patient's bed occupancy from a bed-attached smartphone-inbuilt accelerometer automatically. Previously this system was used to detect cough events successfully and long-term cough monitoring requires bed occupancy detection, as the initial experiments show that patients leave their bed very often for long period of time and using video-monitoring or pressure sensors are not patient-favourite alternatives. We have compiled a 249-hour dataset of manually-labelled acceleration signals gathered from seven adult male patients undergoing treatment for tuberculosis (TB). The bed occupancy detection process consists of three detectors, among which the first one classifies occupancy-change with high sensitivity, low specificity and the second one classifies occupancy-interval with high specificity, low sensitivity. The final state detector corrects the miss-classified sections. After using a leave-one-patient-out cross-validation scheme to train and evaluate four classifiers such as LR, MLP, CNN and LSTM; LSTM produces the highest area under the curve (AUC) of 0.94 while comparing the predicted bed occupancy as the output from the final state detector with the actual bed occupancy sample by sample. We have also calculated colony forming unit and time to positivity of the sputum samples of TB positive patients who were monitored for 14 days and the proposed system was used to predict daily cough rates. The results show that patients who improve under TB treatment have decreasing daily cough rates, indicating the proposed automatic, quick, non-invasive, non-intrusive, cost-effective long-term cough monitoring system can be extremely useful in monitoring patients' recovery rate.
    Finite-Sum Optimization: A New Perspective for Convergence to a Global Solution. (arXiv:2202.03524v1 [cs.LG])
    Deep neural networks (DNNs) have shown great success in many machine learning tasks. Their training is challenging since the loss surface of the network architecture is generally non-convex, or even non-smooth. How and under what assumptions is guaranteed convergence to a \textit{global} minimum possible? We propose a reformulation of the minimization problem allowing for a new recursive algorithmic framework. By using bounded style assumptions, we prove convergence to an $\varepsilon$-(global) minimum using $\mathcal{\tilde{O}}(1/\varepsilon^3)$ gradient computations. Our theoretical foundation motivates further study, implementation, and optimization of the new algorithmic framework and further investigation of its non-standard bounded style assumptions. This new direction broadens our understanding of why and under what circumstances training of a DNN converges to a global minimum.  ( 2 min )
    Backdoor Defense via Decoupling the Training Process. (arXiv:2202.03423v1 [cs.CR])
    Recent studies have revealed that deep neural networks (DNNs) are vulnerable to backdoor attacks, where attackers embed hidden backdoors in the DNN model by poisoning a few training samples. The attacked model behaves normally on benign samples, whereas its prediction will be maliciously changed when the backdoor is activated. We reveal that poisoned samples tend to cluster together in the feature space of the attacked DNN model, which is mostly due to the end-to-end supervised training paradigm. Inspired by this observation, we propose a novel backdoor defense via decoupling the original end-to-end training process into three stages. Specifically, we first learn the backbone of a DNN model via \emph{self-supervised learning} based on training samples without their labels. The learned backbone will map samples with the same ground-truth label to similar locations in the feature space. Then, we freeze the parameters of the learned backbone and train the remaining fully connected layers via standard training with all (labeled) training samples. Lastly, to further alleviate side-effects of poisoned samples in the second stage, we remove labels of some `low-credible' samples determined based on the learned model and conduct a \emph{semi-supervised fine-tuning} of the whole model. Extensive experiments on multiple benchmark datasets and DNN models verify that the proposed defense is effective in reducing backdoor threats while preserving high accuracy in predicting benign samples. Our code is available at \url{https://github.com/SCLBD/DBD}.  ( 2 min )
    On learning Whittle index policy for restless bandits with scalable regret. (arXiv:2202.03463v1 [cs.LG])
    Reinforcement learning is an attractive approach to learn good resource allocation and scheduling policies based on data when the system model is unknown. However, the cumulative regret of most RL algorithms scales as $\tilde O(\mathsf{S} \sqrt{\mathsf{A} T})$, where $\mathsf{S}$ is the size of the state space, $\mathsf{A}$ is the size of the action space, $T$ is the horizon, and the $\tilde{O}(\cdot)$ notation hides logarithmic terms. Due to the linear dependence on the size of the state space, these regret bounds are prohibitively large for resource allocation and scheduling problems. In this paper, we present a model-based RL algorithm for such problem which has scalable regret. In particular, we consider a restless bandit model, and propose a Thompson-sampling based learning algorithm which is tuned to the underlying structure of the model. We present two characterizations of the regret of the proposed algorithm with respect to the Whittle index policy. First, we show that for a restless bandit with $n$ arms and at most $m$ activations at each time, the regret scales either as $\tilde{O}(mn\sqrt{T})$ or $\tilde{O}(n^2 \sqrt{T})$ depending on the reward model. Second, under an additional technical assumption, we show that the regret scales as $\tilde{O}(n^{1.5} \sqrt{T})$. We present numerical examples to illustrate the salient features of the algorithm.  ( 2 min )
    Flexibly Regularized Mixture Models and Application to Image Segmentation. (arXiv:1905.10629v3 [cs.CV] UPDATED)
    Probabilistic finite mixture models are widely used for unsupervised clustering. These models can often be improved by adapting them to the topology of the data. For instance, in order to classify spatially adjacent data points similarly, it is common to introduce a Laplacian constraint on the posterior probability that each data point belongs to a class. Alternatively, the mixing probabilities can be treated as free parameters, while assuming Gauss-Markov or more complex priors to regularize those mixing probabilities. However, these approaches are constrained by the shape of the prior and often lead to complicated or intractable inference. Here, we propose a new parametrization of the Dirichlet distribution to flexibly regularize the mixing probabilities of over-parametrized mixture distributions. Using the Expectation-Maximization algorithm, we show that our approach allows us to define any linear update rule for the mixing probabilities, including spatial smoothing regularization as a special case. We then show that this flexible design can be extended to share class information between multiple mixture models. We apply our algorithm to artificial and natural image segmentation tasks, and we provide quantitative and qualitative comparison of the performance of Gaussian and Student-t mixtures on the Berkeley Segmentation Dataset. We also demonstrate how to propagate class information across the layers of deep convolutional neural networks in a probabilistically optimal way, suggesting a new interpretation for feedback signals in biological visual systems. Our flexible approach can be easily generalized to adapt probabilistic mixture models to arbitrary data topologies.  ( 2 min )
    Learnings from Federated Learning in the Real world. (arXiv:2202.03925v1 [cs.LG])
    Federated Learning (FL) applied to real world data may suffer from several idiosyncrasies. One such idiosyncrasy is the data distribution across devices. Data across devices could be distributed such that there are some "heavy devices" with large amounts of data while there are many "light users" with only a handful of data points. There also exists heterogeneity of data across devices. In this study, we evaluate the impact of such idiosyncrasies on Natural Language Understanding (NLU) models trained using FL. We conduct experiments on data obtained from a large scale NLU system serving thousands of devices and show that simple non-uniform device selection based on the number of interactions at each round of FL training boosts the performance of the model. This benefit is further amplified in continual FL on consecutive time periods, where non-uniform sampling manages to swiftly catch up with FL methods using all data at once.  ( 2 min )
    Coordinate Descent Methods for DC Minimization. (arXiv:2109.04228v2 [math.OC] UPDATED)
    Difference-of-Convex (DC) minimization, referring to the problem of minimizing the difference of two convex functions, has been found rich applications in statistical learning and studied extensively for decades. However, existing methods are primarily based on multi-stage convex relaxation, only leading to weak optimality of critical points. This paper proposes a coordinate descent method for minimizing DC functions based on sequential nonconvex approximation. Our approach iteratively solves a nonconvex one-dimensional subproblem globally, and it is guaranteed to converge to a coordinate-wise stationary point. We prove that this new optimality condition is always stronger than the critical point condition and the directional point condition when the objective function is weakly convex. For comparisons, we also include a naive variant of coordinate descent methods based on sequential convex approximation in our study. When the objective function satisfies an additional regularity condition called \textit{sharpness}, coordinate descent methods with an appropriate initialization converge \textit{linearly} to the optimal solution set. Also, for many applications of interest, we show that the nonconvex one-dimensional subproblem can be computed exactly and efficiently using a breakpoint searching method. Finally, we have conducted extensive experiments on several statistical learning tasks to show the superiority of our approach. Keywords: Coordinate Descent, DC Minimization, DC Programming, Difference-of-Convex Programs, Nonconvex Optimization, Sparse Optimization, Binary Optimization.  ( 2 min )
    Physics-informed neural networks for solving parametric magnetostatic problems. (arXiv:2202.04041v1 [cs.CE])
    The optimal design of magnetic devices becomes intractable using current computational methods when the number of design parameters is high. The emerging physics-informed deep learning framework has the potential to alleviate this curse of dimensionality. The objective of this paper is to investigate the ability of physics-informed neural networks to learn the magnetic field response as a function of design parameters in the context of a two-dimensional (2-D) magnetostatic problem. Our approach is as follows. We derive the variational principle for 2-D parametric magnetostatic problems, and prove the existence and uniqueness of the solution that satisfies the equations of the governing physics, i.e., Maxwell's equations. We use a deep neural network (DNN) to represent the magnetic field as a function of space and a total of ten parameters that describe geometric features and operating point conditions. We train the DNN by minimizing the physics-informed loss function using a variant of stochastic gradient descent. Subsequently, we conduct systematic numerical studies using a parametric EI-core electromagnet problem. In these studies, we vary the DNN architecture trying more than one hundred different possibilities. For each study, we evaluate the accuracy of the DNN by comparing its predictions to those of finite element analysis. In an exhaustive non-parametric study, we observe that sufficiently parameterized dense networks result in relative errors of less than 1%. Residual connections always improve relative errors for the same number of training iterations. Also, we observe that Fourier encoding features aligned with the device geometry do improve the rate of convergence, albeit higher-order harmonics are not necessary. Finally, we demonstrate our approach on a ten-dimensional problem with parameterized geometry.  ( 2 min )
    GMC -- Geometric Multimodal Contrastive Representation Learning. (arXiv:2202.03390v2 [cs.LG] UPDATED)
    Learning representations of multimodal data that are both informative and robust to missing modalities at test time remains a challenging problem due to the inherent heterogeneity of data obtained from different channels. To address it, we present a novel Geometric Multimodal Contrastive (GMC) representation learning method comprised of two main components: i) a two-level architecture consisting of modality-specific base encoder, allowing to process an arbitrary number of modalities to an intermediate representation of fixed dimensionality, and a shared projection head, mapping the intermediate representations to a latent representation space; ii) a multimodal contrastive loss function that encourages the geometric alignment of the learned representations. We experimentally demonstrate that GMC representations are semantically rich and achieve state-of-the-art performance with missing modality information on three different learning problems including prediction and reinforcement learning tasks.  ( 2 min )
    Low-Rank Extragradient Method for Nonsmooth and Low-Rank Matrix Optimization Problems. (arXiv:2202.04026v1 [math.OC])
    Low-rank and nonsmooth matrix optimization problems capture many fundamental tasks in statistics and machine learning. While significant progress has been made in recent years in developing efficient methods for \textit{smooth} low-rank optimization problems that avoid maintaining high-rank matrices and computing expensive high-rank SVDs, advances for nonsmooth problems have been slow paced. In this paper we consider standard convex relaxations for such problems. Mainly, we prove that under a natural \textit{generalized strict complementarity} condition and under the relatively mild assumption that the nonsmooth objective can be written as a maximum of smooth functions, the \textit{extragradient method}, when initialized with a "warm-start" point, converges to an optimal solution with rate $O(1/t)$ while requiring only two \textit{low-rank} SVDs per iteration. We give a precise trade-off between the rank of the SVDs required and the radius of the ball in which we need to initialize the method. We support our theoretical results with empirical experiments on several nonsmooth low-rank matrix recovery tasks, demonstrating that using simple initializations, the extragradient method produces exactly the same iterates when full-rank SVDs are replaced with SVDs of rank that matches the rank of the (low-rank) ground-truth matrix to be recovered.  ( 2 min )
    Bingham Policy Parameterization for 3D Rotations in Reinforcement Learning. (arXiv:2202.03957v1 [cs.RO])
    We propose a new policy parameterization for representing 3D rotations during reinforcement learning. Today in the continuous control reinforcement learning literature, many stochastic policy parameterizations are Gaussian. We argue that universally applying a Gaussian policy parameterization is not always desirable for all environments. One such case in particular where this is true are tasks that involve predicting a 3D rotation output, either in isolation, or coupled with translation as part of a full 6D pose output. Our proposed Bingham Policy Parameterization (BPP) models the Bingham distribution and allows for better rotation (quaternion) prediction over a Gaussian policy parameterization in a range of reinforcement learning tasks. We evaluate BPP on the rotation Wahba problem task, as well as a set of vision-based next-best pose robot manipulation tasks from RLBench. We hope that this paper encourages more research into developing other policy parameterization that are more suited for particular environments, rather than always assuming Gaussian.  ( 2 min )
    A Ranking Game for Imitation Learning. (arXiv:2202.03481v1 [cs.LG])
    We propose a new framework for imitation learning - treating imitation as a two-player ranking-based Stackelberg game between a $\textit{policy}$ and a $\textit{reward}$ function. In this game, the reward agent learns to satisfy pairwise performance rankings within a set of policies, while the policy agent learns to maximize this reward. This game encompasses a large subset of both inverse reinforcement learning (IRL) methods and methods which learn from offline preferences. The Stackelberg game formulation allows us to use optimization methods that take the game structure into account, leading to more sample efficient and stable learning dynamics compared to existing IRL methods. We theoretically analyze the requirements of the loss function used for ranking policy performances to facilitate near-optimal imitation learning at equilibrium. We use insights from this analysis to further increase sample efficiency of the ranking game by using automatically generated rankings or with offline annotated rankings. Our experiments show that the proposed method achieves state-of-the-art sample efficiency and is able to solve previously unsolvable tasks in the Learning from Observation (LfO) setting.
    Reward-Respecting Subtasks for Model-Based Reinforcement Learning. (arXiv:2202.03466v1 [cs.LG])
    To achieve the ambitious goals of artificial intelligence, reinforcement learning must include planning with a model of the world that is abstract in state and time. Deep learning has made progress in state abstraction, but, although the theory of time abstraction has been extensively developed based on the options framework, in practice options have rarely been used in planning. One reason for this is that the space of possible options is immense and the methods previously proposed for option discovery do not take into account how the option models will be used in planning. Options are typically discovered by posing subsidiary tasks such as reaching a bottleneck state, or maximizing a sensory signal other than the reward. Each subtask is solved to produce an option, and then a model of the option is learned and made available to the planning process. The subtasks proposed in most previous work ignore the reward on the original problem, whereas we propose subtasks that use the original reward plus a bonus based on a feature of the state at the time the option stops. We show that options and option models obtained from such reward-respecting subtasks are much more likely to be useful in planning and can be learned online and off-policy using existing learning algorithms. Reward respecting subtasks strongly constrain the space of options and thereby also provide a partial solution to the problem of option discovery. Finally, we show how the algorithms for learning values, policies, options, and models can be unified using general value functions.
    Universal Spam Detection using Transfer Learning of BERT Model. (arXiv:2202.03480v1 [cs.CL])
    Deep learning transformer models become important by training on text data based on self-attention mechanisms. This manuscript demonstrated a novel universal spam detection model using pre-trained Google's Bidirectional Encoder Representations from Transformers (BERT) base uncased models with four datasets by efficiently classifying ham or spam emails in real-time scenarios. Different methods for Enron, Spamassain, Lingspam, and Spamtext message classification datasets, were used to train models individually in which a single model was obtained with acceptable performance on four datasets. The Universal Spam Detection Model (USDM) was trained with four datasets and leveraged hyperparameters from each model. The combined model was finetuned with the same hyperparameters from these four models separately. When each model using its corresponding dataset, an F1-score is at and above 0.9 in individual models. An overall accuracy reached 97%, with an F1 score of 0.96. Research results and implications were discussed.
    Invertible Tabular GANs: Killing Two Birds with OneStone for Tabular Data Synthesis. (arXiv:2202.03636v1 [cs.LG])
    Tabular data synthesis has received wide attention in the literature. This is because available data is often limited, incomplete, or cannot be obtained easily, and data privacy is becoming increasingly important. In this work, we present a generalized GAN framework for tabular synthesis, which combines the adversarial training of GANs and the negative log-density regularization of invertible neural networks. The proposed framework can be used for two distinctive objectives. First, we can further improve the synthesis quality, by decreasing the negative log-density of real records in the process of adversarial training. On the other hand, by increasing the negative log-density of real records, realistic fake records can be synthesized in a way that they are not too much close to real records and reduce the chance of potential information leakage. We conduct experiments with real-world datasets for classification, regression, and privacy attacks. In general, the proposed method demonstrates the best synthesis quality (in terms of task-oriented evaluation metrics, e.g., F1) when decreasing the negative log-density during the adversarial training. If increasing the negative log-density, our experimental results show that the distance between real and fake records increases, enhancing robustness against privacy attacks.
    On the Convergence of Gradient Extrapolation Methods for Unbalanced Optimal Transport. (arXiv:2202.03618v1 [math.OC])
    We study the Unbalanced Optimal Transport (UOT) between two measures of possibly different masses with at most $n$ components, where marginal constraints of the standard Optimal Transport (OT) are relaxed via Kullback-Leibler divergence with regularization factor $\tau$. We propose a novel algorithm based on Gradient Extrapolation Method (GEM-UOT) to find an $\varepsilon$-approximate solution to the UOT problem in $O\big( \kappa n^2 \log\big(\frac{\tau n}{\varepsilon}\big) \big)$, where $\kappa$ is the condition number depending on only the two input measures. Compared to the only known complexity ${O}\big(\tfrac{\tau n^2 \log(n)}{\varepsilon} \log\big(\tfrac{\log(n)}{{\varepsilon}}\big)\big)$ for solving the UOT problem via the Sinkhorn algorithm, ours is better in $\varepsilon$ and lifts Sinkhorn's linear dependence on $\tau$, which hindered its practicality to approximate the standard OT via UOT. Our proof technique is based on a novel dual formulation of the squared $\ell_2$-norm regularized UOT objective, which is of independent interest and also leads to a new characterization of approximation error between UOT and OT in terms of both the transportation plan and transport distance. To this end, we further present an algorithm, based on GEM-UOT with fine tuned $\tau$ and a post-process projection step, to find an $\varepsilon$-approximate solution to the standard OT problem in $O\big( \kappa n^2 \log\big(\frac{ n}{\varepsilon}\big) \big)$, which is a new complexity in the literature of OT. Extensive experiments on synthetic and real datasets validate our theories and demonstrate the favorable performance of our methods in practice.
    Structured Time Series Prediction without Structural Prior. (arXiv:2202.03539v1 [cs.LG])
    Time series prediction is a widespread and well studied problem with applications in many domains (medical, geoscience, network analysis, finance, econometry etc.). In the case of multivariate time series, the key to good performances is to properly capture the dependencies between the variates. Often, these variates are structured, i.e. they are localised in an abstract space, usually representing an aspect of the physical world, and prediction amounts to a form of diffusion of the information across that space over time. Several neural network models of diffusion have been proposed in the literature. However, most of the existing proposals rely on some a priori knowledge on the structure of the space, usually in the form of a graph weighing the pairwise diffusion capacity of its points. We argue that this piece of information can often be dispensed with, since data already contains the diffusion capacity information, and in a more reliable form than that obtained from the usually largely hand-crafted graphs. We propose instead a fully data-driven model which does not rely on such a graph, nor any other prior structural information. We conduct a first set of experiments to measure the impact on performance of a structural prior, as used in baseline models, and show that, except at very low data levels, it remains negligible, and beyond a threshold, it may even become detrimental. We then investigate, through a second set of experiments, the capacity of our model in two respects: treatment of missing data and domain adaptation.
    Nonmyopic Multiclass Active Search for Diverse Discovery. (arXiv:2202.03593v1 [cs.LG])
    Active search is a setting in adaptive experimental design where we aim to uncover members of rare, valuable class(es) subject to a budget constraint. An important consideration in this problem is diversity among the discovered targets -- in many applications, diverse discoveries offer more insight and may be preferable in downstream tasks. However, most existing active search policies either assume that all targets belong to a common positive class or encourage diversity via simple heuristics. We present a novel formulation of active search with multiple target classes, characterized by a utility function that naturally induces a preference for label diversity among discoveries via a diminishing returns mechanism. We then study this problem under the Bayesian lens and prove a hardness result for approximating the optimal policy. Finally, we propose an efficient, nonmyopic approximation to the optimal policy and demonstrate its superior empirical performance across a wide variety of experimental settings, including drug discovery.
    Nesterov Accelerated Shuffling Gradient Method for Convex Optimization. (arXiv:2202.03525v1 [math.OC])
    In this paper, we propose Nesterov Accelerated Shuffling Gradient (NASG), a new algorithm for the convex finite-sum minimization problems. Our method integrates the traditional Nesterov's acceleration momentum with different shuffling sampling schemes. We show that our algorithm has an improved rate of $\mathcal{O}(1/T)$ using unified shuffling schemes, where $T$ is the number of epochs. This rate is better than that of any other shuffling gradient methods in convex regime. Our convergence analysis does not require an assumption on bounded domain or a bounded gradient condition. For randomized shuffling schemes, we improve the convergence bound further. When employing some initial condition, we show that our method converges faster near the small neighborhood of the solution. Numerical simulations demonstrate the efficiency of our algorithm.
    Maximizing Audio Event Detection Model Performance on Small Datasets Through Knowledge Transfer, Data Augmentation, And Pretraining: An Ablation Study. (arXiv:2202.03514v1 [cs.SD])
    An Xception model reaches state-of-the-art (SOTA) accuracy on the ESC-50 dataset for audio event detection through knowledge transfer from ImageNet weights, pretraining on AudioSet, and an on-the-fly data augmentation pipeline. This paper presents an ablation study that analyzes which components contribute to the boost in performance and training time. A smaller Xception model is also presented which nears SOTA performance with almost a third of the parameters.
    Targeted-BEHRT: Deep learning for observational causal inference on longitudinal electronic health records. (arXiv:2202.03487v1 [cs.LG])
    Observational causal inference is useful for decision making in medicine when randomized clinical trials (RCT) are infeasible or non generalizable. However, traditional approaches fail to deliver unconfounded causal conclusions in practice. The rise of "doubly robust" non-parametric tools coupled with the growth of deep learning for capturing rich representations of multimodal data, offers a unique opportunity to develop and test such models for causal inference on comprehensive electronic health records (EHR). In this paper, we investigate causal modelling of an RCT-established null causal association: the effect of antihypertensive use on incident cancer risk. We develop a dataset for our observational study and a Transformer-based model, Targeted BEHRT coupled with doubly robust estimation, we estimate average risk ratio (RR). We compare our model to benchmark statistical and deep learning models for causal inference in multiple experiments on semi-synthetic derivations of our dataset with various types and intensities of confounding. In order to further test the reliability of our approach, we test our model on situations of limited data. We find that our model provides more accurate estimates of RR (least sum absolute error from ground truth) compared to benchmarks for risk ratio estimation on high-dimensional EHR across experiments. Finally, we apply our model to investigate the original case study: antihypertensives' effect on cancer and demonstrate that our model generally captures the validated null association.
    MAML and ANIL Provably Learn Representations. (arXiv:2202.03483v1 [cs.LG])
    Recent empirical evidence has driven conventional wisdom to believe that gradient-based meta-learning (GBML) methods perform well at few-shot learning because they learn an expressive data representation that is shared across tasks. However, the mechanics of GBML have remained largely mysterious from a theoretical perspective. In this paper, we prove that two well-known GBML methods, MAML and ANIL, as well as their first-order approximations, are capable of learning common representation among a set of given tasks. Specifically, in the well-known multi-task linear representation learning setting, they are able to recover the ground-truth representation at an exponentially fast rate. Moreover, our analysis illuminates that the driving force causing MAML and ANIL to recover the underlying representation is that they adapt the final layer of their model, which harnesses the underlying task diversity to improve the representation in all directions of interest. To the best of our knowledge, these are the first results to show that MAML and/or ANIL learn expressive representations and to rigorously explain why they do so.  ( 2 min )
    Approximating Gradients for Differentiable Quality Diversity in Reinforcement Learning. (arXiv:2202.03666v1 [cs.LG])
    Consider a walking agent that must adapt to damage. To approach this task, we can train a collection of policies and have the agent select a suitable policy when damaged. Training this collection may be viewed as a quality diversity (QD) optimization problem, where we search for solutions (policies) which maximize an objective (walking forward) while spanning a set of measures (measurable characteristics). Recent work shows that differentiable quality diversity (DQD) algorithms greatly accelerate QD optimization when exact gradients are available for the objective and measures. However, such gradients are typically unavailable in RL settings due to non-differentiable environments. To apply DQD in RL settings, we propose to approximate objective and measure gradients with evolution strategies and actor-critic methods. We develop two variants of the DQD algorithm CMA-MEGA, each with different gradient approximations, and evaluate them on four simulated walking tasks. One variant achieves comparable performance (QD score) with the state-of-the-art PGA-MAP-Elites in two tasks. The other variant performs comparably in all tasks but is less efficient than PGA-MAP-Elites in two tasks. These results provide insight into the limitations of CMA-MEGA in domains that require rigorous optimization of the objective and where exact gradients are unavailable.  ( 2 min )
    How to Understand Masked Autoencoders. (arXiv:2202.03670v1 [cs.CV])
    "Masked Autoencoders (MAE) Are Scalable Vision Learners" revolutionizes the self-supervised learning that not only achieves the state-of-the-art for image pretraining, but also is a milestone that bridged the gap between the visual and linguistic masked autoencoding (BERT-style) pretrainings. However, to our knowledge, to date there are no theoretical perspectives to explain the powerful expressivity of MAE. In this paper, we, for the first time, propose a unified theoretical framework that provides a mathematical understanding for MAE. Particularly, we explain the patch-based attention approaches of MAE using an integral kernel under a non-overlapping domain decomposition setting. To help the researchers to further grasp the main reasons of the great success of MAE, based on our framework, we contribute five questions and answer them by insights from operator theory with mathematical rigor.  ( 2 min )
    Trained Model in Supervised Deep Learning is a Conditional Risk Minimizer. (arXiv:2202.03674v1 [cs.LG])
    We proved that a trained model in supervised deep learning minimizes the conditional risk for each input (Theorem 2.1). This property provided insights into the behavior of trained models and established a connection between supervised and unsupervised learning in some cases. In addition, when the labels are intractable but can be written as a conditional risk minimizer, we proved an equivalent form of the original supervised learning problem with accessible labels (Theorem 2.2). We demonstrated that many existing works, such as Noise2Score, Noise2Noise and score function estimation can be explained by our theorem. Moreover, we derived a property of classification problem with noisy labels using Theorem 2.1 and validated it using MNIST dataset. Furthermore, We proposed a method to estimate uncertainty in image super-resolution based on Theorem 2.2 and validated it using ImageNet dataset. Our code is available on github.  ( 2 min )
    Contrastive predictive coding for Anomaly Detection in Multi-variate Time Series Data. (arXiv:2202.03639v1 [cs.LG])
    Anomaly detection in multi-variate time series (MVTS) data is a huge challenge as it requires simultaneous representation of long term temporal dependencies and correlations across multiple variables. More often, this is solved by breaking the complexity through modeling one dependency at a time. In this paper, we propose a Time-series Representational Learning through Contrastive Predictive Coding (TRL-CPC) towards anomaly detection in MVTS data. First, we jointly optimize an encoder, an auto-regressor and a non-linear transformation function to effectively learn the representations of the MVTS data sets, for predicting future trends. It must be noted that the context vectors are representative of the observation window in the MTVS. Next, the latent representations for the succeeding instants obtained through non-linear transformations of these context vectors, are contrasted with the latent representations of the encoder for the multi-variables such that the density for the positive pair is maximized. Thus, the TRL-CPC helps to model the temporal dependencies and the correlations of the parameters for a healthy signal pattern. Finally, fitting the latent representations are fit into a Gaussian scoring function to detect anomalies. Evaluation of the proposed TRL-CPC on three MVTS data sets against SOTA anomaly detection methods shows the superiority of TRL-CPC.  ( 2 min )
    Random Gegenbauer Features for Scalable Kernel Methods. (arXiv:2202.03474v1 [cs.LG])
    We propose efficient random features for approximating a new and rich class of kernel functions that we refer to as Generalized Zonal Kernels (GZK). Our proposed GZK family, generalizes the zonal kernels (i.e., dot-product kernels on the unit sphere) by introducing radial factors in their Gegenbauer series expansion, and includes a wide range of ubiquitous kernel functions such as the entirety of dot-product kernels as well as the Gaussian and the recently introduced Neural Tangent kernels. Interestingly, by exploiting the reproducing property of the Gegenbauer polynomials, we can construct efficient random features for the GZK family based on randomly oriented Gegenbauer kernels. We prove subspace embedding guarantees for our Gegenbauer features which ensures that our features can be used for approximately solving learning problems such as kernel k-means clustering, kernel ridge regression, etc. Empirical results show that our proposed features outperform recent kernel approximation methods.  ( 2 min )
    Cascaded Debiasing : Studying the Cumulative Effect of Multiple Fairness-Enhancing Interventions. (arXiv:2202.03734v1 [cs.LG])
    Understanding the cumulative effect of multiple fairness enhancing interventions at different stages of the machine learning (ML) pipeline is a critical and underexplored facet of the fairness literature. Such knowledge can be valuable to data scientists/ML practitioners in designing fair ML pipelines. This paper takes the first step in exploring this area by undertaking an extensive empirical study comprising 60 combinations of interventions, 9 fairness metrics, 2 utility metrics (Accuracy and F1 Score) across 4 benchmark datasets. We quantitatively analyze the experimental data to measure the impact of multiple interventions on fairness, utility and population groups. We found that applying multiple interventions results in better fairness and lower utility than individual interventions on aggregate. However, adding more interventions do no always result in better fairness or worse utility. The likelihood of achieving high performance (F1 Score) along with high fairness increases with larger number of interventions. On the downside, we found that fairness-enhancing interventions can negatively impact different population groups, especially the privileged group. This study highlights the need for new fairness metrics that account for the impact on different population groups apart from just the disparity between groups. Lastly, we offer a list of combinations of interventions that perform best for different fairness and utility metrics to aid the design of fair ML pipelines.  ( 2 min )
    data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language. (arXiv:2202.03555v1 [cs.LG])
    While the general idea of self-supervised learning is identical across modalities, the actual algorithms and objectives differ widely because they were developed with a single modality in mind. To get us closer to general self-supervised learning, we present data2vec, a framework that uses the same learning method for either speech, NLP or computer vision. The core idea is to predict latent representations of the full input data based on a masked view of the input in a self-distillation setup using a standard Transformer architecture. Instead of predicting modality-specific targets such as words, visual tokens or units of human speech which are local in nature, data2vec predicts contextualized latent representations that contain information from the entire input. Experiments on the major benchmarks of speech recognition, image classification, and natural language understanding demonstrate a new state of the art or competitive performance to predominant approaches.  ( 2 min )
    Noise Regularizes Over-parameterized Rank One Matrix Recovery, Provably. (arXiv:2202.03535v1 [cs.LG])
    We investigate the role of noise in optimization algorithms for learning over-parameterized models. Specifically, we consider the recovery of a rank one matrix $Y^*\in R^{d\times d}$ from a noisy observation $Y$ using an over-parameterization model. We parameterize the rank one matrix $Y^*$ by $XX^\top$, where $X\in R^{d\times d}$. We then show that under mild conditions, the estimator, obtained by the randomly perturbed gradient descent algorithm using the square loss function, attains a mean square error of $O(\sigma^2/d)$, where $\sigma^2$ is the variance of the observational noise. In contrast, the estimator obtained by gradient descent without random perturbation only attains a mean square error of $O(\sigma^2)$. Our result partially justifies the implicit regularization effect of noise when learning over-parameterized models, and provides new understanding of training over-parameterized neural networks.  ( 2 min )
    Domain Adversarial Spatial-Temporal Network: A Transferable Framework for Short-term Traffic Forecasting across Cities. (arXiv:2202.03630v1 [cs.LG])
    Accurate real-time traffic forecast is critical for intelligent transportation systems (ITS) and it serves as the cornerstone of various smart mobility applications. Though this research area is dominated by deep learning, recent studies indicate that the accuracy improvement by developing new model structures is becoming marginal. Instead, we envision that the improvement can be achieved by transferring the "forecasting-related knowledge" across cities with different data distributions and network topologies. To this end, this paper aims to propose a novel transferable traffic forecasting framework: Domain Adversarial Spatial-Temporal Network (DASTNet). DASTNet is pre-trained on multiple source networks and fine-tuned with the target network's traffic data. Specifically, we leverage the graph representation learning and adversarial domain adaptation techniques to learn the domain-invariant node embeddings, which are further incorporated to model the temporal traffic data. To the best of our knowledge, we are the first to employ adversarial multi-domain adaptation for network-wide traffic forecasting problems. DASTNet consistently outperforms all state-of-the-art baseline methods on three benchmark datasets. The trained DASTNet is applied to Hong Kong's new traffic detectors, and accurate traffic predictions can be delivered immediately (within one day) when the detector is available. Overall, this study suggests an alternative to enhance the traffic forecasting methods and provides practical implications for cities lacking historical traffic data.  ( 2 min )
    Bandit Sampling for Multiplex Networks. (arXiv:2202.03621v1 [cs.LG])
    Graph neural networks have gained prominence due to their excellent performance in many classification and prediction tasks. In particular, they are used for node classification and link prediction which have a wide range of applications in social networks, biomedical data sets, and financial transaction graphs. Most of the existing work focuses primarily on the monoplex setting where we have access to a network with only a single type of connection between entities. However, in the multiplex setting, where there are multiple types of connections, or \emph{layers}, between entities, performance on tasks such as link prediction has been shown to be stronger when information from other connection types is taken into account. We propose an algorithm for scalable learning on multiplex networks with a large number of layers. The efficiency of our method is enabled by an online learning algorithm that learns how to sample relevant neighboring layers so that only the layers with relevant information are aggregated during training. This sampling differs from prior work, such as MNE, which aggregates information across \emph{all} layers and consequently leads to computational intractability on large networks. Our approach also improves on the recent layer sampling method of \textsc{DeePlex} in that the unsampled layers do not need to be trained, enabling further increases in efficiency.We present experimental results on both synthetic and real-world scenarios that demonstrate the practical effectiveness of our proposed approach.  ( 2 min )
    ECRECer: Enzyme Commission Number Recommendation and Benchmarking based on Multiagent Dual-core Learning. (arXiv:2202.03632v1 [cs.LG])
    Enzyme Commission (EC) numbers, which associate a protein sequence with the biochemical reactions it catalyzes, are essential for the accurate understanding of enzyme functions and cellular metabolism. Many ab-initio computational approaches were proposed to predict EC numbers for given input sequences directly. However, the prediction performance (accuracy, recall, precision), usability, and efficiency of existing methods still have much room to be improved. Here, we report ECRECer, a cloud platform for accurately predicting EC numbers based on novel deep learning techniques. To build ECRECer, we evaluate different protein representation methods and adopt a protein language model for protein sequence embedding. After embedding, we propose a multi-agent hierarchy deep learning-based framework to learn the proposed tasks in a multi-task manner. Specifically, we used an extreme multi-label classifier to perform the EC prediction and employed a greedy strategy to integrate and fine-tune the final model. Comparative analyses against four representative methods demonstrate that ECRECer delivers the highest performance, which improves accuracy and F1 score by 70% and 20% over the state-of-the-the-art, respectively. With ECRECer, we can annotate numerous enzymes in the Swiss-Prot database with incomplete EC numbers to their full fourth level. Take UniPort protein "A0A0U5GJ41" as an example (1.14.-.-), ECRECer annotated it with "1.14.11.38", which supported by further protein structure analysis based on AlphaFold2. Finally, we established a webserver (https://ecrecer.biodesign.ac.cn) and provided an offline bundle to improve usability.  ( 2 min )
    Fourier Representations for Black-Box Optimization over Categorical Variables. (arXiv:2202.03712v1 [cs.LG])
    Optimization of real-world black-box functions defined over purely categorical variables is an active area of research. In particular, optimization and design of biological sequences with specific functional or structural properties have a profound impact in medicine, materials science, and biotechnology. Standalone search algorithms, such as simulated annealing (SA) and Monte Carlo tree search (MCTS), are typically used for such optimization problems. In order to improve the performance and sample efficiency of such algorithms, we propose to use existing methods in conjunction with a surrogate model for the black-box evaluations over purely categorical variables. To this end, we present two different representations, a group-theoretic Fourier expansion and an abridged one-hot encoded Boolean Fourier expansion. To learn such representations, we consider two different settings to update our surrogate model. First, we utilize an adversarial online regression setting where Fourier characters of each representation are considered as experts and their respective coefficients are updated via an exponential weight update rule each time the black box is evaluated. Second, we consider a Bayesian setting where queries are selected via Thompson sampling and the posterior is updated via a sparse Bayesian regression model (over our proposed representation) with a regularized horseshoe prior. Numerical experiments over synthetic benchmarks as well as real-world RNA sequence optimization and design problems demonstrate the representational power of the proposed methods, which achieve competitive or superior performance compared to state-of-the-art counterparts, while improving the computation cost and/or sample efficiency, substantially.  ( 2 min )
    Fair SA: Sensitivity Analysis for Fairness in Face Recognition. (arXiv:2202.03586v1 [cs.CV])
    As the use of deep learning in high impact domains becomes ubiquitous, it is increasingly important to assess the resilience of models. One such high impact domain is that of face recognition, with real world applications involving images affected by various degradations, such as motion blur or high exposure. Moreover, images captured across different attributes, such as gender and race, can also challenge the robustness of a face recognition algorithm. While traditional summary statistics suggest that the aggregate performance of face recognition models has continued to improve, these metrics do not directly measure the robustness or fairness of the models. Visual Psychophysics Sensitivity Analysis (VPSA) [1] provides a way to pinpoint the individual causes of failure by way of introducing incremental perturbations in the data. However, perturbations may affect subgroups differently. In this paper, we propose a new fairness evaluation based on robustness in the form of a generic framework that extends VPSA. With this framework, we can analyze the ability of a model to perform fairly for different subgroups of a population affected by perturbations, and pinpoint the exact failure modes for a subgroup by measuring targeted robustness. With the increasing focus on the fairness of models, we use face recognition as an example application of our framework and propose to compactly visualize the fairness analysis of a model via AUC matrices. We analyze the performance of common face recognition models and empirically show that certain subgroups are at a disadvantage when images are perturbed, thereby uncovering trends that were not visible using the model's performance on subgroups without perturbations.  ( 2 min )
    Accelerating Part-Scale Simulation in Liquid Metal Jet Additive Manufacturing via Operator Learning. (arXiv:2202.03665v1 [physics.flu-dyn])
    Predicting part quality for additive manufacturing (AM) processes requires high-fidelity numerical simulation of partial differential equations (PDEs) governing process multiphysics on a scale of minimum manufacturable features. This makes part-scale predictions computationally demanding, especially when they require many small-scale simulations. We consider drop-on-demand liquid metal jetting (LMJ) as an illustrative example of such computational complexity. A model describing droplet coalescence for LMJ may include coupled incompressible fluid flow, heat transfer, and phase change equations. Numerically solving these equations becomes prohibitively expensive when simulating the build process for a full part consisting of thousands to millions of droplets. Reduced-order models (ROMs) based on neural networks (NN) or k-nearest neighbor (kNN) algorithms have been built to replace the original physics-based solver and are computationally tractable for part-level simulations. However, their quick inference capabilities often come at the expense of accuracy, robustness, and generalizability. We apply an operator learning (OL) approach to learn a mapping between initial and final states of the droplet coalescence process for enabling rapid and accurate part-scale build simulation. Preliminary results suggest that OL requires order-of-magnitude fewer data points than a kNN approach and is generalizable beyond the training set while achieving similar prediction error.  ( 2 min )
    DeepStability: A Study of Unstable Numerical Methods and Their Solutions in Deep Learning. (arXiv:2202.03493v1 [cs.LG])
    Deep learning (DL) has become an integral part of solutions to various important problems, which is why ensuring the quality of DL systems is essential. One of the challenges of achieving reliability and robustness of DL software is to ensure that algorithm implementations are numerically stable. DL algorithms require a large amount and a wide variety of numerical computations. A naive implementation of numerical computation can lead to errors that may result in incorrect or inaccurate learning and results. A numerical algorithm or a mathematical formula can have several implementations that are mathematically equivalent, but have different numerical stability properties. Designing numerically stable algorithm implementations is challenging, because it requires an interdisciplinary knowledge of software engineering, DL, and numerical analysis. In this paper, we study two mature DL libraries PyTorch and Tensorflow with the goal of identifying unstable numerical methods and their solutions. Specifically, we investigate which DL algorithms are numerically unstable and conduct an in-depth analysis of the root cause, manifestation, and patches to numerical instabilities. Based on these findings, we launch, the first database of numerical stability issues and solutions in DL. Our findings and provide future references to developers and tool builders to prevent, detect, localize and fix numerically unstable algorithm implementations. To demonstrate that, using {\it DeepStability} we have located numerical stability issues in Tensorflow, and submitted a fix which has been accepted and merged in.  ( 2 min )
    Integration of a machine learning model into a decision support tool to predict absenteeism at work of prospective employees. (arXiv:2202.03577v1 [cs.LG])
    Purpose - Inefficient hiring may result in lower productivity and higher training costs. Productivity losses caused by absenteeism at work cost U.S. employers billions of dollars each year. Also, employers typically spend a considerable amount of time managing employees who perform poorly. The purpose of this study is to develop a decision support tool to predict absenteeism among potential employees. Design/methodology/approach - We utilized a popular open-access dataset. In order to categorize absenteeism classes, the data have been preprocessed, and four methods of machine learning classification have been applied: Multinomial Logistic Regression (MLR), Support Vector Machines (SVM), Artificial Neural Networks (ANN), and Random Forests (RF). We selected the best model, based on several validation scores, and compared its performance against the existing model; we then integrated the best model into our proposed web-based for hiring managers. Findings - A web-based decision tool allows hiring managers to make more informed decisions before hiring a potential employee, thus reducing time, financial loss and reducing the probability of economic insolvency. Originality/value - In this paper, we propose a model that is trained based on attributes that can be collected during the hiring process. Furthermore, hiring managers may lack experience in machine learning or do not have the time to spend developing machine learning algorithms. Thus, we propose a web-based interactive tool that can be used without prior knowledge of machine learning algorithms.  ( 2 min )
    Impact of Parameter Sparsity on Stochastic Gradient MCMC Methods for Bayesian Deep Learning. (arXiv:2202.03770v1 [cs.LG])
    Bayesian methods hold significant promise for improving the uncertainty quantification ability and robustness of deep neural network models. Recent research has seen the investigation of a number of approximate Bayesian inference methods for deep neural networks, building on both the variational Bayesian and Markov chain Monte Carlo (MCMC) frameworks. A fundamental issue with MCMC methods is that the improvements they enable are obtained at the expense of increased computation time and model storage costs. In this paper, we investigate the potential of sparse network structures to flexibly trade-off model storage costs and inference run time against predictive performance and uncertainty quantification ability. We use stochastic gradient MCMC methods as the core Bayesian inference method and consider a variety of approaches for selecting sparse network structures. Surprisingly, our results show that certain classes of randomly selected substructures can perform as well as substructures derived from state-of-the-art iterative pruning methods while drastically reducing model training times.  ( 2 min )
    Structured Prediction Problem Archive. (arXiv:2202.03574v1 [cs.LG])
    Structured prediction problems are one of the fundamental tools in machine learning. In order to facilitate algorithm development for their numerical solution, we collect in one place a large number of datasets in easy to read formats for a diverse set of problem classes. We provide archival links to datasets, description of the considered problems and problem formats, and a short summary of problem characteristics including size, number of instances etc. For reference we also give a non-exhaustive selection of algorithms proposed in the literature for their solution. We hope that this central repository will make benchmarking and comparison to established works easier. We welcome submission of interesting new datasets and algorithms for inclusion in our archive.  ( 2 min )
    Local Explanations for Reinforcement Learning. (arXiv:2202.03597v1 [cs.LG])
    Many works in explainable AI have focused on explaining black-box classification models. Explaining deep reinforcement learning (RL) policies in a manner that could be understood by domain users has received much less attention. In this paper, we propose a novel perspective to understanding RL policies based on identifying important states from automatically learned meta-states. The key conceptual difference between our approach and many previous ones is that we form meta-states based on locality governed by the expert policy dynamics rather than based on similarity of actions, and that we do not assume any particular knowledge of the underlying topology of the state space. Theoretically, we show that our algorithm to find meta-states converges and the objective that selects important states from each meta-state is submodular leading to efficient high quality greedy selection. Experiments on four domains (four rooms, door-key, minipacman, and pong) and a carefully conducted user study illustrate that our perspective leads to better understanding of the policy. We conjecture that this is a result of our meta-states being more intuitive in that the corresponding important states are strong indicators of tractable intermediate goals that are easier for humans to interpret and follow.  ( 2 min )
    Self-Supervised Representation Learning for Speech Using Visual Grounding and Masked Language Modeling. (arXiv:2202.03543v1 [eess.AS])
    In this paper, we describe our submissions to the ZeroSpeech 2021 Challenge and SUPERB benchmark. Our submissions are based on the recently proposed FaST-VGS model, which is a Transformer-based model that learns to associate raw speech waveforms with semantically related images, all without the use of any transcriptions of the speech. Additionally, we introduce a novel extension of this model, FaST-VGS+, which is learned in a multi-task fashion with a masked language modeling objective in addition to the visual grounding objective. On ZeroSpeech 2021, we show that our models perform competitively on the ABX task, outperform all other concurrent submissions on the Syntactic and Semantic tasks, and nearly match the best system on the Lexical task. On the SUPERB benchmark, we show that our models also achieve strong performance, in some cases even outperforming the popular wav2vec2.0 model.  ( 2 min )
    NxtPost: User to Post Recommendations in Facebook Groups. (arXiv:2202.03645v1 [cs.LG])
    In this paper, we present NxtPost, a deployed user-to-post content-based sequential recommender system for Facebook Groups. Inspired by recent advances in NLP, we have adapted a Transformer-based model to the domain of sequential recommendation. We explore causal masked multi-head attention that optimizes both short and long-term user interests. From a user's past activities validated by defined safety process, NxtPost seeks to learn a representation for the user's dynamic content preference and to predict the next post user may be interested in. In contrast to previous Transformer-based methods, we do not assume that the recommendable posts have a fixed corpus. Accordingly, we use an external item/token embedding to extend a sequence-based approach to a large vocabulary. We achieve 49% abs. improvement in offline evaluation. As a result of NxtPost deployment, 0.6% more users are meeting new people, engaging with the community, sharing knowledge and getting support. The paper shares our experience in developing a personalized sequential recommender system, lessons deploying the model for cold start users, how to deal with freshness, and tuning strategies to reach higher efficiency in online A/B experiments.  ( 2 min )
    Width is Less Important than Depth in ReLU Neural Networks. (arXiv:2202.03841v1 [cs.LG])
    We solve an open question from Lu et al. (2017), by showing that any target network with inputs in $\mathbb{R}^d$ can be approximated by a width $O(d)$ network (independent of the target network's architecture), whose number of parameters is essentially larger only by a linear factor. In light of previous depth separation theorems, which imply that a similar result cannot hold when the roles of width and depth are interchanged, it follows that depth plays a more significant role than width in the expressive power of neural networks. We extend our results to constructing networks with bounded weights, and to constructing networks with width at most $d+2$, which is close to the minimal possible width due to previous lower bounds. Both of these constructions cause an extra polynomial factor in the number of parameters over the target network. We also show an exact representation of wide and shallow networks using deep and narrow networks which, in certain cases, does not increase the number of parameters over the target network.  ( 2 min )
    Convolutional Neural Networks on Graphs with Chebyshev Approximation, Revisited. (arXiv:2202.03580v1 [cs.LG])
    Designing spectral convolutional networks is a challenging problem in graph learning. ChebNet, one of the early attempts, approximates the spectral convolution using Chebyshev polynomials. GCN simplifies ChebNet by utilizing only the first two Chebyshev polynomials while still outperforming it on real-world datasets. GPR-GNN and BernNet demonstrate that the Monomial and Bernstein bases also outperform the Chebyshev basis in terms of learning the spectral convolution. Such conclusions are counter-intuitive in the field of approximation theory, where it is established that the Chebyshev polynomial achieves the optimum convergent rate for approximating a function. In this paper, we revisit the problem of approximating the spectral convolution with Chebyshev polynomials. We show that ChebNet's inferior performance is primarily due to illegal coefficients learnt by ChebNet approximating analytic filter functions, which leads to over-fitting. We then propose ChebNetII, a new GNN model based on Chebyshev interpolation, which enhances the original Chebyshev polynomial approximation while reducing the Runge phenomena. We conducted an extensive experimental study to demonstrate that ChebNetII can learn arbitrary graph spectrum filters and achieve superior performance in both full- and semi-supervised node classification tasks.  ( 2 min )
    Stop Oversampling for Class Imbalance Learning: A Critical Review. (arXiv:2202.03579v1 [cs.LG])
    For the last two decades, oversampling has been employed to overcome the challenge of learning from imbalanced datasets. Many approaches to solving this challenge have been offered in the literature. Oversampling, on the other hand, is a concern. That is, models trained on fictitious data may fail spectacularly when put to real-world problems. The fundamental difficulty with oversampling approaches is that, given a real-life population, the synthesized samples may not truly belong to the minority class. As a result, training a classifier on these samples while pretending they represent minority may result in incorrect predictions when the model is used in the real world. We analyzed a large number of oversampling methods in this paper and devised a new oversampling evaluation system based on hiding a number of majority examples and comparing them to those generated by the oversampling process. Based on our evaluation system, we ranked all these methods based on their incorrectly generated examples for comparison. Our experiments using more than 70 oversampling methods and three imbalanced real-world datasets reveal that all oversampling methods studied generate minority samples that are most likely to be majority. Given data and methods in hand, we argue that oversampling in its current forms and methodologies is unreliable for learning from class imbalanced data and should be avoided in real-world applications.  ( 2 min )
    Deep Reinforcement Learning Assisted Federated Learning Algorithm for Data Management of IIoT. (arXiv:2202.03575v1 [cs.LG])
    The continuous expanded scale of the industrial Internet of Things (IIoT) leads to IIoT equipments generating massive amounts of user data every moment. According to the different requirement of end users, these data usually have high heterogeneity and privacy, while most of users are reluctant to expose them to the public view. How to manage these time series data in an efficient and safe way in the field of IIoT is still an open issue, such that it has attracted extensive attention from academia and industry. As a new machine learning (ML) paradigm, federated learning (FL) has great advantages in training heterogeneous and private data. This paper studies the FL technology applications to manage IIoT equipment data in wireless network environments. In order to increase the model aggregation rate and reduce communication costs, we apply deep reinforcement learning (DRL) to IIoT equipment selection process, specifically to select those IIoT equipment nodes with accurate models. Therefore, we propose a FL algorithm assisted by DRL, which can take into account the privacy and efficiency of data training of IIoT equipment. By analyzing the data characteristics of IIoT equipments, we use MNIST, fashion MNIST and CIFAR-10 data sets to represent the data generated by IIoT. During the experiment, we employ the deep neural network (DNN) model to train the data, and experimental results show that the accuracy can reach more than 97\%, which corroborates the effectiveness of the proposed algorithm.
  • Open

    Locally Sparse Neural Networks for Tabular Biomedical Data. (arXiv:2106.06468v2 [cs.LG] UPDATED)
    Tabular datasets with low-sample-size or many variables are prevalent in biomedicine. Practitioners in this domain prefer linear or tree-based models over neural networks since the latter are harder to interpret and tend to overfit when applied to tabular datasets. To address these neural networks' shortcomings, we propose an intrinsically interpretable network for heterogeneous biomedical data. We design a locally sparse neural network where the local sparsity is learned to identify the subset of most relevant features for each sample. This sample-specific sparsity is predicted via a \textit{gating} network, which is trained in tandem with the \textit{prediction} network. By forcing the model to select a subset of the most informative features for each sample, we reduce model overfitting in low-sample-size data and obtain an interpretable model. We demonstrate that our method outperforms state-of-the-art models when applied to synthetic or real-world biomedical datasets using extensive experiments. Furthermore, the proposed framework dramatically outperforms existing schemes when evaluating its interpretability capabilities. Finally, we demonstrate the applicability of our model to two important biomedical tasks: survival analysis and marker gene identification.  ( 2 min )
    Theoretical characterization of uncertainty in high-dimensional linear classification. (arXiv:2202.03295v1 [cs.LG] CROSS LISTED)
    Being able to reliably assess not only the accuracy but also the uncertainty of models' predictions is an important endeavour in modern machine learning. Even if the model generating the data and labels is known, computing the intrinsic uncertainty after learning the model from a limited number of samples amounts to sampling the corresponding posterior probability measure. Such sampling is computationally challenging in high-dimensional problems and theoretical results on heuristic uncertainty estimators in high-dimensions are thus scarce. In this manuscript, we characterise uncertainty for learning from limited number of samples of high-dimensional Gaussian input data and labels generated by the probit model. We prove that the Bayesian uncertainty (i.e. the posterior marginals) can be asymptotically obtained by the approximate message passing algorithm, bypassing the canonical but costly Monte Carlo sampling of the posterior. We then provide a closed-form formula for the joint statistics between the logistic classifier, the uncertainty of the statistically optimal Bayesian classifier and the ground-truth probit uncertainty. The formula allows us to investigate calibration of the logistic classifier learning from limited amount of samples. We discuss how over-confidence can be mitigated by appropriately regularising, and show that cross-validating with respect to the loss leads to better calibration than with the 0/1 error.  ( 2 min )
    Deep Probability Estimation. (arXiv:2111.10734v2 [cs.LG] UPDATED)
    Reliable probability estimation is of crucial importance in many real-world applications where there is inherent uncertainty, such as weather forecasting, medical prognosis, or collision avoidance in autonomous vehicles. Probability-estimation models are trained on observed outcomes (e.g. whether it has rained or not, or whether a patient has died or not), because the ground-truth probabilities of the events of interest are typically unknown. The problem is therefore analogous to binary classification, with the important difference that the objective is to estimate probabilities rather than predicting the specific outcome. The goal of this work is to investigate probability estimation from high-dimensional data using deep neural networks. There exist several methods to improve the probabilities generated by these models but they mostly focus on classification problems where the probabilities are related to model uncertainty. In the case of problems with inherent uncertainty, it is challenging to evaluate performance without access to ground-truth probabilities. To address this, we build a synthetic dataset to study and compare different computable metrics. We evaluate existing methods on the synthetic data as well as on three real-world probability estimation tasks, all of which involve inherent uncertainty. We also give a theoretical analysis of a model for high-dimensional probability estimation which reproduces several of the phenomena evinced in our experiments. Finally, we propose a new method for probability estimation using neural networks, which modifies the training process to promote output probabilities that are consistent with empirical probabilities computed from the data. The method outperforms existing approaches on most metrics on the simulated as well as real-world data.  ( 2 min )
    Diversified Sampling for Batched Bayesian Optimization with Determinantal Point Processes. (arXiv:2110.11665v2 [cs.LG] UPDATED)
    In Bayesian Optimization (BO) we study black-box function optimization with noisy point evaluations and Bayesian priors. Convergence of BO can be greatly sped up by batching, where multiple evaluations of the black-box function are performed in a single round. The main difficulty in this setting is to propose at the same time diverse and informative batches of evaluation points. In this work, we introduce DPP-Batch Bayesian Optimization (DPP-BBO), a universal framework for inducing batch diversity in sampling based BO by leveraging the repulsive properties of Determinantal Point Processes (DPP) to naturally diversify the batch sampling procedure. We illustrate this framework by formulating DPP-Thompson Sampling (DPP-TS) as a variant of the popular Thompson Sampling (TS) algorithm and introducing a Markov Chain Monte Carlo procedure to sample from it. We then prove novel Bayesian simple regret bounds for both classical batched TS as well as our counterpart DPP-TS, with the latter bound being tighter. Our real-world, as well as synthetic, experiments demonstrate improved performance of DPP-BBO over classical batching methods with Gaussian process and Cox process models.  ( 2 min )
    Robust Neural Network Classification via Double Regularization. (arXiv:2112.08102v2 [stat.ML] UPDATED)
    The presence of mislabeled observations in data is a notoriously challenging problem in statistics and machine learning, associated with poor generalization properties for both traditional classifiers and, perhaps even more so, flexible classifiers like neural networks. Here we propose a novel double regularization of the neural network training loss that combines a penalty on the complexity of the classification model and an optimal reweighting of training observations. The combined penalties result in improved generalization properties and strong robustness against overfitting in different settings of mislabeled training data and also against variation in initial parameter values when training. We provide a theoretical justification for our proposed method derived for a simple case of logistic regression. We demonstrate the double regularization model, here denoted by DRFit, for neural net classification of (i) MNIST and (ii) CIFAR-10, in both cases with simulated mislabeling. We also illustrate that DRFit identifies mislabeled data points with very good precision. This provides strong support for DRFit as a practical of-the-shelf classifier, since, without any sacrifice in performance, we get a classifier that simultaneously reduces overfitting against mislabeling and gives an accurate measure of the trustworthiness of the labels.  ( 2 min )
    Asymptotics of representation learning in finite Bayesian neural networks. (arXiv:2106.00651v5 [cs.LG] UPDATED)
    Recent works have suggested that finite Bayesian neural networks may sometimes outperform their infinite cousins because finite networks can flexibly adapt their internal representations. However, our theoretical understanding of how the learned hidden layer representations of finite networks differ from the fixed representations of infinite networks remains incomplete. Perturbative finite-width corrections to the network prior and posterior have been studied, but the asymptotics of learned features have not been fully characterized. Here, we argue that the leading finite-width corrections to the average feature kernels for any Bayesian network with linear readout and Gaussian likelihood have a largely universal form. We illustrate this explicitly for three tractable network architectures: deep linear fully-connected and convolutional networks, and networks with a single nonlinear hidden layer. Our results begin to elucidate how task-relevant learning signals shape the hidden layer representations of wide Bayesian neural networks.
    Forecasting: theory and practice. (arXiv:2012.03854v4 [stat.AP] CROSS LISTED)
    Forecasting has always been at the forefront of decision making and planning. The uncertainty that surrounds the future is both exciting and challenging, with individuals and organisations seeking to minimise risks and maximise utilities. The large number of forecasting applications calls for a diverse set of forecasting methods to tackle real-life challenges. This article provides a non-systematic review of the theory and the practice of forecasting. We provide an overview of a wide range of theoretical, state-of-the-art models, methods, principles, and approaches to prepare, produce, organise, and evaluate forecasts. We then demonstrate how such theoretical concepts are applied in a variety of real-life contexts. We do not claim that this review is an exhaustive list of methods and applications. However, we wish that our encyclopedic presentation will offer a point of reference for the rich work that has been undertaken over the last decades, with some key insights for the future of forecasting theory and practice. Given its encyclopedic nature, the intended mode of reading is non-linear. We offer cross-references to allow the readers to navigate through the various topics. We complement the theoretical concepts and applications covered by large lists of free or open-source software implementations and publicly-available databases.
    Hair Color Digitization through Imaging and Deep Inverse Graphics. (arXiv:2202.03723v1 [cs.GR])
    Hair appearance is a complex phenomenon due to hair geometry and how the light bounces on different hair fibers. For this reason, reproducing a specific hair color in a rendering environment is a challenging task that requires manual work and expert knowledge in computer graphics to tune the result visually. While current hair capture methods focus on hair shape estimation many applications could benefit from an automated method for capturing the appearance of a physical hair sample, from augmented/virtual reality to hair dying development. Building on recent advances in inverse graphics and material capture using deep neural networks, we introduce a novel method for hair color digitization. Our proposed pipeline allows capturing the color appearance of a physical hair sample and renders synthetic images of hair with a similar appearance, simulating different hair styles and/or lighting environments. Since rendering realistic hair images requires path-tracing rendering, the conventional inverse graphics approach based on differentiable rendering is untractable. Our method is based on the combination of a controlled imaging device, a path-tracing renderer, and an inverse graphics model based on self-supervised machine learning, which does not require to use differentiable rendering to be trained. We illustrate the performance of our hair digitization method on both real and synthetic images and show that our approach can accurately capture and render hair color.
    Non Asymptotic Bounds for Optimization via Online Multiplicative Stochastic Gradient Descent. (arXiv:2112.07110v7 [stat.ML] UPDATED)
    The gradient noise of Stochastic Gradient Descent (SGD) is considered to play a key role in its properties (e.g. escaping low potential points and regularization). Past research has indicated that the covariance of the SGD error done via minibatching plays a critical role in determining its regularization and escape from low potential points. It is however not much explored how much the distribution of the error influences the behavior of the algorithm. Motivated by some new research in this area, we prove universality results by showing that noise classes that have the same mean and covariance structure of SGD via minibatching have similar properties. We mainly consider the Multiplicative Stochastic Gradient Descent (M-SGD) algorithm as introduced by Wu et al., which has a much more general noise class than the SGD algorithm done via minibatching. We establish nonasymptotic bounds for the M-SGD algorithm mainly with respect to the Stochastic Differential Equation corresponding to SGD via minibatching. We also show that the M-SGD error is approximately a scaled Gaussian distribution with mean $0$ at any fixed point of the M-SGD algorithm. We also establish bounds for the convergence of the M-SGD algorithm in the strongly convex regime.
    A Theory of the Inductive Bias and Generalization of Kernel Regression and Wide Neural Networks. (arXiv:2110.03922v3 [cs.LG] UPDATED)
    Kernel regression is an important nonparametric learning algorithm with an equivalence to neural networks in the infinite-width limit. Understanding its generalization behavior is thus an important task for machine learning theory. In this work, we provide a theory of the inductive bias and generalization of kernel regression using a new measure characterizing the "learnability" of a given target function. We prove that a kernel's inductive bias can be characterized as a fixed budget of learnability, allocated to its eigenmodes, that can only be increased with the addition of more training data. We then use this rule to derive expressions for the mean and covariance of the predicted function and gain insight into the overfitting and adversarial robustness of kernel regression and the hardness of the classic parity problem. We show agreement between our theoretical results and both kernel regression and wide finite networks on real and synthetic learning tasks.  ( 2 min )
    A Kernel Test for Causal Association via Noise Contrastive Backdoor Adjustment. (arXiv:2111.13226v3 [stat.ME] UPDATED)
    Causal inference grows increasingly complex as the number of confounders increases. Given treatments $X$, confounders $Z$ and outcomes $Y$, we develop a non-parametric method to test the \textit{do-null} hypothesis $H_0:\; p(y|\text{\it do}(X=x))=p(y)$ against the general alternative. Building on the Hilbert Schmidt Independence Criterion (HSIC) for marginal independence testing, we propose backdoor-HSIC (bd-HSIC) and demonstrate that it is calibrated and has power for both binary and continuous treatments under a large number of confounders. Additionally, we establish convergence properties of the estimators of covariance operators used in bd-HSIC. We investigate the advantages and disadvantages of bd-HSIC against parametric tests as well as the importance of using the do-null testing in contrast to marginal independence testing or conditional independence testing. A complete implementation can be found at \hyperlink{https://github.com/MrHuff/kgformula}{\texttt{https://github.com/MrHuff/kgformula}}.  ( 2 min )
    Neural Collapse Under MSE Loss: Proximity to and Dynamics on the Central Path. (arXiv:2106.02073v3 [cs.LG] UPDATED)
    The recently discovered Neural Collapse (NC) phenomenon occurs pervasively in today's deep net training paradigm of driving cross-entropy (CE) loss towards zero. During NC, last-layer features collapse to their class-means, both classifiers and class-means collapse to the same Simplex Equiangular Tight Frame, and classifier behavior collapses to the nearest-class-mean decision rule. Recent works demonstrated that deep nets trained with mean squared error (MSE) loss perform comparably to those trained with CE. As a preliminary, we empirically establish that NC emerges in such MSE-trained deep nets as well through experiments on three canonical networks and five benchmark datasets. We provide, in a Google Colab notebook, PyTorch code for reproducing MSE-NC and CE-NC: at https://colab.research.google.com/github/neuralcollapse/neuralcollapse/blob/main/neuralcollapse.ipynb. The analytically-tractable MSE loss offers more mathematical opportunities than the hard-to-analyze CE loss, inspiring us to leverage MSE loss towards the theoretical investigation of NC. We develop three main contributions: (I) We show a new decomposition of the MSE loss into (A) terms directly interpretable through the lens of NC and which assume the last-layer classifier is exactly the least-squares classifier; and (B) a term capturing the deviation from this least-squares classifier. (II) We exhibit experiments on canonical datasets and networks demonstrating that term-(B) is negligible during training. This motivates us to introduce a new theoretical construct: the central path, where the linear classifier stays MSE-optimal for feature activations throughout the dynamics. (III) By studying renormalized gradient flow along the central path, we derive exact dynamics that predict NC.  ( 2 min )
    Advantage of Deep Neural Networks for Estimating Functions with Singularity on Hypersurfaces. (arXiv:2011.02256v2 [stat.ML] UPDATED)
    We develop a minimax rate analysis to describe the reason that deep neural networks (DNNs) perform better than other standard methods. For nonparametric regression problems, it is well known that many standard methods attain the minimax optimal rate of estimation errors for smooth functions, and thus, it is not straightforward to identify the theoretical advantages of DNNs. This study tries to fill this gap by considering the estimation for a class of non-smooth functions that have singularities on hypersurfaces. Our findings are as follows: (i) We derive the generalization error of a DNN estimator and prove that its convergence rate is almost optimal. (ii) We elucidate a phase diagram of estimation problems, which describes the situations where the DNNs outperform a general class of estimators, including kernel methods, Gaussian process methods, and others. We additionally show that DNNs outperform harmonic analysis based estimators. This advantage of DNNs comes from the fact that a shape of singularity can be successfully handled by their multi-layered structure.  ( 2 min )
    TACTiS: Transformer-Attentional Copulas for Time Series. (arXiv:2202.03528v1 [cs.LG])
    The estimation of time-varying quantities is a fundamental component of decision making in fields such as healthcare and finance. However, the practical utility of such estimates is limited by how accurately they quantify predictive uncertainty. In this work, we address the problem of estimating the joint predictive distribution of high-dimensional multivariate time series. We propose a versatile method, based on the transformer architecture, that estimates joint distributions using an attention-based decoder that provably learns to mimic the properties of non-parametric copulas. The resulting model has several desirable properties: it can scale to hundreds of time series, supports both forecasting and interpolation, can handle unaligned and non-uniformly sampled data, and can seamlessly adapt to missing data during training. We demonstrate these properties empirically and show that our model produces state-of-the-art predictions on several real-world datasets.  ( 2 min )
    Nesterov Accelerated Shuffling Gradient Method for Convex Optimization. (arXiv:2202.03525v1 [math.OC])
    In this paper, we propose Nesterov Accelerated Shuffling Gradient (NASG), a new algorithm for the convex finite-sum minimization problems. Our method integrates the traditional Nesterov's acceleration momentum with different shuffling sampling schemes. We show that our algorithm has an improved rate of $\mathcal{O}(1/T)$ using unified shuffling schemes, where $T$ is the number of epochs. This rate is better than that of any other shuffling gradient methods in convex regime. Our convergence analysis does not require an assumption on bounded domain or a bounded gradient condition. For randomized shuffling schemes, we improve the convergence bound further. When employing some initial condition, we show that our method converges faster near the small neighborhood of the solution. Numerical simulations demonstrate the efficiency of our algorithm.  ( 2 min )
    Finite-Sum Optimization: A New Perspective for Convergence to a Global Solution. (arXiv:2202.03524v1 [cs.LG])
    Deep neural networks (DNNs) have shown great success in many machine learning tasks. Their training is challenging since the loss surface of the network architecture is generally non-convex, or even non-smooth. How and under what assumptions is guaranteed convergence to a \textit{global} minimum possible? We propose a reformulation of the minimization problem allowing for a new recursive algorithmic framework. By using bounded style assumptions, we prove convergence to an $\varepsilon$-(global) minimum using $\mathcal{\tilde{O}}(1/\varepsilon^3)$ gradient computations. Our theoretical foundation motivates further study, implementation, and optimization of the new algorithmic framework and further investigation of its non-standard bounded style assumptions. This new direction broadens our understanding of why and under what circumstances training of a DNN converges to a global minimum.  ( 2 min )
    Batch size-invariance for policy optimization. (arXiv:2110.00641v2 [cs.LG] UPDATED)
    We say an algorithm is batch size-invariant if changes to the batch size can largely be compensated for by changes to other hyperparameters. Stochastic gradient descent is well-known to have this property at small batch sizes, via the learning rate. However, some policy optimization algorithms (such as PPO) do not have this property, because of how they control the size of policy updates. In this work we show how to make these algorithms batch size-invariant. Our key insight is to decouple the proximal policy (used for controlling policy updates) from the behavior policy (used for off-policy corrections). Our experiments help explain why these algorithms work, and additionally show how they can make more efficient use of stale data.  ( 2 min )
    SMAC3: A Versatile Bayesian Optimization Package for Hyperparameter Optimization. (arXiv:2109.09831v2 [cs.LG] UPDATED)
    Algorithm parameters, in particular hyperparameters of machine learning algorithms, can substantially impact their performance. To support users in determining well-performing hyperparameter configurations for their algorithms, datasets and applications at hand, SMAC3 offers a robust and flexible framework for Bayesian Optimization, which can improve performance within a few evaluations. It offers several facades and pre-sets for typical use cases, such as optimizing hyperparameters, solving low dimensional continuous (artificial) global optimization problems and configuring algorithms to perform well across multiple problem instances. The SMAC3 package is available under a permissive BSD-license at https://github.com/automl/SMAC3.  ( 2 min )
    Transfer Learning under High-dimensional Generalized Linear Models. (arXiv:2105.14328v3 [stat.ML] UPDATED)
    In this work, we study the transfer learning problem under high-dimensional generalized linear models (GLMs), which aim to improve the fit on target data by borrowing information from useful source data. Given which sources to transfer, we propose a transfer learning algorithm on GLM, and derive its $\ell_1/\ell_2$-estimation error bounds as well as a bound for a prediction error measure. The theoretical analysis shows that when the target and source are sufficiently close to each other, these bounds could be improved over those of the classical penalized estimator using only target data under mild conditions. When we don't know which sources to transfer, an algorithm-free transferable source detection approach is introduced to detect informative sources. The detection consistency is proved under the high-dimensional GLM transfer learning setting. We also propose an algorithm to construct confidence intervals of each coefficient component, and the corresponding theories are provided. Extensive simulations and a real-data experiment verify the effectiveness of our algorithms. We implement the proposed GLM transfer learning algorithms in a new R package glmtrans, which is available on CRAN.  ( 2 min )
    Post-mortem on a deep learning contest: a Simpson's paradox and the complementary roles of scale metrics versus shape metrics. (arXiv:2106.00734v2 [cs.LG] UPDATED)
    To understand better good generalization performance in state-of-the-art neural network (NN) models, and in particular the success of the ALPHAHAT metric based on Heavy-Tailed Self-Regularization (HT-SR) theory, we analyze of a corpus of models that was made publicly-available for a contest to predict the generalization accuracy of NNs. These models include a wide range of qualities and were trained with a range of architectures and regularization hyperparameters. We break ALPHAHAT into its two subcomponent metrics: a scale-based metric; and a shape-based metric. We identify what amounts to a Simpson's paradox: where "scale" metrics (from traditional statistical learning theory) perform well in aggregate, but can perform poorly on subpartitions of the data of a given depth, when regularization hyperparameters are varied; and where "shape" metrics (from HT-SR theory) perform well on each subpartition of the data, when hyperparameters are varied for models of a given depth, but can perform poorly overall when models with varying depths are aggregated. Our results highlight the subtlety of comparing models when both architectures and hyperparameters are varied; the complementary role of implicit scale versus implicit shape parameters in understanding NN model quality; and the need to go beyond one-size-fits-all metrics based on upper bounds from generalization theory to describe the performance of NN models. Our results also clarify further why the ALPHAHAT metric from HT-SR theory works so well at predicting generalization across a broad range of CV and NLP models.  ( 2 min )
    OptiLIME: Optimized LIME Explanations for Diagnostic Computer Algorithms. (arXiv:2006.05714v3 [cs.LG] UPDATED)
    Local Interpretable Model-Agnostic Explanations (LIME) is a popular method to perform interpretability of any kind of Machine Learning (ML) model. It explains one ML prediction at a time, by learning a simple linear model around the prediction. The model is trained on randomly generated data points, sampled from the training dataset distribution and weighted according to the distance from the reference point - the one being explained by LIME. Feature selection is applied to keep only the most important variables. LIME is widespread across different domains, although its instability - a single prediction may obtain different explanations - is one of the major shortcomings. This is due to the randomness in the sampling step, as well as to the flexibility in tuning the weights and determines a lack of reliability in the retrieved explanations, making LIME adoption problematic. In Medicine especially, clinical professionals trust is mandatory to determine the acceptance of an explainable algorithm, considering the importance of the decisions at stake and the related legal issues. In this paper, we highlight a trade-off between explanation's stability and adherence, namely how much it resembles the ML model. Exploiting our innovative discovery, we propose a framework to maximise stability, while retaining a predefined level of adherence. OptiLIME provides freedom to choose the best adherence-stability trade-off level and more importantly, it clearly highlights the mathematical properties of the retrieved explanation. As a result, the practitioner is provided with tools to decide whether the explanation is reliable, according to the problem at hand. We extensively test OptiLIME on a toy dataset - to present visually the geometrical findings - and a medical dataset. In the latter, we show how the method comes up with meaningful explanations both from a medical and mathematical standpoint.  ( 3 min )
    Entrywise Recovery Guarantees for Sparse PCA via Sparsistent Algorithms. (arXiv:2202.04061v1 [math.ST])
    Sparse Principal Component Analysis (PCA) is a prevalent tool across a plethora of subfields of applied statistics. While several results have characterized the recovery error of the principal eigenvectors, these are typically in spectral or Frobenius norms. In this paper, we provide entrywise $\ell_{2,\infty}$ bounds for Sparse PCA under a general high-dimensional subgaussian design. In particular, our results hold for any algorithm that selects the correct support with high probability, those that are sparsistent. Our bound improves upon known results by providing a finer characterization of the estimation error, and our proof uses techniques recently developed for entrywise subspace perturbation theory.  ( 2 min )
    Sparse-RS: a versatile framework for query-efficient sparse black-box adversarial attacks. (arXiv:2006.12834v3 [cs.LG] UPDATED)
    We propose a versatile framework based on random search, Sparse-RS, for score-based sparse targeted and untargeted attacks in the black-box setting. Sparse-RS does not rely on substitute models and achieves state-of-the-art success rate and query efficiency for multiple sparse attack models: $l_0$-bounded perturbations, adversarial patches, and adversarial frames. The $l_0$-version of untargeted Sparse-RS outperforms all black-box and even all white-box attacks for different models on MNIST, CIFAR-10, and ImageNet. Moreover, our untargeted Sparse-RS achieves very high success rates even for the challenging settings of $20\times20$ adversarial patches and $2$-pixel wide adversarial frames for $224\times224$ images. Finally, we show that Sparse-RS can be applied to generate targeted universal adversarial patches where it significantly outperforms the existing approaches. The code of our framework is available at https://github.com/fra31/sparse-rs.  ( 2 min )
    On Completeness-aware Concept-Based Explanations in Deep Neural Networks. (arXiv:1910.07969v6 [cs.LG] UPDATED)
    Human explanations of high-level decisions are often expressed in terms of key concepts the decisions are based on. In this paper, we study such concept-based explainability for Deep Neural Networks (DNNs). First, we define the notion of completeness, which quantifies how sufficient a particular set of concepts is in explaining a model's prediction behavior based on the assumption that complete concept scores are sufficient statistics of the model prediction. Next, we propose a concept discovery method that aims to infer a complete set of concepts that are additionally encouraged to be interpretable, which addresses the limitations of existing methods on concept explanations. To define an importance score for each discovered concept, we adapt game-theoretic notions to aggregate over sets and propose ConceptSHAP. Via proposed metrics and user studies, on a synthetic dataset with apriori-known concept explanations, as well as on real-world image and language datasets, we validate the effectiveness of our method in finding concepts that are both complete in explaining the decisions and interpretable. (The code is released at https://github.com/chihkuanyeh/concept_exp)  ( 2 min )
    Low-Rank Extragradient Method for Nonsmooth and Low-Rank Matrix Optimization Problems. (arXiv:2202.04026v1 [math.OC])
    Low-rank and nonsmooth matrix optimization problems capture many fundamental tasks in statistics and machine learning. While significant progress has been made in recent years in developing efficient methods for \textit{smooth} low-rank optimization problems that avoid maintaining high-rank matrices and computing expensive high-rank SVDs, advances for nonsmooth problems have been slow paced. In this paper we consider standard convex relaxations for such problems. Mainly, we prove that under a natural \textit{generalized strict complementarity} condition and under the relatively mild assumption that the nonsmooth objective can be written as a maximum of smooth functions, the \textit{extragradient method}, when initialized with a "warm-start" point, converges to an optimal solution with rate $O(1/t)$ while requiring only two \textit{low-rank} SVDs per iteration. We give a precise trade-off between the rank of the SVDs required and the radius of the ball in which we need to initialize the method. We support our theoretical results with empirical experiments on several nonsmooth low-rank matrix recovery tasks, demonstrating that using simple initializations, the extragradient method produces exactly the same iterates when full-rank SVDs are replaced with SVDs of rank that matches the rank of the (low-rank) ground-truth matrix to be recovered.  ( 2 min )
    Systematically improving existing k-means initialization algorithms at nearly no cost, by pairwise-nearest-neighbor smoothing. (arXiv:2202.03949v1 [cs.LG])
    We present a meta-method for initializing (seeding) the $k$-means clustering algorithm called PNN-smoothing. It consists in splitting a given dataset into $J$ random subsets, clustering each of them individually, and merging the resulting clusterings with the pairwise-nearest-neighbor (PNN) method. It is a meta-method in the sense that when clustering the individual subsets any seeding algorithm can be used. If the computational complexity of that seeding algorithm is linear in the size of the data $N$ and the number of clusters $k$, PNN-smoothing is also almost linear with an appropriate choice of $J$, and in fact only at most a few percent slower in most cases in practice. We show empirically, using several existing seeding methods and testing on several synthetic and real datasets, that this procedure results in systematically better costs. It can even be applied recursively, and easily parallelized. Our implementation is publicly available at https://github.com/carlobaldassi/KMeansPNNSmoothing.jl  ( 2 min )
    Efficient Algorithms for High-Dimensional Convex Subspace Optimization via Strict Complementarity. (arXiv:2202.04020v1 [math.OC])
    We consider optimization problems in which the goal is find a $k$-dimensional subspace of $\reals^n$, $k<<n$, which minimizes a convex and smooth loss. Such problemsgeneralize the fundamental task of principal component analysis (PCA) to include robust and sparse counterparts, and logistic PCA for binary data, among others. While this problem is not convex it admits natural algorithms with very efficient iterations and memory requirements, which is highly desired in high-dimensional regimes however, arguing about their fast convergence to a global optimal solution is difficult. On the other hand, there exists a simple convex relaxation for which convergence to the global optimum is straightforward, however corresponding algorithms are not efficient when the dimension is very large. In this work we present a natural deterministic sufficient condition so that the optimal solution to the convex relaxation is unique and is also the optimal solution to the original nonconvex problem. Mainly, we prove that under this condition, a natural highly-efficient nonconvex gradient method, which we refer to as \textit{gradient orthogonal iteration}, when initialized with a "warm-start", converges linearly for the nonconvex problem. We also establish similar results for the nonconvex projected gradient method, and the Frank-Wolfe method when applied to the convex relaxation. We conclude with empirical evidence on synthetic data which demonstrate the appeal of our approach.  ( 2 min )
    Improved Convergence Rates for Sparse Approximation Methods in Kernel-Based Learning. (arXiv:2202.04005v1 [cs.LG])
    Kernel-based models such as kernel ridge regression and Gaussian processes are ubiquitous in machine learning applications for regression and optimization. It is well known that a serious downside for kernel-based models is the high computational cost; given a dataset of $n$ samples, the cost grows as $\mathcal{O}(n^3)$. Existing sparse approximation methods can yield a significant reduction in the computational cost, effectively reducing the real world cost down to as low as $\mathcal{O}(n)$ in certain cases. Despite this remarkable empirical success, significant gaps remain in the existing results for the analytical confidence bounds on the error due to approximation. In this work, we provide novel confidence intervals for the Nystr\"om method and the sparse variational Gaussian processes approximation method. Our confidence intervals lead to improved error bounds in both regression and optimization. We establish these confidence intervals using novel interpretations of the approximate (surrogate) posterior variance of the models.  ( 2 min )
    Spectral embedding and the latent geometry of multipartite networks. (arXiv:2202.03945v1 [stat.ME])
    Spectral embedding finds vector representations of the nodes of a network, based on the eigenvectors of its adjacency or Laplacian matrix, and has found applications throughout the sciences. Many such networks are multipartite, meaning their nodes can be divided into partitions and nodes of the same partition are never connected. When the network is multipartite, this paper demonstrates that the node representations obtained via spectral embedding live near partition-specific low-dimensional subspaces of a higher-dimensional ambient space. For this reason we propose a follow-on step after spectral embedding, to recover node representations in their intrinsic rather than ambient dimension, proving uniform consistency under a low-rank, inhomogeneous random graph model. Our method naturally generalizes bipartite spectral embedding, in which node representations are obtained by singular value decomposition of the biadjacency or bi-Laplacian matrix.  ( 2 min )
    Data Consistency for Weakly Supervised Learning. (arXiv:2202.03987v1 [cs.LG])
    In many applications, training machine learning models involves using large amounts of human-annotated data. Obtaining precise labels for the data is expensive. Instead, training with weak supervision provides a low-cost alternative. We propose a novel weak supervision algorithm that processes noisy labels, i.e., weak signals, while also considering features of the training data to produce accurate labels for training. Our method searches over classifiers of the data representation to find plausible labelings. We call this paradigm data consistent weak supervision. A key facet of our framework is that we are able to estimate labels for data examples low or no coverage from the weak supervision. In addition, we make no assumptions about the joint distribution of the weak signals and true labels of the data. Instead, we use weak signals and the data features to solve a constrained optimization that enforces data consistency among the labels we generate. Empirical evaluation of our method on different datasets shows that it significantly outperforms state-of-the-art weak supervision methods on both text and image classification tasks.  ( 2 min )
    Width is Less Important than Depth in ReLU Neural Networks. (arXiv:2202.03841v1 [cs.LG])
    We solve an open question from Lu et al. (2017), by showing that any target network with inputs in $\mathbb{R}^d$ can be approximated by a width $O(d)$ network (independent of the target network's architecture), whose number of parameters is essentially larger only by a linear factor. In light of previous depth separation theorems, which imply that a similar result cannot hold when the roles of width and depth are interchanged, it follows that depth plays a more significant role than width in the expressive power of neural networks. We extend our results to constructing networks with bounded weights, and to constructing networks with width at most $d+2$, which is close to the minimal possible width due to previous lower bounds. Both of these constructions cause an extra polynomial factor in the number of parameters over the target network. We also show an exact representation of wide and shallow networks using deep and narrow networks which, in certain cases, does not increase the number of parameters over the target network.  ( 2 min )
    Robust Hybrid Learning With Expert Augmentation. (arXiv:2202.03881v1 [cs.LG])
    Hybrid modelling reduces the misspecification of expert models by combining them with machine learning (ML) components learned from data. Like for many ML algorithms, hybrid model performance guarantees are limited to the training distribution. Leveraging the insight that the expert model is usually valid even outside the training domain, we overcome this limitation by introducing a hybrid data augmentation strategy termed \textit{expert augmentation}. Based on a probabilistic formalization of hybrid modelling, we show why expert augmentation improves generalization. Finally, we validate the practical benefits of augmented hybrid models on a set of controlled experiments, modelling dynamical systems described by ordinary and partial differential equations.  ( 2 min )
    Distribution Regression with Sliced Wasserstein Kernels. (arXiv:2202.03926v1 [stat.ML])
    The problem of learning functions over spaces of probabilities - or distribution regression - is gaining significant interest in the machine learning community. A key challenge behind this problem is to identify a suitable representation capturing all relevant properties of the underlying functional mapping. A principled approach to distribution regression is provided by kernel mean embeddings, which lifts kernel-induced similarity on the input domain at the probability level. This strategy effectively tackles the two-stage sampling nature of the problem, enabling one to derive estimators with strong statistical guarantees, such as universal consistency and excess risk bounds. However, kernel mean embeddings implicitly hinge on the maximum mean discrepancy (MMD), a metric on probabilities, which may fail to capture key geometrical relations between distributions. In contrast, optimal transport (OT) metrics, are potentially more appealing, as documented by the recent literature on the topic. In this work, we propose the first OT-based estimator for distribution regression. We build on the Sliced Wasserstein distance to obtain an OT-based representation. We study the theoretical properties of a kernel ridge regression estimator based on such representation, for which we prove universal consistency and excess risk bounds. Preliminary experiments complement our theoretical findings by showing the effectiveness of the proposed approach and compare it with MMD-based estimators.  ( 2 min )
    Optimal Transport of Binary Classifiers to Fairness. (arXiv:2202.03814v1 [cs.LG])
    Much of the past work on fairness in machine learning has focused on forcing the predictions of classifiers to have similar statistical properties for individuals of different demographics. Yet, such methods often simply perform a rescaling of the classifier scores and ignore whether individuals of different groups have similar features. Our proposed method, Optimal Transport to Fairness (OTF), applies Optimal Transport (OT) to take this similarity into account by quantifying unfairness as the smallest cost of OT between a classifier and any score function that satisfies fairness constraints. For a flexible class of linear fairness constraints, we show a practical way to compute OTF as an unfairness cost term that can be added to any standard classification setting. Experiments show that OTF can be used to achieve an effective trade-off between predictive power and fairness.  ( 2 min )
    Learning to Predict Graphs with Fused Gromov-Wasserstein Barycenters. (arXiv:2202.03813v1 [stat.ML])
    This paper introduces a novel and generic framework to solve the flagship task of supervised labeled graph prediction by leveraging Optimal Transport tools. We formulate the problem as regression with the Fused Gromov-Wasserstein (FGW) loss and propose a predictive model relying on a FGW barycenter whose weights depend on inputs. First we introduce a non-parametric estimator based on kernel ridge regression for which theoretical results such as consistency and excess risk bound are proved. Next we propose an interpretable parametric model where the barycenter weights are modeled with a neural network and the graphs on which the FGW barycenter is calculated are additionally learned. Numerical experiments show the strength of the method and its ability to interpolate in the labeled graph space on simulated data and on a difficult metabolic identification problem where it can reach very good performance with very little engineering.  ( 2 min )
    Calibrated Learning to Defer with One-vs-All Classifiers. (arXiv:2202.03673v1 [cs.LG])
    The learning to defer (L2D) framework has the potential to make AI systems safer. For a given input, the system can defer the decision to a human if the human is more likely than the model to take the correct action. We study the calibration of L2D systems, investigating if the probabilities they output are sound. We find that Mozannar & Sontag's (2020) multiclass framework is not calibrated with respect to expert correctness. Moreover, it is not even guaranteed to produce valid probabilities due to its parameterization being degenerate for this purpose. We propose an L2D system based on one-vs-all classifiers that is able to produce calibrated probabilities of expert correctness. Furthermore, our loss function is also a consistent surrogate for multiclass L2D, like Mozannar & Sontag's (2020). Our experiments verify that not only is our system calibrated, but this benefit comes at no cost to accuracy. Our model's accuracy is always comparable (and often superior) to Mozannar & Sontag's (2020) model's in tasks ranging from hate speech detection to galaxy classification to diagnosis of skin lesions.  ( 2 min )

  • Open

    Are there models like "punctuation_en_bert" from Nvidia for other languages, in particular German? [D]
    Hi guys, I want to perform some analysis on transcripts that miss punctuation. The lack of it distorts the results, so I found that there is "punctuation_en_bert" from Nvidia that inserts punctuation back. It does a great great job for English. I need something like this for German as well. I wonder if that exists though. Can you point me in the right direction ? submitted by /u/vitotittotitto [link] [comments]  ( 1 min )
    [P] Clash Royale Build A Bot
    ​ An example of the state generated when the bot is playing Hello! I've been messing around with YOLOv5 recently, and put together a repo that can be used to build bots for the mobile game Clash Royale :) https://github.com/Pbatch/ClashRoyaleBuildABot The state generator is in a good place, but I'm struggling to build a bot that can climb out of the lower ranks. I don't think reinforcement learning will work, as you can't simulate the game locally (no self-play). Even with self-play, I'm not sure you conduct enough episodes on my compute budget! There is a paper (https://www.ijcai.org/proceedings/2019/0631.pdf) that tries to do it, but it is very limited (Fixed decks and battlefield). They also use a "simulation environment" (???), which I assume does not map perfectly to the real game. A rule-based algorithm could work, but I'm not a good enough player to know what these rules should be. Does anyone have had ideas/links to literature on solving these sorts of problem? Is the field advanced enough to tackle these sorts of games? Let me know your thoughts below, especially if you're also a Clash Royale player! Good luck if you decide to try and make a bot :) (Apologies to Mac and Linux users, the code only supports Windows atm) submitted by /u/Pbatch23888 [link] [comments]  ( 1 min )
    [D] Is neural network architecture just "alchemy"?
    I am trying to understand the "why" of neural network architecture. I've been reading papers and looking at winning solutions of big Kaggle competitions that use DNNs. When I look at the architectures they use, it feels completely arbitrary! This block over here, that block over there! Those two layers have a skip connection! Why?! (Below is my opinion of what ML looks like to an outsider who doesn't have inside information - it is most likely incorrect.) On Kaggle, you can get multiple DNN solutions using completely different architectures with varying degrees of complexity - all achieving identical performance. You'll have a "Ensemble of 27 63-layer DNN with skip connections and other arbitrary complexity" that slightly underperforms to "Simple Autoencoder attached to a simple MLP".…  ( 7 min )
    [D] Distributed Deep learning with low bandwidth
    If you had one 1 GPU in the UK, 1 in Germany and 1 in France all 3080s I'm assuming it's not worth using horovod to try to train a neural network. but if you had to what ways would I be able to optimise it. I know you can use gradient compression to reduce bandwidth but are there any other methods? submitted by /u/Mayfieldmobster [link] [comments]  ( 1 min )
    [R]Two Sparsities Are Better Than One: Unlocking the Performance Benefits of Sparse-Sparse Networks - Numenta Dec 2021
    Paper: https://arxiv.org/pdf/2112.13896.pdf Abstract: "In principle, sparse neural networks should be significantly more effi- cient than traditional dense networks. Neurons in the brain exhibit two types of sparsity; they are sparsely interconnected and sparsely active. These two types of sparsity, called weight sparsity and activation sparsity, when combined, offer the potential to reduce the computational cost of neural networks by two orders of magnitude. Despite this potential, today’s neural networks deliver only mod- est performance benefits using just weight sparsity, because traditional computing hardware cannot efficiently process sparse networks. In this article we introduce Complementary Sparsity, a novel technique that significantly improves the per- formance of dual sparse networks on existing hardware. We demonstrate that we can achieve high performance running weight-sparse networks, and we can mul- tiply those speedups by incorporating activation sparsity. Using Complementary Sparsity, we show up to 100X improvement in throughput and energy efficiency performing inference on FPGAs. We analyze scalability and resource tradeoffs for a variety of kernels typical of commercial convolutional networks such as ResNet-50 and MobileNetV2. Our results with Complementary Sparsity suggest that weight plus activation sparsity can be a potent combination for efficiently scaling future AI models. " https://preview.redd.it/9q9f6n3t1vg81.jpg?width=806&format=pjpg&auto=webp&s=f49358ef3f1da1b66e9cf8c11804c636a427c5d1 https://preview.redd.it/7po1zbet1vg81.jpg?width=796&format=pjpg&auto=webp&s=d308140484d5a9ba0a5edb1c9ae2f28974417485 https://preview.redd.it/cz67emut1vg81.jpg?width=790&format=pjpg&auto=webp&s=ac945dd7e7ca65f972cec57b5db55126661fcc54 submitted by /u/Singularian2501 [link] [comments]  ( 1 min )
    [D] How do I use SSL base models for supervised training?
    So I’m trying to use SimCLR with ResNet50 backbone but I don’t know how to use the ResNet50 backbone for supervised training after SSL training. I’m using the VISSL library by Facebook so it’s fairly black-boxed for me. Any tips? submitted by /u/fireless-phoenix [link] [comments]  ( 1 min )
    [R]Distributed SLIDE: Enabling Training Large Neural Networks on Low Bandwidth and Simple CPU-Clusters via Model Parallelism and Sparsity - Rice University and ThirdAI Corp.
    Paper: https://arxiv.org/pdf/2201.12667.pdf Abstract: " More than 70% of cloud computing is paid for but sits idle. A large fraction of these idle compute are cheap CPUs with few cores that are not utilized during the less busy hours. This paper aims to enable those CPU cycles to train heavyweight AI models. Our goal is against mainstream frameworks, which focus on leveraging expensive specialized ultra-high bandwidth interconnect to address the commu- nication bottleneck in distributed neural network training. This paper presents a distributed model-parallel training framework that enables training large neural networks on small CPU clusters with low Internet bandwidth. We build upon the adaptive sparse training framework introduced by the SLIDE algorithm. By carefully deploying sparsity over distributed nodes, we demonstrate several orders of magnitude faster model parallel training than Horovod, the main engine behind most commercial software. We show that with reduced communication, due to sparsity, we can train close to a billion parameter model on simple 4-16 core CPU nodes connected by basic low bandwidth interconnect. Moreover, the training time is at par with some of the best hardware accelerators. " Article: https://www.nextplatform.com/2022/02/07/distributed-ai-training-seti-style-on-idle-cloud/ https://preview.redd.it/pute0fg3xug81.jpg?width=1156&format=pjpg&auto=webp&s=ec089b61d4865dc657b061ca6c4fa7c6ce0d3da9 https://preview.redd.it/rff69fg3xug81.jpg?width=1139&format=pjpg&auto=webp&s=a6c4d1843b7487a2dffe1791813d6b102940f87b submitted by /u/Singularian2501 [link] [comments]  ( 1 min )
    [N] Gran Turismo Sophy - AI agent created using Reinforcement Learning for racing games
    Here is very interesting application of modern AI techniques (they mention Reinforcement Learning) in video game industry. Described in short video: Sony AI: The Making of Gran Turismo Sophy. I remember similar project done for MotoGP 19, which in the end was deployed in the game and received positive reviews on Steam. Racing games seem to be perfect low-hanging fruit for applied RL research. I hope we're going to see more learning-based agents in video games this decade. The holy grail in this line of work would be embedding learning-based AI agents in open-world games. Is anyone researching this? How is it going? submitted by /u/arnowaczynski [link] [comments]  ( 1 min )
    [N] **CFP:** Special Session on Low-Resource ASR Development @ INTERSPEECH 2022
    Call for Papers We invite submission of original results or studies on automatic speech recognition (ASR) technologies for low-resource languages to the preliminarily accepted Low-Resource ASR Special Session at INTERSPEECH 2022. The special session aims to bring together researchers from all sectors working on ASR (Automatic Speech Recognition) for low-resource languages and dialects to discuss the state of the art and future directions. It will allow for fruitful exchanges between participants in low-resource ASR challenges and evaluations and other researchers working on low-resource ASR development. One such challenge is the OpenASR Challenge series conducted by NIST (National Institute of Standards and Technology) in coordination with IARPA’s (Intelligence Advanced Research Project…  ( 2 min )
    [D] request | Network architecture visualization tool
    Hi there, Recently I saw a neural network architecture visualization tool which was really cool somewhere in the last two weeks. I saw it as a demo video/gif either on Linked In (yes, eww) or here on reddit. It was a very modern/sleek looking representation of the network. But I can't find it for the life of me, and now I found a place where I want to apply it. You could basically build a representation of the architecture of your network in 3D, and you could view it from different angles etc and 'screenshot' svgs from it for your papers. The tool basically allowed you to build your neural network layer by layer, I'm not sure whether you would build it through code (LaTeX?) or through a UI though. One feature I am sure of though, is that it was a true 3D figure that the user could move around to show the best angle, not some faked '3D representation'. I'm sure at least some of you must have seen this too. ​ Please help me find this tool. -ES submitted by /u/Efficient-Shame [link] [comments]  ( 1 min )
    [R] Free book on Causal ML (w/code)!
    The author posted this: https://drive.switch.ch/index.php/s/tNhKQmkGB48bjfz. It's a book discussing causal inference, w/deep dives into ML tools for causal inference, as well as other topics (both more and less common). submitted by /u/bikeskata [link] [comments]
    [N] New Subreddit for Machine Learning Trick Derivations
    [N] We created a new subreddit where we concisely derive popular machine learning tricks. Please join us (https://www.reddit.com/r/MachineLearningDervs/). Also please let us know what tricks you want to have derived next! https://preview.redd.it/3trtnb8r2tg81.png?width=1445&format=png&auto=webp&s=9326ddac21fef29a604d6f6f5860003046843050 ​ Thanks a lot! View Poll submitted by /u/Ok_Can2425 [link] [comments]  ( 1 min )
    [R] Towards understanding deep learning with the natural clustering prior (PhD thesis)
    Hello fellow redditors. I'm happy to share my PhD thesis, after 6 years of work. It is relatively short (80 pages), and should be useful to anyone interested in thought-provoking experiments and intuitions around the training process and inner workings of deep neural networks. I'm open to any questions and feedback. Link: https://dial.uclouvain.be/pr/boreal/object/boreal:258239 ​ Long abstract: The prior knowledge (a.k.a. priors) integrated into the design of a machine learning system strongly influences its generalization abilities. In the specific context of deep learning, some of these priors are poorly understood as they implicitly emerge from the successful heuristics and tentative approximations of biological brains involved in deep learning design. Through the lens of supervise…  ( 3 min )
    [P] What we learned by accelerating by 5X Hugging Face generative language models
    2 trends ongoing in the NLP ecosystem: bigger language model and better text generation. Both are NLP game changers (zero shot, etc.) but they bring their own challenges: how to perform inference with them? At what cost? GPU or CPU ? etc. That’s what we worked on recently, and below you will find the main lessons learned : memory IO is by far the main perf bottleneck Standard API of ONNX Runtime should not be used but there is an undocumented way of using another ONNX Runtime API which works well Nvidia TensorRT is always the fastest option on GPU, by a large margin Caching K/V token representation do not bring any inference optimization (quite unexpected) Project: https://github.com/ELS-RD/transformer-deploy/ Notebook (reproduce measures): https://github.com/ELS-RD/transformer-…  ( 3 min )
    [D][R] How familiar are Machine Learning researchers with STOC/FOCS?
    Basically the title. How familiar are Machine Learning researchers with theory conferences? Would an average senior ML researcher realize how prestigious it is to have a (ML Theory) STOC/FOCS paper? submitted by /u/Existing-Map9316 [link] [comments]  ( 1 min )
    [D] Using Standard Scaler on Sparse PCA (SPCA) - what do you think ?
    Hi, I am new to SPCA and had a debate with a friend on whether we should use standard scaler (0 mean and 1 variance) on a dataset (e.g., the wine dataset) then we fit the model. My thinking (based on PCA -which I am more familiar with- not SPCA) was that variance captures/ indicates how much information we have, and if we scale the dataset we will lose that advantage. Is there any merit to use a Standard Scaler then fit the data? is this something related to SPCA that I do not know? [D] submitted by /u/elmsha [link] [comments]  ( 1 min )
    ANN backprop is linear, but biological neurons are not? [D]
    Hey all! My question is motivated by Towards Biologically Plausible Deep Learning on the first page they mention the backpropagation computation (coming down from the output layer to lower hidden layers) is purely linear, whereas biological neurons interleave linear and non-linear operations, but this is not expanded upon. Does anyone know: 1) in what sense are ANNs linear? Is it that the layers are always well defined and we move from layer L to layer L-1? 2) in what sense are biological neurons non-linear? Is it the case that for a given pathway certain neurons may not have spiked? I'm assuming there are more cases Unfortunately, the paper does not list sources for that statement so I'm not sure where to look submitted by /u/iamquah [link] [comments]  ( 4 min )
    [P] Training Vision Transformers with Only 2040 Images: Implementation
    Hello Everyone, This is an implementation of the recent paper "Training Vision Transformers with Only 2040 Images" in PyTorch. Thought it could be of interest to someone. Feel free to point out any mistakes. Post: https://niranjankrishna.in/2022/02/09/training-vision-transformers-with-only-2040-images-implementation/ submitted by /u/nirufeynman [link] [comments]  ( 1 min )
    [R] Dynamic-TinyBERT: Boost TinyBERT's Inference Efficiency by Dynamic Sequence Length
    submitted by /u/beluis3d [link] [comments]
  • Open

    Real World Example of Machine Learning on Rails
    submitted by /u/Kagermanov [link] [comments]
    Does the GPT iterations include
    nuances? Like a hunch, or instinct as it goes learning? Im sorry. I really dont know how it works. Like if it is watching a show, I mean when you watch a show, you get a hunch this certain character is up to no good. Does it have that angle or thing to learn like that? submitted by /u/RXPT [link] [comments]
    Awesome AI R&D content (with code!) on Computer Vision News of February 2022
    Many great articles about AI, Deep Learning, Computer Vision and more... HTML5 version (recommended) PDF version Dilbert on page 2. Free subscription on page 56. Enjoy! https://preview.redd.it/q2o15j16fug81.jpg?width=400&format=pjpg&auto=webp&s=5756ce26fd1c708a124535e60486e1d26caf9fd9 submitted by /u/Gletta [link] [comments]
    One-year AI Masters in Europe
    Hi all, I'm currently pursuing a MSc in Psychology and became highly interested in AI during this time. I'm currently finishing my thesis (AI application in Neuroscience ) and plan to pursue a second master's in AI, but do not necessarily want to spend two more years. Does anyone know of any good one-year programs, preferably in Europe? So far, the programs offered at Stockholm University and the University of Edinburgh seem quite good, but I would appreciate any further hint. Thanks a lot! submitted by /u/Jollifresh [link] [comments]  ( 1 min )
    Microsoft AI Research Introduces A New Reinforcement Learning Based Method, Called ‘Dead-end Discovery’ (DeD), To Identify the High-Risk States And Treatments In Healthcare Using Machine Learning
    A policy is a roadmap for the relationships between perception and action in a given context. It defines an agent’s behavior at any given point in time. Comparing reinforcement learning models for hyperparameter optimization is expensive and often impossible. As a result, on-policy interactions with the target environment are used to access the performance of these algorithms, which help in gaining insights into the type of policy that the agent is enforcing. However, it’s known as an off-policy when the performance is unaffected by the agent’s actions. Off-policy Reinforcement Learning (RL) separates behavioral policies that generate experience from the target policy that seeks optimality. It also allows for learning several target policies with distinct aims using the same data stream or prior experience. Continue Reading Paper: https://proceedings.neurips.cc/paper/2021/file/26405399c51ad7b13b504e74eb7c696c-Paper.pdf Github: https://github.com/microsoft/med-deadend ​ https://preview.redd.it/678zwzg98ug81.png?width=2048&format=png&auto=webp&s=b8bc2e86f2ae4a6ee98070feef7364124ef455d4 submitted by /u/ai-lover [link] [comments]  ( 1 min )
    All thought no real background in Machine Learning but id like to one day. This is just a broader aspect of looking at what consciousness is. Thank you for reading if you did. Not to be biased this is a opinion alone. Open to open minds!
    Multidimensional consciousness and our chakras “Dimensions” are a means of organizing different planes of existence according to their vibratory rate. Each dimension has certain sets of laws and principles that are specific to the frequency of that dimension. “Consciousness” represents awareness. The inhabitants of each dimension function clearly, easily, and with a minimum of resistance withinthat plane because their consciousness vibrates in resonance with the frequency of that dimension. “Multidimensional Consciousness” is the ability to be “conscious” of more than one dimension. To be multidimensional in our consciousness we must remember that we have within us the potential to expand our perceptual awareness to the dimensions above and below our physical plane. “Unconscious” means unaware of and unable to attend to internal and/or external stimuli within the inhabitants’ own dimension or within another dimension. Third dimensional humans are largely unaware of their first dimensional, second dimensional, and fourth dimensional selves. The human unconscious is best accessed through physical body messages, introspection, dreams, and meditation. “Conscious” means aware of and able to attend to stimuli within the inhabitants’ own dimension. The third dimensional self is conscious of what can be perceived by the five physical senses of sight, hearing, touch, taste, and smell. “Superconscious” is a higher order of consciousness of the fifth dimension and above in which the inhabitants are able to be aware of and attend stimuli of their own dimension as well as all the lower dimensions. The superconscious is innately multidimensional. The third dimensional self can become “conscious” of the superconscious through meditation, prayer, and by surrendering to the enfoldment of the higher order consciousness. submitted by /u/Slight_Lunch4786 [link] [comments]  ( 2 min )
    Everyone has heard of AI making leaps and bound upgrades in multiple industries, but what is this cognitive growth in human evolution doing in 'Cognitive Science' space itself? Let's hear it from Jyot Rekhi, Founder of WAI, who has used AI for the Cognitive Science space for the first time ever.
    submitted by /u/jr_planet_earth [link] [comments]  ( 1 min )
    8 tools anyone in AI should know!
    submitted by /u/OnlyProggingForFun [link] [comments]
    Tiny ML for Big Hearts on an 8-bit Microcontroller Predict the possibility of arrhythmias on an 8- bit Microcontroller, without sending the corresponding sensor data to the cloud.
    Things used in this project Hardware components: Arduino Mega 2560 Software apps and online services: Neuton Tiny Machine Learning Story In the course of the pandemic, the interest in creating more innovative medical devices has run high, as recent years showed how unpredictable the situation in healthcare can be. Never before have we faced such an acute need for masks, ventilators, oxygen cylinders, and other must-have devices to conquer the pandemic. All this has become a trigger to develop devices that can work autonomously for a long time, without access to the internet or cloud, just on batteries with ultra-low power consumption. And most importantly, it's vital that such devices can be made by a wider range of people, even without in-depth technical skills. Perhaps, you’ve he…  ( 5 min )
    Consciousness as a process in the brain. An implementable definition of consciousness
    submitted by /u/agoramancy [link] [comments]  ( 2 min )
    AI Momentum in Global Retail
    Global retail has been a rapidly growing force since the turn of the 21st century. In Retail Innovation, we have gone from televised catalogs to Amazon’s adaptive homepage on click of our fingers. Competition is on an all-time high spurred by technology like AI in the retail market. That is why the global Retail Business, is investing in developing artificial intelligence in the retail industry. The benefits of AI in retail include market research, product development and review, custom advertising and suggestion for buyers, and strategic business planning, among others. The Retail Technology focuses on predicting purchase patterns ahead of main events and holidays such as Christmas Festival or Black Friday. By analyzing the metrics, AI potentially transforms high Retail Optimization for the future in retail with better results for clients. Retail Industry and AI-driven big data analysis will also boost operations in core retail software for example CRM etc. Retailer Shop can amplify click-through rates online, as well as the lead conversion from cold calling by using targeted data sets. Among other examples of AI in retail, Retail Stores can ease the purchase process by helping buyers finalize their payments faster. Dedicated merchandise Retail Shop or Retail Store linked with live streaming channels can automatically advertise product suggestions across platforms like Youtube, Vimeo, etc. The future of AI and MLOps in retail is significant for e-commerce businesses to survive is actively cultivating AI practices like predictive analysis in retail for product demands, pricing patterns, potential product brand/line collaborations, and market acquisitions. The retail giants use IRL sensors and cameras to understand consumer behavior paired with real-time analytics identifying out-of-stock products, an example of AI making a massive impact on the Retail Industry. submitted by /u/futureanalytica [link] [comments]  ( 1 min )
    Any good resources to learn Markov Decision Process?
    I am learning MDP, and I'm struggling to understand how all those different terms and concepts come together. Like V(s), G(t) etc ​ I've watched a bunch of youtube tutorials but I still can't seem to understand them, especially for the formulas and equations. Tried referring to the AI book but it doesn't help as well. submitted by /u/cocag13996 [link] [comments]  ( 1 min )
    My Ode to Coleridge - Kubla Khan, generated using CLIP guided diffusion
    submitted by /u/twm7 [link] [comments]
    Artificial Nightmares : James Webb Telescope Images 02 || DD PYTTI VQGAN AI Art Video [4K 60 FPS]
    submitted by /u/Thenamessd [link] [comments]
    For those who get overwhelmed by the maths for machine learning. 👇
    submitted by /u/mr-minion [link] [comments]  ( 1 min )
    Overview of AI Recommendation systems
    submitted by /u/beluis3d [link] [comments]
    Deeplearning & cancer
    I just finished my Course in deeplearning and I want to use it in medical path so I searched and found that i can use it in cancer treatment so I bought a book about cancer biology do I move in the right direction ..is that a good starting point or that it is just enthusiasm ...do I put pressure on myself or How should I go in that path? submitted by /u/engkhaledeisa [link] [comments]  ( 1 min )
  • Open

    Anybody using Robomimic?
    I'm looking into Robomimic (https://arise-initiative.github.io/robomimic-web/docs/introduction/overview.html), since I need to perform some imitation learning and offline reinforcement learning on manipulators. The framework looks good, even though still unpolished. Any feedback on it? What you don't like? Any better alternative? submitted by /u/lorepieri [link] [comments]
    What benefits RL can bring over DL for stock trading?
    I red some research papers on RL being used for stock trading but I can't imagine what value it can bring other that making live decisions. Do you have any idea? Can it really solve the environment? submitted by /u/UniqueAway [link] [comments]  ( 2 min )
    What are your strategies in training RL for a new task (custom env)?
    When I try to train RL with entirely new task, this is what I do: - Start training - After 500k steps (or any reasonable number to the task complexity), if the agent seems to learn nothing, go back to make some changes in reward function or look for any potential bugs. - If there is a sign of conversion, keep the same set of hyper parameters, and train a few more times - If the agent indeed learns something, start tuning parameters. Repeat same above steps with other SOTA algorithms, compare performances, and select the best. Do you guys think this is good strategies? I hope someone could give advice to save more time spent on training RL. Thanks in advance! submitted by /u/Rich_Beautiful [link] [comments]  ( 1 min )
    Microsoft AI Research Introduces A New Reinforcement Learning Based Method, Called ‘Dead-end Discovery’ (DeD), To Identify the High-Risk States And Treatments In Healthcare Using Machine Learning
    A policy is a roadmap for the relationships between perception and action in a given context. It defines an agent’s behavior at any given point in time. Comparing reinforcement learning models for hyperparameter optimization is expensive and often impossible. As a result, on-policy interactions with the target environment are used to access the performance of these algorithms, which help in gaining insights into the type of policy that the agent is enforcing. However, it’s known as an off-policy when the performance is unaffected by the agent’s actions. Off-policy Reinforcement Learning (RL) separates behavioral policies that generate experience from the target policy that seeks optimality. It also allows for learning several target policies with distinct aims using the same data stream or prior experience. Continue Reading Paper: https://proceedings.neurips.cc/paper/2021/file/26405399c51ad7b13b504e74eb7c696c-Paper.pdf Github: https://github.com/microsoft/med-deadend ​ https://preview.redd.it/epoqyiu68ug81.png?width=2048&format=png&auto=webp&s=e0d527f005bc1776c121498b43d457806e128961 submitted by /u/ai-lover [link] [comments]  ( 1 min )
    _Evolutionsstrategie: Optimierung technischer Systeme nach Prinzipien der biologischen Evolution_, Rechenberg 1973
    submitted by /u/gwern [link] [comments]  ( 1 min )
    Entropy Regularization in Policy Gradient (A2C,PPO,SAC). Simple but vague question
    Dear engineers, scientists and interested people, Could somebody explain about dimension of this term: Q - log(pi(a|s) Let say in mini-batch we have 100 samples, Q is 1 value, then we have Q's shape (100,1) but log(pi(a|s) or log_prob for continuous action space produce (mu-actions)^2/(2*var) all parts depends on action space and it will produce (100,n), where n is action dimension. So, Q - log(pi(a|s) will produce (100,n) shaped value. Is it normal? In calculating batch loss how it will affect weights of neural network? Should not we reduce log(pi(a|s) to mean over actions or sum over actions (sum seems reasonable if probabilities sums to 1)? submitted by /u/Timur_1988 [link] [comments]  ( 1 min )
    Mujoco vs Pybullet for closed loop chain environment
    I am trying to modify the existing InvertedDoublePendulum environment such that it becomes a closed loop chain, by essentially adding another cart with poles joined to the 1st carts. However there seems to be some issue with the constraint parameter in fixing both poles together as it still separates during test-time (somehow). I was wondering if Pybullet might be better for modelling such environments? Does anyone have any experience in such matters? submitted by /u/Lifeisunfair210 [link] [comments]  ( 1 min )
    How does an AI Recommendation engine work?
    submitted by /u/beluis3d [link] [comments]
    Understanding PPO update
    Hello, I have a question regarding the advantage calculation in PPO. Let us assume that the agent takes 1000 steps in the environment before each PPO update, and that an episode has a maximum length of 100. If the agent never finishes an episode (so basically the agent goes throw 10 episodes before PPO update), do we still reset the discounted reward every 100 steps when computing the vector of discounted rewards (of length 1000)? As per the code below, currently I reset the discounted reward when is_terminal is met, which is True only when the agent successfully finishes the episode, not when the episode limit is met. Is this wrong? https://preview.redd.it/k6b6d7nbhpg81.png?width=796&format=png&auto=webp&s=9a461c021124f94af064f25932454741e9c52404 submitted by /u/AhmedNizam_ [link] [comments]  ( 1 min )
  • Open

    Nested Hierarchical Transformer: Towards Accurate, Data-Efficient, and Interpretable Visual Understanding
    Posted by Zizhao Zhang, Software Engineer, Google Research, Cloud AI Team In visual understanding, the Visual Transformer (ViT) and its variants have received significant attention recently due to their superior performance on many core visual applications, such as image classification, object detection, and video understanding. The core idea of ViT is to utilize the power of self-attention layers to learn global relationships between small patches of images. However, the number of connections between patches increases quadratically with image size. Such a design has been observed to be data inefficient — although the original ViT can perform better than convolutional networks with hundreds of millions of images for pre-training, such a data requirement is not always practical, and it sti…  ( 8 min )
  • Open

    Extract entities from insurance documents using Amazon Comprehend named entity recognition
    Intelligent document processing (IDP) is a common use case for customers on AWS. You can utilize Amazon Comprehend and Amazon Textract for a variety of use cases ranging from document extraction, data classification, and entity extraction. One specific industry that uses IDP is insurance. They use IDP to automate data extraction for common use cases such as claims intake, […]  ( 8 min )
  • Open

    Chaos in the frequency domain
    Solutions to the non-linear differential equation x ″ + 0.25x ′ + x(x² – 1) = 0.3 cos t are chaotic. It’s more common to see plots of chaotic systems in the time domain, so I wanted to write a post looking at the power spectrum in the frequency domain. The following plot was created by solving […] Chaos in the frequency domain first appeared on John D. Cook.  ( 1 min )
    The tent map
    Yesterday I said that Lyapunov exponents can’t be calculated exactly except in the case of toy problems. One such toy model is the tent map. The graph of function on the right hand side looks a tent. It’s zero at x = 0 and at x = 1 and rises to a height of r […] The tent map first appeared on John D. Cook.  ( 2 min )
    Aliasing in a nutshell
    Suppose you have a sine wave with frequency f0 Hz. We’re going to discretize this signal by sampling it fs times per second. That is, we’re going to evaluate S at integer multiples of The result is the sequence where n runs through the integers. Next, let k be an integer and consider the sine […] Aliasing in a nutshell first appeared on John D. Cook.  ( 1 min )
  • Open

    Startup Taps Finance Micromodels for Data Annotation Automation
    After meeting at an entrepreneur matchmaking event, Ulrik Hansen and Eric Landau teamed up to parlay their experience in financial trading systems into a platform for faster data labeling. In 2020, the pair of finance industry veterans founded Encord to adapt micromodels typical in finance to automated data annotation. Micromodels are neural networks that require Read article > The post Startup Taps Finance Micromodels for Data Annotation Automation appeared first on The Official NVIDIA Blog.  ( 3 min )
  • Open

    Comments, docstrings, and type hints in Python code
    The source code of a program should be readable to human. Making it run correctly is only half of its […] The post Comments, docstrings, and type hints in Python code appeared first on Machine Learning Mastery.  ( 13 min )
  • Open

    Bridging Data Center AI Systems with Edge Computing for Actionable Information Retrieval. (arXiv:2105.13967v3 [cs.LG] UPDATED)
    Extremely high data rates at modern synchrotron and X-ray free-electron laser light source beamlines motivate the use of machine learning methods for data reduction, feature detection, and other purposes. Regardless of the application, the basic concept is the same: data collected in early stages of an experiment, data from past similar experiments, and/or data simulated for the upcoming experiment are used to train machine learning models that, in effect, learn specific characteristics of those data; these models are then used to process subsequent data more efficiently than would general-purpose models that lack knowledge of the specific dataset or data class. Thus, a key challenge is to be able to train models with sufficient rapidity that they can be deployed and used within useful timescales. We describe here how specialized data center AI (DCAI) systems can be used for this purpose through a geographically distributed workflow. Experiments show that although there are data movement cost and service overhead to use remote DCAI systems for DNN training, the turnaround time is still less than 1/30 of using a locally deploy-able GPU.  ( 2 min )
    An end-to-end deep learning approach for extracting stochastic dynamical systems with $\alpha$-stable L\'evy noise. (arXiv:2201.13114v2 [stat.ML] UPDATED)
    Recently, extracting data-driven governing laws of dynamical systems through deep learning frameworks has gained a lot of attention in various fields. Moreover, a growing amount of research work tends to transfer deterministic dynamical systems to stochastic dynamical systems, especially those driven by non-Gaussian multiplicative noise. However, lots of log-likelihood based algorithms that work well for Gaussian cases cannot be directly extended to non-Gaussian scenarios which could have high error and low convergence issues. In this work, we overcome some of these challenges and identify stochastic dynamical systems driven by $\alpha$-stable L\'evy noise from only random pairwise data. Our innovations include: (1) designing a deep learning approach to learn both drift and diffusion terms for L\'evy induced noise with $\alpha$ across all values, (2) learning complex multiplicative noise without restrictions on small noise intensity, (3) proposing an end-to-end complete framework for stochastic systems identification under a general input data assumption, that is, $\alpha$-stable random variable. Finally, numerical experiments and comparisons with the non-local Kramers-Moyal formulas with moment generating function confirm the effectiveness of our method.  ( 2 min )
    Memorization and Generalization in Neural Code Intelligence Models. (arXiv:2106.08704v2 [cs.LG] UPDATED)
    Deep Neural Networks are capable of learning highly generalizable patterns from large datasets of source code through millions of parameters. This large capacity also renders them prone to memorizing data points. Recent work suggests that the memorization risk manifests especially strongly when the training dataset is noisy, involving many ambiguous or questionable samples, and memorization is the only recourse. Unfortunately, most code intelligence tasks rely on rather noise-prone and repetitive data sources, such as code from GitHub. Given the sheer size of such corpora, determining the role and extent of noise in these is beyond manual inspection. In this paper, we propose an alternative analysis: we evaluate the impact of the noise on training neural models of source code by introducing targeted noise to the dataset of several state-of-the-art neural intelligence models and benchmarks based on Java and Python codebases. By studying the resulting behavioral changes at various rates of noise, and across a wide range of metrics, we can characterize both typical generalizing and problematic memorization-like learning of models of source code. Our results highlight important risks: millions of trainable parameters allow the neural networks to memorize anything, including noisy data, and provide a false sense of generalization. At the same time, the metrics used to analyze this phenomenon proved surprisingly useful for detecting and quantifying such effects, offering a powerful toolset for creating reliable models of code.  ( 2 min )
    Multidimensional Cybersecurity Framework for Strategic Foresight. (arXiv:2202.02537v1 [cs.CR])
    Cybersecurity is now at the forefront of most organisational digital transformative agendas and National economic, social and political programmes. Hence its impact to society can no longer be seen to be one dimensional. The rise in National cybersecurity laws and regulations is a good indicator of its perceived importance to nations. And the recent awakening for social and ethical transparency in society and coupled with sustainability issues demonstrate the need for a paradigm shift in how cybersecurity discourses can now happen. In response to this shift, a multidimensional cybersecurity framework for strategic foresight underpinned on situational awareness is proposed. The conceptual cybersecurity framework comprising six domains such as Physical, Cultural, Economic, Social, Political and Cyber, is discussed. The guiding principles underpinning the framework are outlined, followed by in-depth reflection on the Business, Operational, Technological and Human (BOTH) factors and their implications for strategic foresight for cybersecurity.  ( 2 min )
    Deep Ensembling with No Overhead for either Training or Testing: The All-Round Blessings of Dynamic Sparsity. (arXiv:2106.14568v4 [cs.LG] UPDATED)
    The success of deep ensembles on improving predictive performance, uncertainty estimation, and out-of-distribution robustness has been extensively studied in the machine learning literature. Albeit the promising results, naively training multiple deep neural networks and combining their predictions at inference leads to prohibitive computational costs and memory requirements. Recently proposed efficient ensemble approaches reach the performance of the traditional deep ensembles with significantly lower costs. However, the training resources required by these approaches are still at least the same as training a single dense model. In this work, we draw a unique connection between sparse neural network training and deep ensembles, yielding a novel efficient ensemble learning framework called FreeTickets. Instead of training multiple dense networks and averaging them, we directly train sparse subnetworks from scratch and extract diverse yet accurate subnetworks during this efficient, sparse-to-sparse training. Our framework, FreeTickets, is defined as the ensemble of these relatively cheap sparse subnetworks. Despite being an ensemble method, FreeTickets has even fewer parameters and training FLOPs than a single dense model. This seemingly counter-intuitive outcome is due to the ultra training/inference efficiency of dynamic sparse training. FreeTickets surpasses the dense baseline in all the following criteria: prediction accuracy, uncertainty estimation, out-of-distribution (OoD) robustness, as well as efficiency for both training and inference. Impressively, FreeTickets outperforms the naive deep ensemble with ResNet50 on ImageNet using around only 1/5 of the training FLOPs required by the latter. We have released our source code at https://github.com/VITA-Group/FreeTickets.  ( 3 min )
    Bregman Plug-and-Play Priors. (arXiv:2202.02388v1 [eess.IV])
    The past few years have seen a surge of activity around integration of deep learning networks and optimization algorithms for solving inverse problems. Recent work on plug-and-play priors (PnP), regularization by denoising (RED), and deep unfolding has shown the state-of-the-art performance of such integration in a variety of applications. However, the current paradigm for designing such algorithms is inherently Euclidean, due to the usage of the quadratic norm within the projection and proximal operators. We propose to broaden this perspective by considering a non-Euclidean setting based on the more general Bregman distance. Our new Bregman Proximal Gradient Method variant of PnP (PnP-BPGM) and Bregman Steepest Descent variant of RED (RED-BSD) replace the traditional updates in PnP and RED from the quadratic norms to more general Bregman distance. We present a theoretical convergence result for PnP-BPGM and demonstrate the effectiveness of our algorithms on Poisson linear inverse problems.  ( 2 min )
    Learning Synthetic Environments and Reward Networks for Reinforcement Learning. (arXiv:2202.02790v1 [cs.LG])
    We introduce Synthetic Environments (SEs) and Reward Networks (RNs), represented by neural networks, as proxy environment models for training Reinforcement Learning (RL) agents. We show that an agent, after being trained exclusively on the SE, is able to solve the corresponding real environment. While an SE acts as a full proxy to a real environment by learning about its state dynamics and rewards, an RN is a partial proxy that learns to augment or replace rewards. We use bi-level optimization to evolve SEs and RNs: the inner loop trains the RL agent, and the outer loop trains the parameters of the SE / RN via an evolution strategy. We evaluate our proposed new concept on a broad range of RL algorithms and classic control environments. In a one-to-one comparison, learning an SE proxy requires more interactions with the real environment than training agents only on the real environment. However, once such an SE has been learned, we do not need any interactions with the real environment to train new agents. Moreover, the learned SE proxies allow us to train agents with fewer interactions while maintaining the original task performance. Our empirical results suggest that SEs achieve this result by learning informed representations that bias the agents towards relevant states. Moreover, we find that these proxies are robust against hyperparameter variation and can also transfer to unseen agents.  ( 2 min )
    A new similarity measure for covariate shift with applications to nonparametric regression. (arXiv:2202.02837v1 [math.ST])
    We study covariate shift in the context of nonparametric regression. We introduce a new measure of distribution mismatch between the source and target distributions that is based on the integrated ratio of probabilities of balls at a given radius. We use the scaling of this measure with respect to the radius to characterize the minimax rate of estimation over a family of H\"older continuous functions under covariate shift. In comparison to the recently proposed notion of transfer exponent, this measure leads to a sharper rate of convergence and is more fine-grained. We accompany our theory with concrete instances of covariate shift that illustrate this sharp difference.  ( 2 min )
    Efficient Local Planning with Linear Function Approximation. (arXiv:2108.05533v3 [cs.LG] UPDATED)
    We study query and computationally efficient planning algorithms with linear function approximation and a simulator. We assume that the agent only has local access to the simulator, meaning that the agent can only query the simulator at states that have been visited before. This setting is more practical than many prior works on reinforcement learning with a generative model. We propose two algorithms, named confident Monte Carlo least square policy iteration (Confident MC-LSPI) and confident Monte Carlo Politex (Confident MC-Politex) for this setting. Under the assumption that the Q-functions of all policies are linear in known features of the state-action pairs, we show that our algorithms have polynomial query and computational costs in the dimension of the features, the effective planning horizon, and the targeted sub-optimality, while these costs are independent of the size of the state space. One technical contribution of our work is the introduction of a novel proof technique that makes use of a virtual policy iteration algorithm. We use this method to leverage existing results on $\ell_\infty$-bounded approximate policy iteration to show that our algorithm can learn the optimal policy for the given initial state even only with local access to the simulator. We believe that this technique can be extended to broader settings beyond this work.  ( 2 min )
    Human rights, democracy, and the rule of law assurance framework for AI systems: A proposal. (arXiv:2202.02776v1 [cs.AI])
    Following on from the publication of its Feasibility Study in December 2020, the Council of Europe's Ad Hoc Committee on Artificial Intelligence (CAHAI) and its subgroups initiated efforts to formulate and draft its Possible Elements of a Legal Framework on Artificial Intelligence, based on the Council of Europe's standards on human rights, democracy, and the rule of law. This document was ultimately adopted by the CAHAI plenary in December 2021. To support this effort, The Alan Turing Institute undertook a programme of research that explored the governance processes and practical tools needed to operationalise the integration of human right due diligence with the assurance of trustworthy AI innovation practices. The resulting framework was completed and submitted to the Council of Europe in September 2021. It presents an end-to-end approach to the assurance of AI project lifecycles that integrates context-based risk analysis and appropriate stakeholder engagement with comprehensive impact assessment, and transparent risk management, impact mitigation, and innovation assurance practices. Taken together, these interlocking processes constitute a Human Rights, Democracy and the Rule of Law Assurance Framework (HUDERAF). The HUDERAF combines the procedural requirements for principles-based human rights due diligence with the governance mechanisms needed to set up technical and socio-technical guardrails for responsible and trustworthy AI innovation practices. Its purpose is to provide an accessible and user-friendly set of mechanisms for facilitating compliance with a binding legal framework on artificial intelligence, based on the Council of Europe's standards on human rights, democracy, and the rule of law, and to ensure that AI innovation projects are carried out with appropriate levels of public accountability, transparency, and democratic governance.  ( 2 min )
    CVEfixes: Automated Collection of Vulnerabilities and Their Fixes from Open-Source Software. (arXiv:2107.08760v1 [cs.SE] CROSS LISTED)
    Data-driven research on the automated discovery and repair of security vulnerabilities in source code requires comprehensive datasets of real-life vulnerable code and their fixes. To assist in such research, we propose a method to automatically collect and curate a comprehensive vulnerability dataset from Common Vulnerabilities and Exposures (CVE) records in the public National Vulnerability Database (NVD). We implement our approach in a fully automated dataset collection tool and share an initial release of the resulting vulnerability dataset named CVEfixes. The CVEfixes collection tool automatically fetches all available CVE records from the NVD, gathers the vulnerable code and corresponding fixes from associated open-source repositories, and organizes the collected information in a relational database. Moreover, the dataset is enriched with meta-data such as programming language, and detailed code and security metrics at five levels of abstraction. The collection can easily be repeated to keep up-to-date with newly discovered or patched vulnerabilities. The initial release of CVEfixes spans all published CVEs up to 9 June 2021, covering 5365 CVE records for 1754 open-source projects that were addressed in a total of 5495 vulnerability fixing commits. CVEfixes supports various types of data-driven software security research, such as vulnerability prediction, vulnerability classification, vulnerability severity prediction, analysis of vulnerability-related code changes, and automated vulnerability repair.  ( 2 min )
    AutoBERT-Zero: Evolving BERT Backbone from Scratch. (arXiv:2107.07445v2 [cs.CL] UPDATED)
    Transformer-based pre-trained language models like BERT and its variants have recently achieved promising performance in various natural language processing (NLP) tasks. However, the conventional paradigm constructs the backbone by purely stacking the manually designed global self-attention layers, introducing inductive bias and thus leads to sub-optimal. In this work, we make the first attempt to automatically discover novel pre-trained language model (PLM) backbone on a flexible search space containing the most fundamental operations from scratch. Specifically, we propose a well-designed search space which (i) contains primitive math operations in the intra-layer level to explore novel attention structures, and (ii) leverages convolution blocks to be the supplementary for attentions in the inter-layer level to better learn local dependency. To enhance the efficiency for finding promising architectures, we propose an Operation-Priority Neural Architecture Search (OP-NAS) algorithm, which optimizes both the search algorithm and evaluation of candidate models. Specifically, we propose Operation-Priority (OP) evolution strategy to facilitate model search via balancing exploration and exploitation. Furthermore, we design a Bi-branch Weight-Sharing (BIWS) training strategy for fast model evaluation. Extensive experiments show that the searched architecture (named AutoBERT-Zero) significantly outperforms BERT and its variants of different model capacities in various downstream tasks, proving the architecture's transfer and scaling abilities. Remarkably, AutoBERT-Zero-base outperforms RoBERTa-base (using much more data) and BERT-large (with much larger model size) by 2.4 and 1.4 higher score on GLUE test set.  ( 2 min )
    Learning to be a Statistician: Learned Estimator for Number of Distinct Values. (arXiv:2202.02800v1 [cs.LG])
    Estimating the number of distinct values (NDV) in a column is useful for many tasks in database systems, such as columnstore compression and data profiling. In this work, we focus on how to derive accurate NDV estimations from random (online/offline) samples. Such efficient estimation is critical for tasks where it is prohibitive to scan the data even once. Existing sample-based estimators typically rely on heuristics or assumptions and do not have robust performance across different datasets as the assumptions on data can easily break. On the other hand, deriving an estimator from a principled formulation such as maximum likelihood estimation is very challenging due to the complex structure of the formulation. We propose to formulate the NDV estimation task in a supervised learning framework, and aim to learn a model as the estimator. To this end, we need to answer several questions: i) how to make the learned model workload agnostic; ii) how to obtain training data; iii) how to perform model training. We derive conditions of the learning framework under which the learned model is workload agnostic, in the sense that the model/estimator can be trained with synthetically generated training data, and then deployed into any data warehouse simply as, e.g., user-defined functions (UDFs), to offer efficient (within microseconds on CPU) and accurate NDV estimations for unseen tables and workloads. We compare the learned estimator with the state-of-the-art sample-based estimators on nine real-world datasets to demonstrate its superior estimation accuracy. We publish our code for training data generation, model training, and the learned estimator online for reproducibility.  ( 2 min )
    On the Global Convergence of Momentum-based Policy Gradient. (arXiv:2110.10116v2 [cs.LG] UPDATED)
    Policy gradient (PG) methods are popular and efficient for large-scale reinforcement learning due to their relative stability and incremental nature. In recent years, the empirical success of PG methods has led to the development of a theoretical foundation for these methods. In this work, we generalize this line of research by studying the global convergence of stochastic PG methods with momentum terms, which have been demonstrated to be efficient recipes for improving PG methods. We study both the soft-max and the Fisher-non-degenerate policy parametrizations, and show that adding a momentum improves the global optimality sample complexity of vanilla PG methods by $\tilde{\mathcal{O}}(\epsilon^{-1.5})$ and $\tilde{\mathcal{O}}(\epsilon^{-1})$, respectively, where $\epsilon>0$ is the target tolerance. Our work is the first one that obtains global convergence results for the momentum-based PG methods. For the generic Fisher-non-degenerate policy parametrizations, our result is the first single-loop and finite-batch PG algorithm achieving $\tilde{O}(\epsilon^{-3})$ global optimality sample complexity. Finally, as a by-product, our methods also provide general framework for analyzing the global convergence rates of stochastic PG methods, which can be easily applied and extended to different PG estimators.  ( 2 min )
    Hedging using reinforcement learning: Contextual $k$-Armed Bandit versus $Q$-learning. (arXiv:2007.01623v2 [cs.LG] UPDATED)
    The construction of replication strategies for contingent claims in the presence of risk and market friction is a key problem of financial engineering. In real markets, continuous replication, such as in the model of Black, Scholes and Merton (BSM), is not only unrealistic but it is also undesirable due to high transaction costs. A variety of methods have been proposed to balance between effective replication and losses in the incomplete market setting. With the rise of Artificial Intelligence (AI), AI-based hedgers have attracted considerable interest, where particular attention was given to Recurrent Neural Network systems and variations of the $Q$-learning algorithm. From a practical point of view, sufficient samples for training such an AI can only be obtained from a simulator of the market environment. Yet if an agent was trained solely on simulated data, the run-time performance will primarily reflect the accuracy of the simulation, which leads to the classical problem of model choice and calibration. In this article, the hedging problem is viewed as an instance of a risk-averse contextual $k$-armed bandit problem, which is motivated by the simplicity and sample-efficiency of the architecture. This allows for realistic online model updates from real-world data. We find that the $k$-armed bandit model naturally fits to the Profit and Loss formulation of hedging, providing for a more accurate and sample efficient approach than $Q$-learning and reducing to the Black-Scholes model in the absence of transaction costs and risks.  ( 2 min )
    Logarithmic Unbiased Quantization: Simple 4-bit Training in Deep Learning. (arXiv:2112.10769v2 [cs.LG] UPDATED)
    Quantization of the weights and activations is one of the main methods to reduce the computational footprint of Deep Neural Networks (DNNs) training. Current methods enable 4-bit quantization of the forward phase. However, this constitutes only a third of the training process. Reducing the computational footprint of the entire training process requires the quantization of the neural gradients, i.e., the loss gradients with respect to the outputs of intermediate neural layers. In this work, we examine the importance of having unbiased quantization in quantized neural network training, where to maintain it, and how. Based on this, we suggest a $\textit{logarithmic unbiased quantization}$ (LUQ) method to quantize both the forward and backward phase to 4-bit, achieving state-of-the-art results in 4-bit training without overhead. For example, in ResNet50 on ImageNet, we achieved a degradation of 1.18%. We further improve this to degradation of only 0.64% after a single epoch of high precision fine-tuning combined with a variance reduction method -- both add overhead comparable to previously suggested methods. Finally, we suggest a method that uses the low precision format to avoid multiplications during two-thirds of the training process, thus reducing by 5x the area used by the multiplier.  ( 2 min )
    Learning Curves for SGD on Structured Features. (arXiv:2106.02713v4 [stat.ML] UPDATED)
    The generalization performance of a machine learning algorithm such as a neural network depends in a non-trivial way on the structure of the data distribution. To analyze the influence of data structure on test loss dynamics, we study an exactly solveable model of stochastic gradient descent (SGD) on mean square loss which predicts test loss when training on features with arbitrary covariance structure. We solve the theory exactly for both Gaussian features and arbitrary features and we show that the simpler Gaussian model accurately predicts test loss of nonlinear random-feature models and deep neural networks trained with SGD on real datasets such as MNIST and CIFAR-10. We show that the optimal batch size at a fixed compute budget is typically small and depends on the feature correlation structure, demonstrating the computational benefits of SGD with small batch sizes. Lastly, we extend our theory to the more usual setting of stochastic gradient descent on a fixed subsampled training set, showing that both training and test error can be accurately predicted in our framework on real data.  ( 2 min )
    Confidence-Aware Subject-to-Subject Transfer Learning for Brain-Computer Interface. (arXiv:2112.09243v2 [cs.HC] UPDATED)
    The inter/intra-subject variability of electroencephalography (EEG) makes the practical use of the brain-computer interface (BCI) difficult. In general, the BCI system requires a calibration procedure to tune the model every time the system is used. This problem is recognized as a major obstacle to BCI, and to overcome it, approaches based on transfer learning (TL) have recently emerged. However, many BCI paradigms are limited in that they consist of a structure that shows labels first and then measures "imagery", the negative effects of source subjects containing data that do not contain control signals have been ignored in many cases of the subject-to-subject TL process. The main purpose of this paper is to propose a method of excluding subjects that are expected to have a negative impact on subject-to-subject TL training, which generally uses data from as many subjects as possible. In this paper, we proposed a BCI framework using only high-confidence subjects for TL training. In our framework, a deep neural network selects useful subjects for the TL process and excludes noisy subjects, using a co-teaching algorithm based on the small-loss trick. We experimented with leave-one-subject-out validation on two public datasets (2020 international BCI competition track 4 and OpenBMI dataset). Our experimental results showed that confidence-aware TL, which selects subjects with small loss instances, improves the generalization performance of BCI.  ( 2 min )
    Principal Components Bias in Over-parameterized Linear Models, and its Manifestation in Deep Neural Networks. (arXiv:2105.05553v6 [cs.LG] UPDATED)
    Recent work suggests that convolutional neural networks of different architectures learn to classify images in the same order. To understand this phenomenon, we revisit the over-parametrized deep linear network model. Our asymptotic analysis, assuming that the hidden layers are wide enough, reveals that the convergence rate of this model's parameters is exponentially faster along the directions of the larger principal components of the data, at a rate governed by the corresponding singular values. We term this convergence pattern the Principal Components bias (PC-bias). Empirically, we show how the PC-bias streamlines the order of learning of both linear and non-linear networks, more prominently at earlier stages of learning. We then compare our results to the simplicity bias, showing that both biases can be seen independently, and affect the order of learning in different ways. Finally, we discuss how the PC-bias may explain some benefits of early stopping and its connection to PCA, and why deep networks converge more slowly with random labels.  ( 2 min )
    Causal Inference Using Tractable Circuits. (arXiv:2202.02891v1 [cs.AI])
    The aim of this paper is to discuss a recent result which shows that probabilistic inference in the presence of (unknown) causal mechanisms can be tractable for models that have traditionally been viewed as intractable. This result was reported recently to facilitate model-based supervised learning but it can be interpreted in a causality context as follows. One can compile a non-parametric causal graph into an arithmetic circuit that supports inference in time linear in the circuit size. The circuit is also non-parametric so it can be used to estimate parameters from data and to further reason (in linear time) about the causal graph parametrized by these estimates. Moreover, the circuit size can sometimes be bounded even when the treewidth of the causal graph is not, leading to tractable inference on models that have been deemed intractable previously. This has been enabled by a new technique that can exploit causal mechanisms computationally but without needing to know their identities (the classical setup in causal inference). Our goal is to provide a causality-oriented exposure to these new results and to speculate on how they may potentially contribute to more scalable and versatile causal inference.  ( 2 min )
    A Discourse on MetODS: Meta-Optimized Dynamical Synapses for Meta-Reinforcement Learning. (arXiv:2202.02363v1 [cs.LG])
    Recent meta-reinforcement learning work has emphasized the importance of mnemonic control for agents to quickly assimilate relevant experience in new contexts and suitably adapt their policy. However, what computational mechanisms support flexible behavioral adaptation from past experience remains an open question. Inspired by neuroscience, we propose MetODS (for Meta-Optimized Dynamical Synapses), a broadly applicable model of meta-reinforcement learning which leverages fast synaptic dynamics influenced by action-reward feedback. We develop a theoretical interpretation of MetODS as a model learning powerful control rules in the policy space and demonstrate empirically that robust reinforcement learning programs emerge spontaneously from them. We further propose a formalism which efficiently optimizes the meta-parameters governing MetODS synaptic processes. In multiple experiments and domains, MetODS outperforms or compares favorably with previous meta-reinforcement learning approaches. Our agents can perform one-shot learning, approaches optimal exploration/exploitation strategies, generalize navigation principles to unseen environments and demonstrate a strong ability to learn adaptive motor policies.  ( 2 min )
    Tensor-CSPNet: A Novel Geometric Deep Learning Framework for Motor Imagery Classification. (arXiv:2202.02472v1 [eess.SP])
    Deep learning (DL) has been widely investigated in a vast majority of applications in electroencephalography (EEG)-based brain-computer interfaces (BCIs), especially for motor imagery (MI) classification in the past five years. The mainstream DL methodology for the MI-EEG classification exploits the temporospatial patterns of EEG signals using convolutional neural networks (CNNs), which have been particularly successful in visual images. However, since the statistical characteristics of visual images may not benefit EEG signals, a natural question that arises is whether there exists an alternative network architecture despite CNNs to extract features for the MI-EEG classification. To address this question, we propose a novel geometric deep learning (GDL) framework called Tensor-CSPNet to characterize EEG signals on symmetric positive definite (SPD) manifolds and exploit the temporo-spatio-frequential patterns using deep neural networks on SPD manifolds. Meanwhile, many experiences of successful MI-EEG classifiers have been integrated into the Tensor-CSPNet framework to make it more efficient. In the experiments, Tensor-CSPNet attains or slightly outperforms the current state-of-the-art performance on the cross-validation and holdout scenarios of two MI-EEG datasets. The visualization and interpretability analyses also exhibit its validity for the MI-EEG classification. To conclude, we provide a feasible answer to the question by generalizing the previous DL methodologies on SPD manifolds, which indicates the start of a specific class from the GDL methodology for the MI-EEG classification.  ( 2 min )
    Deep Layer-wise Networks Have Closed-Form Weights. (arXiv:2006.08539v5 [stat.ML] UPDATED)
    There is currently a debate within the neuroscience community over the likelihood of the brain performing backpropagation (BP). To better mimic the brain, training a network $\textit{one layer at a time}$ with only a "single forward pass" has been proposed as an alternative to bypass BP; we refer to these networks as "layer-wise" networks. We continue the work on layer-wise networks by answering two outstanding questions. First, $\textit{do they have a closed-form solution?}$ Second, $\textit{how do we know when to stop adding more layers?}$ This work proves that the kernel Mean Embedding is the closed-form weight that achieves the network global optimum while driving these networks to converge towards a highly desirable kernel for classification; we call it the $\textit{Neural Indicator Kernel}$.  ( 2 min )
    Supervised Learning based QoE Prediction of Video Streaming in Future Networks: A Tutorial with Comparative Study. (arXiv:2202.02454v1 [cs.NI])
    The Quality of Experience (QoE) based service management remains key for successful provisioning of multimedia services in next-generation networks such as 5G/6G, which requires proper tools for quality monitoring, prediction and resource management where machine learning (ML) can play a crucial role. In this paper, we provide a tutorial on the development and deployment of the QoE measurement and prediction solutions for video streaming services based on supervised learning ML models. Firstly, we provide a detailed pipeline for developing and deploying supervised learning-based video streaming QoE prediction models which covers several stages including data collection, feature engineering, model optimization and training, testing and prediction and evaluation. Secondly, we discuss the deployment of the ML model for the QoE prediction/measurement in the next generation networks (5G/6G) using network enabling technologies such as Software-Defined Networking (SDN), Network Function Virtualization (NFV) and Mobile Edge Computing (MEC) by proposing reference architecture. Thirdly, we present a comparative study of the state-of-the-art supervised learning ML models for QoE prediction of video streaming applications based on multiple performance metrics.  ( 2 min )
    Application of Machine Learning-Based Pattern Recognition in IoT Devices: Review. (arXiv:2202.02456v1 [cs.NI])
    The Internet of things (IoT) is a rapidly advancing area of technology that has quickly become more widespread in recent years. With greater numbers of everyday objects being connected to the Internet, many different innovations have been presented to make our everyday lives more straightforward. Pattern recognition is extremely prevalent in IoT devices because of the many applications and benefits that can come from it. A multitude of studies has been conducted with the intention of improving speed and accuracy, decreasing complexity, and reducing the overall required processing power of pattern recognition algorithms in IoT devices. After reviewing the applications of different machine learning algorithms, results vary from case to case, but a general conclusion can be drawn that the optimal machine learning-based pattern recognition algorithms to be used with IoT devices are support vector machine, k-nearest neighbor, and random forest.  ( 2 min )
    Differentially Private Graph Classification with GNNs. (arXiv:2202.02575v1 [cs.LG])
    Graph Neural Networks (GNNs) have established themselves as the state-of-the-art models for many machine learning applications such as the analysis of social networks, protein interactions and molecules. Several among these datasets contain privacy-sensitive data. Machine learning with differential privacy is a promising technique to allow deriving insight from sensitive data while offering formal guarantees of privacy protection. However, the differentially private training of GNNs has so far remained under-explored due to the challenges presented by the intrinsic structural connectivity of graphs. In this work, we introduce differential privacy for graph-level classification, one of the key applications of machine learning on graphs. Our method is applicable to deep learning on multi-graph datasets and relies on differentially private stochastic gradient descent (DP-SGD). We show results on a variety of synthetic and public datasets and evaluate the impact of different GNN architectures and training hyperparameters on model performance for differentially private graph classification. Finally, we apply explainability techniques to assess whether similar representations are learned in the private and non-private settings and establish robust baselines for future work in this area.  ( 2 min )
    A Comparison of Neural Network Architectures for Data-Driven Reduced-Order Modeling. (arXiv:2110.03442v2 [cs.LG] UPDATED)
    The popularity of deep convolutional autoencoders (CAEs) has engendered new and effective reduced-order models (ROMs) for the simulation of large-scale dynamical systems. Despite this, it is still unknown whether deep CAEs provide superior performance over established linear techniques or other network-based methods in all modeling scenarios. To elucidate this, the effect of autoencoder architecture on its associated ROM is studied through the comparison of deep CAEs against two alternatives: a simple fully connected autoencoder, and a novel graph convolutional autoencoder. Through benchmark experiments, it is shown that the superior autoencoder architecture for a given ROM application is highly dependent on the size of the latent space and the structure of the snapshot data, with the proposed architecture demonstrating benefits on data with irregular connectivity when the latent space is sufficiently large.  ( 2 min )
    Medical Image Segmentation using 3D Convolutional Neural Networks: A Review. (arXiv:2108.08467v2 [eess.IV] UPDATED)
    Computer-aided medical image analysis plays a significant role in assisting medical practitioners for expert clinical diagnosis and deciding the optimal treatment plan. At present, convolutional neural networks (CNN) are the preferred choice for medical image analysis. In addition, with the rapid advancements in three-dimensional (3D) imaging systems and the availability of excellent hardware and software support to process large volumes of data, 3D deep learning methods are gaining popularity in medical image analysis. Here, we present an extensive review of the recently evolved 3D deep learning methods in medical image segmentation. Furthermore, the research gaps and future directions in 3D medical image segmentation are discussed.  ( 2 min )
    A Hybrid Architecture for Federated and Centralized Learning. (arXiv:2105.03288v2 [cs.LG] UPDATED)
    Many of the machine learning tasks rely on centralized learning (CL), which requires the transmission of local datasets from the clients to a parameter server (PS) entailing huge communication overhead. To overcome this, federated learning (FL) has been suggested as a promising tool, wherein the clients send only the model updates to the PS instead of the whole dataset. However, FL demands powerful computational resources from the clients. In practice, not all the clients have sufficient computational resources to participate in training. To address this common scenario, we propose a more efficient approach called hybrid federated and centralized learning (HFCL), wherein only the clients with sufficient resources employ FL, while the remaining ones send their datasets to the PS, which computes the model on behalf of them. Then, the model parameters are aggregated at the PS. To improve the efficiency of dataset transmission, we propose two different techniques: i) increased computation-per-client and ii) sequential data transmission. Notably, the HFCL frameworks outperform FL with up to 20\% improvement in the learning accuracy when only half of the clients perform FL while having 50\% less communication overhead than CL since all the clients collaborate on the learning process with their datasets.  ( 2 min )
    TTS-GAN: A Transformer-based Time-Series Generative Adversarial Network. (arXiv:2202.02691v1 [cs.LG])
    Signal measurements appearing in the form of time series are one of the most common types of data used in medical machine learning applications. However, such datasets are often small, making the training of deep neural network architectures ineffective. For time-series, the suite of data augmentation tricks we can use to expand the size of the dataset is limited by the need to maintain the basic properties of the signal. Data generated by a Generative Adversarial Network (GAN) can be utilized as another data augmentation tool. RNN-based GANs suffer from the fact that they cannot effectively model long sequences of data points with irregular temporal relations. To tackle these problems, we introduce TTS-GAN, a transformer-based GAN which can successfully generate realistic synthetic time-series data sequences of arbitrary length, similar to the real ones. Both the generator and discriminator networks of the GAN model are built using a pure transformer encoder architecture. We use visualizations and dimensionality reduction techniques to demonstrate the similarity of real and generated time-series data. We also compare the quality of our generated data with the best existing alternative, which is an RNN-based time-series GAN.  ( 2 min )
    REMAX: Relational Representation for Multi-Agent Exploration. (arXiv:2008.05214v2 [cs.LG] UPDATED)
    Training a multi-agent reinforcement learning (MARL) model with a sparse reward is generally difficult because numerous combinations of interactions among agents induce a certain outcome (i.e., success or failure). Earlier studies have tried to resolve this issue by employing an intrinsic reward to induce interactions that are helpful for learning an effective policy. However, this approach requires extensive prior knowledge for designing an intrinsic reward. To train the MARL model effectively without designing the intrinsic reward, we propose a learning-based exploration strategy to generate the initial states of a game. The proposed method adopts a variational graph autoencoder to represent a game state such that (1) the state can be compactly encoded to a latent representation by considering relationships among agents, and (2) the latent representation can be used as an effective input for a coupled surrogate model to predict an exploration score. The proposed method then finds new latent representations that maximize the exploration scores and decodes these representations to generate initial states from which the MARL model starts training in the game and thus experiences novel and rewardable states. We demonstrate that our method improves the training and performance of the MARL model more than the existing exploration methods.
    Handling Distribution Shifts on Graphs: An Invariance Perspective. (arXiv:2202.02466v1 [cs.LG])
    There is increasing evidence suggesting neural networks' sensitivity to distribution shifts, so that research on out-of-distribution (OOD) generalization comes into the spotlight. Nonetheless, current endeavors mostly focus on Euclidean data, and its formulation for graph-structured data is not clear and remains under-explored, given the two-fold fundamental challenges: 1) the inter-connection among nodes in one graph, which induces non-IID generation of data points even under the same environment, and 2) the structural information in the input graph, which is also informative for prediction. In this paper, we formulate the OOD problem for node-level prediction on graphs and develop a new domain-invariant learning approach, named Explore-to-Extrapolate Risk Minimization, that facilitates GNNs to leverage invariant graph features for prediction. The key difference to existing invariant models is that we design multiple context explorers (specified as graph editers in our case) that are adversarially trained to maximize the variance of risks from multiple virtual environments. Such a design enables the model to extrapolate from a single observed environment which is the common case for node-level prediction. We prove the validity of our method by theoretically showing its guarantee of a valid OOD solution and further demonstrate its power on various real-world datasets for handling distribution shifts from artificial spurious features, cross-domain transfers and dynamic graph evolution.
    Automatic Sleep Staging of EEG Signals: Recent Development, Challenges, and Future Directions. (arXiv:2111.08446v2 [eess.SP] UPDATED)
    Modern deep learning holds a great potential to transform clinical practice on human sleep. Teaching a machine to carry out routine tasks would be a tremendous reduction in workload for clinicians. Sleep staging, a fundamental step in sleep practice, is a suitable task for this and will be the focus in this article. Recently, automatic sleep staging systems have been trained to mimic manual scoring, leading to similar performance to human sleep experts, at least on scoring of healthy subjects. Despite tremendous progress, we have not seen automatic sleep scoring adopted widely in clinical environments. This review aims to give a shared view of the authors on the most recent state-of-the-art development in automatic sleep staging, the challenges that still need to be addressed, and the future directions for automatic sleep scoring to achieve clinical value.
    Adaptive Explainable Continual Learning Framework for Regression Problems with Focus on Power Forecasts. (arXiv:2108.10781v2 [cs.LG] UPDATED)
    Compared with traditional deep learning techniques, continual learning enables deep neural networks to learn continually and adaptively. Deep neural networks have to learn new tasks and overcome forgetting the knowledge obtained from the old tasks as the amount of data keeps increasing in applications. In this article, two continual learning scenarios will be proposed to describe the potential challenges in this context. Besides, based on our previous work regarding the CLeaR framework, which is short for continual learning for regression tasks, the work will be further developed to enable models to extend themselves and learn data successively. Research topics are related but not limited to developing continual deep learning algorithms, strategies for non-stationarity detection in data streams, explainable and visualizable artificial intelligence, etc. Moreover, the framework- and algorithm-related hyperparameters should be dynamically updated in applications. Forecasting experiments will be conducted based on power generation and consumption data collected from real-world applications. A series of comprehensive evaluation metrics and visualization tools can help analyze the experimental results. The proposed framework is expected to be generally applied to other constantly changing scenarios.
    DNS: Determinantal Point Process Based Neural Network Sampler for Ensemble Reinforcement Learning. (arXiv:2201.13357v2 [cs.LG] UPDATED)
    Application of ensemble of neural networks is becoming an imminent tool for advancing the state-of-the-art in deep reinforcement learning algorithms. However, training these large numbers of neural networks in the ensemble has an exceedingly high computation cost which may become a hindrance in training large-scale systems. In this paper, we propose DNS: a Determinantal Point Process based Neural Network Sampler that specifically uses k-dpp to sample a subset of neural networks for backpropagation at every training step thus significantly reducing the training time and computation cost. We integrated DNS in REDQ for continuous control tasks and evaluated on MuJoCo environments. Our experiments show that DNS augmented REDQ outperforms baseline REDQ in terms of average cumulative reward and achieves this using less than 50% computation when measured in FLOPS.
    Topological obstructions in neural networks learning. (arXiv:2012.15834v2 [cs.LG] UPDATED)
    We apply topological data analysis methods to loss functions to gain insights into learning of deep neural networks and deep neural networks generalization properties. We use the Morse complex of the loss function to relate the local behavior of gradient descent trajectories with global properties of the loss surface. We define the neural network Topological Obstructions score, "TO-score", with the help of robust topological invariants, barcodes of the loss function, that quantify the "badness" of local minima for gradient-based optimization. We have made experiments for computing these invariants for fully-connected, convolutional and ResNet-like neural networks on different datasets: MNIST, Fashion MNIST, CIFAR10, CIFAR100 and SVHN. Our two principal observations are as follows. Firstly, the neural network barcode and TO score decrease with the increase of the neural network depth and width, thus the topological obstructions to learning diminish. Secondly, in certain situations there is an intriguing connection between the lengths of minima segments in the barcode and the minima generalization errors.
    Semi-Supervised Clustering via Information-Theoretic Markov Chain Aggregation. (arXiv:2112.09397v2 [cs.LG] UPDATED)
    We connect the problem of semi-supervised clustering to constrained Markov aggregation, i.e., the task of partitioning the state space of a Markov chain. We achieve this connection by considering every data point in the dataset as an element of the Markov chain's state space, by defining the transition probabilities between states via similarities between corresponding data points, and by incorporating semi-supervision information as hard constraints in a Hartigan-style algorithm. The introduced Constrained Markov Clustering (CoMaC) is an extension of a recent information-theoretic framework for (unsupervised) Markov aggregation to the semi-supervised case. Instantiating CoMaC for certain parameter settings further generalizes two previous information-theoretic objectives for unsupervised clustering. Our results indicate that CoMaC is competitive with the state-of-the-art.
    A Novel Micro-service Based Platform for Composition, Deployment and Execution of BDA Applications. (arXiv:2202.02845v1 [cs.DC])
    Big Data are growing at an exponential rate and it becomes necessary the use of tools and technologies to manage, process and visualize them in order to extract value. In this paper a micro-service based platform is presented for the composition, deployment and execution of Big Data Analytics (BDA) application workflows in several domains and scenarios is presented. ALIDA is a result coming from previous research activities by ENGINEERING. It aims to achieve a unified platform that allows both BDA application developers and data analysts to interact with it. Developers will be able to register new BDA applications through the exposed API and/or through the web user interface. Data analysts will be able to use the BDA applications provided to create batch/stream workflows through a dashboard user interface to manipulate and subsequently visualize results from one or more sources. The platform also supports the auto-tuning of Big Data frameworks deployment properties to improve metrics for analytics application. ALIDA has been properly extended and integrated into a software solution for the analysis of large amounts of data from the avionic industries. A use case within this context is then presented.
    Deep Learning-Aided Spatial Multiplexing with Index Modulation. (arXiv:2202.02856v1 [eess.SP])
    In this paper, deep learning (DL)-aided data detection of spatial multiplexing (SMX) multiple-input multiple-output (MIMO) transmission with index modulation (IM) (Deep-SMX-IM) has been proposed. Deep-SMX-IM has been constructed by combining a zero-forcing (ZF) detector and DL technique. The proposed method uses the significant advantages of DL techniques to learn transmission characteristics of the frequency and spatial domains. Furthermore, thanks to using subblockbased detection provided by IM, Deep-SMX-IM is a straightforward method, which eventually reveals reduced complexity. It has been shown that Deep-SMX-IM has significant error performance gains compared to ZF detector without increasing computational complexity for different system configurations.
    Machine Learning with Knowledge Constraints for Process Optimization of Open-Air Perovskite Solar Cell Manufacturing. (arXiv:2110.01387v4 [cs.LG] UPDATED)
    Perovskite photovoltaics (PV) have achieved rapid development in the past decade in terms of power conversion efficiency of small-area lab-scale devices; however, successful commercialization still requires further development of low-cost, scalable, and high-throughput manufacturing techniques. One of the critical challenges of developing a new fabrication technique is the high-dimensional parameter space for optimization, but machine learning (ML) can readily be used to accelerate perovskite PV scaling. Herein, we present an ML-guided framework of sequential learning for manufacturing process optimization. We apply our methodology to the Rapid Spray Plasma Processing (RSPP) technique for perovskite thin films in ambient conditions. With a limited experimental budget of screening 100 process conditions, we demonstrated an efficiency improvement to 18.5% as the best-in-our-lab device fabricated by RSPP, and we also experimentally found 10 unique process conditions to produce the top-performing devices of more than 17% efficiency, which is 5 times higher rate of success than the control experiments with pseudo-random Latin hypercube sampling. Our model is enabled by three innovations: (a) flexible knowledge transfer between experimental processes by incorporating data from prior experimental data as a probabilistic constraint; (b) incorporation of both subjective human observations and ML insights when selecting next experiments; (c) adaptive strategy of locating the region of interest using Bayesian optimization first, and then conducting local exploration for high-efficiency devices. Furthermore, in virtual benchmarking, our framework achieves faster improvements with limited experimental budgets than traditional design-of-experiments methods (e.g., one-variable-at-a-time sampling).
    A Game-theoretic Understanding of Repeated Explanations in ML Models. (arXiv:2202.02659v1 [cs.GT])
    This paper formally models the strategic repeated interactions between a system, comprising of a machine learning (ML) model and associated explanation method, and an end-user who is seeking a prediction/label and its explanation for a query/input, by means of game theory. In this game, a malicious end-user must strategically decide when to stop querying and attempt to compromise the system, while the system must strategically decide how much information (in the form of noisy explanations) it should share with the end-user and when to stop sharing, all without knowing the type (honest/malicious) of the end-user. This paper formally models this trade-off using a continuous-time stochastic Signaling game framework and characterizes the Markov perfect equilibrium state within such a framework.
    Differentiable Economics for Randomized Affine Maximizer Auctions. (arXiv:2202.02872v1 [cs.GT])
    A recent approach to automated mechanism design, differentiable economics, represents auctions by rich function approximators and optimizes their performance by gradient descent. The ideal auction architecture for differentiable economics would be perfectly strategyproof, support multiple bidders and items, and be rich enough to represent the optimal (i.e. revenue-maximizing) mechanism. So far, such an architecture does not exist. There are single-bidder approaches (MenuNet, RochetNet) which are always strategyproof and can represent optimal mechanisms. RegretNet is multi-bidder and can approximate any mechanism, but is only approximately strategyproof. We present an architecture that supports multiple bidders and is perfectly strategyproof, but cannot necessarily represent the optimal mechanism. This architecture is the classic affine maximizer auction (AMA), modified to offer lotteries. By using the gradient-based optimization tools of differentiable economics, we can now train lottery AMAs, competing with or outperforming prior approaches in revenue.
    A Brief Review of Machine Learning Techniques for Protein Phosphorylation Sites Prediction. (arXiv:2108.04951v2 [q-bio.QM] UPDATED)
    Post-translational modifications (PTMs) have vital roles in extending the functional diversity of proteins and as a result, regulating diverse cellular processes in prokaryotic and eukaryotic organisms. Phosphorylation modification is a vital PTM that occurs in most proteins and plays significant roles in many biological processes. Disorders in the phosphorylation process lead to multiple diseases including neurological disorders and cancers. At first, this study comprehensively reviewed all databases related to phosphorylation sites (p-sites). Secondly, we introduced all steps regarding dataset creation, data preprocessing and method evaluation in p-sites prediction. Next, we investigated p-sites prediction methods which fall into two computational and Machine Learning (ML) groups. Additionally, it was shown that there are basically two main approaches for p-sites prediction by ML: conventional and End-to-End learning, which were given an overview for both of them. Moreover, this study introduced the most important feature extraction techniques which have mostly been used in ML approaches. Finally, we created three test sets from new proteins related to the 2022th released version of the dbPTM database based on general and human species. After evaluating available online tools on the test sets, results showed that the performance of online tools for p-sites prediction are quite weak on new reported phospho-proteins.
    AbstractDifferentiation.jl: Backend-Agnostic Differentiable Programming in Julia. (arXiv:2109.12449v2 [cs.MS] UPDATED)
    No single Automatic Differentiation (AD) system is the optimal choice for all problems. This means informed selection of an AD system and combinations can be a problem-specific variable that can greatly impact performance. In the Julia programming language, the major AD systems target the same input and thus in theory can compose. Hitherto, switching between AD packages in the Julia Language required end-users to familiarize themselves with the user-facing API of the respective packages. Furthermore, implementing a new, usable AD package required AD package developers to write boilerplate code to define convenience API functions for end-users. As a response to these issues, we present AbstractDifferentiation.jl for the automatized generation of an extensive, unified, user-facing API for any AD package. By splitting the complexity between AD users and AD developers, AD package developers only need to implement one or two primitive definitions to support various utilities for AD users like Jacobians, Hessians and lazy product operators from native primitives such as pullbacks or pushforwards, thus removing tedious -- but so far inevitable -- boilerplate code, and enabling the easy switching and composing between AD implementations for end-users.
    Making Invisible Visible: Data-Driven Seismic Inversion with Spatio-temporally Constrained Data Augmentation. (arXiv:2106.11892v3 [cs.LG] UPDATED)
    Deep learning and data-driven approaches have shown great potential in scientific domains. The promise of data-driven techniques relies on the availability of a large volume of high-quality training datasets. Due to the high cost of obtaining data through expensive physical experiments, instruments, and simulations, data augmentation techniques for scientific applications have emerged as a new direction for obtaining scientific data recently. However, existing data augmentation techniques originating from computer vision, yield physically unacceptable data samples that are not helpful for the domain problems that we are interested in. In this paper, we develop new data augmentation techniques based on convolutional neural networks. Specifically, our generative models leverage different physics knowledge (such as governing equations, observable perception, and physics phenomena) to improve the quality of the synthetic data. To validate the effectiveness of our data augmentation techniques, we apply them to solve a subsurface seismic full-waveform inversion using simulated CO$_2$ leakage data. Our interest is to invert for subsurface velocity models associated with very small CO$_2$ leakage. We validate the performance of our methods using comprehensive numerical tests. Via comparison and analysis, we show that data-driven seismic imaging can be significantly enhanced by using our data augmentation techniques. Particularly, the imaging quality has been improved by 15% in test scenarios of general-sized leakage and 17% in small-sized leakage when using an augmented training set obtained with our techniques.
    Free Component Analysis: Theory, Algorithms & Applications. (arXiv:1905.01713v3 [cs.LG] UPDATED)
    We describe a method for unmixing mixtures of freely independent random variables in a manner analogous to the independent component analysis (ICA) based method for unmixing independent random variables from their additive mixtures. Random matrices play the role of free random variables in this context so the method we develop, which we call Free component analysis (FCA), unmixes matrices from additive mixtures of matrices. Thus, while the mixing model is standard, the novelty and difference in unmixing performance comes from the introduction of a new statistical criteria, derived from free probability theory, that quantify freeness analogous to how kurtosis and entropy quantify independence. We describe the theory, the various algorithms, and compare FCA to vanilla ICA which does not account for spatial or temporal structure. We highlight why the statistical criteria make FCA also vanilla despite its matricial underpinnings and show that FCA performs comparably to, and sometimes better than, (vanilla) ICA in every application, such as image and speech unmixing, where ICA has been known to succeed. Our computational experiments suggest that not-so-random matrices, such as images and short time fourier transform matrix of waveforms are (closer to being) freer "in the wild" than we might have theoretically expected.
    Learning Interpretable, High-Performing Policies for Continuous Control Problems. (arXiv:2202.02352v1 [cs.LG])
    Gradient-based approaches in reinforcement learning (RL) have achieved tremendous success in learning policies for continuous control problems. While the performance of these approaches warrants real-world adoption in domains, such as in autonomous driving and robotics, these policies lack interpretability, limiting deployability in safety-critical and legally-regulated domains. Such domains require interpretable and verifiable control policies that maintain high performance. We propose Interpretable Continuous Control Trees (ICCTs), a tree-based model that can be optimized via modern, gradient-based, RL approaches to produce high-performing, interpretable policies. The key to our approach is a procedure for allowing direct optimization in a sparse decision-tree-like representation. We validate ICCTs against baselines across six domains, showing that ICCTs are capable of learning interpretable policy representations that parity or outperform baselines by up to 33$\%$ in autonomous driving scenarios while achieving a $300$x-$600$x reduction in the number of policy parameters against deep learning baselines.
    Temporal Knowledge Distillation for On-device Audio Classification. (arXiv:2110.14131v2 [cs.SD] UPDATED)
    Improving the performance of on-device audio classification models remains a challenge given the computational limits of the mobile environment. Many studies leverage knowledge distillation to boost predictive performance by transferring the knowledge from large models to on-device models. However, most lack a mechanism to distill the essence of the temporal information, which is crucial to audio classification tasks, or similar architecture is often required. In this paper, we propose a new knowledge distillation method designed to incorporate the temporal knowledge embedded in attention weights of large transformer-based models into on-device models. Our distillation method is applicable to various types of architectures, including the non-attention-based architectures such as CNNs or RNNs, while retaining the original network architecture during inference. Through extensive experiments on both an audio event detection dataset and a noisy keyword spotting dataset, we show that our proposed method improves the predictive performance across diverse on-device architectures.
    Optimal Algorithms for Decentralized Stochastic Variational Inequalities. (arXiv:2202.02771v1 [math.OC])
    Variational inequalities are a formalism that includes games, minimization, saddle point, and equilibrium problems as special cases. Methods for variational inequalities are therefore universal approaches for many applied tasks, including machine learning problems. This work concentrates on the decentralized setting, which is increasingly important but not well understood. In particular, we consider decentralized stochastic (sum-type) variational inequalities over fixed and time-varying networks. We present lower complexity bounds for both communication and local iterations and construct optimal algorithms that match these lower bounds. Our algorithms are the best among the available literature not only in the decentralized stochastic case, but also in the decentralized deterministic and non-distributed stochastic cases. Experimental results confirm the effectiveness of the presented algorithms.
    Evaluation Methods and Measures for Causal Learning Algorithms. (arXiv:2202.02896v1 [cs.LG])
    The convenient access to copious multi-faceted data has encouraged machine learning researchers to reconsider correlation-based learning and embrace the opportunity of causality-based learning, i.e., causal machine learning (causal learning). Recent years have therefore witnessed great effort in developing causal learning algorithms aiming to help AI achieve human-level intelligence. Due to the lack-of ground-truth data, one of the biggest challenges in current causal learning research is algorithm evaluations. This largely impedes the cross-pollination of AI and causal inference, and hinders the two fields to benefit from the advances of the other. To bridge from conventional causal inference (i.e., based on statistical methods) to causal learning with big data (i.e., the intersection of causal inference and machine learning), in this survey, we review commonly-used datasets, evaluation methods, and measures for causal learning using an evaluation pipeline similar to conventional machine learning. We focus on the two fundamental causal-inference tasks and causality-aware machine learning tasks. Limitations of current evaluation procedures are also discussed. We then examine popular causal inference tools/packages and conclude with primary challenges and opportunities for benchmarking causal learning algorithms in the era of big data. The survey seeks to bring to the forefront the urgency of developing publicly available benchmarks and consensus-building standards for causal learning evaluation with observational data. In doing so, we hope to broaden the discussions and facilitate collaboration to advance the innovation and application of causal learning.
    TorchMD-NET: Equivariant Transformers for Neural Network based Molecular Potentials. (arXiv:2202.02541v1 [cs.LG])
    The prediction of quantum mechanical properties is historically plagued by a trade-off between accuracy and speed. Machine learning potentials have previously shown great success in this domain, reaching increasingly better accuracy while maintaining computational efficiency comparable with classical force fields. In this work we propose TorchMD-NET, a novel equivariant transformer (ET) architecture, outperforming state-of-the-art on MD17, ANI-1, and many QM9 targets in both accuracy and computational efficiency. Through an extensive attention weight analysis, we gain valuable insights into the black box predictor and show differences in the learned representation of conformers versus conformations sampled from molecular dynamics or normal modes. Furthermore, we highlight the importance of datasets including off-equilibrium conformations for the evaluation of molecular potentials.
    Communication Efficient Federated Learning via Ordered ADMM in a Fully Decentralized Setting. (arXiv:2202.02580v1 [cs.LG])
    The challenge of communication-efficient distributed optimization has attracted attention in recent years. In this paper, a communication efficient algorithm, called ordering-based alternating direction method of multipliers (OADMM) is devised in a general fully decentralized network setting where a worker can only exchange messages with neighbors. Compared to the classical ADMM, a key feature of OADMM is that transmissions are ordered among workers at each iteration such that a worker with the most informative data broadcasts its local variable to neighbors first, and neighbors who have not transmitted yet can update their local variables based on that received transmission. In OADMM, we prohibit workers from transmitting if their current local variables are not sufficiently different from their previously transmitted value. A variant of OADMM, called SOADMM, is proposed where transmissions are ordered but transmissions are never stopped for each node at each iteration. Numerical results demonstrate that given a targeted accuracy, OADMM can significantly reduce the number of communications compared to existing algorithms including ADMM. We also show numerically that SOADMM can accelerate convergence, resulting in communication savings compared to the classical ADMM.
    PRONTO: Preamble Overhead Reduction with Neural Networks for Coarse Synchronization. (arXiv:2112.10885v2 [cs.LG] UPDATED)
    In IEEE 802.11 WiFi-based waveforms, the receiver performs coarse time and frequency synchronization using the first field of the preamble known as the legacy short training field (L-STF). The L-STF occupies upto 40% of the preamble length and takes upto 32 us of airtime. With the goal of reducing communication overhead, we propose a modified waveform, where the preamble length is reduced by eliminating the L-STF. To decode this modified waveform, we propose a machine learning (ML)-based scheme called PRONTO that performs coarse time and frequency estimations using other preamble fields, specifically the legacy long training field (L-LTF). Our contributions are threefold: (i) We present PRONTO featuring customized convolutional neural networks (CNNs) for packet detection and coarse CFO estimation, along with data augmentation steps for robust training. (ii) We propose a generalized decision flow that makes PRONTO compatible with legacy waveforms that include the standard L-STF. (iii) We validate the outcomes on an over-the-air WiFi dataset from a testbed of software defined radios (SDRs). Our evaluations show that PRONTO can perform packet detection with 100% accuracy, and coarse CFO estimation with errors as small as 3%. We demonstrate that PRONTO provides upto 40% preamble length reduction with no bit error rate (BER) degradation. Finally, we experimentally show the speedup achieved by PRONTO through GPU parallelization over the corresponding CPU-only implementations.
    Redatuming physical systems using symmetric autoencoders. (arXiv:2108.02537v2 [physics.comp-ph] UPDATED)
    This paper considers physical systems described by hidden states and indirectly observed through repeated measurements corrupted by unmodeled nuisance parameters. A network-based representation learns to disentangle the coherent information (relative to the state) from the incoherent nuisance information (relative to the sensing). Instead of physical models, the representation uses symmetry and stochastic regularization to inform an autoencoder architecture called SymAE. It enables redatuming, i.e., creating virtual data instances where the nuisances are uniformized across measurements.
    Wave-Encoded Model-based Deep Learning for Highly Accelerated Imaging with Joint Reconstruction. (arXiv:2202.02814v1 [eess.IV])
    Purpose: To propose a wave-encoded model-based deep learning (wave-MoDL) strategy for highly accelerated 3D imaging and joint multi-contrast image reconstruction, and further extend this to enable rapid quantitative imaging using an interleaved look-locker acquisition sequence with T2 preparation pulse (3D-QALAS). Method: Recently introduced MoDL technique successfully incorporates convolutional neural network (CNN)-based regularizers into physics-based parallel imaging reconstruction using a small number of network parameters. Wave-CAIPI is an emerging parallel imaging method that accelerates the imaging speed by employing sinusoidal gradients in the phase- and slice-encoding directions during the readout to take better advantage of 3D coil sensitivity profiles. In wave-MoDL, we propose to combine the wave-encoding strategy with unrolled network constraints to accelerate the acquisition speed while enforcing wave-encoded data consistency. We further extend wave-MoDL to reconstruct multi-contrast data with controlled aliasing in parallel imaging (CAIPI) sampling patterns to leverage similarity between multiple images to improve the reconstruction quality. Result: Wave-MoDL enables a 47-second MPRAGE acquisition at 1 mm resolution at 16-fold acceleration. For quantitative imaging, wave-MoDL permits a 2-minute acquisition for T1, T2, and proton density mapping at 1 mm resolution at 12-fold acceleration, from which contrast weighted images can be synthesized as well. Conclusion: Wave-MoDL allows rapid MR acquisition and high-fidelity image reconstruction and may facilitate clinical and neuroscientific applications by incorporating unrolled neural networks into wave-CAIPI reconstruction.
    Stochastic Gradient Descent with Dependent Data for Offline Reinforcement Learning. (arXiv:2202.02850v1 [cs.LG])
    In reinforcement learning (RL), offline learning decoupled learning from data collection and is useful in dealing with exploration-exploitation tradeoff and enables data reuse in many applications. In this work, we study two offline learning tasks: policy evaluation and policy learning. For policy evaluation, we formulate it as a stochastic optimization problem and show that it can be solved using approximate stochastic gradient descent (aSGD) with time-dependent data. We show aSGD achieves $\tilde O(1/t)$ convergence when the loss function is strongly convex and the rate is independent of the discount factor $\gamma$. This result can be extended to include algorithms making approximately contractive iterations such as TD(0). The policy evaluation algorithm is then combined with the policy iteration algorithm to learn the optimal policy. To achieve an $\epsilon$ accuracy, the complexity of the algorithm is $\tilde O(\epsilon^{-2}(1-\gamma)^{-5})$, which matches the complexity bound for classic online RL algorithms such as Q-learning.
    Featherweight Assisted Vulnerability Discovery. (arXiv:2202.02679v1 [cs.CR])
    Predicting vulnerable source code helps to focus attention on those parts of the code that need to be examined with more scrutiny. Recent work proposed the use of function names as semantic cues that can be learned by a deep neural network (DNN) to aid in the hunt for vulnerability of functions. Combining identifier splitting, which splits each function name into its constituent words, with a novel frequency-based algorithm, we explore the extent to which the words that make up a function's name can predict potentially vulnerable functions. In contrast to *lightweight* predictions by a DNN that considers only function names, avoiding the use of a DNN provides *featherweight* predictions. The underlying idea is that function names that contain certain "dangerous" words are more likely to accompany vulnerable functions. Of course, this assumes that the frequency-based algorithm can be properly tuned to focus on truly dangerous words. Because it is more transparent than a DNN, the frequency-based algorithm enables us to investigate the inner workings of the DNN. If successful, this investigation into what the DNN does and does not learn will help us train more effective future models. We empirically evaluate our approach on a heterogeneous dataset containing over 73000 functions labeled vulnerable, and over 950000 functions labeled benign. Our analysis shows that words alone account for a significant portion of the DNN's classification ability. We also find that words are of greatest value in the datasets with a more homogeneous vocabulary. Thus, when working within the scope of a given project, where the vocabulary is unavoidably homogeneous, our approach provides a cheaper, potentially complementary, technique to aid in the hunt for source-code vulnerabilities. Finally, this approach has the advantage that it is viable with orders of magnitude less training data.
    AASAE: Augmentation-Augmented Stochastic Autoencoders. (arXiv:2107.12329v2 [cs.LG] UPDATED)
    Recent methods for self-supervised learning can be grouped into two paradigms: contrastive and non-contrastive approaches. Their success can largely be attributed to data augmentation pipelines which generate multiple views of a single input that preserve the underlying semantics. In this work, we introduce augmentation-augmented stochastic autoencoders (AASAE), yet another alternative to self-supervised learning, based on autoencoding. We derive AASAE starting from the conventional variational autoencoder (VAE), by replacing the KL divergence regularization, which is agnostic to the input domain, with data augmentations that explicitly encourage the internal representations to encode domain-specific invariances and equivariances. We empirically evaluate the proposed AASAE on image classification, similar to how recent contrastive and non-contrastive learning algorithms have been evaluated. Our experiments confirm the effectiveness of data augmentation as a replacement for KL divergence regularization. The AASAE outperforms the VAE by 30% on CIFAR-10, 40% on STL-10 and 45% on Imagenet. On CIFAR-10 and STL-10, the results for AASAE are largely comparable to the state-of-the-art algorithms for self-supervised learning.
    Local Permutation Equivariance For Graph Neural Networks. (arXiv:2111.11840v2 [cs.LG] UPDATED)
    In this work we develop a new method, named Sub-graph Permutation Equivariant Networks (SPEN), which provides a framework for building graph neural networks that operate on sub-graphs, while using permutation equivariant update functions that are also equivariant to a novel choice of automorphism groups. Message passing neural networks have been shown to be limited in their expressive power and recent approaches to over come this either lack scalability or require structural information to be encoded into the feature space. The general framework presented here overcomes the scalability issues associated with global permutation equivariance by operating on sub-graphs. In addition, through operating on sub-graphs the expressive power of higher-dimensional global permutation equivariant networks is improved; this is due to fact that two non-distinguishable graphs often contain distinguishable sub-graphs. Furthermore, the proposed framework only requires a choice of $k$-hops for creating ego-network sub-graphs and a choice of representation space to be used for each layer, which makes the method easily applicable across a range of graph based domains. We experimentally validate the method on a range of graph benchmark classification tasks, demonstrating either state-of-the-art results or very competitive results on all benchmarks. Further, we demonstrate that the use of local update functions offers a significant improvement in GPU memory over global methods.
    On Using Transformers for Speech-Separation. (arXiv:2202.02884v1 [eess.AS])
    Transformers have enabled major improvements in deep learning. They often outperform recurrent and convolutional models in many tasks while taking advantage of parallel processing. Recently, we have proposed SepFormer, which uses self-attention and obtains state-of-the art results on WSJ0-2/3 Mix datasets for speech separation. In this paper, we extend our previous work by providing results on more datasets including LibriMix, and WHAM!, WHAMR! which include noisy and noisy-reverberant conditions. Moreover we provide denoising, and denoising+dereverberation results in the context of speech enhancement, respectively on WHAM! and WHAMR! datasets. We also investigate incorporating recently proposed efficient self-attention mechanisms inside the SepFormer model, and show that by using efficient self-attention mechanisms it is possible to reduce the memory requirements significantly while performing better than the popular convtasnet model on WSJ0-2Mix dataset.
    Weighted Gaussian Process Bandits for Non-stationary Environments. (arXiv:2107.02371v2 [cs.LG] UPDATED)
    In this paper, we consider the Gaussian process (GP) bandit optimization problem in a non-stationary environment. To capture external changes, the black-box function is allowed to be time-varying within a reproducing kernel Hilbert space (RKHS). To this end, we develop WGP-UCB, a novel UCB-type algorithm based on weighted Gaussian process regression. A key challenge is how to cope with infinite-dimensional feature maps. To that end, we leverage kernel approximation techniques to prove a sublinear regret bound, which is the first (frequentist) sublinear regret guarantee on weighted time-varying bandits with general nonlinear rewards. This result generalizes both non-stationary linear bandits and standard GP-UCB algorithms. Further, a novel concentration inequality is achieved for weighted Gaussian process regression with general weights. We also provide universal upper bounds and weight-dependent upper bounds for weighted maximum information gains. These results are of independent interest for applications such as news ranking and adaptive pricing, where weights can be adopted to capture the importance or quality of data. Finally, we conduct experiments to highlight the favorable gains of the proposed algorithm in many cases when compared to existing methods.
    Coordinate Linear Variance Reduction for Generalized Linear Programming. (arXiv:2111.01842v3 [math.OC] UPDATED)
    We study a class of generalized linear programs (GLP) in a large-scale setting, which includes simple, possibly nonsmooth convex regularizer and simple convex set constraints. By reformulating (GLP) as an equivalent convex-concave min-max problem, we show that the linear structure in the problem can be used to design an efficient, scalable first-order algorithm, to which we give the name \emph{Coordinate Linear Variance Reduction} (\textsc{clvr}; pronounced "clever"). \textsc{clvr} yields improved complexity results for (GLP) that depend on the max row norm of the linear constraint matrix in (GLP) rather than the spectral norm. When the regularization terms and constraints are separable, \textsc{clvr} admits an efficient lazy update strategy that makes its complexity bounds scale with the number of nonzero elements of the linear constraint matrix in (GLP) rather than the matrix dimensions. On the other hand, for the special case of linear programs, by exploiting sharpness, we propose a restart scheme for \textsc{clvr} to obtain empirical linear convergence. Then we show that Distributionally Robust Optimization (DRO) problems with ambiguity sets based on both $f$-divergence and Wasserstein metrics can be reformulated as (GLPs) by introducing sparsely connected auxiliary variables. We complement our theoretical guarantees with numerical experiments that verify our algorithm's practical effectiveness, in terms of wall-clock time and number of data passes.
    Scaling-Translation-Equivariant Networks with Decomposed Convolutional Filters. (arXiv:1909.11193v3 [cs.LG] UPDATED)
    Encoding the scale information explicitly into the representation learned by a convolutional neural network (CNN) is beneficial for many computer vision tasks especially when dealing with multiscale inputs. We study, in this paper, a scaling-translation-equivariant (ST-equivariant) CNN with joint convolutions across the space and the scaling group, which is shown to be both sufficient and necessary to achieve equivariance for the regular representation of the scaling-translation group ST . To reduce the model complexity and computational burden, we decompose the convolutional filters under two pre-fixed separable bases and truncate the expansion to low-frequency components. A further benefit of the truncated filter expansion is the improved deformation robustness of the equivariant representation, a property which is theoretically analyzed and empirically verified. Numerical experiments demonstrate that the proposed scaling-translation-equivariant network with decomposed convolutional filters (ScDCFNet) achieves significantly improved performance in multiscale image classification and better interpretability than regular CNNs at a reduced model size.
    Learning to Prompt for Vision-Language Models. (arXiv:2109.01134v3 [cs.CV] UPDATED)
    Large pre-trained vision-language models like CLIP have shown great potential in learning representations that are transferable across a wide range of downstream tasks. Different from the traditional representation learning that is based mostly on discretized labels, vision-language pre-training aligns images and texts in a common feature space, which allows zero-shot transfer to any downstream task via \emph{prompting}, i.e., classification weights are synthesized from natural language describing classes of interest. In this work, we show that a major challenge for deploying such models in practice is prompt engineering, which requires domain expertise and is extremely time-consuming -- one needs to spend a significant amount of time on words tuning since a slight change in wording could have a huge impact on performance. Inspired by recent advances in prompt learning research in natural language processing (NLP), we propose \emph{Context Optimization (CoOp)}, a simple approach specifically for adapting CLIP-like vision-language models for downstream image recognition. Concretely, CoOp models a prompt's context words with learnable vectors while the entire pre-trained parameters are kept fixed. To handle different image recognition tasks, we provide two implementations of CoOp: unified context and class-specific context. Through extensive experiments on 11 datasets, we demonstrate that CoOp requires as few as one or two shots to beat hand-crafted prompts with a decent margin and is able to gain significant improvements when using more shots, e.g., with 16 shots the average gain is around 15\% (with the highest reaching over 45\%). Despite being a learning-based approach, CoOp achieves superb domain generalization performance compared with the zero-shot model using hand-crafted prompts.
    Applications of Machine Learning in Healthcare and Internet of Things (IOT): A Comprehensive Review. (arXiv:2202.02868v1 [cs.LG])
    In recent years, smart healthcare IoT devices have become ubiquitous, but they work in isolated networks due to their policy. Having these devices connected in a network enables us to perform medical distributed data analysis. However, the presence of diverse IoT devices in terms of technology, structure, and network policy, makes it a challenging issue while applying traditional centralized learning algorithms on decentralized data collected from the IoT devices. In this study, we present an extensive review of the state-of-the-art machine learning applications particularly in healthcare, challenging issues in IoT, and corresponding promising solutions. Finally, we highlight some open-ended issues of IoT in healthcare that leaves further research studies and investigation for scientists.
    Sparse Training via Boosting Pruning Plasticity with Neuroregeneration. (arXiv:2106.10404v4 [cs.LG] UPDATED)
    Works on lottery ticket hypothesis (LTH) and single-shot network pruning (SNIP) have raised a lot of attention currently on post-training pruning (iterative magnitude pruning), and before-training pruning (pruning at initialization). The former method suffers from an extremely large computation cost and the latter usually struggles with insufficient performance. In comparison, during-training pruning, a class of pruning methods that simultaneously enjoys the training/inference efficiency and the comparable performance, temporarily, has been less explored. To better understand during-training pruning, we quantitatively study the effect of pruning throughout training from the perspective of pruning plasticity (the ability of the pruned networks to recover the original performance). Pruning plasticity can help explain several other empirical observations about neural network pruning in literature. We further find that pruning plasticity can be substantially improved by injecting a brain-inspired mechanism called neuroregeneration, i.e., to regenerate the same number of connections as pruned. We design a novel gradual magnitude pruning (GMP) method, named gradual pruning with zero-cost neuroregeneration (\textbf{GraNet}), that advances state of the art. Perhaps most impressively, its sparse-to-sparse version for the first time boosts the sparse-to-sparse training performance over various dense-to-sparse methods with ResNet-50 on ImageNet without extending the training time. We release all codes in https://github.com/Shiweiliuiiiiiii/GraNet.
    Likelihood Training of Schr\"odinger Bridge using Forward-Backward SDEs Theory. (arXiv:2110.11291v3 [stat.ML] UPDATED)
    Schr\"odinger Bridge (SB) is an entropy-regularized optimal transport problem that has received increasing attention in deep generative modeling for its mathematical flexibility compared to the Scored-based Generative Model (SGM). However, it remains unclear whether the optimization principle of SB relates to the modern training of deep generative models, which often rely on constructing log-likelihood objectives.This raises questions on the suitability of SB models as a principled alternative for generative applications. In this work, we present a novel computational framework for likelihood training of SB models grounded on Forward-Backward Stochastic Differential Equations Theory - a mathematical methodology appeared in stochastic optimal control that transforms the optimality condition of SB into a set of SDEs. Crucially, these SDEs can be used to construct the likelihood objectives for SB that, surprisingly, generalizes the ones for SGM as special cases. This leads to a new optimization principle that inherits the same SB optimality yet without losing applications of modern generative training techniques, and we show that the resulting training algorithm achieves comparable results on generating realistic images on MNIST, CelebA, and CIFAR10. Our code is available at https://github.com/ghliu/SB-FBSDE.
    Discovering Personalized Semantics for Soft Attributes in Recommender Systems using Concept Activation Vectors. (arXiv:2202.02830v1 [cs.IR])
    Interactive recommender systems (RSs) allow users to express intent, preferences and contexts in a rich fashion, often using natural language. One challenge in using such feedback is inferring a user's semantic intent from the open-ended terms used to describe an item, and using it to refine recommendation results. Leveraging concept activation vectors (CAVs) [21], we develop a framework to learn a representation that captures the semantics of such attributes and connects them to user preferences and behaviors in RSs. A novel feature of our approach is its ability to distinguish objective and subjective attributes and associate different senses with different users. Using synthetic and real-world datasets, we show that our CAV representation accurately interprets users' subjective semantics, and can improve recommendations via interactive critiquing
    Bayesian Neural Ordinary Differential Equations. (arXiv:2012.07244v4 [cs.LG] UPDATED)
    Recently, Neural Ordinary Differential Equations has emerged as a powerful framework for modeling physical simulations without explicitly defining the ODEs governing the system, but instead learning them via machine learning. However, the question: "Can Bayesian learning frameworks be integrated with Neural ODE's to robustly quantify the uncertainty in the weights of a Neural ODE?" remains unanswered. In an effort to address this question, we primarily evaluate the following categories of inference methods: (a) The No-U-Turn MCMC sampler (NUTS), (b) Stochastic Gradient Hamiltonian Monte Carlo (SGHMC) and (c) Stochastic Langevin Gradient Descent (SGLD). We demonstrate the successful integration of Neural ODEs with the above Bayesian inference frameworks on classical physical systems, as well as on standard machine learning datasets like MNIST, using GPU acceleration. On the MNIST dataset, we achieve a posterior sample accuracy of 98.5% on the test ensemble of 10,000 images. Subsequently, for the first time, we demonstrate the successful integration of variational inference with normalizing flows and Neural ODEs, leading to a powerful Bayesian Neural ODE object. Finally, considering a predator-prey model and an epidemiological system, we demonstrate the probabilistic identification of model specification in partially-described dynamical systems using universal ordinary differential equations. Together, this gives a scientific machine learning tool for probabilistic estimation of epistemic uncertainties.
    Solving Inventory Management Problems with Inventory-dynamics-informed Neural Networks. (arXiv:2201.06126v2 [cs.LG] UPDATED)
    A key challenge in inventory management is to identify policies that optimally replenish inventory from multiple suppliers. To solve such optimization problems, inventory managers need to decide what quantities to order from each supplier, given the on-hand inventory and outstanding orders, so that the expected backlogging, holding, and sourcing costs are jointly minimized. Inventory management problems have been studied extensively for over 60 years, and yet even basic dual sourcing problems, in which orders from an expensive supplier arrive faster than orders from a regular supplier, remain intractable in their general form. In this work, we approach dual sourcing from a neural-network-based optimization lens. By incorporating inventory dynamics into the design of neural networks, we are able to learn near-optimal policies of commonly used instances within a few minutes of CPU time on a regular personal computer. To demonstrate the versatility of inventory-dynamics-informed neural networks, we show that they are able to control inventory dynamics with empirical demand distributions that are challenging to tackle effectively using alternative, state-of-the-art approaches.
    Semi-Supervised Empirical Risk Minimization: Using unlabeled data to improve prediction. (arXiv:2009.00606v5 [stat.ML] UPDATED)
    We present a general methodology for using unlabeled data to design semi supervised learning (SSL) variants of the Empirical Risk Minimization (ERM) learning process. Focusing on generalized linear regression, we analyze of the effectiveness of our SSL approach in improving prediction performance. The key ideas are carefully considering the null model as a competitor, and utilizing the unlabeled data to determine signal-noise combinations where SSL outperforms both supervised learning and the null model. We then use SSL in an adaptive manner based on estimation of the signal and noise. In the special case of linear regression with Gaussian covariates, we prove that the non-adaptive SSL version is in fact not capable of improving on both the supervised estimator and the null model simultaneously, beyond a negligible O(1/n) term. On the other hand, the adaptive model presented in this work, can achieve a substantial improvement over both competitors simultaneously, under a variety of settings. This is shown empirically through extensive simulations, and extended to other scenarios, such as non-Gaussian covariates, misspecified linear regression, or generalized linear regression with non-linear link functions.
    Learning Features with Parameter-Free Layers. (arXiv:2202.02777v1 [cs.CV])
    Trainable layers such as convolutional building blocks are the standard network design choices by learning parameters to capture the global context through successive spatial operations. When designing an efficient network, trainable layers such as the depthwise convolution is the source of efficiency in the number of parameters and FLOPs, but there was little improvement to the model speed in practice. This paper argues that simple built-in parameter-free operations can be a favorable alternative to the efficient trainable layers replacing spatial operations in a network architecture. We aim to break the stereotype of organizing the spatial operations of building blocks into trainable layers. Extensive experimental analyses based on layer-level studies with fully-trained models and neural architecture searches are provided to investigate whether parameter-free operations such as the max-pool are functional. The studies eventually give us a simple yet effective idea for redesigning network architectures, where the parameter-free operations are heavily used as the main building block without sacrificing the model accuracy as much. Experimental results on the ImageNet dataset demonstrate that the network architectures with parameter-free operations could enjoy the advantages of further efficiency in terms of model speed, the number of the parameters, and FLOPs. Code and ImageNet pretrained models are available at https://github.com/naver-ai/PfLayer.
    A Scalable AutoML Approach Based on Graph Neural Networks. (arXiv:2111.00083v2 [cs.LG] UPDATED)
    AutoML systems build machine learning models automatically by performing a search over valid data transformations and learners, along with hyper-parameter optimization for each learner. Many AutoML systems use meta-learning to guide search for optimal pipelines. In this work, we present a novel meta-learning system called KGpip which, (1) builds a database of datasets and corresponding pipelines by mining thousands of scripts with program analysis, (2) uses dataset embeddings to find similar datasets in the database based on its content instead of metadata-based features, (3) models AutoML pipeline creation as a graph generation problem, to succinctly characterize the diverse pipelines seen for a single dataset. KGpip's meta-learning is a sub-component for AutoML systems. We demonstrate this by integrating KGpip with two AutoML systems. Our comprehensive evaluation using 126 datasets, including those used by the state-of-the-art systems, shows that KGpip significantly outperforms these systems.
    Multi-Task Neural Processes. (arXiv:2110.14953v3 [cs.LG] UPDATED)
    Neural Processes (NPs) consider a task as a function realized from a stochastic process and flexibly adapt to unseen tasks through inference on functions. However, naive NPs can model data from only a single stochastic process and are designed to infer each task independently. Since many real-world data represent a set of correlated tasks from multiple sources (e.g., multiple attributes and multi-sensor data), it is beneficial to infer them jointly and exploit the underlying correlation to improve the predictive performance. To this end, we propose Multi-Task Neural Processes (MTNPs), an extension of NPs designed to jointly infer tasks realized from multiple stochastic processes. We build MTNPs in a hierarchical way such that inter-task correlation is considered by conditioning all per-task latent variables on a single global latent variable. In addition, we further design our MTNPs so that they can address multi-task settings with incomplete data (i.e., not all tasks share the same set of input points), which has high practical demands in various applications. Experiments demonstrate that MTNPs can successfully model multiple tasks jointly by discovering and exploiting their correlations in various real-world data such as time series of weather attributes and pixel-aligned visual modalities.
    HARFE: Hard-Ridge Random Feature Expansion. (arXiv:2202.02877v1 [stat.ML])
    We propose a random feature model for approximating high-dimensional sparse additive functions called the hard-ridge random feature expansion method (HARFE). This method utilizes a hard-thresholding pursuit-based algorithm applied to the sparse ridge regression (SRR) problem to approximate the coefficients with respect to the random feature matrix. The SRR formulation balances between obtaining sparse models that use fewer terms in their representation and ridge-based smoothing that tend to be robust to noise and outliers. In addition, we use a random sparse connectivity pattern in the random feature matrix to match the additive function assumption. We prove that the HARFE method is guaranteed to converge with a given error bound depending on the noise and the parameters of the sparse ridge regression model. Based on numerical results on synthetic data as well as on real datasets, the HARFE approach obtains lower (or comparable) error than other state-of-the-art algorithms.
    Selecting Parallel In-domain Sentences for Neural Machine Translation Using Monolingual Texts. (arXiv:2112.06096v3 [cs.CL] UPDATED)
    Continuously-growing data volumes lead to larger generic models. Specific use-cases are usually left out, since generic models tend to perform poorly in domain-specific cases. Our work addresses this gap with a method for selecting in-domain data from generic-domain (parallel text) corpora, for the task of machine translation. The proposed method ranks sentences in parallel general-domain data according to their cosine similarity with a monolingual domain-specific data set. We then select the top K sentences with the highest similarity score to train a new machine translation system tuned to the specific in-domain data. Our experimental results show that models trained on this in-domain data outperform models trained on generic or a mixture of generic and domain data. That is, our method selects high-quality domain-specific training instances at low computational cost and data size.
    Augmenting Novelty Search with a Surrogate Model to Engineer Meta-Diversity in Ensembles of Classifiers. (arXiv:2201.12896v2 [cs.LG] UPDATED)
    Using Neuroevolution combined with Novelty Search to promote behavioural diversity is capable of constructing high-performing ensembles for classification. However, using gradient descent to train evolved architectures during the search can be computationally prohibitive. Here we propose a method to overcome this limitation by using a surrogate model which estimates the behavioural distance between two neural network architectures required to calculate the sparseness term in Novelty Search. We demonstrate a speedup of 10 times over previous work and significantly improve on previous reported results on three benchmark datasets from Computer Vision -- CIFAR-10, CIFAR-100, and SVHN. This results from the expanded architecture search space facilitated by using a surrogate. Our method represents an improved paradigm for implementing horizontal scaling of learning algorithms by making an explicit search for diversity considerably more tractable for the same bounded resources.
    EASY: Ensemble Augmented-Shot Y-shaped Learning: State-Of-The-Art Few-Shot Classification with Simple Ingredients. (arXiv:2201.09699v2 [cs.LG] UPDATED)
    Few-shot learning aims at leveraging knowledge learned by one or more deep learning models, in order to obtain good classification performance on new problems, where only a few labeled samples per class are available. Recent years have seen a fair number of works in the field, introducing methods with numerous ingredients. A frequent problem, though, is the use of suboptimally trained models to extract knowledge, leading to interrogations on whether proposed approaches bring gains compared to using better initial models without the introduced ingredients. In this work, we propose a simple methodology, that reaches or even beats state of the art performance on multiple standardized benchmarks of the field, while adding almost no hyperparameters or parameters to those used for training the initial deep learning models on the generic dataset. This methodology offers a new baseline on which to propose (and fairly compare) new techniques or adapt existing ones.
    Multi-Image Visual Question Answering. (arXiv:2112.13706v2 [cs.CV] UPDATED)
    While a lot of work has been done on developing models to tackle the problem of Visual Question Answering, the ability of these models to relate the question to the image features still remain less explored. We present an empirical study of different feature extraction methods with different loss functions. We propose New dataset for the task of Visual Question Answering with multiple image inputs having only one ground truth, and benchmark our results on them. Our final model utilising Resnet + RCNN image features and Bert embeddings, inspired from stacked attention network gives 39% word accuracy and 99% image accuracy on CLEVER+TinyImagenet dataset.
    Federated Submodel Averaging. (arXiv:2109.07704v3 [cs.LG] UPDATED)
    We study practical data characteristics underlying federated learning, where non-i.i.d. data from clients have sparse features, and a certain client's local data normally involves only a small part of the full model, called a submodel. Due to data sparsity, the classical federated averaging (FedAvg) algorithm or its variants will be severely slowed down, because when updating the global model, each client's zero update of the full model excluding its submodel is inaccurately aggregated. Therefore, we propose federated submodel averaging (FedSubAvg), ensuring that the expectation of the global update of each model parameter is equal to the average of the local updates of the clients who involve it. We theoretically proved the convergence rate of FedSubAvg by deriving an upper bound under a new metric called the element-wise gradient norm. In particular, this new metric can characterize the convergence of federated optimization over sparse data, while the conventional metric of squared gradient norm used in FedAvg and its variants cannot. We extensively evaluated FedSubAvg over both public and industrial datasets. The evaluation results demonstrate that FedSubAvg significantly outperforms FedAvg and its variants.
    Exploration with Multi-Sample Target Values for Distributional Reinforcement Learning. (arXiv:2202.02693v1 [cs.LG])
    Distributional reinforcement learning (RL) aims to learn a value-network that predicts the full distribution of the returns for a given state, often modeled via a quantile-based critic. This approach has been successfully integrated into common RL methods for continuous control, giving rise to algorithms such as Distributional Soft Actor-Critic (DSAC). In this paper, we introduce multi-sample target values (MTV) for distributional RL, as a principled replacement for single-sample target value estimation, as commonly employed in current practice. The improved distributional estimates further lend themselves to UCB-based exploration. These two ideas are combined to yield our distributional RL algorithm, E2DC (Extra Exploration with Distributional Critics). We evaluate our approach on a range of continuous control tasks and demonstrate state-of-the-art model-free performance on difficult tasks such as Humanoid control. We provide further insight into the method via visualization and analysis of the learned distributions and their evolution during training.
    FAST-RIR: Fast neural diffuse room impulse response generator. (arXiv:2110.04057v2 [cs.SD] UPDATED)
    We present a neural-network-based fast diffuse room impulse response generator (FAST-RIR) for generating room impulse responses (RIRs) for a given acoustic environment. Our FAST-RIR takes rectangular room dimensions, listener and speaker positions, and reverberation time as inputs and generates specular and diffuse reflections for a given acoustic environment. Our FAST-RIR is capable of generating RIRs for a given input reverberation time with an average error of 0.02s. We evaluate our generated RIRs in automatic speech recognition (ASR) applications using Google Speech API, Microsoft Speech API, and Kaldi tools. We show that our proposed FAST-RIR with batch size 1 is 400 times faster than a state-of-the-art diffuse acoustic simulator (DAS) on a CPU and gives similar performance to DAS in ASR experiments. Our FAST-RIR is 12 times faster than an existing GPU-based RIR generator (gpuRIR). We show that our FAST-RIR outperforms gpuRIR by 2.5% in an AMI far-field ASR benchmark.
    VIS-iTrack: Visual Intention through Gaze Tracking using Low-Cost Webcam. (arXiv:2202.02587v1 [cs.HC])
    Human intention is an internal, mental characterization for acquiring desired information. From interactive interfaces containing either textual or graphical information, intention to perceive desired information is subjective and strongly connected with eye gaze. In this work, we determine such intention by analyzing real-time eye gaze data with a low-cost regular webcam. We extracted unique features (e.g., Fixation Count, Eye Movement Ratio) from the eye gaze data of 31 participants to generate a dataset containing 124 samples of visual intention for perceiving textual or graphical information, labeled as either TEXT or IMAGE, having 48.39% and 51.61% distribution, respectively. Using this dataset, we analyzed 5 classifiers, including Support Vector Machine (SVM) (Accuracy: 92.19%). Using the trained SVM, we investigated the variation of visual intention among 30 participants, distributed in 3 age groups, and found out that young users were more leaned towards graphical contents whereas older adults felt more interested in textual ones. This finding suggests that real-time eye gaze data can be a potential source of identifying visual intention, analyzing which intention aware interactive interfaces can be designed and developed to facilitate human cognition.
    Single-stream CNN with Learnable Architecture for Multi-source Remote Sensing Data. (arXiv:2109.06094v2 [cs.CV] UPDATED)
    In this paper, we propose an efficient and generalizable framework based on deep convolutional neural network (CNN) for multi-source remote sensing data joint classification. While recent methods are mostly based on multi-stream architectures, we use group convolution to construct equivalent network architectures efficiently within a single-stream network. We further adopt and improve dynamic grouping convolution (DGConv) to make group convolution hyperparameters, and thus the overall network architecture, learnable during network training. The proposed method therefore can theoretically adjust any modern CNN models to any multi-source remote sensing data set, and can potentially avoid sub-optimal solutions caused by manually decided architecture hyperparameters. In the experiments, the proposed method is applied to ResNet and UNet, and the adjusted networks are verified on three very diverse benchmark data sets (i.e., Houston2018 data, Berlin data, and MUUFL data). Experimental results demonstrate the effectiveness of the proposed single-stream CNNs, and in particular ResNet18-DGConv improves the state-of-the-art classification overall accuracy (OA) on HS-SAR Berlin data set from $62.23\%$ to $68.21\%$. In the experiments we have two interesting findings. First, using DGConv generally reduces test OA variance. Second, multi-stream is harmful to model performance if imposed to the first few layers, but becomes beneficial if applied to deeper layers. Altogether, the findings imply that multi-stream architecture, instead of being a strictly necessary component in deep learning models for multi-source remote sensing data, essentially plays the role of model regularizer. Our code is publicly available at https://github.com/yyyyangyi/Multi-source-RS-DGConv. We hope our work can inspire novel research in the future.  ( 2 min )
    Short-Term Power Prediction for Renewable Energy Using Hybrid Graph Convolutional Network and Long Short-Term Memory Approach. (arXiv:2111.07958v2 [eess.SY] UPDATED)
    Accurate short-term solar and wind power predictions play an important role in the planning and operation of power systems. However, the short-term power prediction of renewable energy has always been considered a complex regression problem, owing to the fluctuation and intermittence of output powers and the law of dynamic change with time due to local weather conditions, i.e. spatio-temporal correlation. To capture the spatio-temporal features simultaneously, this paper proposes a new graph neural network-based short-term power forecasting approach, which combines the graph convolutional network (GCN) and long short-term memory (LSTM). Specifically, the GCN is employed to learn complex spatial correlations between adjacent renewable energies, and the LSTM is used to learn dynamic changes of power generation curves. The simulation results show that the proposed hybrid approach can model the spatio-temporal correlation of renewable energies, and its performance outperforms popular baselines on real-world datasets.  ( 2 min )
    BALanCe: Deep Bayesian Active Learning via Equivalence Class Annealing. (arXiv:2112.13737v2 [cs.LG] UPDATED)
    Active learning has demonstrated data efficiency in many fields. Existing active learning algorithms, especially in the context of deep Bayesian active models, rely heavily on the quality of uncertainty estimations of the model. However, such uncertainty estimates could be heavily biased, especially with limited and imbalanced training data. In this paper, we propose BALanCe, a Bayesian deep active learning framework that mitigates the effect of such biases. Concretely, BALanCe employs a novel acquisition function which leverages the structure captured by equivalence hypothesis classes and facilitates differentiation among different equivalence classes. Intuitively, each equivalence class consists of instantiations of deep models with similar predictions, and BALanCe adaptively adjusts the size of the equivalence classes as learning progresses. Besides the fully sequential setting, we further propose Batch-BALanCe -- a generalization of the sequential algorithm to the batched setting -- to efficiently select batches of training examples that are jointly effective for model improvement. We show that Batch-BALanCe achieves state-of-the-art performance on several benchmark datasets for active learning, and that both algorithms can effectively handle realistic challenges that often involve multi-class and imbalanced data.  ( 2 min )
    Demystifying the Transferability of Adversarial Attacks in Computer Networks. (arXiv:2110.04488v2 [cs.CR] UPDATED)
    Convolutional Neural Networks (CNNs) models are one of the most frequently used deep learning networks and are extensively used in both academia and industry. Recent studies demonstrated that adversarial attacks against such models can maintain their effectiveness even when used on models other than the one targeted by the attacker. This major property is known as transferability and makes CNNs ill-suited for security applications. In this paper, we provide the first comprehensive study which assesses the robustness of CNN-based models for computer networks against adversarial transferability. Furthermore, we investigate whether the transferability property issue holds in computer networks applications. In our experiments, we first consider five different attacks: the Iterative Fast Gradient Method (I-FGSM), the Jacobian-based Saliency Map (JSMA), the Limited-memory Broyden letcher Goldfarb Shanno BFGS (L-BFGS), the Projected Gradient Descent (PGD), and the DeepFool attack. Then, we perform these attacks against three well-known datasets: the Network-based Detection of IoT (N-BaIoT) dataset and the Domain Generating Algorithms (DGA) dataset, and RIPE Atlas dataset. Our experimental results show clearly that the transferability happens in specific use cases for the I-FGSM, the JSMA, and the LBFGS attack. In such scenarios, the attack success rate on the target network range from 63.00\% to 100\%. Finally, we suggest two shielding strategies to hinder the attack transferability, by considering the Most Powerful Attacks (MPAs), and the mismatch LSTM architecture.  ( 2 min )
    Demystifying How Self-Supervised Features Improve Training from Noisy Labels. (arXiv:2110.09022v2 [cs.LG] UPDATED)
    The advancement of self-supervised learning (SSL) motivates researchers to apply SSL on other tasks such as learning with noisy labels. Recent literature indicates that methods built on SSL features can substantially improve the performance of learning with noisy labels. Nonetheless, the deeper reasons why (and how) SSL features benefit the training from noisy labels are less understood. In this paper, we study why and how self-supervised features help networks resist label noise using both theoretical analyses and numerical experiments. Our results explain when and why fixing the SSL encoder helps converge to a better optimum, and why an unfixed encoder is unstable but tends to achieve a better best-epoch accuracy in more challenging noise settings. Further, we provide insights for how knowledge distilled from SSL features can compromise between a fixed encoder and an unfixed encoder. We hope our work provides a better understanding for learning with noisy labels from the perspective of self-supervised learning and can potentially serve as a guideline for further research. Code is available at github.com/UCSC-REAL/SelfSup_NoisyLabel.  ( 2 min )
    Improving Ensemble Robustness by Collaboratively Promoting and Demoting Adversarial Robustness. (arXiv:2009.09612v2 [cs.CV] UPDATED)
    Ensemble-based adversarial training is a principled approach to achieve robustness against adversarial attacks. An important technique of this approach is to control the transferability of adversarial examples among ensemble members. We propose in this work a simple yet effective strategy to collaborate among committee models of an ensemble model. This is achieved via the secure and insecure sets defined for each model member on a given sample, hence help us to quantify and regularize the transferability. Consequently, our proposed framework provides the flexibility to reduce the adversarial transferability as well as to promote the diversity of ensemble members, which are two crucial factors for better robustness in our ensemble approach. We conduct extensive and comprehensive experiments to demonstrate that our proposed method outperforms the state-of-the-art ensemble baselines, at the same time can detect a wide range of adversarial examples with a nearly perfect accuracy. Our code is available at: https://github.com/tuananhbui89/Crossing-Collaborative-Ensemble.  ( 2 min )
    Resource-Efficient and Delay-Aware Federated Learning Design under Edge Heterogeneity. (arXiv:2112.13926v3 [cs.NI] UPDATED)
    Federated learning (FL) has emerged as a popular technique for distributing machine learning across wireless edge devices. We examine FL under two salient properties of contemporary networks: device-server communication delays and device computation heterogeneity. Our proposed StoFedDelAv algorithm incorporates a local-global model combiner into the FL synchronization step. We theoretically characterize the convergence behavior of StoFedDelAv and obtain the optimal combiner weights, which consider the global model delay and expected local gradient error at each device. We then formulate a network-aware optimization problem which tunes the minibatch sizes of the devices to jointly minimize energy consumption and machine learning training loss, and solve the non-convex problem through a series of convex approximations. Our simulations reveal that StoFedDelAv outperforms the current art in FL, evidenced by the obtained improvements in optimization objective.  ( 2 min )
    Beyond Black Box Densities: Parameter Learning for the Deviated Components. (arXiv:2202.02651v1 [stat.ML])
    As we collect additional samples from a data population for which a known density function estimate may have been previously obtained by a black box method, the increased complexity of the data set may result in the true density being deviated from the known estimate by a mixture distribution. To model this phenomenon, we consider the \emph{deviating mixture model} $(1-\lambda^{*})h_0 + \lambda^{*} (\sum_{i = 1}^{k} p_{i}^{*} f(x|\theta_{i}^{*}))$, where $h_0$ is a known density function, while the deviated proportion $\lambda^{*}$ and latent mixing measure $G_{*} = \sum_{i = 1}^{k} p_{i}^{*} \delta_{\theta_i^{*}}$ associated with the mixture distribution are unknown. Via a novel notion of distinguishability between the known density $h_{0}$ and the deviated mixture distribution, we establish rates of convergence for the maximum likelihood estimates of $\lambda^{*}$ and $G^{*}$ under Wasserstein metric. Simulation studies are carried out to illustrate the theory.  ( 2 min )
    Spectrally Adapted Physics-Informed Neural Networks for Solving Unbounded Domain Problems. (arXiv:2202.02710v1 [cs.LG])
    Solving analytically intractable partial differential equations (PDEs) that involve at least one variable defined in an unbounded domain requires efficient numerical methods that accurately resolve the dependence of the PDE on that variable over several orders of magnitude. Unbounded domain problems arise in various application areas and solving such problems is important for understanding multi-scale biological dynamics, resolving physical processes at long time scales and distances, and performing parameter inference in engineering problems. In this work, we combine two classes of numerical methods: (i) physics-informed neural networks (PINNs) and (ii) adaptive spectral methods. The numerical methods that we develop take advantage of the ability of physics-informed neural networks to easily implement high-order numerical schemes to efficiently solve PDEs. We then show how recently introduced adaptive techniques for spectral methods can be integrated into PINN-based PDE solvers to obtain numerical solutions of unbounded domain problems that cannot be efficiently approximated by standard PINNs. Through a number of examples, we demonstrate the advantages of the proposed spectrally adapted PINNs (s-PINNs) over standard PINNs in approximating functions, solving PDEs, and estimating model parameters from noisy observations in unbounded domains.  ( 2 min )
    A Coalition Formation Game Approach for Personalized Federated Learning. (arXiv:2202.02502v1 [cs.AI])
    Facing the challenge of statistical diversity in client local data distribution, personalized federated learning (PFL) has become a growing research hotspot. Although the state-of-the-art methods with model similarity-based pairwise collaboration have achieved promising performance, they neglect the fact that model aggregation is essentially a collaboration process within the coalition, where the complex multiwise influences take place among clients. In this paper, we first apply Shapley value (SV) from coalition game theory into the PFL scenario. To measure the multiwise collaboration among a group of clients on the personalized learning performance, SV takes their marginal contribution to the final result as a metric. We propose a novel personalized algorithm: pFedSV, which can 1. identify each client's optimal collaborator coalition and 2. perform personalized model aggregation based on SV. Extensive experiments on various datasets (MNIST, Fashion-MNIST, and CIFAR-10) are conducted with different Non-IID data settings (Pathological and Dirichlet). The results show that pFedSV can achieve superior personalized accuracy for each client, compared to the state-of-the-art benchmarks.  ( 2 min )
    ROMNet: Renovate the Old Memories. (arXiv:2202.02606v1 [eess.IV])
    Renovating the memories in old photos is an intriguing research topic in computer vision fields. These legacy images often suffer from severe and commingled degradations such as cracks, noise, and color-fading, while lack of large-scale paired old photo datasets makes this restoration task very challenging. In this work, we present a novel reference-based end-to-end learning framework that can jointly repair and colorize the degraded legacy pictures. Specifically, the proposed framework consists of three modules: a restoration sub-network for degradation restoration, a similarity sub-network for color histogram matching and transfer, and a colorization subnet that learns to predict the chroma elements of the images conditioned on chromatic reference signals. The whole system takes advantage of the color histogram priors in a given reference image, which vastly reduces the dependency on large-scale training data. Apart from the proposed method, we also create, to our knowledge, the first public and real-world old photo dataset with paired ground truth for evaluating old photo restoration models, wherein each old photo is paired with a manually restored pristine image by PhotoShop experts. Our extensive experiments conducted on both synthetic and real-world datasets demonstrate that our method significantly outperforms state-of-the-arts both quantitatively and qualitatively.  ( 2 min )
    Test Set Sizing Via Random Matrix Theory. (arXiv:2112.05977v2 [stat.ML] UPDATED)
    This paper uses techniques from Random Matrix Theory to find the ideal training-testing data split for a simple linear regression with m data points, each an independent n-dimensional multivariate Gaussian. It defines "ideal" as satisfying the integrity metric, i.e. the empirical model error is the actual measurement noise, and thus fairly reflects the value or lack of same of the model. This paper is the first to solve for the training and test size for any model in a way that is truly optimal. The number of data points in the training set is the root of a quartic polynomial Theorem 1 derives which depends only on m and n; the covariance matrix of the multivariate Gaussian, the true model parameters, and the true measurement noise drop out of the calculations. The critical mathematical difficulties were realizing that the problems herein were discussed in the context of the Jacobi Ensemble, a probability distribution describing the eigenvalues of a known random matrix model, and evaluating a new integral in the style of Selberg and Aomoto. Mathematical results are supported with thorough computational evidence. This paper is a step towards automatic choices of training/test set sizes in machine learning.  ( 2 min )
    Evaluating natural language processing models with generalization metrics that do not need access to any training or testing data. (arXiv:2202.02842v1 [cs.CL])
    The search for effective and robust generalization metrics has been the focus of recent theoretical and empirical work. In this paper, we discuss the performance of natural language processing (NLP) models, and we evaluate various existing and novel generalization metrics. Compared to prior studies, we (i) focus on NLP instead of computer vision (CV), (ii) focus on generalization metrics that predict test error instead of the generalization gap, (iii) focus on generalization metrics that do not need the access to data, and (iv) focus on the heavy-tail (HT) phenomenon that has received comparatively less attention in the study of deep neural networks (NNs). We extend recent HT-based work which focuses on power law (PL) distributions, and we study exponential (EXP) and exponentially truncated power law (E-TPL) fitting to the empirical spectral densities (ESDs) of weight matrices. Our detailed empirical studies show that (i) \emph{shape metrics}, or the metrics obtained from fitting the shape of the ESDs, perform uniformly better at predicting generalization performance than \emph{scale metrics} commonly studied in the literature, as measured by the \emph{average} rank correlations with the generalization performance for all of our experiments; (ii) among forty generalization metrics studied in our paper, the \RANDDISTANCE metric, a new shape metric invented in this paper that measures the distance between empirical eigenvalues of weight matrices and those of randomly initialized weight matrices, achieves the highest worst-case rank correlation with generalization performance under a variety of training settings; and (iii) among the three HT distributions considered in our paper, the E-TPL fitting of ESDs performs the most robustly.  ( 2 min )
    The influence of labeling techniques in classifying human manipulation movement of different speed. (arXiv:2202.02426v1 [cs.CV])
    In this work, we investigate the influence of labeling methods on the classification of human movements on data recorded using a marker-based motion capture system. The dataset is labeled using two different approaches, one based on video data of the movements, the other based on the movement trajectories recorded using the motion capture system. The dataset is labeled using two different approaches, one based on video data of the movements, the other based on the movement trajectories recorded using the motion capture system. The data was recorded from one participant performing a stacking scenario comprising simple arm movements at three different speeds (slow, normal, fast). Machine learning algorithms that include k-Nearest Neighbor, Random Forest, Extreme Gradient Boosting classifier, Convolutional Neural networks (CNN), Long Short-Term Memory networks (LSTM), and a combination of CNN-LSTM networks are compared on their performance in recognition of these arm movements. The models were trained on actions performed on slow and normal speed movements segments and generalized on actions consisting of fast-paced human movement. It was observed that all the models trained on normal-paced data labeled using trajectories have almost 20% improvement in accuracy on test data in comparison to the models trained on data labeled using videos of the performed experiments.  ( 2 min )
    Hyperparameter-free and Explainable Whole Graph Embedding. (arXiv:2108.02113v3 [cs.LG] UPDATED)
    Graphs can be used to describe complex systems. Recently, whole graph embedding (graph representation learning) can compress a graph into a compact lower-dimension vector while preserving intrinsic properties, earning much attention. However, most graph embedding methods have problems such as tedious parameter tuning or poor explanation. This paper presents a simple and hyperparameter-free whole graph embedding method based on the DHC (Degree, H-index, and Coreness) theorem and Shannon Entropy (E), abbreviated as DHC-E. The DHC-E can provide a trade-off between simplicity and quality for supervised classification learning tasks involving molecular, social, and brain networks. Moreover, it performs well in lower-dimensional graph visualization. Overall, the DHC-E is simple, hyperparameter-free, and explainable for whole graph embedding with promising potential for exploring graph classification and lower-dimensional graph visualization.  ( 2 min )
    CheXstray: Real-time Multi-Modal Data Concordance for Drift Detection in Medical Imaging AI. (arXiv:2202.02833v1 [eess.IV])
    Rapidly expanding Clinical AI applications worldwide have the potential to impact to all areas of medical practice. Medical imaging applications constitute a vast majority of approved clinical AI applications. Though healthcare systems are eager to adopt AI solutions a fundamental question remains: \textit{what happens after the AI model goes into production?} We use the CheXpert and PadChest public datasets to build and test a medical imaging AI drift monitoring workflow that tracks data and model drift without contemporaneous ground truth. We simulate drift in multiple experiments to compare model performance with our novel multi-modal drift metric, which uses DICOM metadata, image appearance representation from a variational autoencoder (VAE), and model output probabilities as input. Through experimentation, we demonstrate a strong proxy for ground truth performance using unsupervised distributional shifts in relevant metadata, predicted probabilities, and VAE latent representation. Our key contributions include (1) proof-of-concept for medical imaging drift detection including use of VAE and domain specific statistical methods (2) a multi-modal methodology for measuring and unifying drift metrics (3) new insights into the challenges and solutions for observing deployed medical imaging AI (4) creation of open-source tools enabling others to easily run their own workflows or scenarios. This work has important implications for addressing the translation gap related to continuous medical imaging AI model monitoring in dynamic healthcare environments.  ( 2 min )
    LiDAR dataset distillation within bayesian active learning framework: Understanding the effect of data augmentation. (arXiv:2202.02661v1 [cs.CV])
    Autonomous driving (AD) datasets have progressively grown in size in the past few years to enable better deep representation learning. Active learning (AL) has re-gained attention recently to address reduction of annotation costs and dataset size. AL has remained relatively unexplored for AD datasets, especially on point cloud data from LiDARs. This paper performs a principled evaluation of AL based dataset distillation on (1/4th) of the large Semantic-KITTI dataset. Further on, the gains in model performance due to data augmentation (DA) are demonstrated across different subsets of the AL loop. We also demonstrate how DA improves the selection of informative samples to annotate. We observe that data augmentation achieves full dataset accuracy using only 60\% of samples from the selected dataset configuration. This provides faster training time and subsequent gains in annotation costs.  ( 2 min )
    Pipe Overflow: Smashing Voice Authentication for Fun and Profit. (arXiv:2202.02751v1 [cs.LG])
    Recent years have seen a surge of popularity of acoustics-enabled personal devices powered by machine learning. Yet, machine learning has proven to be vulnerable to adversarial examples. Large number of modern systems protect themselves against such attacks by targeting the artificiality, i.e., they deploy mechanisms to detect the lack of human involvement in generating the adversarial examples. However, these defenses implicitly assume that humans are incapable of producing meaningful and targeted adversarial examples. In this paper, we show that this base assumption is wrong. In particular, we demonstrate that for tasks like speaker identification, a human is capable of producing analog adversarial examples directly with little cost and supervision: by simply speaking through a tube, an adversary reliably impersonates other speakers in eyes of ML models for speaker identification. Our findings extend to a range of other acoustic-biometric tasks such as liveness, bringing into question their use in security-critical settings in real life, such as phone banking.  ( 2 min )
    Machine Learning Method for Functional Assessment of Retinal Models. (arXiv:2202.02443v1 [eess.IV])
    Challenges in the field of retinal prostheses motivate the development of retinal models to accurately simulate Retinal Ganglion Cells (RGCs) responses. The goal of retinal prostheses is to enable blind individuals to solve complex, reallife visual tasks. In this paper, we introduce the functional assessment (FA) of retinal models, which describes the concept of evaluating the performance of retinal models on visual understanding tasks. We present a machine learning method for FA: we feed traditional machine learning classifiers with RGC responses generated by retinal models, to solve object and digit recognition tasks (CIFAR-10, MNIST, Fashion MNIST, Imagenette). We examined critical FA aspects, including how the performance of FA depends on the task, how to optimally feed RGC responses to the classifiers and how the number of output neurons correlates with the model's accuracy. To increase the number of output neurons, we manipulated input images - by splitting and then feeding them to the retinal model and we found that image splitting does not significantly improve the model's accuracy. We also show that differences in the structure of datasets result in largely divergent performance of the retinal model (MNIST and Fashion MNIST exceeded 80% accuracy, while CIFAR-10 and Imagenette achieved ~40%). Furthermore, retinal models which perform better in standard evaluation, i.e. more accurately predict RGC response, perform better in FA as well. However, unlike standard evaluation, FA results can be straightforwardly interpreted in the context of comparing the quality of visual perception.  ( 2 min )
    One-Shot Neural Architecture Search via Compressive Sensing. (arXiv:1906.02869v2 [cs.LG] UPDATED)
    Neural Architecture Search remains a very challenging meta-learning problem. Several recent techniques based on parameter-sharing idea have focused on reducing the NAS running time by leveraging proxy models, leading to architectures with competitive performance compared to those with hand-crafted designs. In this paper, we propose an iterative technique for NAS, inspired by algorithms for learning low-degree sparse Boolean functions. We validate our approach on the DARTs search space (Liu et al., 2018b) and NAS-Bench-201 (Yang et al., 2020). In addition, we provide theoretical analysis via upper bounds on the number of validation error measurements needed for reliable learning, and include ablation studies to further in-depth understanding of our technique.  ( 2 min )
    Learning a Discrete Set of Optimal Allocation Rules in Queueing Systems with Unknown Service Rates. (arXiv:2202.02419v1 [eess.SY])
    We study learning-based admission control for a classical Erlang-B blocking system with unknown service rate, i.e., an $M/M/k/k$ queueing system. At every job arrival, a dispatcher decides to assign the job to an available server or to block it. Every served job yields a fixed reward for the dispatcher, but it also results in a cost per unit time of service. Our goal is to design a dispatching policy that maximizes the long-term average reward for the dispatcher based on observing the arrival times and the state of the system at each arrival; critically, the dispatcher observes neither the service times nor departure times. We develop our learning-based dispatch scheme as a parametric learning problem a'la self-tuning adaptive control. In our problem, certainty equivalent control switches between an always admit policy (always explore) and a never admit policy (immediately terminate learning), which is distinct from the adaptive control literature. Our learning scheme then uses maximum likelihood estimation followed by certainty equivalent control but with judicious use of the always admit policy so that learning doesn't stall. We prove that for all service rates, the proposed policy asymptotically learns to take the optimal action. Further, we also present finite-time regret guarantees for our scheme. The extreme contrast in the certainty equivalent optimal control policies leads to difficulties in learning that show up in our regret bounds for different parameter regimes. We explore this aspect in our simulations and also follow-up sampling related questions for our continuous-time system.  ( 2 min )
    A Fast Network Exploration Strategy to Profile Low Energy Consumption for Keyword Spotting. (arXiv:2202.02361v1 [cs.LG])
    Keyword Spotting nowadays is an integral part of speech-oriented user interaction targeted for smart devices. To this extent, neural networks are extensively used for their flexibility and high accuracy. However, coming up with a suitable configuration for both accuracy requirements and hardware deployment is a challenge. We propose a regression-based network exploration technique that considers the scaling of the network filters ($s$) and quantization ($q$) of the network layers, leading to a friendly and energy-efficient configuration for FPGA hardware implementation. We experiment with different combinations of $\mathcal{NN}\scriptstyle\langle q,\,s\rangle \displaystyle$ on the FPGA to profile the energy consumption of the deployed network so that the user can choose the most energy-efficient network configuration promptly. Our accelerator design is deployed on the Xilinx AC 701 platform and has at least 2.1$\times$ and 4$\times$ improvements on energy and energy efficiency results, respectively, compared to recent hardware implementations for keyword spotting.  ( 2 min )
    Capacity of Group-invariant Linear Readouts from Equivariant Representations: How Many Objects can be Linearly Classified Under All Possible Views?. (arXiv:2110.07472v4 [cs.LG] UPDATED)
    Equivariance has emerged as a desirable property of representations of objects subject to identity-preserving transformations that constitute a group, such as translations and rotations. However, the expressivity of a representation constrained by group equivariance is still not fully understood. We address this gap by providing a generalization of Cover's Function Counting Theorem that quantifies the number of linearly separable and group-invariant binary dichotomies that can be assigned to equivariant representations of objects. We find that the fraction of separable dichotomies is determined by the dimension of the space that is fixed by the group action. We show how this relation extends to operations such as convolutions, element-wise nonlinearities, and global and local pooling. While other operations do not change the fraction of separable dichotomies, local pooling decreases the fraction, despite being a highly nonlinear operation. Finally, we test our theory on intermediate representations of randomly initialized and fully trained convolutional neural networks and find perfect agreement.  ( 2 min )
    Accounting for shared covariates in semi-parametric Bayesian additive regression trees. (arXiv:2108.07636v3 [stat.ML] UPDATED)
    We propose some extensions to semi-parametric models based on Bayesian additive regression trees (BART). In the semi-parametric BART paradigm, the response variable is approximated by a linear predictor and a BART model, where the linear component is responsible for estimating the main effects and BART accounts for non-specified interactions and non-linearities. Previous semi-parametric models based on BART have assumed that the set of covariates in the linear predictor and the BART model are mutually exclusive in an attempt to avoid bias and poor coverage properties. The main novelty in our approach lies in the way we change the tree-generation moves in BART to deal with bias/confounding between the parametric and non-parametric components, even when they have covariates in common. This allows us to model complex interactions involving the covariates of primary interest, both among themselves and with those in the BART component. Through synthetic and real-world examples, we demonstrate that the performance of our novel semi-parametric BART is competitive when compared to regression models, alternative formulations of semi-parametric BART, and other tree-based methods. The implementation of the proposed method is available at https://github.com/ebprado/CSP-BART.  ( 2 min )
    Rethinking ValueDice: Does It Really Improve Performance?. (arXiv:2202.02468v1 [cs.LG])
    Since the introduction of GAIL, adversarial imitation learning (AIL) methods attract lots of research interests. Among these methods, ValueDice has achieved significant improvements: it beats the classical approach Behavioral Cloning (BC) under the offline setting, and it requires fewer interactions than GAIL under the online setting. Are these improvements benefited from more advanced algorithm designs? We answer this question with the following conclusions. First, we show that ValueDice could reduce to BC under the offline setting. Second, we verify that overfitting exists and regularization matters. Specifically, we demonstrate that with weight decay, BC also nearly matches the expert performance as ValueDice does. The first two claims explain the superior offline performance of ValueDice. Third, we establish that ValueDice does not work at all when the expert trajectory is subsampled. Instead, the mentioned success holds when the expert trajectory is complete, in which ValueDice is closely related to BC that performs well as mentioned. Finally, we discuss the implications of our research for imitation learning studies beyond ValueDice.  ( 2 min )
    Deep-HyROMnet: A deep learning-based operator approximation for hyper-reduction of nonlinear parametrized PDEs. (arXiv:2202.02658v1 [math.NA])
    To speed-up the solution to parametrized differential problems, reduced order models (ROMs) have been developed over the years, including projection-based ROMs such as the reduced-basis (RB) method, deep learning-based ROMs, as well as surrogate models obtained via a machine learning approach. Thanks to its physics-based structure, ensured by the use of a Galerkin projection of the full order model (FOM) onto a linear low-dimensional subspace, RB methods yield approximations that fulfill the physical problem at hand. However, to make the assembling of a ROM independent of the FOM dimension, intrusive and expensive hyper-reduction stages are usually required, such as the discrete empirical interpolation method (DEIM), thus making this strategy less feasible for problems characterized by (high-order polynomial or nonpolynomial) nonlinearities. To overcome this bottleneck, we propose a novel strategy for learning nonlinear ROM operators using deep neural networks (DNNs). The resulting hyper-reduced order model enhanced by deep neural networks, to which we refer to as Deep-HyROMnet, is then a physics-based model, still relying on the RB method approach, however employing a DNN architecture to approximate reduced residual vectors and Jacobian matrices once a Galerkin projection has been performed. Numerical results dealing with fast simulations in nonlinear structural mechanics show that Deep-HyROMnets are orders of magnitude faster than POD-Galerkin-DEIM ROMs, keeping the same level of accuracy.  ( 2 min )
    Detecting Errors and Estimating Accuracy on Unlabeled Data with Self-training Ensembles. (arXiv:2106.15728v3 [cs.LG] UPDATED)
    When a deep learning model is deployed in the wild, it can encounter test data drawn from distributions different from the training data distribution and suffer drop in performance. For safe deployment, it is essential to estimate the accuracy of the pre-trained model on the test data. However, the labels for the test inputs are usually not immediately available in practice, and obtaining them can be expensive. This observation leads to two challenging tasks: (1) unsupervised accuracy estimation, which aims to estimate the accuracy of a pre-trained classifier on a set of unlabeled test inputs; (2) error detection, which aims to identify mis-classified test inputs. In this paper, we propose a principled and practically effective framework that simultaneously addresses the two tasks. The proposed framework iteratively learns an ensemble of models to identify mis-classified data points and performs self-training to improve the ensemble with the identified points. Theoretical analysis demonstrates that our framework enjoys provable guarantees for both accuracy estimation and error detection under mild conditions readily satisfied by practical deep learning models. Along with the framework, we proposed and experimented with two instantiations and achieved state-of-the-art results on 59 tasks. For example, on iWildCam, one instantiation reduces the estimation error for unsupervised accuracy estimation by at least 70% and improves the F1 score for error detection by at least 4.7% compared to existing methods.  ( 2 min )
    Spelunking the Deep: Guaranteed Queries for General Neural Implicit Surfaces. (arXiv:2202.02444v1 [cs.CV])
    Neural implicit representations, which encode a surface as the level set of a neural network applied to spatial coordinates, have proven to be remarkably effective for optimizing, compressing, and generating 3D geometry. Although these representations are easy to fit, it is not clear how to best evaluate geometric queries on the shape, such as intersecting against a ray or finding a closest point. The predominant approach is to encourage the network to have a signed distance property. However, this property typically holds only approximately, leading to robustness issues, and holds only at the conclusion of training, inhibiting the use of queries in loss functions. Instead, this work presents a new approach to perform queries directly on general neural implicit functions for a wide range of existing architectures. Our key tool is the application of range analysis to neural networks, using automatic arithmetic rules to bound the output of a network over a region; we conduct a study of range analysis on neural networks, and identify variants of affine arithmetic which are highly effective. We use the resulting bounds to develop geometric queries including ray casting, intersection testing, constructing spatial hierarchies, fast mesh extraction, closest-point evaluation, evaluating bulk properties, and more. Our queries can be efficiently evaluated on GPUs, and offer concrete accuracy guarantees even on randomly-initialized networks, enabling their use in training objectives and beyond. We also show a preliminary application to inverse rendering.  ( 2 min )
    Automatic Curricula via Expert Demonstrations. (arXiv:2106.09159v3 [cs.LG] UPDATED)
    We propose Automatic Curricula via Expert Demonstrations (ACED), a reinforcement learning (RL) approach that combines the ideas of imitation learning and curriculum learning in order to solve challenging robotic manipulation tasks with sparse reward functions. Curriculum learning solves complicated RL tasks by introducing a sequence of auxiliary tasks with increasing difficulty, yet how to automatically design effective and generalizable curricula remains a challenging research problem. ACED extracts curricula from a small amount of expert demonstration trajectories by dividing demonstrations into sections and initializing training episodes to states sampled from different sections of demonstrations. Through moving the reset states from the end to the beginning of demonstrations as the learning agent improves its performance, ACED not only learns challenging manipulation tasks with unseen initializations and goals, but also discovers novel solutions that are distinct from the demonstrations. In addition, ACED can be naturally combined with other imitation learning methods to utilize expert demonstrations in a more efficient manner, and we show that a combination of ACED with behavior cloning allows pick-and-place tasks to be learned with as few as 1 demonstration and block stacking tasks to be learned with 20 demonstrations.  ( 2 min )
    Equitable Community Resilience: The Case of Winter Storm Uri in Texas. (arXiv:2201.06652v2 [stat.ML] UPDATED)
    Community resilience in the face of natural hazards relies on a community's potential to bounce back. A failure to integrate equity into resilience considerations results in unequal recovery and disproportionate impacts on vulnerable populations, which has long been a concern in the United States. This research investigated aspects of equity related to community resilience in the aftermath of Winter Storm Uri in Texas which led to extended power outages for more than 4 million households. County level outage and recovery data was analyzed to explore potential significant links between various county attributes and their share of the outages during the recovery and restoration phases. Next, satellite imagery was used to examine data at a much higher geographical resolution focusing on census tracts in the city of Houston. The goal was to use computer vision to extract the extent of outages within census tracts and investigate their linkages to census tracts attributes. Results from various statistical procedures revealed statistically significant negative associations between counties' percentage of non-Hispanic whites and median household income with the ratio of outages. Additionally, at census tract level, variables including percentages of linguistically isolated population and public transport users exhibited positive associations with the group of census tracts that were affected by the outage as detected by computer vision analysis. Informed by these results, engineering solutions such as the applicability of grid modernization technologies, together with distributed and renewable energy resources, when controlled for the region's topographical characteristics, are proposed to enhance equitable power grid resiliency in the face of natural hazards.  ( 2 min )
    Gradient boosting machines and careful pre-processing work best: ASHRAE Great Energy Predictor III lessons learned. (arXiv:2202.02898v1 [cs.LG])
    The ASHRAE Great Energy Predictor III (GEPIII) competition was held in late 2019 as one of the largest machine learning competitions ever held focused on building performance. It was hosted on the Kaggle platform and resulted in 39,402 prediction submissions, with the top five teams splitting $25,000 in prize money. This paper outlines lessons learned from participants, mainly from teams who scored in the top 5% of the competition. Various insights were gained from their experience through an online survey, analysis of publicly shared submissions and notebooks, and the documentation of the winning teams. The top-performing solutions mostly used ensembles of Gradient Boosting Machine (GBM) tree-based models, with the LightGBM package being the most popular. The survey participants indicated that the preprocessing and feature extraction phases were the most important aspects of creating the best modeling approach. All the survey respondents used Python as their primary modeling tool, and it was common to use Jupyter-style Notebooks as development environments. These conclusions are essential to help steer the research and practical implementation of building energy meter prediction in the future.  ( 2 min )
    PAC-Bayes Information Bottleneck. (arXiv:2109.14509v3 [cs.LG] UPDATED)
    Understanding the source of the superior generalization ability of NNs remains one of the most important problems in ML research. There have been a series of theoretical works trying to derive non-vacuous bounds for NNs. Recently, the compression of information stored in weights (IIW) is proved to play a key role in NNs generalization based on the PAC-Bayes theorem. However, no solution of IIW has ever been provided, which builds a barrier for further investigation of the IIW's property and its potential in practical deep learning. In this paper, we propose an algorithm for the efficient approximation of IIW. Then, we build an IIW-based information bottleneck on the trade-off between accuracy and information complexity of NNs, namely PIB. From PIB, we can empirically identify the fitting to compressing phase transition during NNs' training and the concrete connection between the IIW compression and the generalization. Besides, we verify that IIW is able to explain NNs in broad cases, e.g., varying batch sizes, over-parameterization, and noisy labels. Moreover, we propose an MCMC-based algorithm to sample from the optimal weight posterior characterized by PIB, which fulfills the potential of IIW in enhancing NNs in practice.  ( 2 min )
    HYDRA: Hypergradient Data Relevance Analysis for Interpreting Deep Neural Networks. (arXiv:2102.02515v4 [cs.LG] UPDATED)
    The behaviors of deep neural networks (DNNs) are notoriously resistant to human interpretations. In this paper, we propose Hypergradient Data Relevance Analysis, or HYDRA, which interprets the predictions made by DNNs as effects of their training data. Existing approaches generally estimate data contributions around the final model parameters and ignore how the training data shape the optimization trajectory. By unrolling the hypergradient of test loss w.r.t. the weights of training data, HYDRA assesses the contribution of training data toward test data points throughout the training trajectory. In order to accelerate computation, we remove the Hessian from the calculation and prove that, under moderate conditions, the approximation error is bounded. Corroborating this theoretical claim, empirical results indicate the error is indeed small. In addition, we quantitatively demonstrate that HYDRA outperforms influence functions in accurately estimating data contribution and detecting noisy data labels. The source code is available at https://github.com/cyyever/aaai_hydra_8686.  ( 2 min )
    Deep Learning for Deepfakes Creation and Detection: A Survey. (arXiv:1909.11573v4 [cs.CV] UPDATED)
    Deep learning has been successfully applied to solve various complex problems ranging from big data analytics to computer vision and human-level control. Deep learning advances however have also been employed to create software that can cause threats to privacy, democracy and national security. One of those deep learning-powered applications recently emerged is deepfake. Deepfake algorithms can create fake images and videos that humans cannot distinguish them from authentic ones. The proposal of technologies that can automatically detect and assess the integrity of digital visual media is therefore indispensable. This paper presents a survey of algorithms used to create deepfakes and, more importantly, methods proposed to detect deepfakes in the literature to date. We present extensive discussions on challenges, research trends and directions related to deepfake technologies. By reviewing the background of deepfakes and state-of-the-art deepfake detection methods, this study provides a comprehensive overview of deepfake techniques and facilitates the development of new and more robust methods to deal with the increasingly challenging deepfakes.  ( 2 min )
    Deep Learning for Regularization Prediction in Diffeomorphic Image Registration. (arXiv:2011.14229v3 [eess.IV] UPDATED)
    This paper presents a predictive model for estimating regularization parameters of diffeomorphic image registration. We introduce a novel framework that automatically determines the parameters controlling the smoothness of diffeomorphic transformations. Our method significantly reduces the effort of parameter tuning, which is time and labor-consuming. To achieve the goal, we develop a predictive model based on deep convolutional neural networks (CNN) that learns the mapping between pairwise images and the regularization parameter of image registration. In contrast to previous methods that estimate such parameters in a high-dimensional image space, our model is built in an efficient bandlimited space with much lower dimensions. We demonstrate the effectiveness of our model on both 2D synthetic data and 3D real brain images. Experimental results show that our model not only predicts appropriate regularization parameters for image registration, but also improving the network training in terms of time and memory efficiency.  ( 2 min )
    Conditional GANs with Auxiliary Discriminative Classifier. (arXiv:2107.10060v4 [cs.LG] UPDATED)
    Conditional generative models aim to learn the underlying joint distribution of data and labels, and thus realize conditional generation. Among them, auxiliary classifier generative adversarial networks (AC-GAN) have been widely used, but suffer from the problem of low intra-class diversity on generated samples. The fundamental reason pointed out in this paper is that the classifier of AC-GAN is generator-agnostic, which therefore cannot provide informative guidance to the generator to approximate the target distribution, resulting in minimization of conditional entropy that decreases the intra-class diversity. Motivated by this, we propose a novel conditional GAN with auxiliary \textit{discriminative} classifier (ADC-GAN) to resolve the above problem. Specifically, the proposed auxiliary \textit{discriminative} classifier becomes generator-aware by recognizing the labels of the real data and the generated data \textit{discriminatively}. Our theoretical analysis reveals that the generator can faithfully replicate the target distribution even without the original discriminator, making the proposed ADC-GAN robust to the value of coefficient hyper-parameter and the selection of GAN loss, and being stable during the training process. Extensive experimental results on synthetic and real-world datasets demonstrate the superiority of ADC-GAN on conditional generative modeling compared with state-of-the-art classifier-based and projection-based cGANs.  ( 2 min )
    Active Learning on a Budget: Opposite Strategies Suit High and Low Budgets. (arXiv:2202.02794v1 [cs.LG])
    Investigating active learning, we focus on the relation between the number of labeled examples (budget size), and suitable corresponding querying strategies. Our theoretical analysis shows a behavior reminiscent of phase transition: typical points should best be queried in the low budget regime, while atypical (or uncertain) points are best queried when the budget is large. Combined evidence from our theoretical and empirical studies shows that a similar phenomenon occurs in simple classification models. Accordingly, we propose TypiClust -- a deep active learning strategy suited for low budgets. In a comparative empirical investigation using a variety of architectures and image datasets, we report that in the low budget regime, TypiClust outperforms all other active learning strategies. Using TypiClust in a semi-supervised framework, the performance of competitive semi-supervised methods gets a significant boost, surpassing the state of the art.  ( 2 min )
    Regret Analysis of Distributed Online LQR Control for Unknown LTI Systems. (arXiv:2105.07310v2 [math.OC] UPDATED)
    Online optimization has recently opened avenues to study optimal control for time-varying cost functions that are unknown in advance. Inspired by this line of research, we study the distributed online linear quadratic regulator (LQR) problem for linear time-invariant (LTI) systems with unknown dynamics. Consider a multi-agent network where each agent is modeled as a LTI system. The network has a global time-varying quadratic cost, which may evolve adversarially and is only partially observed by each agent sequentially. The goal of the network is to collectively (i) estimate the unknown dynamics and (ii) compute local control sequences competitive to the best centralized policy in hindsight, which minimizes the sum of network costs over time. This problem is formulated as a regret minimization. We propose a distributed variant of the online LQR algorithm, where agents compute their system estimates during an exploration stage. Each agent then applies distributed online gradient descent on a semi-definite programming (SDP) whose feasible set is based on the agent system estimate. We prove that with high probability the regret bound of our proposed algorithm scales as $O(T^{2/3}\log T)$, implying the consensus of all agents over time. We also provide simulation results verifying our theoretical guarantee.  ( 2 min )
    ADMM-DAD net: a deep unfolding network for analysis compressed sensing. (arXiv:2110.06986v2 [cs.IT] UPDATED)
    In this paper, we propose a new deep unfolding neural network based on the ADMM algorithm for analysis Compressed Sensing. The proposed network jointly learns a redundant analysis operator for sparsification and reconstructs the signal of interest. We compare our proposed network with a state-of-the-art unfolded ISTA decoder, that also learns an orthogonal sparsifier. Moreover, we consider not only image, but also speech datasets as test examples. Computational experiments demonstrate that our proposed network outperforms the state-of-the-art deep unfolding network, consistently for both real-world image and speech datasets.  ( 2 min )
    On the expressivity of bi-Lipschitz normalizing flows. (arXiv:2107.07232v2 [cs.LG] UPDATED)
    An invertible function is bi-Lipschitz if both the function and its inverse have bounded Lipschitz constants. Nowadays, most Normalizing Flows are bi-Lipschitz by design or by training to limit numerical errors (among other things). In this paper, we discuss the expressivity of bi-Lipschitz Normalizing Flows and identify several target distributions that are difficult to approximate using such models. Then, we characterize the expressivity of bi-Lipschitz Normalizing Flows by giving several lower bounds on the Total Variation distance between these particularly unfavorable distributions and their best possible approximation. Finally, we discuss potential remedies which include using more complex latent distributions.  ( 2 min )
    SIGMA: A Structural Inconsistency Reducing Graph Matching Algorithm. (arXiv:2202.02797v1 [cs.LG])
    Graph matching finds the correspondence of nodes across two correlated graphs and lies at the core of many applications. When graph side information is not available, the node correspondence is estimated on the sole basis of network topologies. In this paper, we propose a novel criterion to measure the graph matching accuracy, structural inconsistency (SI), which is defined based on the network topological structure. Specifically, SI incorporates the heat diffusion wavelet to accommodate the multi-hop structure of the graphs. Based on SI, we propose a Structural Inconsistency reducing Graph Matching Algorithm (SIGMA), which improves the alignment scores of node pairs that have low SI values in each iteration. Under suitable assumptions, SIGMA can reduce SI values of true counterparts. Furthermore, we demonstrate that SIGMA can be derived by using a mirror descent method to solve the Gromov-Wasserstein distance with a novel K-hop-structure-based matching costs. Extensive experiments show that our method outperforms state-of-the-art methods.  ( 2 min )
    Federated Dropout -- A Simple Approach for Enabling Federated Learning on Resource Constrained Devices. (arXiv:2109.15258v3 [cs.LG] UPDATED)
    Federated learning (FL) is a popular framework for training an AI model using distributed mobile data in a wireless network. It features data parallelism by distributing the learning task to multiple edge devices while attempting to preserve their local-data privacy. One main challenge confronting practical FL is that resource constrained devices struggle with the computation intensive task of updating of a deep-neural network model. To tackle the challenge, in this paper, a federated dropout (FedDrop) scheme is proposed building on the classic dropout scheme for random model pruning. Specifically, in each iteration of the FL algorithm, several subnets are independently generated from the global model at the server using dropout but with heterogeneous dropout rates (i.e., parameter-pruning probabilities),each of which is adapted to the state of an assigned channel. The subnets are downloaded to associated devices for updating. Thereby, FedDrop reduces both the communication overhead and devices' computation loads compared with the conventional FL while outperforming the latter in the case of overfitting and also the FL scheme with uniform dropout (i.e., identical subnets).  ( 2 min )
    Learning on Arbitrary Graph Topologies via Predictive Coding. (arXiv:2201.13180v2 [cs.LG] UPDATED)
    Training with backpropagation (BP) in standard deep learning consists of two main steps: a forward pass that maps a data point to its prediction, and a backward pass that propagates the error of this prediction back through the network. This process is highly effective when the goal is to minimize a specific objective function. However, it does not allow training on networks with cyclic or backward connections. This is an obstacle to reaching brain-like capabilities, as the highly complex heterarchical structure of the neural connections in the neocortex are potentially fundamental for its effectiveness. In this paper, we show how predictive coding (PC), a theory of information processing in the cortex, can be used to perform inference and learning on arbitrary graph topologies. We experimentally show how this formulation, called PC graphs, can be used to flexibly perform different tasks with the same network by simply stimulating specific neurons, and investigate how the topology of the graph influences the final performance. We conclude by comparing against simple baselines trained~with~BP.  ( 2 min )
    Lossy Gradient Compression: How Much Accuracy Can One Bit Buy?. (arXiv:2202.02812v1 [cs.LG])
    In federated learning (FL), a global model is trained at a Parameter Server (PS) by aggregating model updates obtained from multiple remote learners. Critically, the communication between the remote users and the PS is limited by the available power for transmission, while the transmission from the PS to the remote users can be considered unbounded. This gives rise to the distributed learning scenario in which the updates from the remote learners have to be compressed so as to meet communication rate constraints in the uplink transmission toward the PS. For this problem, one would like to compress the model updates so as to minimize the resulting loss in accuracy. In this paper, we take a rate-distortion approach to answer this question for the distributed training of a deep neural network (DNN). In particular, we define a measure of the compression performance, the \emph{per-bit accuracy}, which addresses the ultimate model accuracy that a bit of communication brings to the centralized model. In order to maximize the per-bit accuracy, we consider modeling the gradient updates at remote learners as a generalized normal distribution. Under this assumption on the model update distribution, we propose a class of distortion measures for the design of quantizer for the compression of the model updates. We argue that this family of distortion measures, which we refer to as "$M$-magnitude weighted $L_2$" norm, capture the practitioner intuition in the choice of gradient compressor. Numerical simulations are provided to validate the proposed approach.  ( 2 min )
    A Novel Deep Parallel Time-series Relation Network for Fault Diagnosis. (arXiv:2112.03405v2 [cs.LG] UPDATED)
    Considering the models that apply the contextual information of time-series data could improve the fault diagnosis performance, some neural network structures such as RNN, LSTM, and GRU were proposed to model the fault diagnosis effectively. However, these models are restricted by their serial computation and hence cannot achieve high diagnostic efficiency. Also the parallel CNN is difficult to implement fault diagnosis in an efficient way because it requires larger convolution kernels or deep structure to achieve long-term feature extraction capabilities. Besides, BERT model applies absolute position embedding to introduce contextual information to the model, which would bring noise to the raw data and therefore cannot be applied to fault diagnosis directly. In order to address the above problems, a fault diagnosis model named deep parallel time-series relation network(DPTRN) has been proposed in this paper. There are mainly three advantages for DPTRN: (1) Our proposed time relationship unit is based on full multilayer perceptron(MLP) structure, therefore, DPTRN performs fault diagnosis in a parallel way and improves computing efficiency significantly. (2) By improving the absolute position embedding, our novel decoupling position embedding unit could be applied on the fault diagnosis directly and learn contextual information. (3) Our proposed DPTRN has obvious advantage in feature interpretability. We confirm the effect of the proposed method on four datasets, and the results show the effectiveness, efficiency and interpretability of the proposed DPTRN model.  ( 2 min )
    Learning a Single Neuron with Bias Using Gradient Descent. (arXiv:2106.01101v2 [cs.LG] UPDATED)
    We theoretically study the fundamental problem of learning a single neuron with a bias term ($\mathbf{x} \mapsto \sigma( + b)$) in the realizable setting with the ReLU activation, using gradient descent. Perhaps surprisingly, we show that this is a significantly different and more challenging problem than the bias-less case (which was the focus of previous works on single neurons), both in terms of the optimization geometry as well as the ability of gradient methods to succeed in some scenarios. We provide a detailed study of this problem, characterizing the critical points of the objective, demonstrating failure cases, and providing positive convergence guarantees under different sets of assumptions. To prove our results, we develop some tools which may be of independent interest, and improve previous results on learning single neurons.  ( 2 min )
    Learning quantum dynamics with latent neural ODEs. (arXiv:2110.10721v2 [quant-ph] UPDATED)
    The core objective of machine-assisted scientific discovery is to learn physical laws from experimental data without prior knowledge of the systems in question. In the area of quantum physics, making progress towards these goals is significantly more challenging due to the curse of dimensionality as well as the counter-intuitive nature of quantum mechanics. Here, we present the QNODE, a latent neural ODE trained on expectation values of closed and open quantum systems dynamics. It can learn to generate such measurement data and extrapolate outside of its training region that satisfies the von Neumann and time-local Lindblad master equations for closed and open quantum systems respectively in an unsupervised means. Furthermore, the QNODE rediscovers quantum mechanical laws such as the Heisenberg's uncertainty principle in a data-driven way, without any constraint or guidance. Additionally, we show that trajectories that are generated from the QNODE that are close in its latent space have similar quantum dynamics while preserving the physics of the training system.  ( 2 min )
    Semi-verified PAC Learning from the Crowd with Pairwise Comparisons. (arXiv:2106.07080v2 [cs.LG] UPDATED)
    We study the problem of crowdsourced PAC learning of threshold functions with pairwise comparisons. This is a challenging problem and only recently have query-efficient algorithms been established in the scenario where the majority of the crowd are perfect. In this work, we investigate the significantly more challenging case that the majority are incorrect, which in general renders learning impossible. We show that under the semi-verified model of Charikar~et~al.~(2017), where we have (limited) access to a trusted oracle who always returns the correct annotation, it is possible to PAC learn the underlying hypothesis class while drastically mitigating the labeling cost via the more easily obtained comparison queries. Orthogonal to recent developments in semi-verified or list-decodable learning that crucially rely on data distributional assumptions, our PAC guarantee holds by exploring the wisdom of the crowd.  ( 2 min )
    EvadeDroid: A Practical Evasion Attack on Machine Learning for Black-box Android Malware Detection. (arXiv:2110.03301v2 [cs.LG] UPDATED)
    Over the last decade, several studies have investigated the weaknesses of Android malware detectors against adversarial examples by proposing novel evasion attacks; however, their practicality in manipulating real-world malware remains arguable. The majority of studies have assumed attackers know the details of the target classifiers used for malware detection, while in reality, malicious actors have limited access to the target classifiers. This paper presents a practical evasion attack, EvadeDroid, to circumvent black-box Android malware detectors. In addition to generating real-world adversarial malware, the proposed evasion attack can also preserve the functionality of the original malware samples. EvadeDroid prepares a collection of functionality-preserving transformations using an n-gram-based similarity method, which are then used to morph malware instances into benign ones via an iterative and incremental manipulation strategy. The proposed manipulation technique is a novel, query-efficient optimization algorithm with the aim of finding and injecting optimal sequences of transformations into malware samples. Our empirical evaluation demonstrates the efficacy of EvadeDroid under hard- and soft-label attacks. Moreover, EvadeDroid is capable to generate practical adversarial examples with only a small number of queries, with evasion rates of $81\%$, $73\%$, $75\%$, and $79\%$ for DREBIN, Sec-SVM, MaMaDroid, and ADE-MA, respectively. Finally, we show that EvadeDroid is able to preserve its stealthiness against five popular commercial antivirus, thus demonstrating its feasibility in the real world.  ( 2 min )
    Faster Rates for Compressed Federated Learning with Client-Variance Reduction. (arXiv:2112.13097v2 [cs.LG] UPDATED)
    Due to the communication bottleneck in distributed and federated learning applications, algorithms using communication compression have attracted significant attention and are widely used in practice. Moreover, the huge number, high heterogeneity and limited availability of clients result in high client-variance. This paper addresses these two issues together by proposing compressed and client-variance reduced methods COFIG and FRECON. We prove an $O(\frac{(1+\omega)^{3/2}\sqrt{N}}{S\epsilon^2}+\frac{(1+\omega)N^{2/3}}{S\epsilon^2})$ bound on the number of communication rounds of COFIG in the nonconvex setting, where $N$ is the total number of clients, $S$ is the number of clients participating in each round, $\epsilon$ is the convergence error, and $\omega$ is the variance parameter associated with the compression operator. In case of FRECON, we prove an $O(\frac{(1+\omega)\sqrt{N}}{S\epsilon^2})$ bound on the number of communication rounds. In the convex setting, COFIG converges within $O(\frac{(1+\omega)\sqrt{N}}{S\epsilon})$ communication rounds, which is also the first convergence result for compression schemes that do not communicate with all the clients in each round. We stress that neither COFIG nor FRECON needs to communicate with all the clients, and they enjoy the first or faster convergence results for convex and nonconvex federated learning in the regimes considered. Experimental results point to an empirical superiority of COFIG and FRECON over existing baselines.  ( 2 min )
    Artificial Intelligence in the Battle against Coronavirus (COVID-19): A Survey and Future Research Directions. (arXiv:2008.07343v3 [cs.CY] UPDATED)
    Artificial intelligence (AI) has been applied widely in our daily lives in a variety of ways with numerous success stories. AI has also contributed to dealing with the coronavirus disease (COVID-19) pandemic, which has been happening around the globe. This paper presents a survey of AI methods being used in various applications in the fight against the COVID-19 outbreak and outlines the crucial role of AI research in this unprecedented battle. We touch on areas where AI plays as an essential component, from medical image processing, data analytics, text mining and natural language processing, the Internet of Things, to computational biology and medicine. A summary of COVID-19 related data sources that are available for research purposes is also presented. Research directions on exploring the potential of AI and enhancing its capability and power in the pandemic battle are thoroughly discussed. We identify 13 groups of problems related to the COVID-19 pandemic and highlight promising AI methods and tools that can be used to address these problems. It is envisaged that this study will provide AI researchers and the wider community with an overview of the current status of AI applications, and motivate researchers to harness AI's potential in the fight against COVID-19.  ( 2 min )
    Machine Learning Aided Holistic Handover Optimization for Emerging Networks. (arXiv:2202.02851v1 [cs.NI])
    In the wake of network densification and multi-band operation in emerging cellular networks, mobility and handover management is becoming a major bottleneck. The problem is further aggravated by the fact that holistic mobility management solutions for different types of handovers, namely inter-frequency and intra-frequency handovers, remain scarce. This paper presents a first mobility management solution that concurrently optimizes inter-frequency related A5 parameters and intra-frequency related A3 parameters. We analyze and optimize five parameters namely A5-time to trigger (TTT), A5-threshold1, A5-threshold2, A3-TTT, and A3-offset to jointly maximize three critical key performance indicators (KPIs): edge user reference signal received power (RSRP), handover success rate (HOSR) and load between frequency bands. In the absence of tractable analytical models due to system level complexity, we leverage machine learning to quantify the KPIs as a function of the mobility parameters. An XGBoost based model has the best performance for edge RSRP and HOSR while random forest outperforms others for load prediction. An analysis of the mobility parameters provides several insights: 1) there exists a strong coupling between A3 and A5 parameters; 2) an optimal set of parameters exists for each KPI; and 3) the optimal parameters vary for different KPIs. We also perform a SHAP based sensitivity to help resolve the parametric conflict between the KPIs. Finally, we formulate a maximization problem, show it is non-convex, and solve it utilizing simulated annealing (SA). Results indicate that ML-based SA-aided solution is more than 14x faster than the brute force approach with a slight loss in optimality.  ( 2 min )
    Tight Mutual Information Estimation With Contrastive Fenchel-Legendre Optimization. (arXiv:2107.01131v2 [stat.ML] UPDATED)
    Successful applications of InfoNCE and its variants have popularized the use of contrastive variational mutual information (MI) estimators in machine learning. While featuring superior stability, these estimators crucially depend on costly large-batch training, and they sacrifice bound tightness for variance reduction. To overcome these limitations, we revisit the mathematics of popular variational MI bounds from the lens of unnormalized statistical modeling and convex optimization. Our investigation not only yields a new unified theoretical framework encompassing popular variational MI bounds but also leads to a novel, simple, and powerful contrastive MI estimator named as FLO. Theoretically, we show that the FLO estimator is tight, and it provably converges under stochastic gradient descent. Empirically, our FLO estimator overcomes the limitations of its predecessors and learns more efficiently. The utility of FLO is verified using an extensive set of benchmarks, which also reveals the trade-offs in practical MI estimation.  ( 2 min )
    Deep Representation Learning with an Information-theoretic Loss. (arXiv:2111.12950v5 [cs.LG] UPDATED)
    This paper proposes a deep representation learning using an information-theoretic loss with an aim to increase the inter-class distances as well as within-class similarity in the embedded space. Tasks such as anomaly and out-of-distribution detection, in which test samples comes from classes unseen in training, are problematic for deep neural networks. For such tasks, it is not sufficient to merely discriminate between known classes. Our intuition is to represent the known classes in compact and separated embedded regions in order to decrease the possibility of known and unseen classes overlapping in the embedded space. We derive a loss from Information Bottleneck principle, which reflects the inter-class distances as well as the compactness within classes, thus will extend the existing deep data description models. Our empirical study shows that the proposed model improves the segmentation of normal classes in the deep feature space, and subsequently contributes to identifying out-of-distribution samples.  ( 2 min )
    Anticorrelated Noise Injection for Improved Generalization. (arXiv:2202.02831v1 [stat.ML])
    Injecting artificial noise into gradient descent (GD) is commonly employed to improve the performance of machine learning models. Usually, uncorrelated noise is used in such perturbed gradient descent (PGD) methods. It is, however, not known if this is optimal or whether other types of noise could provide better generalization performance. In this paper, we zoom in on the problem of correlating the perturbations of consecutive PGD steps. We consider a variety of objective functions for which we find that GD with anticorrelated perturbations ("Anti-PGD") generalizes significantly better than GD and standard (uncorrelated) PGD. To support these experimental findings, we also derive a theoretical analysis that demonstrates that Anti-PGD moves to wider minima, while GD and PGD remain stuck in suboptimal regions or even diverge. This new connection between anticorrelated noise and generalization opens the field to novel ways to exploit noise for training machine learning models.  ( 2 min )
    Scaling Law for Recommendation Models: Towards General-purpose User Representations. (arXiv:2111.11294v3 [cs.IR] UPDATED)
    Recent advancement of large-scale pretrained models such as BERT, GPT-3, CLIP, and Gopher, has shown astonishing achievements across various task domains. Unlike vision recognition and language models, studies on general-purpose user representation at scale still remain underexplored. Here we explore the possibility of general-purpose user representation learning by training a universal user encoder at large scales. We demonstrate that the scaling law is present in user representation learning areas, where the training error scales as a power-law with the amount of computation. Our Contrastive Learning User Encoder (CLUE), optimizes task-agnostic objectives, and the resulting user embeddings stretch our expectation of what is possible to do in various downstream tasks. CLUE also shows great transferability to other domains and companies, as performances on an online experiment shows significant improvements in Click-Through-Rate (CTR). Furthermore, we also investigate how the model performance is influenced by the scale factors, such as training data size, model capacity, sequence length, and batch size. Finally, we discuss the broader impacts of CLUE in general.  ( 2 min )
    Transformers and the representation of biomedical background knowledge. (arXiv:2202.02432v1 [cs.CL])
    BioBERT and BioMegatron are Transformers models adapted for the biomedical domain based on publicly available biomedical corpora. As such, they have the potential to encode large-scale biological knowledge. We investigate the encoding and representation of biological knowledge in these models, and its potential utility to support inference in cancer precision medicine - namely, the interpretation of the clinical significance of genomic alterations. We compare the performance of different transformer baselines; we use probing to determine the consistency of encodings for distinct entities; and we use clustering methods to compare and contrast the internal properties of the embeddings for genes, variants, drugs and diseases. We show that these models do indeed encode biological knowledge, although some of this is lost in fine-tuning for specific tasks. Finally, we analyse how the models behave with regard to biases and imbalances in the dataset.  ( 2 min )
    Information Theoretic Evaluation of Privacy-Leakage, Interpretability, and Transferability for a Novel Trustworthy AI Framework. (arXiv:2106.06046v4 [cs.LG] UPDATED)
    In order to develop machine learning and deep learning models that take into account the guidelines and principles of trustworthy AI, a novel information theoretic trustworthy AI framework is introduced. A unified approach to "privacy-preserving interpretable and transferable learning" is considered for studying and optimizing the tradeoffs between privacy, interpretability, and transferability aspects. A variational membership-mapping Bayesian model is used for the analytical approximations of the defined information theoretic measures for privacy-leakage, interpretability, and transferability. The approach consists of approximating the information theoretic measures via maximizing a lower-bound using variational optimization. The study presents a unified information theoretic approach to study different aspects of trustworthy AI in a rigorous analytical manner. The approach is demonstrated through numerous experiments on benchmark datasets and a real-world biomedical application concerned with the detection of mental stress on individuals using heart rate variability analysis.  ( 2 min )
    BEAS: Blockchain Enabled Asynchronous & Secure Federated Machine Learning. (arXiv:2202.02817v1 [cs.CR])
    Federated Learning (FL) enables multiple parties to distributively train a ML model without revealing their private datasets. However, it assumes trust in the centralized aggregator which stores and aggregates model updates. This makes it prone to gradient tampering and privacy leakage by a malicious aggregator. Malicious parties can also introduce backdoors into the joint model by poisoning the training data or model gradients. To address these issues, we present BEAS, the first blockchain-based framework for N-party FL that provides strict privacy guarantees of training data using gradient pruning (showing improved differential privacy compared to existing noise and clipping based techniques). Anomaly detection protocols are used to minimize the risk of data-poisoning attacks, along with gradient pruning that is further used to limit the efficacy of model-poisoning attacks. We also define a novel protocol to prevent premature convergence in heterogeneous learning environments. We perform extensive experiments on multiple datasets with promising results: BEAS successfully prevents privacy leakage from dataset reconstruction attacks, and minimizes the efficacy of poisoning attacks. Moreover, it achieves an accuracy similar to centralized frameworks, and its communication and computation overheads scale linearly with the number of participants.  ( 2 min )
    Opening the Blackbox: Accelerating Neural Differential Equations by Regularizing Internal Solver Heuristics. (arXiv:2105.03918v2 [cs.LG] UPDATED)
    Democratization of machine learning requires architectures that automatically adapt to new problems. Neural Differential Equations (NDEs) have emerged as a popular modeling framework by removing the need for ML practitioners to choose the number of layers in a recurrent model. While we can control the computational cost by choosing the number of layers in standard architectures, in NDEs the number of neural network evaluations for a forward pass can depend on the number of steps of the adaptive ODE solver. But, can we force the NDE to learn the version with the least steps while not increasing the training cost? Current strategies to overcome slow prediction require high order automatic differentiation, leading to significantly higher training time. We describe a novel regularization method that uses the internal cost heuristics of adaptive differential equation solvers combined with discrete adjoint sensitivities to guide the training process towards learning NDEs that are easier to solve. This approach opens up the blackbox numerical analysis behind the differential equation solver's algorithm and directly uses its local error estimates and stiffness heuristics as cheap and accurate cost estimates. We incorporate our method without any change in the underlying NDE framework and show that our method extends beyond Ordinary Differential Equations to accommodate Neural Stochastic Differential Equations. We demonstrate how our approach can halve the prediction time and, unlike other methods which can increase the training time by an order of magnitude, we demonstrate similar reduction in training times. Together this showcases how the knowledge embedded within state-of-the-art equation solvers can be used to enhance machine learning.  ( 2 min )
    Learning Sparse Graphs via Majorization-Minimization for Smooth Node Signals. (arXiv:2202.02815v1 [eess.SP])
    In this letter, we propose an algorithm for learning a sparse weighted graph by estimating its adjacency matrix under the assumption that the observed signals vary smoothly over the nodes of the graph. The proposed algorithm is based on the principle of majorization-minimization (MM), wherein we first obtain a tight surrogate function for the graph learning objective and then solve the resultant surrogate problem which has a simple closed form solution. The proposed algorithm does not require tuning of any hyperparameter and it has the desirable feature of eliminating the inactive variables in the course of the iterations - which can help speeding up the algorithm. The numerical simulations conducted using both synthetic and real world (brain-network) data show that the proposed algorithm converges faster, in terms of the average number of iterations, than several existing methods in the literature.  ( 2 min )
    Spectral Bias and Task-Model Alignment Explain Generalization in Kernel Regression and Infinitely Wide Neural Networks. (arXiv:2006.13198v6 [stat.ML] UPDATED)
    Generalization beyond a training dataset is a main goal of machine learning, but theoretical understanding of generalization remains an open problem for many models. The need for a new theory is exacerbated by recent observations in deep neural networks where overparameterization leads to better performance, contradicting the conventional wisdom from classical statistics. In this paper, we investigate generalization error for kernel regression, which, besides being a popular machine learning method, also includes infinitely overparameterized neural networks trained with gradient descent. We use techniques from statistical mechanics to derive an analytical expression for generalization error applicable to any kernel or data distribution. We present applications of our theory to real and synthetic datasets, and for many kernels including those that arise from training deep neural networks in the infinite-width limit. We elucidate an inductive bias of kernel regression to explain data with "simple functions", which are identified by solving a kernel eigenfunction problem on the data distribution. This notion of simplicity allows us to characterize whether a kernel is compatible with a learning task, facilitating good generalization performance from a small number of training examples. We show that more data may impair generalization when noisy or not expressible by the kernel, leading to non-monotonic learning curves with possibly many peaks. To further understand these phenomena, we turn to the broad class of rotation invariant kernels, which is relevant to training deep neural networks in the infinite-width limit, and present a detailed mathematical analysis of them when data is drawn from a spherically symmetric distribution and the number of input dimensions is large.  ( 3 min )
    Should We Be Pre-training? An Argument for End-task Aware Training as an Alternative. (arXiv:2109.07437v2 [cs.LG] UPDATED)
    In most settings of practical concern, machine learning practitioners know in advance what end-task they wish to boost with auxiliary tasks. However, widely used methods for leveraging auxiliary data like pre-training and its continued-pretraining variant are end-task agnostic: they rarely, if ever, exploit knowledge of the target task. We study replacing end-task agnostic continued training of pre-trained language models with end-task aware training of said models. We argue that for sufficiently important end-tasks, the benefits of leveraging auxiliary data in a task-aware fashion can justify forgoing the traditional approach of obtaining generic, end-task agnostic representations as with (continued) pre-training. On three different low-resource NLP tasks from two domains, we demonstrate that multi-tasking the end-task and auxiliary objectives results in significantly better downstream task performance than the widely-used task-agnostic continued pre-training paradigm of Gururangan et al. (2020). We next introduce an online meta-learning algorithm that learns a set of multi-task weights to better balance among our multiple auxiliary objectives, achieving further improvements on end-task performance and data efficiency.  ( 2 min )
    Towards a Principled Learning Rate Adaptation for Natural Evolution Strategies. (arXiv:2112.10680v2 [cs.NE] UPDATED)
    Natural Evolution Strategies (NES) is a promising framework for black-box continuous optimization problems. NES optimizes the parameters of a probability distribution based on the estimated natural gradient, and one of the key parameters affecting the performance is the learning rate. We argue that from the viewpoint of the natural gradient method, the learning rate should be determined according to the estimation accuracy of the natural gradient. To do so, we propose a new learning rate adaptation mechanism for NES. The proposed mechanism makes it possible to set a high learning rate for problems that are relatively easy to optimize, which results in speeding up the search. On the other hand, in problems that are difficult to optimize (e.g., multimodal functions), the proposed mechanism makes it possible to set a conservative learning rate when the estimation accuracy of the natural gradient seems to be low, which results in the robust and stable search. The experimental evaluations on unimodal and multimodal functions demonstrate that the proposed mechanism works properly depending on a search situation and is effective over the existing method, i.e., using the fixed learning rate.  ( 2 min )
    Learning and Fast Adaptation for Grid Emergency Control via Deep Meta Reinforcement Learning. (arXiv:2101.05317v2 [cs.LG] UPDATED)
    As power systems are undergoing a significant transformation with more uncertainties, less inertia and closer to operation limits, there is increasing risk of large outages. Thus, there is an imperative need to enhance grid emergency control to maintain system reliability and security. Towards this end, great progress has been made in developing deep reinforcement learning (DRL) based grid control solutions in recent years. However, existing DRL-based solutions have two main limitations: 1) they cannot handle well with a wide range of grid operation conditions, system parameters, and contingencies; 2) they generally lack the ability to fast adapt to new grid operation conditions, system parameters, and contingencies, limiting their applicability for real-world applications. In this paper, we mitigate these limitations by developing a novel deep meta reinforcement learning (DMRL) algorithm. The DMRL combines the meta strategy optimization together with DRL, and trains policies modulated by a latent space that can quickly adapt to new scenarios. We test the developed DMRL algorithm on the IEEE 300-bus system. We demonstrate fast adaptation of the meta-trained DRL polices with latent variables to new operating conditions and scenarios using the proposed method and achieve superior performance compared to the state-of-the-art DRL and model predictive control (MPC) methods.  ( 2 min )
    Low-confidence Samples Matter for Domain Adaptation. (arXiv:2202.02802v1 [cs.CV])
    Domain adaptation (DA) aims to transfer knowledge from a label-rich source domain to a related but label-scarce target domain. The conventional DA strategy is to align the feature distributions of the two domains. Recently, increasing researches have focused on self-training or other semi-supervised algorithms to explore the data structure of the target domain. However, the bulk of them depend largely on confident samples in order to build reliable pseudo labels, prototypes or cluster centers. Representing the target data structure in such a way would overlook the huge low-confidence samples, resulting in sub-optimal transferability that is biased towards the samples similar to the source domain. To overcome this issue, we propose a novel contrastive learning method by processing low-confidence samples, which encourages the model to make use of the target data structure through the instance discrimination process. To be specific, we create positive and negative pairs only using low-confidence samples, and then re-represent the original features with the classifier weights rather than directly utilizing them, which can better encode the task-specific semantic information. Furthermore, we combine cross-domain mixup to augment the proposed contrastive loss. Consequently, the domain gap can be well bridged through contrastive learning of intermediate representations across domains. We evaluate the proposed method in both unsupervised and semi-supervised DA settings, and extensive experimental results on benchmarks reveal that our method is effective and achieves state-of-the-art performance. The code can be found in https://github.com/zhyx12/MixLRCo.  ( 2 min )
    Effects of Parametric and Non-Parametric Methods on High Dimensional Sparse Matrix Representations. (arXiv:2202.02894v1 [cs.LG])
    The semantics are derived from textual data that provide representations for Machine Learning algorithms. These representations are interpretable form of high dimensional sparse matrix that are given as an input to the machine learning algorithms. Since learning methods are broadly classified as parametric and non-parametric learning methods, in this paper we provide the effects of these type of algorithms on the high dimensional sparse matrix representations. In order to derive the representations from the text data, we have considered TF-IDF representation with valid reason in the paper. We have formed representations of 50, 100, 500, 1000 and 5000 dimensions respectively over which we have performed classification using Linear Discriminant Analysis and Naive Bayes as parametric learning method, Decision Tree and Support Vector Machines as non-parametric learning method. We have later provided the metrics on every single dimension of the representation and effect of every single algorithm detailed in this paper.  ( 2 min )
    An Adapter Based Pre-Training for Efficient and Scalable Self-Supervised Speech Representation Learning. (arXiv:2107.13530v2 [eess.AS] UPDATED)
    We present a method for transferring pre-trained self-supervised (SSL) speech representations to multiple languages. There is an abundance of unannotated speech, so creating self-supervised representations from raw audio and fine-tuning on small annotated datasets is a promising direction to build speech recognition systems. SSL models generally perform SSL on raw audio in a pre-training phase and then fine-tune on a small fraction of annotated data. Such models have produced state of the art results for ASR. However, these models are very expensive to pre-train. We use an existing wav2vec 2.0 model and tackle the problem of learning new language representations while utilizing existing model knowledge. Crucially we do so without catastrophic forgetting of the existing language representation. We use adapter modules to speed up pre-training a new language task. Our model can decrease pre-training times by 32% when learning a new language task, and learn this new audio-language representation without forgetting previous language representation. We evaluate by applying these language representations to automatic speech recognition.  ( 2 min )
    Estimating the Euclidean Quantum Propagator with Deep Generative Modelling of Feynman paths. (arXiv:2202.02750v1 [quant-ph])
    Feynman path integrals provide an elegant, classically-inspired representation for the quantum propagator and the quantum dynamics, through summing over a huge manifold of all possible paths. From computational and simulational perspectives, the ergodic tracking of the whole path manifold is a hard problem. Machine learning can help, in an efficient manner, to identify the relevant subspace and the intrinsic structure residing at a small fraction of the vast path manifold. In this work, we propose the concept of Feynman path generator, which efficiently generates Feynman paths with fixed endpoints from a (low-dimensional) latent space, by targeting a desired density of paths in the Euclidean space-time. With such path generators, the Euclidean propagator as well as the ground state wave function can be estimated efficiently for a generic potential energy. Our work leads to a fresh approach for calculating the quantum propagator, paves the way toward generative modelling of Feynman paths, and may also provide a future new perspective to understand the quantum-classical correspondence through deep learning.  ( 2 min )
    On the Power of Differentiable Learning versus PAC and SQ Learning. (arXiv:2108.04190v2 [cs.LG] UPDATED)
    We study the power of learning via mini-batch stochastic gradient descent (SGD) on the population loss, and batch Gradient Descent (GD) on the empirical loss, of a differentiable model or neural network, and ask what learning problems can be learnt using these paradigms. We show that SGD and GD can always simulate learning with statistical queries (SQ), but their ability to go beyond that depends on the precision $\rho$ of the gradient calculations relative to the minibatch size $b$ (for SGD) and sample size $m$ (for GD). With fine enough precision relative to minibatch size, namely when $b \rho$ is small enough, SGD can go beyond SQ learning and simulate any sample-based learning algorithm and thus its learning power is equivalent to that of PAC learning; this extends prior work that achieved this result for $b=1$. Similarly, with fine enough precision relative to the sample size $m$, GD can also simulate any sample-based learning algorithm based on $m$ samples. In particular, with polynomially many bits of precision (i.e. when $\rho$ is exponentially small), SGD and GD can both simulate PAC learning regardless of the mini-batch size. On the other hand, when $b \rho^2$ is large enough, the power of SGD is equivalent to that of SQ learning.  ( 2 min )
    Learning under Storage and Privacy Constraints. (arXiv:2202.02892v1 [cs.IT])
    Storage-efficient privacy-guaranteed learning is crucial due to enormous amounts of sensitive user data required for increasingly many learning tasks. We propose a framework for reducing the storage cost while at the same time providing privacy guarantees, without essential loss in the utility of the data for learning. Our method comprises noise injection followed by lossy compression. We show that, when appropriately matching the lossy compression to the distribution of the added noise, the compressed examples converge, in distribution, to that of the noise-free training data. In this sense, the utility of the data for learning is essentially maintained, while reducing storage and privacy leakage by quantifiable amounts. We present experimental results on the CelebA dataset for gender classification and find that our suggested pipeline delivers in practice on the promise of the theory: the individuals in the images are unrecognizable (or less recognizable, depending on the noise level), overall storage of the data is substantially reduced, with no essential loss of the classification accuracy. As an added bonus, our experiments suggest that our method yields a substantial boost to robustness in the face of adversarial test data.  ( 2 min )
    Trusted Approximate Policy Iteration with Bisimulation Metrics. (arXiv:2202.02881v1 [cs.LG])
    Bisimulation metrics define a distance measure between states of a Markov decision process (MDP) based on a comparison of reward sequences. Due to this property they provide theoretical guarantees in value function approximation. In this work we first prove that bisimulation metrics can be defined via any $p$-Wasserstein metric for $p\geq 1$. Then we describe an approximate policy iteration (API) procedure that uses $\epsilon$-aggregation with $\pi$-bisimulation and prove performance bounds for continuous state spaces. We bound the difference between $\pi$-bisimulation metrics in terms of the change in the policies themselves. Based on these theoretical results, we design an API($\alpha$) procedure that employs conservative policy updates and enjoys better performance bounds than the naive API approach. In addition, we propose a novel trust region approach which circumvents the requirement to explicitly solve a constrained optimization problem. Finally, we provide experimental evidence of improved stability compared to non-conservative alternatives in simulated continuous control.  ( 2 min )
    Predicting the Timing of Camera Movements From the Kinematics of Instruments in Robotic-Assisted Surgery Using Artificial Neural Networks. (arXiv:2109.11192v3 [cs.LG] UPDATED)
    Robotic-assisted surgeries benefit both surgeons and patients, however, surgeons frequently need to adjust the endoscopic camera to achieve good viewpoints. Simultaneously controlling the camera and the surgical instruments is impossible, and consequentially, these camera adjustments repeatedly interrupt the surgery. Autonomous camera control could help overcome this challenge, but most existing systems are reactive, e.g., by having the camera follow the surgical instruments. We propose a predictive approach for anticipating when camera movements will occur using artificial neural networks. We used the kinematic data of the surgical instruments, which were recorded during robotic-assisted surgical training on porcine models. We split the data into segments, and labeled each either as a segment that immediately precedes a camera movement, or one that does not. Due to the large class imbalance, we trained an ensemble of networks, each on a balanced sub-set of the training data. We found that the instruments' kinematic data can be used to predict when camera movements will occur, and evaluated the performance on different segment durations and ensemble sizes. We also studied how much in advance an upcoming camera movement can be predicted, and found that predicting a camera movement 0.25, 0.5, and 1 second before they occurred achieved 98%, 94%, and 84% accuracy relative to the prediction of an imminent camera movement. This indicates that camera movement events can be predicted early enough to leave time for computing and executing an autonomous camera movement and suggests that an autonomous camera controller for RAMIS may one day be feasible.  ( 2 min )
    Kanerva++: extending The Kanerva Machine with differentiable, locally block allocated latent memory. (arXiv:2103.03905v3 [cs.NE] UPDATED)
    Episodic and semantic memory are critical components of the human memory model. The theory of complementary learning systems (McClelland et al., 1995) suggests that the compressed representation produced by a serial event (episodic memory) is later restructured to build a more generalized form of reusable knowledge (semantic memory). In this work we develop a new principled Bayesian memory allocation scheme that bridges the gap between episodic and semantic memory via a hierarchical latent variable model. We take inspiration from traditional heap allocation and extend the idea of locally contiguous memory to the Kanerva Machine, enabling a novel differentiable block allocated latent memory. In contrast to the Kanerva Machine, we simplify the process of memory writing by treating it as a fully feed forward deterministic process, relying on the stochasticity of the read key distribution to disperse information within the memory. We demonstrate that this allocation scheme improves performance in memory conditional image generation, resulting in new state-of-the-art conditional likelihood values on binarized MNIST (<=41.58 nats/image) , binarized Omniglot (<=66.24 nats/image), as well as presenting competitive performance on CIFAR10, DMLab Mazes, Celeb-A and ImageNet32x32.  ( 2 min )
    Riemannian Score-Based Generative Modeling. (arXiv:2202.02763v1 [cs.LG])
    Score-based generative models (SGMs) are a novel class of generative models demonstrating remarkable empirical performance. One uses a diffusion to add progressively Gaussian noise to the data, while the generative model is a "denoising" process obtained by approximating the time-reversal of this "noising" diffusion. However, current SGMs make the underlying assumption that the data is supported on a Euclidean manifold with flat geometry. This prevents the use of these models for applications in robotics, geoscience or protein modeling which rely on distributions defined on Riemannian manifolds. To overcome this issue, we introduce Riemannian Score-based Generative Models (RSGMs) which extend current SGMs to the setting of compact Riemannian manifolds. We illustrate our approach with earth and climate science data and show how RSGMs can be accelerated by solving a Schr\"odinger bridge problem on manifolds.  ( 2 min )
    Out-of-Distribution Generalization in Kernel Regression. (arXiv:2106.02261v3 [stat.ML] UPDATED)
    In real word applications, data generating process for training a machine learning model often differs from what the model encounters in the test stage. Understanding how and whether machine learning models generalize under such distributional shifts have been a theoretical challenge. Here, we study generalization in kernel regression when the training and test distributions are different using methods from statistical physics. Using the replica method, we derive an analytical formula for the out-of-distribution generalization error applicable to any kernel and real datasets. We identify an overlap matrix that quantifies the mismatch between distributions for a given kernel as a key determinant of generalization performance under distribution shift. Using our analytical expressions we elucidate various generalization phenomena including possible improvement in generalization when there is a mismatch. We develop procedures for optimizing training and test distributions for a given data budget to find best and worst case generalizations under the shift. We present applications of our theory to real and synthetic datasets and for many kernels. We compare results of our theory applied to Neural Tangent Kernel with simulations of wide networks and show agreement. We analyze linear regression in further depth.  ( 2 min )
    Importance Weighting Approach in Kernel Bayes' Rule. (arXiv:2202.02474v1 [stat.ML])
    We study a nonparametric approach to Bayesian computation via feature means, where the expectation of prior features is updated to yield expected posterior features, based on regression from kernel or neural net features of the observations. All quantities involved in the Bayesian update are learned from observed data, making the method entirely model-free. The resulting algorithm is a novel instance of a kernel Bayes' rule (KBR). Our approach is based on importance weighting, which results in superior numerical stability to the existing approach to KBR, which requires operator inversion. We show the convergence of the estimator using a novel consistency analysis on the importance weighting estimator in the infinity norm. We evaluate our KBR on challenging synthetic benchmarks, including a filtering problem with a state-space model involving high dimensional image observations. The proposed method yields uniformly better empirical performance than the existing KBR, and competitive performance with other competing methods.  ( 2 min )
    Energy awareness in low precision neural networks. (arXiv:2202.02783v1 [cs.LG])
    Power consumption is a major obstacle in the deployment of deep neural networks (DNNs) on end devices. Existing approaches for reducing power consumption rely on quite general principles, including avoidance of multiplication operations and aggressive quantization of weights and activations. However, these methods do not take into account the precise power consumed by each module in the network, and are therefore not optimal. In this paper we develop accurate power consumption models for all arithmetic operations in the DNN, under various working conditions. We reveal several important factors that have been overlooked to date. Based on our analysis, we present PANN (power-aware neural network), a simple approach for approximating any full-precision network by a low-power fixed-precision variant. Our method can be applied to a pre-trained network, and can also be used during training to achieve improved performance. In contrast to previous methods, PANN incurs only a minor degradation in accuracy w.r.t. the full-precision version of the network, even when working at the power-budget of a 2-bit quantized variant. In addition, our scheme enables to seamlessly traverse the power-accuracy trade-off at deployment time, which is a major advantage over existing quantization methods that are constrained to specific bit widths.  ( 2 min )
    Stratification of carotid atheromatous plaque using interpretable deep learning methods on B-mode ultrasound images. (arXiv:2202.02428v1 [eess.IV])
    Carotid atherosclerosis is the major cause of ischemic stroke resulting in significant rates of mortality and disability annually. Early diagnosis of such cases is of great importance, since it enables clinicians to apply a more effective treatment strategy. This paper introduces an interpretable classification approach of carotid ultrasound images for the risk assessment and stratification of patients with carotid atheromatous plaque. To address the highly imbalanced distribution of patients between the symptomatic and asymptomatic classes (16 vs 58, respectively), an ensemble learning scheme based on a sub-sampling approach was applied along with a two-phase, cost-sensitive strategy of learning, that uses the original and a resampled data set. Convolutional Neural Networks (CNNs) were utilized for building the primary models of the ensemble. A six-layer deep CNN was used to automatically extract features from the images, followed by a classification stage of two fully connected layers. The obtained results (Area Under the ROC Curve (AUC): 73%, sensitivity: 75%, specificity: 70%) indicate that the proposed approach achieved acceptable discrimination performance. Finally, interpretability methods were applied on the model's predictions in order to reveal insights on the model's decision process as well as to enable the identification of novel image biomarkers for the stratification of patients with carotid atheromatous plaque.Clinical Relevance-The integration of interpretability methods with deep learning strategies can facilitate the identification of novel ultrasound image biomarkers for the stratification of patients with carotid atheromatous plaque.  ( 2 min )
    Is High Variance Unavoidable in RL? A Case Study in Continuous Control. (arXiv:2110.11222v2 [cs.LG] UPDATED)
    Reinforcement learning (RL) experiments have notoriously high variance, and minor details can have disproportionately large effects on measured outcomes. This is problematic for creating reproducible research and also serves as an obstacle for real-world applications, where safety and predictability are paramount. In this paper, we investigate causes for this perceived instability. To allow for an in-depth analysis, we focus on a specifically popular setup with high variance -- continuous control from pixels with an actor-critic agent. In this setting, we demonstrate that variance mostly arises early in training as a result of poor "outlier" runs, but that weight initialization and initial exploration are not to blame. We show that one cause for early variance is numerical instability which leads to saturating nonlinearities. We investigate several fixes to this issue and find that one particular method is surprisingly effective and simple -- normalizing penultimate features. Addressing the learning instability allows for larger learning rates, and significantly decreases the variance of outcomes. This demonstrates that the perceived variance in RL is not necessarily inherent to the problem definition and may be addressed through simple architectural modifications.  ( 2 min )
    MPF6D: Masked Pyramid Fusion 6D Pose Estimation. (arXiv:2111.09378v2 [cs.CV] UPDATED)
    Object pose estimation has multiple important applications, such as robotic grasping and augmented reality. We present a new method to estimate the 6D pose of objects that improves upon the accuracy of current proposals and can still be used in real-time. Our method uses RGB-D data as input to segment objects and estimate their pose. It uses a neural network with multiple heads to identify the objects in the scene, generate the appropriate masks and estimate the values of the translation vectors and the quaternion that represents the objects' rotation. These heads leverage a pyramid architecture used during feature extraction and feature fusion. We conduct an empirical evaluation using the two most common datasets in the area, and compare against state-of-the-art approaches, illustrating the capabilities of MPF6D. Our method can be used in real-time with its low inference time and high accuracy.  ( 2 min )
    Linking Across Data Granularity: Fitting Multivariate Hawkes Processes to Partially Interval-Censored Data. (arXiv:2111.02062v2 [cs.LG] UPDATED)
    This work introduces a novel multivariate temporal point process, the Partial Mean Behavior Poisson (PMBP) process, which can be leveraged to fit the multivariate Hawkes process to partially interval-censored data consisting of a mix of event timestamps on a subset of dimensions and interval-censored event counts on the complementary dimensions. First, we define the PMBP process via its conditional intensity and derive the regularity conditions for subcriticality. We show that both the Hawkes process and the MBP process (Rizoiu et al. (2021)) are special cases of the PMBP process. Second, we provide numerical schemes that enable calculating the conditional intensity and sampling event histories of the PMBP process. Third, we demonstrate the applicability of the PMBP process by empirical testing using synthetic and real-world datasets: We test the capability of the PMBP process to recover multivariate Hawkes parameters given sample event histories of the Hawkes process. Next, we evaluate the PMBP process on the Youtube popularity prediction task and show that it outperforms the current state-of-the-art Hawkes Intensity process (Rizoiu et al. (2017b)). Lastly, on a curated dataset of COVID19 daily case counts and COVID19-related news articles for a sample of countries, we show that clustering on the PMBP-fitted parameters enables a categorization of countries with respect to the country-level interaction of cases and news reporting.  ( 2 min )
    On Neural Differential Equations. (arXiv:2202.02435v1 [cs.LG])
    The conjoining of dynamical systems and deep learning has become a topic of great interest. In particular, neural differential equations (NDEs) demonstrate that neural networks and differential equation are two sides of the same coin. Traditional parameterised differential equations are a special case. Many popular neural network architectures, such as residual networks and recurrent networks, are discretisations. NDEs are suitable for tackling generative problems, dynamical systems, and time series (particularly in physics, finance, ...) and are thus of interest to both modern machine learning and traditional mathematical modelling. NDEs offer high-capacity function approximation, strong priors on model space, the ability to handle irregular data, memory efficiency, and a wealth of available theory on both sides. This doctoral thesis provides an in-depth survey of the field. Topics include: neural ordinary differential equations (e.g. for hybrid neural/mechanistic modelling of physical systems); neural controlled differential equations (e.g. for learning functions of irregular time series); and neural stochastic differential equations (e.g. to produce generative models capable of representing complex stochastic dynamics, or sampling from complex high-dimensional distributions). Further topics include: numerical methods for NDEs (e.g. reversible differential equations solvers, backpropagation through differential equations, Brownian reconstruction); symbolic regression for dynamical systems (e.g. via regularised evolution); and deep implicit models (e.g. deep equilibrium models, differentiable optimisation). We anticipate this thesis will be of interest to anyone interested in the marriage of deep learning with dynamical systems, and hope it will provide a useful reference for the current state of the art.  ( 2 min )
    LyaNet: A Lyapunov Framework for Training Neural ODEs. (arXiv:2202.02526v1 [cs.LG])
    We propose a method for training ordinary differential equations by using a control-theoretic Lyapunov condition for stability. Our approach, called LyaNet, is based on a novel Lyapunov loss formulation that encourages the inference dynamics to converge quickly to the correct prediction. Theoretically, we show that minimizing Lyapunov loss guarantees exponential convergence to the correct solution and enables a novel robustness guarantee. We also provide practical algorithms, including one that avoids the cost of backpropagating through a solver or using the adjoint method. Relative to standard Neural ODE training, we empirically find that LyaNet can offer improved prediction performance, faster convergence of inference dynamics, and improved adversarial robustness. Our code available at https://github.com/ivandariojr/LyapunovLearning .  ( 2 min )
    JARVix at SemEval-2022 Task 2: It Takes One to Know One? Idiomaticity Detection using Zero and One Shot Learning. (arXiv:2202.02394v1 [cs.CL])
    Large Language Models have been successful in a wide variety of Natural Language Processing tasks by capturing the compositionality of the text representations. In spite of their great success, these vector representations fail to capture meaning of idiomatic multi-word expressions (MWEs). In this paper, we focus on the detection of idiomatic expressions by using binary classification. We use a dataset consisting of the literal and idiomatic usage of MWEs in English and Portuguese. Thereafter, we perform the classification in two different settings: zero shot and one shot, to determine if a given sentence contains an idiom or not. N shot classification for this task is defined by N number of common idioms between the training and testing sets. In this paper, we train multiple Large Language Models in both the settings and achieve an F1 score (macro) of 0.73 for the zero shot setting and an F1 score (macro) of 0.85 for the one shot setting. An implementation of our work can be found at https://github.com/ashwinpathak20/Idiomaticity_Detection_Using_Few_Shot_Learning .  ( 2 min )
    Adaptive Fine-Tuning of Transformer-Based Language Models for Named Entity Recognition. (arXiv:2202.02617v1 [cs.CL])
    The current standard approach for fine-tuning transformer-based language models includes a fixed number of training epochs and a linear learning rate schedule. In order to obtain a near-optimal model for the given downstream task, a search in optimization hyperparameter space is usually required. In particular, the number of training epochs needs to be adjusted to the dataset size. In this paper, we introduce adaptive fine-tuning, which is an alternative approach that uses early stopping and a custom learning rate schedule to dynamically adjust the number of training epochs to the dataset size. For the example use case of named entity recognition, we show that our approach not only makes hyperparameter search with respect to the number of training epochs redundant, but also leads to improved results in terms of performance, stability and efficiency. This holds true especially for small datasets, where we outperform the state-of-the-art fine-tuning method by a large margin.  ( 2 min )
    Physics Informed Machine Learning of SPH: Machine Learning Lagrangian Turbulence. (arXiv:2110.13311v2 [physics.flu-dyn] UPDATED)
    Smoothed particle hydrodynamics (SPH) is a mesh-free Lagrangian method for obtaining approximate numerical solutions of the equations of fluid dynamics; which has been widely applied to weakly- and strongly compressible turbulence in astrophysics and engineering applications. We present a learn-able hierarchy of parameterized and "physics-explainable" Lagrangian based fluid simulators using both physics based parameters and Neural Networks (NNs) as universal function approximators. This hierarchy of parameterized Lagrangian models gradually introduces more SPH based structure, which we show improves interpretability, generalizability (over larger ranges of time scales and Mach numbers), preservation of physical symmetries (corresponding to conservation of linear and angular momentum), and requires less training data. Our learning algorithm develops a mixed mode approach, mixing forward and reverse mode automatic differentiation with local sensitivity analyses to efficiently perform gradient based optimization. We train this hierarchy on both weakly compressible SPH and DNS data, and show that our physics informed learning method is capable of: (a) solving inverse problems over the physically interpretable parameter space, as well as over the space of NN parameters; (b) learning Lagrangian statistics of turbulence (interpolation); (c) combining Lagrangian trajectory based, probabilistic, and Eulerian field based loss functions; (d) extrapolating beyond training sets into more complex regimes of interest; (e) learning new parameterized smoothing kernels better suited to weakly compressible DNS turbulence data.  ( 2 min )
    Constructing a Good Behavior Basis for Transfer using Generalized Policy Updates. (arXiv:2112.15025v2 [cs.LG] UPDATED)
    We study the problem of learning a good set of policies, so that when combined together, they can solve a wide variety of unseen reinforcement learning tasks with no or very little new data. Specifically, we consider the framework of generalized policy evaluation and improvement, in which the rewards for all tasks of interest are assumed to be expressible as a linear combination of a fixed set of features. We show theoretically that, under certain assumptions, having access to a specific set of diverse policies, which we call a set of independent policies, can allow for instantaneously achieving high-level performance on all possible downstream tasks which are typically more complex than the ones on which the agent was trained. Based on this theoretical analysis, we propose a simple algorithm that iteratively constructs this set of policies. In addition to empirically validating our theoretical results, we compare our approach with recently proposed diverse policy set construction methods and show that, while others fail, our approach is able to build a behavior basis that enables instantaneous transfer to all possible downstream tasks. We also show empirically that having access to a set of independent policies can better bootstrap the learning process on downstream tasks where the new reward function cannot be described as a linear combination of the features. Finally, we demonstrate how this policy set can be useful in a lifelong reinforcement learning setting.  ( 2 min )
    Efficient Logistic Regression with Local Differential Privacy. (arXiv:2202.02650v1 [cs.CR])
    Internet of Things devices are expanding rapidly and generating huge amount of data. There is an increasing need to explore data collected from these devices. Collaborative learning provides a strategic solution for the Internet of Things settings but also raises public concern over data privacy. In recent years, large amount of privacy preserving techniques have been developed based on differential privacy and secure multi-party computation. A major challenge of collaborative learning is to balance disclosure risk and data utility while maintaining high computation efficiency. In this paper, we proposed privacy preserving logistic regression model using matrix encryption approach. The secure scheme achieves local differential privacy and can be implemented for both vertical and horizontal partitioning scenarios. Moreover, cross validation is investigated to generate robust model results without increasing the communication cost. Simulation illustrates the high efficiency of proposed scheme to analyze dataset with millions of records. Experimental evaluations further demonstrate high model accuracy while achieving privacy protection.  ( 2 min )
    Learning Optimal Resource Allocations in Wireless Systems. (arXiv:1807.08088v3 [cs.LG] UPDATED)
    This paper considers the design of optimal resource allocation policies in wireless communication systems which are generically modeled as a functional optimization problem with stochastic constraints. These optimization problems have the structure of a learning problem in which the statistical loss appears as a constraint, motivating the development of learning methodologies to attempt their solution. To handle stochastic constraints, training is undertaken in the dual domain. It is shown that this can be done with small loss of optimality when using near-universal learning parameterizations. In particular, since deep neural networks (DNN) are near-universal their use is advocated and explored. DNNs are trained here with a model-free primal-dual method that simultaneously learns a DNN parametrization of the resource allocation policy and optimizes the primal and dual variables. Numerical simulations demonstrate the strong performance of the proposed approach on a number of common wireless resource allocation problems.  ( 2 min )
    Training Differentially Private Models with Secure Multiparty Computation. (arXiv:2202.02625v1 [cs.CR])
    We address the problem of learning a machine learning model from training data that originates at multiple data owners while providing formal privacy guarantees regarding the protection of each owner's data. Existing solutions based on Differential Privacy (DP) achieve this at the cost of a drop in accuracy. Solutions based on Secure Multiparty Computation (MPC) do not incur such accuracy loss but leak information when the trained model is made publicly available. We propose an MPC solution for training DP models. Our solution relies on an MPC protocol for model training, and an MPC protocol for perturbing the trained model coefficients with Laplace noise in a privacy-preserving manner. The resulting MPC+DP approach achieves higher accuracy than a pure DP approach while providing the same formal privacy guarantees. Our work obtained first place in the iDASH2021 Track III competition on confidential computing for secure genome analysis.  ( 2 min )
    A survey of top-down approaches for human pose estimation. (arXiv:2202.02656v1 [cs.CV])
    Human pose estimation in two-dimensional images videos has been a hot topic in the computer vision problem recently due to its vast benefits and potential applications for improving human life, such as behaviors recognition, motion capture and augmented reality, training robots, and movement tracking. Many state-of-the-art methods implemented with Deep Learning have addressed several challenges and brought tremendous remarkable results in the field of human pose estimation. Approaches are classified into two kinds: the two-step framework (top-down approach) and the part-based framework (bottom-up approach). While the two-step framework first incorporates a person detector and then estimates the pose within each box independently, detecting all body parts in the image and associating parts belonging to distinct persons is conducted in the part-based framework. This paper aims to provide newcomers with an extensive review of deep learning methods-based 2D images for recognizing the pose of people, which only focuses on top-down approaches since 2016. The discussion through this paper presents significant detectors and estimators depending on mathematical background, the challenges and limitations, benchmark datasets, evaluation metrics, and comparison between methods.  ( 2 min )
    Parameter-free Online Linear Optimization with Side Information via Universal Coin Betting. (arXiv:2202.02406v1 [cs.IT])
    A class of parameter-free online linear optimization algorithms is proposed that harnesses the structure of an adversarial sequence by adapting to some side information. These algorithms combine the reduction technique of Orabona and P{\'a}l (2016) for adapting coin betting algorithms for online linear optimization with universal compression techniques in information theory for incorporating sequential side information to coin betting. Concrete examples are studied in which the side information has a tree structure and consists of quantized values of the previous symbols of the adversarial sequence, including fixed-order and variable-order Markov cases. By modifying the context-tree weighting technique of Willems, Shtarkov, and Tjalkens (1995), the proposed algorithm is further refined to achieve the best performance over all adaptive algorithms with tree-structured side information of a given maximum order in a computationally efficient manner.  ( 2 min )
    Adversarially Trained Actor Critic for Offline Reinforcement Learning. (arXiv:2202.02446v1 [cs.LG])
    We propose Adversarially Trained Actor Critic (ATAC), a new model-free algorithm for offline reinforcement learning under insufficient data coverage, based on a two-player Stackelberg game framing of offline RL: A policy actor competes against an adversarially trained value critic, who finds data-consistent scenarios where the actor is inferior to the data-collection behavior policy. We prove that, when the actor attains no regret in the two-player game, running ATAC produces a policy that provably 1) outperforms the behavior policy over a wide range of hyperparameters, and 2) competes with the best policy covered by data with appropriately chosen hyperparameters. Compared with existing works, notably our framework offers both theoretical guarantees for general function approximation and a deep RL implementation scalable to complex environments and large datasets. In the D4RL benchmark, ATAC consistently outperforms state-of-the-art offline RL algorithms on a range of continuous control tasks  ( 2 min )
    Portfolio Optimization on NIFTY Thematic Sector Stocks Using an LSTM Model. (arXiv:2202.02723v1 [q-fin.PM])
    Portfolio optimization has been a broad and intense area of interest for quantitative and statistical finance researchers and financial analysts. It is a challenging task to design a portfolio of stocks to arrive at the optimized values of the return and risk. This paper presents an algorithmic approach for designing optimum risk and eigen portfolios for five thematic sectors of the NSE of India. The prices of the stocks are extracted from the web from Jan 1, 2016, to Dec 31, 2020. Optimum risk and eigen portfolios for each sector are designed based on ten critical stocks from the sector. An LSTM model is designed for predicting future stock prices. Seven months after the portfolios were formed, on Aug 3, 2021, the actual returns of the portfolios are compared with the LSTM-predicted returns. The predicted and the actual returns indicate a very high-level accuracy of the LSTM model.  ( 2 min )
    A Temporal-Difference Approach to Policy Gradient Estimation. (arXiv:2202.02396v1 [cs.LG])
    The policy gradient theorem (Sutton et al., 2000) prescribes the usage of a cumulative discounted state distribution under the target policy to approximate the gradient. Most algorithms based on this theorem, in practice, break this assumption, introducing a distribution shift that can cause the convergence to poor solutions. In this paper, we propose a new approach of reconstructing the policy gradient from the start state without requiring a particular sampling strategy. The policy gradient calculation in this form can be simplified in terms of a gradient critic, which can be recursively estimated due to a new Bellman equation of gradients. By using temporal-difference updates of the gradient critic from an off-policy data stream, we develop the first estimator that sidesteps the distribution shift issue in a model-free way. We prove that, under certain realizability conditions, our estimator is unbiased regardless of the sampling strategy. We empirically show that our technique achieves a superior bias-variance trade-off and performance in presence of off-policy samples.  ( 2 min )
    The impact of feature importance methods on the interpretation of defect classifiers. (arXiv:2202.02389v1 [cs.LG])
    Classifier specific (CS) and classifier agnostic (CA) feature importance methods are widely used (often interchangeably) by prior studies to derive feature importance ranks from a defect classifier. However, different feature importance methods are likely to compute different feature importance ranks even for the same dataset and classifier. Hence such interchangeable use of feature importance methods can lead to conclusion instabilities unless there is a strong agreement among different methods. Therefore, in this paper, we evaluate the agreement between the feature importance ranks associated with the studied classifiers through a case study of 18 software projects and six commonly used classifiers. We find that: 1) The computed feature importance ranks by CA and CS methods do not always strongly agree with each other. 2) The computed feature importance ranks by the studied CA methods exhibit a strong agreement including the features reported at top-1 and top-3 ranks for a given dataset and classifier, while even the commonly used CS methods yield vastly different feature importance ranks. Such findings raise concerns about the stability of conclusions across replicated studies. We further observe that the commonly used defect datasets are rife with feature interactions and these feature interactions impact the computed feature importance ranks of the CS methods (not the CA methods). We demonstrate that removing these feature interactions, even with simple methods like CFS improves agreement between the computed feature importance ranks of CA and CS methods. In light of our findings, we provide guidelines for stakeholders and practitioners when performing model interpretation and directions for future research, e.g., future research is needed to investigate the impact of advanced feature interaction removal methods on computed feature importance ranks of different CS methods.  ( 2 min )
    SEED: Sound Event Early Detection via Evidential Uncertainty. (arXiv:2202.02441v1 [cs.SD])
    Sound Event Early Detection (SEED) is an essential task in recognizing the acoustic environments and soundscapes. However, most of the existing methods focus on the offline sound event detection, which suffers from the over-confidence issue of early-stage event detection and usually yield unreliable results. To solve the problem, we propose a novel Polyphonic Evidential Neural Network (PENet) to model the evidential uncertainty of the class probability with Beta distribution. Specifically, we use a Beta distribution to model the distribution of class probabilities, and the evidential uncertainty enriches uncertainty representation with evidence information, which plays a central role in reliable prediction. To further improve the event detection performance, we design the backtrack inference method that utilizes both the forward and backward audio features of an ongoing event. Experiments on the DESED database show that the proposed method can simultaneously improve 13.0\% and 3.8\% in time delay and detection F1 score compared to the state-of-the-art methods.  ( 2 min )
    Causal Disentanglement for Semantics-Aware Intent Learning in Recommendation. (arXiv:2202.02576v1 [cs.IR])
    Traditional recommendation models trained on observational interaction data have generated large impacts in a wide range of applications, it faces bias problems that cover users' true intent and thus deteriorate the recommendation effectiveness. Existing methods tracks this problem as eliminating bias for the robust recommendation, e.g., by re-weighting training samples or learning disentangled representation. The disentangled representation methods as the state-of-the-art eliminate bias through revealing cause-effect of the bias generation. However, how to design the semantics-aware and unbiased representation for users true intents is largely unexplored. To bridge the gap, we are the first to propose an unbiased and semantics-aware disentanglement learning called CaDSI (Causal Disentanglement for Semantics-Aware Intent Learning) from a causal perspective. Particularly, CaDSI explicitly models the causal relations underlying recommendation task, and thus produces semantics-aware representations via disentangling users true intents aware of specific item context. Moreover, the causal intervention mechanism is designed to eliminate confounding bias stemmed from context information, which further to align the semantics-aware representation with users true intent. Extensive experiments and case studies both validate the robustness and interpretability of our proposed model.  ( 2 min )
    How Effective is Incongruity? Implications for Code-mix Sarcasm Detection. (arXiv:2202.02702v1 [cs.CL])
    The presence of sarcasm in conversational systems and social media like chatbots, Facebook, Twitter, etc. poses several challenges for downstream NLP tasks. This is attributed to the fact that the intended meaning of a sarcastic text is contrary to what is expressed. Further, the use of code-mix language to express sarcasm is increasing day by day. Current NLP techniques for code-mix data have limited success due to the use of different lexicon, syntax, and scarcity of labeled corpora. To solve the joint problem of code-mixing and sarcasm detection, we propose the idea of capturing incongruity through sub-word level embeddings learned via fastText. Empirical results shows that our proposed model achieves F1-score on code-mix Hinglish dataset comparable to pretrained multilingual models while training 10x faster and using a lower memory footprint  ( 2 min )
    BAM: Bayes with Adaptive Memory. (arXiv:2202.02405v1 [cs.LG])
    Online learning via Bayes' theorem allows new data to be continuously integrated into an agent's current beliefs. However, a naive application of Bayesian methods in non-stationary environments leads to slow adaptation and results in state estimates that may converge confidently to the wrong parameter value. A common solution when learning in changing environments is to discard/downweight past data; however, this simple mechanism of "forgetting" fails to account for the fact that many real-world environments involve revisiting similar states. We propose a new framework, Bayes with Adaptive Memory (BAM), that takes advantage of past experience by allowing the agent to choose which past observations to remember and which to forget. We demonstrate that BAM generalizes many popular Bayesian update rules for non-stationary environments. Through a variety of experiments, we demonstrate the ability of BAM to continuously adapt in an ever-changing world.  ( 2 min )
    Pushing the Efficiency-Regret Pareto Frontier for Online Learning of Portfolios and Quantum States. (arXiv:2202.02765v1 [cs.LG])
    We revisit the classical online portfolio selection problem. It is widely assumed that a trade-off between computational complexity and regret is unavoidable, with Cover's Universal Portfolios algorithm, SOFT-BAYES and ADA-BARRONS currently constituting its state-of-the-art Pareto frontier. In this paper, we present the first efficient algorithm, BISONS, that obtains polylogarithmic regret with memory and per-step running time requirements that are polynomial in the dimension, displacing ADA-BARRONS from the Pareto frontier. Additionally, we resolve a COLT 2020 open problem by showing that a certain Follow-The-Regularized-Leader algorithm with log-barrier regularization suffers an exponentially larger dependence on the dimension than previously conjectured. Thus, we rule out this algorithm as a candidate for the Pareto frontier. We also extend our algorithm and analysis to a more general problem than online portfolio selection, viz. online learning of quantum states with log loss. This algorithm, called SCHRODINGER'S BISONS, is the first efficient algorithm with polylogarithmic regret for this more general problem.  ( 2 min )
    Neural Logic Analogy Learning. (arXiv:2202.02436v1 [cs.CL])
    Letter-string analogy is an important analogy learning task which seems to be easy for humans but very challenging for machines. The main idea behind current approaches to solving letter-string analogies is to design heuristic rules for extracting analogy structures and constructing analogy mappings. However, one key problem is that it is difficult to build a comprehensive and exhaustive set of analogy structures which can fully describe the subtlety of analogies. This problem makes current approaches unable to handle complicated letter-string analogy problems. In this paper, we propose Neural logic analogy learning (Noan), which is a dynamic neural architecture driven by differentiable logic reasoning to solve analogy problems. Each analogy problem is converted into logical expressions consisting of logical variables and basic logical operations (AND, OR, and NOT). More specifically, Noan learns the logical variables as vector embeddings and learns each logical operation as a neural module. In this way, the model builds computational graph integrating neural network with logical reasoning to capture the internal logical structure of the input letter strings. The analogy learning problem then becomes a True/False evaluation problem of the logical expressions. Experiments show that our machine learning-based Noan approach outperforms state-of-the-art approaches on standard letter-string analogy benchmark datasets.  ( 2 min )
    ASHA: Assistive Teleoperation via Human-in-the-Loop Reinforcement Learning. (arXiv:2202.02465v1 [cs.RO])
    Building assistive interfaces for controlling robots through arbitrary, high-dimensional, noisy inputs (e.g., webcam images of eye gaze) can be challenging, especially when it involves inferring the user's desired action in the absence of a natural 'default' interface. Reinforcement learning from online user feedback on the system's performance presents a natural solution to this problem, and enables the interface to adapt to individual users. However, this approach tends to require a large amount of human-in-the-loop training data, especially when feedback is sparse. We propose a hierarchical solution that learns efficiently from sparse user feedback: we use offline pre-training to acquire a latent embedding space of useful, high-level robot behaviors, which, in turn, enables the system to focus on using online user feedback to learn a mapping from user inputs to desired high-level behaviors. The key insight is that access to a pre-trained policy enables the system to learn more from sparse rewards than a na\"ive RL algorithm: using the pre-trained policy, the system can make use of successful task executions to relabel, in hindsight, what the user actually meant to do during unsuccessful executions. We evaluate our method primarily through a user study with 12 participants who perform tasks in three simulated robotic manipulation domains using a webcam and their eye gaze: flipping light switches, opening a shelf door to reach objects inside, and rotating a valve. The results show that our method successfully learns to map 128-dimensional gaze features to 7-dimensional joint torques from sparse rewards in under 10 minutes of online training, and seamlessly helps users who employ different gaze strategies, while adapting to distributional shift in webcam inputs, tasks, and environments.  ( 2 min )
    The Unreasonable Effectiveness of Random Pruning: Return of the Most Naive Baseline for Sparse Training. (arXiv:2202.02643v1 [cs.LG])
    Random pruning is arguably the most naive way to attain sparsity in neural networks, but has been deemed uncompetitive by either post-training pruning or sparse training. In this paper, we focus on sparse training and highlight a perhaps counter-intuitive finding, that random pruning at initialization can be quite powerful for the sparse training of modern neural networks. Without any delicate pruning criteria or carefully pursued sparsity structures, we empirically demonstrate that sparsely training a randomly pruned network from scratch can match the performance of its dense equivalent. There are two key factors that contribute to this revival: (i) the network sizes matter: as the original dense networks grow wider and deeper, the performance of training a randomly pruned sparse network will quickly grow to matching that of its dense equivalent, even at high sparsity ratios; (ii) appropriate layer-wise sparsity ratios can be pre-chosen for sparse training, which shows to be another important performance booster. Simple as it looks, a randomly pruned subnetwork of Wide ResNet-50 can be sparsely trained to outperforming a dense Wide ResNet-50, on ImageNet. We also observed such randomly pruned networks outperform dense counterparts in other favorable aspects, such as out-of-distribution detection, uncertainty estimation, and adversarial robustness. Overall, our results strongly suggest there is larger-than-expected room for sparse training at scale, and the benefits of sparsity might be more universal beyond carefully designed pruning. Our source code can be found at https://github.com/VITA-Group/Random_Pruning.  ( 2 min )
    Lightweight Compositional Embeddings for Incremental Streaming Recommendation. (arXiv:2202.02427v1 [cs.LG])
    Most work in graph-based recommender systems considers a {\em static} setting where all information about test nodes (i.e., users and items) is available upfront at training time. However, this static setting makes little sense for many real-world applications where data comes in continuously as a stream of new edges and nodes, and one has to update model predictions incrementally to reflect the latest state. To fully capitalize on the newly available data in the stream, recent graph-based recommendation models would need to be repeatedly retrained, which is infeasible in practice. In this paper, we study the graph-based streaming recommendation setting and propose a compositional recommendation model -- Lightweight Compositional Embedding (LCE) -- that supports incremental updates under low computational cost. Instead of learning explicit embeddings for the full set of nodes, LCE learns explicit embeddings for only a subset of nodes and represents the other nodes {\em implicitly}, through a composition function based on their interactions in the graph. This provides an effective, yet efficient, means to leverage streaming graph data when one node type (e.g., items) is more amenable to static representation. We conduct an extensive empirical study to compare LCE to a set of competitive baselines on three large-scale user-item recommendation datasets with interactions under a streaming setting. The results demonstrate the superior performance of LCE, showing that it achieves nearly skyline performance with significantly fewer parameters than alternative graph-based models.  ( 2 min )
    Doing Right by Not Doing Wrong in Human-Robot Collaboration. (arXiv:2202.02654v1 [cs.RO])
    As robotic systems become more and more capable of assisting humans in their everyday lives, we must consider the opportunities for these artificial agents to make their human collaborators feel unsafe or to treat them unfairly. Robots can exhibit antisocial behavior causing physical harm to people or reproduce unfair behavior replicating and even amplifying historical and societal biases which are detrimental to humans they interact with. In this paper, we discuss these issues considering sociable robotic manipulation and fair robotic decision making. We propose a novel approach to learning fair and sociable behavior, not by reproducing positive behavior, but rather by avoiding negative behavior. In this study, we highlight the importance of incorporating sociability in robot manipulation, as well as the need to consider fairness in human-robot interactions.  ( 2 min )
    Simulation-to-Reality domain adaptation for offline 3D object annotation on pointclouds with correlation alignment. (arXiv:2202.02666v1 [cs.CV])
    Annotating objects with 3D bounding boxes in LiDAR pointclouds is a costly human driven process in an autonomous driving perception system. In this paper, we present a method to semi-automatically annotate real-world pointclouds collected by deployment vehicles using simulated data. We train a 3D object detector model on labeled simulated data from CARLA jointly with real world pointclouds from our target vehicle. The supervised object detection loss is augmented with a CORAL loss term to reduce the distance between labeled simulated and unlabeled real pointcloud feature representations. The goal here is to learn representations that are invariant to simulated (labeled) and real-world (unlabeled) target domains. We also provide an updated survey on domain adaptation methods for pointclouds.  ( 2 min )
    Exemplar-Based Contrastive Self-Supervised Learning with Few-Shot Class Incremental Learning. (arXiv:2202.02601v1 [cs.LG])
    Humans are capable of learning new concepts from only a few (labeled) exemplars, incrementally and continually. This happens within the context that we can differentiate among the exemplars, and between the exemplars and large amounts of other data (unlabeled and labeled). This suggests, in human learning, supervised learning of concepts based on exemplars takes place within the larger context of contrastive self-supervised learning (CSSL) based on unlabeled and labeled data. We discuss extending CSSL (1) to be based mainly on exemplars and only secondly on data augmentation, and (2) to apply to both unlabeled data (a large amount is available in general) and labeled data (a few exemplars can be obtained with valuable supervised knowledge). A major benefit of the extensions is that exemplar-based CSSL, with supervised finetuning, supports few-shot class incremental learning (CIL). Specifically, we discuss exemplar-based CSSL including: nearest-neighbor CSSL, neighborhood CSSL with supervised pretraining, and exemplar CSSL with supervised finetuning. We further discuss using exemplar-based CSSL to facilitate few-shot learning and, in particular, few-shot CIL.  ( 2 min )
    The Implicit Bias of Gradient Descent on Generalized Gated Linear Networks. (arXiv:2202.02649v1 [stat.ML])
    Understanding the asymptotic behavior of gradient-descent training of deep neural networks is essential for revealing inductive biases and improving network performance. We derive the infinite-time training limit of a mathematically tractable class of deep nonlinear neural networks, gated linear networks (GLNs), and generalize these results to gated networks described by general homogeneous polynomials. We study the implications of our results, focusing first on two-layer GLNs. We then apply our theoretical predictions to GLNs trained on MNIST and show how architectural constraints and the implicit bias of gradient descent affect performance. Finally, we show that our theory captures a substantial portion of the inductive bias of ReLU networks. By making the inductive bias explicit, our framework is poised to inform the development of more efficient, biologically plausible, and robust learning algorithms.  ( 2 min )
    Weisfeiler-Lehman meets Gromov-Wasserstein. (arXiv:2202.02495v1 [cs.LG])
    The Weisfeiler-Lehman (WL) test is a classical procedure for graph isomorphism testing. The WL test has also been widely used both for designing graph kernels and for analyzing graph neural networks. In this paper, we propose the Weisfeiler-Lehman (WL) distance, a notion of distance between labeled measure Markov chains (LMMCs), of which labeled graphs are special cases. The WL distance is polynomial time computable and is also compatible with the WL test in the sense that the former is positive if and only if the WL test can distinguish the two involved graphs. The WL distance captures and compares subtle structures of the underlying LMMCs and, as a consequence of this, it is more discriminating than the distance between graphs used for defining the state-of-the-art Wasserstein Weisfeiler-Lehman graph kernel. Inspired by the structure of the WL distance we identify a neural network architecture on LMMCs which turns out to be universal w.r.t. continuous functions defined on the space of all LMMCs (which includes all graphs) endowed with the WL distance. Finally, the WL distance turns out to be stable w.r.t. a natural variant of the Gromov-Wasserstein (GW) distance for comparing metric Markov chains that we identify. Hence, the WL distance can also be construed as a polynomial time lower bound for the GW distance which is in general NP-hard to compute.  ( 2 min )
    Improved Certified Defenses against Data Poisoning with (Deterministic) Finite Aggregation. (arXiv:2202.02628v1 [cs.LG])
    Data poisoning attacks aim at manipulating model behaviors through distorting training data. Previously, an aggregation-based certified defense, Deep Partition Aggregation (DPA), was proposed to mitigate this threat. DPA predicts through an aggregation of base classifiers trained on disjoint subsets of data, thus restricting its sensitivity to dataset distortions. In this work, we propose an improved certified defense against general poisoning attacks, namely Finite Aggregation. In contrast to DPA, which directly splits the training set into disjoint subsets, our method first splits the training set into smaller disjoint subsets and then combines duplicates of them to build larger (but not disjoint) subsets for training base classifiers. This reduces the worst-case impacts of poison samples and thus improves certified robustness bounds. In addition, we offer an alternative view of our method, bridging the designs of deterministic and stochastic aggregation-based certified defenses. Empirically, our proposed Finite Aggregation consistently improves certificates on MNIST, CIFAR-10, and GTSRB, boosting certified fractions by up to 3.05%, 3.87% and 4.77%, respectively, while keeping the same clean accuracies as DPA's, effectively establishing a new state of the art in (pointwise) certified robustness against data poisoning.  ( 2 min )
    Linear Model with Local Differential Privacy. (arXiv:2202.02448v1 [cs.CR])
    Scientific collaborations benefit from collaborative learning of distributed sources, but remain difficult to achieve when data are sensitive. In recent years, privacy preserving techniques have been widely studied to analyze distributed data across different agencies while protecting sensitive information. Secure multiparty computation has been widely studied for privacy protection with high privacy level but intense computation cost. There are also other security techniques sacrificing partial data utility to reduce disclosure risk. A major challenge is to balance data utility and disclosure risk while maintaining high computation efficiency. In this paper, matrix masking technique is applied to encrypt data such that the secure schemes are against malicious adversaries while achieving local differential privacy. The proposed schemes are designed for linear models and can be implemented for both vertical and horizontal partitioning scenarios. Moreover, cross validation is studied to prevent overfitting and select optimal parameters without additional communication cost. Simulation results present the efficiency of proposed schemes to analyze dataset with millions of records and high-dimensional data (n << p).  ( 2 min )
    Memory Defense: More Robust Classification via a Memory-Masking Autoencoder. (arXiv:2202.02595v1 [cs.CV])
    Many deep neural networks are susceptible to minute perturbations of images that have been carefully crafted to cause misclassification. Ideally, a robust classifier would be immune to small variations in input images, and a number of defensive approaches have been created as a result. One method would be to discern a latent representation which could ignore small changes to the input. However, typical autoencoders easily mingle inter-class latent representations when there are strong similarities between classes, making it harder for a decoder to accurately project the image back to the original high-dimensional space. We propose a novel framework, Memory Defense, an augmented classifier with a memory-masking autoencoder to counter this challenge. By masking other classes, the autoencoder learns class-specific independent latent representations. We test the model's robustness against four widely used attacks. Experiments on the Fashion-MNIST & CIFAR-10 datasets demonstrate the superiority of our model. We make available our source code at GitHub repository: https://github.com/eashanadhikarla/MemDefense  ( 2 min )
    A note on the complex and bicomplex valued neural networks. (arXiv:2202.02354v1 [cs.LG])
    In this paper we first write a proof of the perceptron convergence algorithm for the complex multivalued neural networks (CMVNNs). Our primary goal is to formulate and prove the perceptron convergence algorithm for the bicomplex multivalued neural networks (BMVNNs) and other important results in the theory of neural networks based on a bicomplex algebra.  ( 2 min )
    Privacy-preserving Speech Emotion Recognition through Semi-Supervised Federated Learning. (arXiv:2202.02611v1 [cs.LG])
    Speech Emotion Recognition (SER) refers to the recognition of human emotions from natural speech. If done accurately, it can offer a number of benefits in building human-centered context-aware intelligent systems. Existing SER approaches are largely centralized, without considering users' privacy. Federated Learning (FL) is a distributed machine learning paradigm dealing with decentralization of privacy-sensitive personal data. In this paper, we present a privacy-preserving and data-efficient SER approach by utilizing the concept of FL. To the best of our knowledge, this is the first federated SER approach, which utilizes self-training learning in conjunction with federated learning to exploit both labeled and unlabeled on-device data. Our experimental evaluations on the IEMOCAP dataset shows that our federated approach can learn generalizable SER models even under low availability of data labels and highly non-i.i.d. distributions. We show that our approach with as few as 10% labeled data, on average, can improve the recognition rate by 8.67% compared to the fully-supervised federated counterparts.  ( 2 min )
    MarkovGNN: Graph Neural Networks on Markov Diffusion. (arXiv:2202.02470v1 [cs.LG])
    Most real-world networks contain well-defined community structures where nodes are densely connected internally within communities. To learn from these networks, we develop MarkovGNN that captures the formation and evolution of communities directly in different convolutional layers. Unlike most Graph Neural Networks (GNNs) that consider a static graph at every layer, MarkovGNN generates different stochastic matrices using a Markov process and then uses these community-capturing matrices in different layers. MarkovGNN is a general approach that could be used with most existing GNNs. We experimentally show that MarkovGNN outperforms other GNNs for clustering, node classification, and visualization tasks. The source code of MarkovGNN is publicly available at \url{https://github.com/HipGraph/MarkovGNN}.  ( 2 min )
    GraphEye: A Novel Solution for Detecting Vulnerable Functions Based on Graph Attention Network. (arXiv:2202.02501v1 [cs.CR])
    With the continuous extension of the Industrial Internet, cyber incidents caused by software vulnerabilities have been increasing in recent years. However, software vulnerabilities detection is still heavily relying on code review done by experts, and how to automatedly detect software vulnerabilities is an open problem so far. In this paper, we propose a novel solution named GraphEye to identify whether a function of C/C++ code has vulnerabilities, which can greatly alleviate the burden of code auditors. GraphEye is originated from the observation that the code property graph of a non-vulnerable function naturally differs from the code property graph of a vulnerable function with the same functionality. Hence, detecting vulnerable functions is attributed to the graph classification problem.GraphEye is comprised of VecCPG and GcGAT. VecCPG is a vectorization for the code property graph, which is proposed to characterize the key syntax and semantic features of the corresponding source code. GcGAT is a deep learning model based on the graph attention graph, which is proposed to solve the graph classification problem according to VecCPG. Finally, GraphEye is verified by the SARD Stack-based Buffer Overflow, Divide-Zero, Null Pointer Deference, Buffer Error, and Resource Error datasets, the corresponding F1 scores are 95.6%, 95.6%,96.1%,92.6%, and 96.1% respectively, which validate the effectiveness of the proposed solution.  ( 2 min )
    A Graph Neural Network Framework for Grid-Based Simulation. (arXiv:2202.02652v1 [cs.LG])
    Reservoir simulations are computationally expensive in the well control and well placement optimization. Generally, numerous simulation runs (realizations) are needed in order to achieve the optimal well locations. In this paper, we propose a graph neural network (GNN) framework to build a surrogate feed-forward model which replaces simulation runs to accelerate the optimization process. Our GNN framework includes an encoder, a process, and a decoder which takes input from the processed graph data designed and generated from the simulation raw data. We train the GNN model with 6000 samples (equivalent to 40 well configurations) with each containing the previous step state variable and the next step state variable. We test the GNN model with another 6000 samples and after model tuning, both one-step prediction and rollout prediction achieve a close match with the simulation results. Our GNN framework shows great potential in the application of well-related subsurface optimization including oil and gas as well as carbon capture sequestration (CCS).  ( 2 min )
    Robust Anomaly Detection for Time-series Data. (arXiv:2202.02721v1 [cs.LG])
    Time-series anomaly detection plays a vital role in monitoring complex operation conditions. However, the detection accuracy of existing approaches is heavily influenced by pattern distribution, existence of multiple normal patterns, dynamical features representation, and parameter settings. For the purpose of improving the robustness and guaranteeing the accuracy, this research combined the strengths of negative selection, unthresholded recurrence plots, and an extreme learning machine autoencoder and then proposed robust anomaly detection for time-series data (RADTD), which can automatically learn dynamical features in time series and recognize anomalies with low label dependency and high robustness. Yahoo benchmark datasets and three tunneling engineering simulation experiments were used to evaluate the performance of RADTD. The experiments showed that in benchmark datasets RADTD possessed higher accuracy and robustness than recurrence qualification analysis and extreme learning machine autoencoder, respectively, and that RADTD accurately detected the occurrence of tunneling settlement accidents, indicating its remarkable performance in accuracy and robustness.  ( 2 min )
    SMODICE: Versatile Offline Imitation Learning via State Occupancy Matching. (arXiv:2202.02433v1 [cs.LG])
    We propose State Matching Offline DIstribution Correction Estimation (SMODICE), a novel and versatile algorithm for offline imitation learning (IL) via state-occupancy matching. We show that the SMODICE objective admits a simple optimization procedure through an application of Fenchel duality and an analytic solution in tabular MDPs. Without requiring access to expert actions, SMODICE can be effectively applied to three offline IL settings: (i) imitation from observations (IfO), (ii) IfO with dynamics or morphologically mismatched expert, and (iii) example-based reinforcement learning, which we show can be formulated as a state-occupancy matching problem. We extensively evaluate SMODICE on both gridworld environments as well as on high-dimensional offline benchmarks. Our results demonstrate that SMODICE is effective for all three problem settings and significantly outperforms prior state-of-art.  ( 2 min )
    Score-based Generative Modeling of Graphs via the System of Stochastic Differential Equations. (arXiv:2202.02514v1 [cs.LG])
    Generating graph-structured data requires learning the underlying distribution of graphs. Yet, this is a challenging problem, and the previous graph generative methods either fail to capture the permutation-invariance property of graphs or cannot sufficiently model the complex dependency between nodes and edges, which is crucial for generating real-world graphs such as molecules. To overcome such limitations, we propose a novel score-based generative model for graphs with a continuous-time framework. Specifically, we propose a new graph diffusion process that models the joint distribution of the nodes and edges through a system of stochastic differential equations (SDEs). Then, we derive novel score matching objectives tailored for the proposed diffusion process to estimate the gradient of the joint log-density with respect to each component, and introduce a new solver for the system of SDEs to efficiently sample from the reverse diffusion process. We validate our graph generation method on diverse datasets, on which it either achieves significantly superior or competitive performance to the baselines. Further analysis shows that our method is able to generate molecules that lie close to the training distribution yet do not violate the chemical valency rule, demonstrating the effectiveness of the system of SDEs in modeling the node-edge relationships.  ( 2 min )
    Transfer Reinforcement Learning for Differing Action Spaces via Q-Network Representations. (arXiv:2202.02442v1 [cs.LG])
    Transfer learning approaches in reinforcement learning aim to assist agents in learning their target domains by leveraging the knowledge learned from other agents that have been trained on similar source domains. For example, recent research focus within this space has been placed on knowledge transfer between tasks that have different transition dynamics and reward functions; however, little focus has been placed on knowledge transfer between tasks that have different action spaces. In this paper, we approach the task of transfer learning between domains that differ in action spaces. We present a reward shaping method based on source embedding similarity that is applicable to domains with both discrete and continuous action spaces. The efficacy of our approach is evaluated on transfer to restricted action spaces in the Acrobot-v1 and Pendulum-v0 domains (Brockman et al. 2016). A comparison with two baselines shows that our method does not outperform these baselines in these continuous action spaces but does show an improvement in these discrete action spaces. We conclude our analysis with future directions for this work.  ( 2 min )
    LotRec: A Recommender for Urban Vacant Lot Conversion. (arXiv:2202.02481v1 [cs.CY])
    Vacant lots are neglected properties in a city that lead to environmental hazards and poor standard of living for the community. Thus, reclaiming vacant lots and putting them to productive use is an important consideration for many cities. Given a large number of vacant lots and resource constraints for conversion, two key questions for a city are (1) whether to convert a vacant lot or not; and (2) what to convert a vacant lot as. We seek to provide computational support to answer these questions. To this end, we identify the determinants of a vacant lot conversion and build a recommender based on those determinants. We evaluate our models on real-world vacant lot datasets from the US cities of Philadelphia,PA and Baltimore, MD. Our results indicate that our recommender yields mean F-measures of (1) 90% in predicting whether a vacant lot should be converted or not within a single city, (2) 91% in predicting what a vacant lot should be converted to, within a single city and, (3) 85% in predicting whether a vacant lot should be converted or not across two cities.  ( 2 min )
    Marius++: Large-Scale Training of Graph Neural Networks on a Single Machine. (arXiv:2202.02365v1 [cs.LG])
    Graph Neural Networks (GNNs) have emerged as a powerful model for ML over graph-structured data. Yet, scalability remains a major challenge for using GNNs over billion-edge inputs. The creation of mini-batches used for training incurs computational and data movement costs that grow exponentially with the number of GNN layers as state-of-the-art models aggregate information from the multi-hop neighborhood of each input node. In this paper, we focus on scalable training of GNNs with emphasis on resource efficiency. We show that out-of-core pipelined mini-batch training in a single machine outperforms resource-hungry multi-GPU solutions. We introduce Marius++, a system for training GNNs over billion-scale graphs. Marius++ provides disk-optimized training for GNNs and introduces a series of data organization and algorithmic contributions that 1) minimize the memory-footprint and end-to-end time required for training and 2) ensure that models learned with disk-based training exhibit accuracy similar to those fully trained in mixed CPU/GPU settings. We evaluate Marius++ against PyTorch Geometric and Deep Graph Library using seven benchmark (model, data set) settings and find that Marius++ with one GPU can achieve the same level of model accuracy up to 8$\times$ faster than these systems when they are using up to eight GPUs. For these experiments, disk-based training allows Marius++ deployments to be up to 64$\times$ cheaper in monetary cost than those of the competing systems.  ( 2 min )
    Towards Training Reproducible Deep Learning Models. (arXiv:2202.02326v1 [cs.LG])
    Reproducibility is an increasing concern in Artificial Intelligence (AI), particularly in the area of Deep Learning (DL). Being able to reproduce DL models is crucial for AI-based systems, as it is closely tied to various tasks like training, testing, debugging, and auditing. However, DL models are challenging to be reproduced due to issues like randomness in the software (e.g., DL algorithms) and non-determinism in the hardware (e.g., GPU). There are various practices to mitigate some of the aforementioned issues. However, many of them are either too intrusive or can only work for a specific usage context. In this paper, we propose a systematic approach to training reproducible DL models. Our approach includes three main parts: (1) a set of general criteria to thoroughly evaluate the reproducibility of DL models for two different domains, (2) a unified framework which leverages a record-and-replay technique to mitigate software-related randomness and a profile-and-patch technique to control hardware-related non-determinism, and (3) a reproducibility guideline which explains the rationales and the mitigation strategies on conducting a reproducible training process for DL models. Case study results show our approach can successfully reproduce six open source and one commercial DL models.  ( 2 min )
    Verifying Inverse Model Neural Networks. (arXiv:2202.02429v1 [cs.LG])
    Inverse problems exist in a wide variety of physical domains from aerospace engineering to medical imaging. The goal is to infer the underlying state from a set of observations. When the forward model that produced the observations is nonlinear and stochastic, solving the inverse problem is very challenging. Neural networks are an appealing solution for solving inverse problems as they can be trained from noisy data and once trained are computationally efficient to run. However, inverse model neural networks do not have guarantees of correctness built-in, which makes them unreliable for use in safety and accuracy-critical contexts. In this work we introduce a method for verifying the correctness of inverse model neural networks. Our approach is to overapproximate a nonlinear, stochastic forward model with piecewise linear constraints and encode both the overapproximate forward model and the neural network inverse model as a mixed-integer program. We demonstrate this verification procedure on a real-world airplane fuel gauge case study. The ability to verify and consequently trust inverse model neural networks allows their use in a wide variety of contexts, from aerospace to medicine.  ( 2 min )
    Deep Dynamic Effective Connectivity Estimation from Multivariate Time Series. (arXiv:2202.02393v1 [cs.LG])
    Recently, methods that represent data as a graph, such as graph neural networks (GNNs) have been successfully used to learn data representations and structures to solve classification and link prediction problems. The applications of such methods are vast and diverse, but most of the current work relies on the assumption of a static graph. This assumption does not hold for many highly dynamic systems, where the underlying connectivity structure is non-stationary and is mostly unobserved. Using a static model in these situations may result in sub-optimal performance. In contrast, modeling changes in graph structure with time can provide information about the system whose applications go beyond classification. Most work of this type does not learn effective connectivity and focuses on cross-correlation between nodes to generate undirected graphs. An undirected graph is unable to capture direction of an interaction which is vital in many fields, including neuroscience. To bridge this gap, we developed dynamic effective connectivity estimation via neural network training (DECENNT), a novel model to learn an interpretable directed and dynamic graph induced by the downstream classification/prediction task. DECENNT outperforms state-of-the-art (SOTA) methods on five different tasks and infers interpretable task-specific dynamic graphs. The dynamic graphs inferred from functional neuroimaging data align well with the existing literature and provide additional information. Additionally, the temporal attention module of DECENNT identifies time-intervals crucial for predictive downstream task from multivariate time series data.  ( 2 min )
    Emblaze: Illuminating Machine Learning Representations through Interactive Comparison of Embedding Spaces. (arXiv:2202.02641v1 [cs.HC])
    Modern machine learning techniques commonly rely on complex, high-dimensional embedding representations to capture underlying structure in the data and improve performance. In order to characterize model flaws and choose a desirable representation, model builders often need to compare across multiple embedding spaces, a challenging analytical task supported by few existing tools. We first interviewed nine embedding experts in a variety of fields to characterize the diverse challenges they face and techniques they use when analyzing embedding spaces. Informed by these perspectives, we developed a novel system called Emblaze that integrates embedding space comparison within a computational notebook environment. Emblaze uses an animated, interactive scatter plot with a novel Star Trail augmentation to enable visual comparison. It also employs novel neighborhood analysis and clustering procedures to dynamically suggest groups of points with interesting changes between spaces. Through a series of case studies with ML experts, we demonstrate how interactive comparison with Emblaze can help gain new insights into embedding space structure.  ( 2 min )
    Improved Information Theoretic Generalization Bounds for Distributed and Federated Learning. (arXiv:2202.02423v1 [cs.IT])
    We consider information-theoretic bounds on expected generalization error for statistical learning problems in a networked setting. In this setting, there are $K$ nodes, each with its own independent dataset, and the models from each node have to be aggregated into a final centralized model. We consider both simple averaging of the models as well as more complicated multi-round algorithms. We give upper bounds on the expected generalization error for a variety of problems, such as those with Bregman divergence or Lipschitz continuous losses, that demonstrate an improved dependence of $1/K$ on the number of nodes. These "per node" bounds are in terms of the mutual information between the training dataset and the trained weights at each node, and are therefore useful in describing the generalization properties inherent to having communication or privacy constraints at each node.  ( 2 min )
    Energy-Aware Edge Association for Cluster-based Personalized Federated Learning. (arXiv:2202.02727v1 [cs.LG])
    Federated Learning (FL) over wireless network enables data-conscious services by leveraging the ubiquitous intelligence at network edge for privacy-preserving model training. As the proliferation of context-aware services, the diversified personal preferences causes disagreeing conditional distributions among user data, which leads to poor inference performance. In this sense, clustered federated learning is proposed to group user devices with similar preference and provide each cluster with a personalized model. This calls for innovative design in edge association that involves user clustering and also resource management optimization. We formulate an accuracy-cost trade-off optimization problem by jointly considering model accuracy, communication resource allocation and energy consumption. To comply with parameter encryption techniques in FL, we propose an iterative solution procedure which employs deep reinforcement learning based approach at cloud server for edge association. The reward function consists of minimized energy consumption at each base station and the averaged model accuracy of all users. Under our proposed solution, multiple edge base station are fully exploited to realize cost efficient personalized federated learning without any prior knowledge on model parameters. Simulation results show that our proposed strategy outperforms existing strategies in achieving accurate learning at low energy cost.  ( 2 min )
    An Experimental Design Approach for Regret Minimization in Logistic Bandits. (arXiv:2202.02407v1 [stat.ML])
    In this work we consider the problem of regret minimization for logistic bandits. The main challenge of logistic bandits is reducing the dependence on a potentially large problem dependent constant $\kappa$ that can at worst scale exponentially with the norm of the unknown parameter $\theta_{\ast}$. Abeille et al. (2021) have applied self-concordance of the logistic function to remove this worst-case dependence providing regret guarantees like $O(d\log^2(\kappa)\sqrt{\dot\mu T}\log(|\mathcal{X}|))$ where $d$ is the dimensionality, $T$ is the time horizon, and $\dot\mu$ is the variance of the best-arm. This work improves upon this bound in the fixed arm setting by employing an experimental design procedure that achieves a minimax regret of $O(\sqrt{d \dot\mu T\log(|\mathcal{X}|)})$. Our regret bound in fact takes a tighter instance (i.e., gap) dependent regret bound for the first time in logistic bandits. We also propose a new warmup sampling algorithm that can dramatically reduce the lower order term in the regret in general and prove that it can replace the lower order term dependency on $\kappa$ to $\log^2(\kappa)$ for some instances. Finally, we discuss the impact of the bias of the MLE on the logistic bandit problem, providing an example where $d^2$ lower order regret (cf., it is $d$ for linear bandits) may not be improved as long as the MLE is used and how bias-corrected estimators may be used to make it closer to $d$.  ( 2 min )
    Distributed Learning With Sparsified Gradient Differences. (arXiv:2202.02491v1 [cs.LG])
    A very large number of communications are typically required to solve distributed learning tasks, and this critically limits scalability and convergence speed in wireless communications applications. In this paper, we devise a Gradient Descent method with Sparsification and Error Correction (GD-SEC) to improve the communications efficiency in a general worker-server architecture. Motivated by a variety of wireless communications learning scenarios, GD-SEC reduces the number of bits per communication from worker to server with no degradation in the order of the convergence rate. This enables larger-scale model learning without sacrificing convergence or accuracy. At each iteration of GD-SEC, instead of directly transmitting the entire gradient vector, each worker computes the difference between its current gradient and a linear combination of its previously transmitted gradients, and then transmits the sparsified gradient difference to the server. A key feature of GD-SEC is that any given component of the gradient difference vector will not be transmitted if its magnitude is not sufficiently large. An error correction technique is used at each worker to compensate for the error resulting from sparsification. We prove that GD-SEC is guaranteed to converge for strongly convex, convex, and nonconvex optimization problems with the same order of convergence rate as GD. Furthermore, if the objective function is strongly convex, GD-SEC has a fast linear convergence rate. Numerical results not only validate the convergence rate of GD-SEC but also explore the communication bit savings it provides. Given a target accuracy, GD-SEC can significantly reduce the communications load compared to the best existing algorithms without slowing down the optimization process.  ( 2 min )
    Graph Neural Network with Curriculum Learning for Imbalanced Node Classification. (arXiv:2202.02529v1 [cs.LG])
    Graph Neural Network (GNN) is an emerging technique for graph-based learning tasks such as node classification. In this work, we reveal the vulnerability of GNN to the imbalance of node labels. Traditional solutions for imbalanced classification (e.g. resampling) are ineffective in node classification without considering the graph structure. Worse still, they may even bring overfitting or underfitting results due to lack of sufficient prior knowledge. To solve these problems, we propose a novel graph neural network framework with curriculum learning (GNN-CL) consisting of two modules. For one thing, we hope to acquire certain reliable interpolation nodes and edges through the novel graph-based oversampling based on smoothness and homophily. For another, we combine graph classification loss and metric learning loss which adjust the distance between different nodes associated with minority class in feature space. Inspired by curriculum learning, we dynamically adjust the weights of different modules during training process to achieve better ability of generalization and discrimination. The proposed framework is evaluated via several widely used graph datasets, showing that our proposed model consistently outperforms the existing state-of-the-art methods.  ( 2 min )
    No Parameters Left Behind: Sensitivity Guided Adaptive Learning Rate for Training Large Transformer Models. (arXiv:2202.02664v1 [cs.CL])
    Recent research has shown the existence of significant redundancy in large Transformer models. One can prune the redundant parameters without significantly sacrificing the generalization performance. However, we question whether the redundant parameters could have contributed more if they were properly trained. To answer this question, we propose a novel training strategy that encourages all parameters to be trained sufficiently. Specifically, we adaptively adjust the learning rate for each parameter according to its sensitivity, a robust gradient-based measure reflecting this parameter's contribution to the model performance. A parameter with low sensitivity is redundant, and we improve its fitting by increasing its learning rate. In contrast, a parameter with high sensitivity is well-trained, and we regularize it by decreasing its learning rate to prevent further overfitting. We conduct extensive experiments on natural language understanding, neural machine translation, and image classification to demonstrate the effectiveness of the proposed schedule. Analysis shows that the proposed schedule indeed reduces the redundancy and improves generalization performance.  ( 2 min )
    One-Nearest-Neighbor Search is All You Need for Minimax Optimal Regression and Classification. (arXiv:2202.02464v1 [math.ST])
    Recently, Qiao, Duan, and Cheng~(2019) proposed a distributed nearest-neighbor classification method, in which a massive dataset is split into smaller groups, each processed with a $k$-nearest-neighbor classifier, and the final class label is predicted by a majority vote among these groupwise class labels. This paper shows that the distributed algorithm with $k=1$ over a sufficiently large number of groups attains a minimax optimal error rate up to a multiplicative logarithmic factor under some regularity conditions, for both regression and classification problems. Roughly speaking, distributed 1-nearest-neighbor rules with $M$ groups has a performance comparable to standard $\Theta(M)$-nearest-neighbor rules. In the analysis, alternative rules with a refined aggregation method are proposed and shown to attain exact minimax optimal rates.  ( 2 min )
    Adversarial Detector with Robust Classifier. (arXiv:2202.02503v1 [cs.CV])
    Deep neural network (DNN) models are wellknown to easily misclassify prediction results by using input images with small perturbations, called adversarial examples. In this paper, we propose a novel adversarial detector, which consists of a robust classifier and a plain one, to highly detect adversarial examples. The proposed adversarial detector is carried out in accordance with the logits of plain and robust classifiers. In an experiment, the proposed detector is demonstrated to outperform a state-of-the-art detector without any robust classifier.  ( 2 min )
    Discovering Distribution Shifts using Latent Space Representations. (arXiv:2202.02339v1 [cs.LG])
    Rapid progress in representation learning has led to a proliferation of embedding models, and to associated challenges of model selection and practical application. It is non-trivial to assess a model's generalizability to new, candidate datasets and failure to generalize may lead to poor performance on downstream tasks. Distribution shifts are one cause of reduced generalizability, and are often difficult to detect in practice. In this paper, we use the embedding space geometry to propose a non-parametric framework for detecting distribution shifts, and specify two tests. The first test detects shifts by establishing a robustness boundary, determined by an intelligible performance criterion, for comparing reference and candidate datasets. The second test detects shifts by featurizing and classifying multiple subsamples of two datasets as in-distribution and out-of-distribution. In evaluation, both tests detect model-impacting distribution shifts, in various shift scenarios, for both synthetic and real-world datasets.  ( 2 min )
    Few-shot Learning as Cluster-induced Voronoi Diagrams: A Geometric Approach. (arXiv:2202.02471v1 [cs.LG])
    Few-shot learning (FSL) is the process of rapid generalization from abundant base samples to inadequate novel samples. Despite extensive research in recent years, FSL is still not yet able to generate satisfactory solutions for a wide range of real-world applications. To confront this challenge, we study the FSL problem from a geometric point of view in this paper. One observation is that the widely embraced ProtoNet model is essentially a Voronoi Diagram (VD) in the feature space. We retrofit it by making use of a recent advance in computational geometry called Cluster-induced Voronoi Diagram (CIVD). Starting from the simplest nearest neighbor model, CIVD gradually incorporates cluster-to-point and then cluster-to-cluster relationships for space subdivision, which is used to improve the accuracy and robustness at multiple stages of FSL. Specifically, we use CIVD (1) to integrate parametric and nonparametric few-shot classifiers; (2) to combine feature representation and surrogate representation; (3) and to leverage feature-level, transformation-level, and geometry-level heterogeneities for a better ensemble. Our CIVD-based workflow enables us to achieve new state-of-the-art results on mini-ImageNet, CUB, and tiered-ImagenNet datasets, with ${\sim}2\%{-}5\%$ improvements upon the next best. To summarize, CIVD provides a mathematically elegant and geometrically interpretable framework that compensates for extreme data insufficiency, prevents overfitting, and allows for fast geometric ensemble for thousands of individual VD. These together make FSL stronger.  ( 2 min )
    Zero Experience Required: Plug & Play Modular Transfer Learning for Semantic Visual Navigation. (arXiv:2202.02440v1 [cs.CV])
    In reinforcement learning for visual navigation, it is common to develop a model for each new task, and train that model from scratch with task-specific interactions in 3D environments. However, this process is expensive; massive amounts of interactions are needed for the model to generalize well. Moreover, this process is repeated whenever there is a change in the task type or the goal modality. We present a unified approach to visual navigation using a novel modular transfer learning model. Our model can effectively leverage its experience from one source task and apply it to multiple target tasks (e.g., ObjectNav, RoomNav, ViewNav) with various goal modalities (e.g., image, sketch, audio, label). Furthermore, our model enables zero-shot experience learning, whereby it can solve the target tasks without receiving any task-specific interactive training. Our experiments on multiple photorealistic datasets and challenging tasks show that our approach learns faster, generalizes better, and outperforms SoTA models by a significant margin.  ( 2 min )
    Frequency comb and machine learning-based breath analysis for COVID-19 classification. (arXiv:2202.02321v1 [physics.med-ph])
    Human breath contains hundreds of volatile molecules that can provide powerful, non-intrusive spectral diagnosis of a diverse set of diseases and physiological/metabolic states. To unleash this tremendous potential for medical science, we present a robust analytical method that simultaneously measures tens of thousands of spectral features in each breath sample, followed by efficient and detail-specific multivariate data analysis for unambiguous binary medical response classification. We combine mid-infrared cavity-enhanced direct frequency comb spectroscopy (CE-DFCS), capable of real-time collection of tens of thousands of distinct molecular features at parts-per-trillion sensitivity, with supervised machine learning, capable of analysis and verification of extremely high-dimensional input data channels. Here, we present the first application of this method to the breath detection of Coronavirus Disease 2019 (COVID-19). Using 170 individual samples at the University of Colorado, we report a cross-validated area under the Receiver-Operating-Characteristics curve of 0.849(4), providing excellent prediction performance. Further, this method detected a significant difference between male and female breath as well as other variables such as smoking and abdominal pain. Together, these highlight the utility of CE-DFCS for rapid, non-invasive detection of diverse biological conditions and disease states. The unique properties of frequency comb spectroscopy thus help establish precise digital spectral fingerprints for building accurate databases and provide means for simultaneous multi-response classifications. The predictive power can be further enhanced with readily scalable comb spectral coverage.  ( 2 min )
    Self-Adaptive Forecasting for Improved Deep Learning on Non-Stationary Time-Series. (arXiv:2202.02403v1 [cs.LG])
    Real-world time-series datasets often violate the assumptions of standard supervised learning for forecasting -- their distributions evolve over time, rendering the conventional training and model selection procedures suboptimal. In this paper, we propose a novel method, Self-Adaptive Forecasting (SAF), to modify the training of time-series forecasting models to improve their performance on forecasting tasks with such non-stationary time-series data. SAF integrates a self-adaptation stage prior to forecasting based on `backcasting', i.e. predicting masked inputs backward in time. This is a form of test-time training that creates a self-supervised learning problem on test samples before performing the prediction task. In this way, our method enables efficient adaptation of encoded representations to evolving distributions, leading to superior generalization. SAF can be integrated with any canonical encoder-decoder based time-series architecture such as recurrent neural networks or attention-based architectures. On synthetic and real-world datasets in domains where time-series data are known to be notoriously non-stationary, such as healthcare and finance, we demonstrate a significant benefit of SAF in improving forecasting accuracy.  ( 2 min )
    Selective Network Linearization for Efficient Private Inference. (arXiv:2202.02340v1 [cs.CR])
    Private inference (PI) enables inference directly on cryptographically secure data. While promising to address many privacy issues, it has seen limited use due to extreme runtimes. Unlike plaintext inference, where latency is dominated by FLOPs, in PI non-linear functions (namely ReLU) are the bottleneck. Thus, practical PI demands novel ReLU-aware optimizations. To reduce PI latency we propose a gradient-based algorithm that selectively linearizes ReLUs while maintaining prediction accuracy. We evaluate our algorithm on several standard PI benchmarks. The results demonstrate up to $4.25\%$ more accuracy (iso-ReLU count at 50K) or $2.2\times$ less latency (iso-accuracy at 70\%) than the current state of the art and advance the Pareto frontier across the latency-accuracy space. To complement empirical results, we present a "no free lunch" theorem that sheds light on how and when network linearization is possible while maintaining prediction accuracy.  ( 2 min )
    OMLT: Optimization & Machine Learning Toolkit. (arXiv:2202.02414v1 [stat.ML])
    The optimization and machine learning toolkit (OMLT) is an open-source software package incorporating neural network and gradient-boosted tree surrogate models, which have been trained using machine learning, into larger optimization problems. We discuss the advances in optimization technology that made OMLT possible and show how OMLT seamlessly integrates with the algebraic modeling language Pyomo. We demonstrate how to use OMLT for solving decision-making problems in both computer science and engineering.  ( 2 min )
    A Survey on Poisoning Attacks Against Supervised Machine Learning. (arXiv:2202.02510v1 [cs.CR])
    With the rise of artificial intelligence and machine learning in modern computing, one of the major concerns regarding such techniques is to provide privacy and security against adversaries. We present this survey paper to cover the most representative papers in poisoning attacks against supervised machine learning models. We first provide a taxonomy to categorize existing studies and then present detailed summaries for selected papers. We summarize and compare the methodology and limitations of existing literature. We conclude this paper with potential improvements and future directions to further exploit and prevent poisoning attacks on supervised models. We propose several unanswered research questions to encourage and inspire researchers for future work.  ( 2 min )
  • Open

    Learning a Single Neuron with Bias Using Gradient Descent. (arXiv:2106.01101v2 [cs.LG] UPDATED)
    We theoretically study the fundamental problem of learning a single neuron with a bias term ($\mathbf{x} \mapsto \sigma( + b)$) in the realizable setting with the ReLU activation, using gradient descent. Perhaps surprisingly, we show that this is a significantly different and more challenging problem than the bias-less case (which was the focus of previous works on single neurons), both in terms of the optimization geometry as well as the ability of gradient methods to succeed in some scenarios. We provide a detailed study of this problem, characterizing the critical points of the objective, demonstrating failure cases, and providing positive convergence guarantees under different sets of assumptions. To prove our results, we develop some tools which may be of independent interest, and improve previous results on learning single neurons.  ( 2 min )
    Spectral Bias and Task-Model Alignment Explain Generalization in Kernel Regression and Infinitely Wide Neural Networks. (arXiv:2006.13198v6 [stat.ML] UPDATED)
    Generalization beyond a training dataset is a main goal of machine learning, but theoretical understanding of generalization remains an open problem for many models. The need for a new theory is exacerbated by recent observations in deep neural networks where overparameterization leads to better performance, contradicting the conventional wisdom from classical statistics. In this paper, we investigate generalization error for kernel regression, which, besides being a popular machine learning method, also includes infinitely overparameterized neural networks trained with gradient descent. We use techniques from statistical mechanics to derive an analytical expression for generalization error applicable to any kernel or data distribution. We present applications of our theory to real and synthetic datasets, and for many kernels including those that arise from training deep neural networks in the infinite-width limit. We elucidate an inductive bias of kernel regression to explain data with "simple functions", which are identified by solving a kernel eigenfunction problem on the data distribution. This notion of simplicity allows us to characterize whether a kernel is compatible with a learning task, facilitating good generalization performance from a small number of training examples. We show that more data may impair generalization when noisy or not expressible by the kernel, leading to non-monotonic learning curves with possibly many peaks. To further understand these phenomena, we turn to the broad class of rotation invariant kernels, which is relevant to training deep neural networks in the infinite-width limit, and present a detailed mathematical analysis of them when data is drawn from a spherically symmetric distribution and the number of input dimensions is large.  ( 3 min )
    Multi-Task Neural Processes. (arXiv:2110.14953v3 [cs.LG] UPDATED)
    Neural Processes (NPs) consider a task as a function realized from a stochastic process and flexibly adapt to unseen tasks through inference on functions. However, naive NPs can model data from only a single stochastic process and are designed to infer each task independently. Since many real-world data represent a set of correlated tasks from multiple sources (e.g., multiple attributes and multi-sensor data), it is beneficial to infer them jointly and exploit the underlying correlation to improve the predictive performance. To this end, we propose Multi-Task Neural Processes (MTNPs), an extension of NPs designed to jointly infer tasks realized from multiple stochastic processes. We build MTNPs in a hierarchical way such that inter-task correlation is considered by conditioning all per-task latent variables on a single global latent variable. In addition, we further design our MTNPs so that they can address multi-task settings with incomplete data (i.e., not all tasks share the same set of input points), which has high practical demands in various applications. Experiments demonstrate that MTNPs can successfully model multiple tasks jointly by discovering and exploiting their correlations in various real-world data such as time series of weather attributes and pixel-aligned visual modalities.  ( 2 min )
    Variance reduced stochastic optimization over directed graphs with row and column stochastic weights. (arXiv:2202.03346v1 [math.OC])
    This paper proposes AB-SAGA, a first-order distributed stochastic optimization method to minimize a finite-sum of smooth and strongly convex functions distributed over an arbitrary directed graph. AB-SAGA removes the uncertainty caused by the stochastic gradients using a node-level variance reduction and subsequently employs network-level gradient tracking to address the data dissimilarity across the nodes. Unlike existing methods that use the nonlinear push-sum correction to cancel the imbalance caused by the directed communication, the consensus updates in AB-SAGA are linear and uses both row and column stochastic weights. We show that for a constant step-size, AB-SAGA converges linearly to the global optimal. We quantify the directed nature of the underlying graph using an explicit directivity constant and characterize the regimes in which AB-SAGA achieves a linear speed-up over its centralized counterpart. Numerical experiments illustrate the convergence of AB-SAGA for strongly convex and nonconvex problems.  ( 2 min )
    Capacity of Group-invariant Linear Readouts from Equivariant Representations: How Many Objects can be Linearly Classified Under All Possible Views?. (arXiv:2110.07472v4 [cs.LG] UPDATED)
    Equivariance has emerged as a desirable property of representations of objects subject to identity-preserving transformations that constitute a group, such as translations and rotations. However, the expressivity of a representation constrained by group equivariance is still not fully understood. We address this gap by providing a generalization of Cover's Function Counting Theorem that quantifies the number of linearly separable and group-invariant binary dichotomies that can be assigned to equivariant representations of objects. We find that the fraction of separable dichotomies is determined by the dimension of the space that is fixed by the group action. We show how this relation extends to operations such as convolutions, element-wise nonlinearities, and global and local pooling. While other operations do not change the fraction of separable dichotomies, local pooling decreases the fraction, despite being a highly nonlinear operation. Finally, we test our theory on intermediate representations of randomly initialized and fully trained convolutional neural networks and find perfect agreement.  ( 2 min )
    Scaling-Translation-Equivariant Networks with Decomposed Convolutional Filters. (arXiv:1909.11193v3 [cs.LG] UPDATED)
    Encoding the scale information explicitly into the representation learned by a convolutional neural network (CNN) is beneficial for many computer vision tasks especially when dealing with multiscale inputs. We study, in this paper, a scaling-translation-equivariant (ST-equivariant) CNN with joint convolutions across the space and the scaling group, which is shown to be both sufficient and necessary to achieve equivariance for the regular representation of the scaling-translation group ST . To reduce the model complexity and computational burden, we decompose the convolutional filters under two pre-fixed separable bases and truncate the expansion to low-frequency components. A further benefit of the truncated filter expansion is the improved deformation robustness of the equivariant representation, a property which is theoretically analyzed and empirically verified. Numerical experiments demonstrate that the proposed scaling-translation-equivariant network with decomposed convolutional filters (ScDCFNet) achieves significantly improved performance in multiscale image classification and better interpretability than regular CNNs at a reduced model size.  ( 2 min )
    Efficient Local Planning with Linear Function Approximation. (arXiv:2108.05533v3 [cs.LG] UPDATED)
    We study query and computationally efficient planning algorithms with linear function approximation and a simulator. We assume that the agent only has local access to the simulator, meaning that the agent can only query the simulator at states that have been visited before. This setting is more practical than many prior works on reinforcement learning with a generative model. We propose two algorithms, named confident Monte Carlo least square policy iteration (Confident MC-LSPI) and confident Monte Carlo Politex (Confident MC-Politex) for this setting. Under the assumption that the Q-functions of all policies are linear in known features of the state-action pairs, we show that our algorithms have polynomial query and computational costs in the dimension of the features, the effective planning horizon, and the targeted sub-optimality, while these costs are independent of the size of the state space. One technical contribution of our work is the introduction of a novel proof technique that makes use of a virtual policy iteration algorithm. We use this method to leverage existing results on $\ell_\infty$-bounded approximate policy iteration to show that our algorithm can learn the optimal policy for the given initial state even only with local access to the simulator. We believe that this technique can be extended to broader settings beyond this work.  ( 2 min )
    Active learning for structural reliability: survey, general framework and benchmark. (arXiv:2106.01713v2 [stat.CO] UPDATED)
    Active learning methods have recently surged in the literature due to their ability to solve complex structural reliability problems within an affordable computational cost. These methods are designed by adaptively building an inexpensive surrogate of the original limit-state function. Examples of such surrogates include Gaussian process models which have been adopted in many contributions, the most popular ones being the efficient global reliability analysis (EGRA) and the active Kriging Monte Carlo simulation (AK-MCS), two milestone contributions in the field. In this paper, we first conduct a survey of the recent literature, showing that most of the proposed methods actually span from modifying one or more aspects of the two aforementioned methods. We then propose a generalized modular framework to build on-the-fly efficient active learning strategies by combining the following four ingredients or modules: surrogate model, reliability estimation algorithm, learning function and stopping criterion. Using this framework, we devise 39 strategies for the solution of $20$ reliability benchmark problems. The results of this extensive benchmark (more than $12,000$ reliability problems solved) are analyzed under various criteria leading to a synthesized set of recommendations for practitioners. These may be refined with a priori knowledge about the feature of the problem to solve, i.e. dimensionality and magnitude of the failure probability. This benchmark has eventually highlighted the importance of using surrogates in conjunction with sophisticated reliability estimation algorithms as a way to enhance the efficiency of the latter.  ( 2 min )
    Penalized Estimation and Forecasting of Multiple Subject Intensive Longitudinal Data. (arXiv:2007.05052v2 [stat.ME] UPDATED)
    Intensive Longitudinal Data (ILD) is increasingly available to social and behavioral scientists. With this increased availability come new opportunities for modeling and predicting complex biological, behavioral, and physiological phenomena. Despite these new opportunities psychological researchers have not taken full advantage of promising opportunities inherent to this data, the potential to forecast psychological processes at the individual level. To address this gap in the literature we present a novel modeling framework that addresses a number of topical challenges and open questions in the psychological literature on modeling dynamic processes. First, how can we model and forecast ILD when the length of individual time series and the number of variables collected are roughly equivalent, or when time series lengths are shorter than what is typically required for time series analyses? Second, how can we best take advantage of the cross-sectional (between-person) information inherent to most ILD scenarios while acknowledging individuals differ both quantitatively (e.g. in parameter magnitude) and qualitatively (e.g. in structural dynamics)? Despite the acknowledged between-person heterogeneity in many psychological processes is it possible to leverage group-level information to support improved forecasting at the individual level? In the remainder of the manuscript, we attempt to address these and other pressing questions relevant to the forecasting of multiple-subject ILD.  ( 2 min )
    Free Component Analysis: Theory, Algorithms & Applications. (arXiv:1905.01713v3 [cs.LG] UPDATED)
    We describe a method for unmixing mixtures of freely independent random variables in a manner analogous to the independent component analysis (ICA) based method for unmixing independent random variables from their additive mixtures. Random matrices play the role of free random variables in this context so the method we develop, which we call Free component analysis (FCA), unmixes matrices from additive mixtures of matrices. Thus, while the mixing model is standard, the novelty and difference in unmixing performance comes from the introduction of a new statistical criteria, derived from free probability theory, that quantify freeness analogous to how kurtosis and entropy quantify independence. We describe the theory, the various algorithms, and compare FCA to vanilla ICA which does not account for spatial or temporal structure. We highlight why the statistical criteria make FCA also vanilla despite its matricial underpinnings and show that FCA performs comparably to, and sometimes better than, (vanilla) ICA in every application, such as image and speech unmixing, where ICA has been known to succeed. Our computational experiments suggest that not-so-random matrices, such as images and short time fourier transform matrix of waveforms are (closer to being) freer "in the wild" than we might have theoretically expected.  ( 2 min )
    On the Power of Differentiable Learning versus PAC and SQ Learning. (arXiv:2108.04190v2 [cs.LG] UPDATED)
    We study the power of learning via mini-batch stochastic gradient descent (SGD) on the population loss, and batch Gradient Descent (GD) on the empirical loss, of a differentiable model or neural network, and ask what learning problems can be learnt using these paradigms. We show that SGD and GD can always simulate learning with statistical queries (SQ), but their ability to go beyond that depends on the precision $\rho$ of the gradient calculations relative to the minibatch size $b$ (for SGD) and sample size $m$ (for GD). With fine enough precision relative to minibatch size, namely when $b \rho$ is small enough, SGD can go beyond SQ learning and simulate any sample-based learning algorithm and thus its learning power is equivalent to that of PAC learning; this extends prior work that achieved this result for $b=1$. Similarly, with fine enough precision relative to the sample size $m$, GD can also simulate any sample-based learning algorithm based on $m$ samples. In particular, with polynomially many bits of precision (i.e. when $\rho$ is exponentially small), SGD and GD can both simulate PAC learning regardless of the mini-batch size. On the other hand, when $b \rho^2$ is large enough, the power of SGD is equivalent to that of SQ learning.  ( 2 min )
    Approximation error of single hidden layer neural networks with fixed weights. (arXiv:2202.03289v1 [cs.LG])
    This paper provides an explicit formula for the approximation error of single hidden layer neural networks with two fixed weights.  ( 2 min )
    Deep Layer-wise Networks Have Closed-Form Weights. (arXiv:2006.08539v5 [stat.ML] UPDATED)
    There is currently a debate within the neuroscience community over the likelihood of the brain performing backpropagation (BP). To better mimic the brain, training a network $\textit{one layer at a time}$ with only a "single forward pass" has been proposed as an alternative to bypass BP; we refer to these networks as "layer-wise" networks. We continue the work on layer-wise networks by answering two outstanding questions. First, $\textit{do they have a closed-form solution?}$ Second, $\textit{how do we know when to stop adding more layers?}$ This work proves that the kernel Mean Embedding is the closed-form weight that achieves the network global optimum while driving these networks to converge towards a highly desirable kernel for classification; we call it the $\textit{Neural Indicator Kernel}$.  ( 2 min )
    Kanerva++: extending The Kanerva Machine with differentiable, locally block allocated latent memory. (arXiv:2103.03905v3 [cs.NE] UPDATED)
    Episodic and semantic memory are critical components of the human memory model. The theory of complementary learning systems (McClelland et al., 1995) suggests that the compressed representation produced by a serial event (episodic memory) is later restructured to build a more generalized form of reusable knowledge (semantic memory). In this work we develop a new principled Bayesian memory allocation scheme that bridges the gap between episodic and semantic memory via a hierarchical latent variable model. We take inspiration from traditional heap allocation and extend the idea of locally contiguous memory to the Kanerva Machine, enabling a novel differentiable block allocated latent memory. In contrast to the Kanerva Machine, we simplify the process of memory writing by treating it as a fully feed forward deterministic process, relying on the stochasticity of the read key distribution to disperse information within the memory. We demonstrate that this allocation scheme improves performance in memory conditional image generation, resulting in new state-of-the-art conditional likelihood values on binarized MNIST (<=41.58 nats/image) , binarized Omniglot (<=66.24 nats/image), as well as presenting competitive performance on CIFAR10, DMLab Mazes, Celeb-A and ImageNet32x32.  ( 2 min )
    Out-of-Distribution Generalization in Kernel Regression. (arXiv:2106.02261v3 [stat.ML] UPDATED)
    In real word applications, data generating process for training a machine learning model often differs from what the model encounters in the test stage. Understanding how and whether machine learning models generalize under such distributional shifts have been a theoretical challenge. Here, we study generalization in kernel regression when the training and test distributions are different using methods from statistical physics. Using the replica method, we derive an analytical formula for the out-of-distribution generalization error applicable to any kernel and real datasets. We identify an overlap matrix that quantifies the mismatch between distributions for a given kernel as a key determinant of generalization performance under distribution shift. Using our analytical expressions we elucidate various generalization phenomena including possible improvement in generalization when there is a mismatch. We develop procedures for optimizing training and test distributions for a given data budget to find best and worst case generalizations under the shift. We present applications of our theory to real and synthetic datasets and for many kernels. We compare results of our theory applied to Neural Tangent Kernel with simulations of wide networks and show agreement. We analyze linear regression in further depth.  ( 2 min )
    One-Shot Neural Architecture Search via Compressive Sensing. (arXiv:1906.02869v2 [cs.LG] UPDATED)
    Neural Architecture Search remains a very challenging meta-learning problem. Several recent techniques based on parameter-sharing idea have focused on reducing the NAS running time by leveraging proxy models, leading to architectures with competitive performance compared to those with hand-crafted designs. In this paper, we propose an iterative technique for NAS, inspired by algorithms for learning low-degree sparse Boolean functions. We validate our approach on the DARTs search space (Liu et al., 2018b) and NAS-Bench-201 (Yang et al., 2020). In addition, we provide theoretical analysis via upper bounds on the number of validation error measurements needed for reliable learning, and include ablation studies to further in-depth understanding of our technique.  ( 2 min )
    Semi-Supervised Empirical Risk Minimization: Using unlabeled data to improve prediction. (arXiv:2009.00606v5 [stat.ML] UPDATED)
    We present a general methodology for using unlabeled data to design semi supervised learning (SSL) variants of the Empirical Risk Minimization (ERM) learning process. Focusing on generalized linear regression, we analyze of the effectiveness of our SSL approach in improving prediction performance. The key ideas are carefully considering the null model as a competitor, and utilizing the unlabeled data to determine signal-noise combinations where SSL outperforms both supervised learning and the null model. We then use SSL in an adaptive manner based on estimation of the signal and noise. In the special case of linear regression with Gaussian covariates, we prove that the non-adaptive SSL version is in fact not capable of improving on both the supervised estimator and the null model simultaneously, beyond a negligible O(1/n) term. On the other hand, the adaptive model presented in this work, can achieve a substantial improvement over both competitors simultaneously, under a variety of settings. This is shown empirically through extensive simulations, and extended to other scenarios, such as non-Gaussian covariates, misspecified linear regression, or generalized linear regression with non-linear link functions.  ( 2 min )
    Accounting for shared covariates in semi-parametric Bayesian additive regression trees. (arXiv:2108.07636v3 [stat.ML] UPDATED)
    We propose some extensions to semi-parametric models based on Bayesian additive regression trees (BART). In the semi-parametric BART paradigm, the response variable is approximated by a linear predictor and a BART model, where the linear component is responsible for estimating the main effects and BART accounts for non-specified interactions and non-linearities. Previous semi-parametric models based on BART have assumed that the set of covariates in the linear predictor and the BART model are mutually exclusive in an attempt to avoid bias and poor coverage properties. The main novelty in our approach lies in the way we change the tree-generation moves in BART to deal with bias/confounding between the parametric and non-parametric components, even when they have covariates in common. This allows us to model complex interactions involving the covariates of primary interest, both among themselves and with those in the BART component. Through synthetic and real-world examples, we demonstrate that the performance of our novel semi-parametric BART is competitive when compared to regression models, alternative formulations of semi-parametric BART, and other tree-based methods. The implementation of the proposed method is available at https://github.com/ebprado/CSP-BART.  ( 2 min )
    REMAX: Relational Representation for Multi-Agent Exploration. (arXiv:2008.05214v2 [cs.LG] UPDATED)
    Training a multi-agent reinforcement learning (MARL) model with a sparse reward is generally difficult because numerous combinations of interactions among agents induce a certain outcome (i.e., success or failure). Earlier studies have tried to resolve this issue by employing an intrinsic reward to induce interactions that are helpful for learning an effective policy. However, this approach requires extensive prior knowledge for designing an intrinsic reward. To train the MARL model effectively without designing the intrinsic reward, we propose a learning-based exploration strategy to generate the initial states of a game. The proposed method adopts a variational graph autoencoder to represent a game state such that (1) the state can be compactly encoded to a latent representation by considering relationships among agents, and (2) the latent representation can be used as an effective input for a coupled surrogate model to predict an exploration score. The proposed method then finds new latent representations that maximize the exploration scores and decodes these representations to generate initial states from which the MARL model starts training in the game and thus experiences novel and rewardable states. We demonstrate that our method improves the training and performance of the MARL model more than the existing exploration methods.  ( 2 min )
    Technical Reports Compilation: Detecting the Fire Drill anti-pattern using Source Code and issue-tracking data. (arXiv:2104.15090v6 [cs.SE] UPDATED)
    Detecting the presence of project management anti-patterns (AP) currently requires experts on the matter and is an expensive endeavor. Worse, experts may introduce their individual subjectivity or bias. Using the Fire Drill AP, we first introduce a novel way to translate descriptions into detectable AP that are comprised of arbitrary metrics and events such as logged time or maintenance activities, which are mined from the underlying source code or issue-tracking data, thus making the description objective as it becomes data-based. Secondly, we demonstrate a novel method to quantify and score the deviations of real-world projects to data-based AP descriptions. Using nine real-world projects that exhibit a Fire Drill to some degree, we show how to further enhance the translated AP. The ground truth in these projects was extracted from two individual experts and consensus was found between them. Our evaluation spans four kinds of patterns, where the first is purely derived from description, the second type is enhanced by data, and the third kind is derived from data only. The fourth type then is a derivative meta-process pattern. The Fire Drill AP as translated from description only for either, source code- or issue-tracking-based detection, shows weak potential of confidently detecting the presence of the anti-pattern in a project. Enriching the AP with data from real-world projects significantly improves detection. Using patterns derived from data only leads to almost perfect correlations of the scores with the ground truth. Some APs share symptoms with the Fire Drill AP, and we conclude that the presence of similar patterns is most certainly detectable. Furthermore, any pattern that can be characteristically modeled using the proposed approach is potentially well detectable.  ( 3 min )
    Multi-layered characterization of hot stellar systems with confidence. (arXiv:2003.05777v4 [astro-ph.GA] UPDATED)
    Understanding the physical and evolutionary properties of Hot Stellar Systems (HSS) is a major challenge in astronomy. We studied the dataset on 13456 HSS of Misgeld and Hilker (2011) that includes 12763 candidate globular clusters and found multi-layered homogeneous grouping among these stellar systems. Our methods elicited eight homogeneous ellipsoidal groups at the finest sub-group level. Some of these groups have high overlap and were merged through a multi-phased syncytial algorithm motivated from Almod\'ovar-Rivera and Maitra (2020). Five groups were merged in the first phase, resulting in three complex-structured groups. Our algorithm determined further complex structure and permitted another merging phase, revealing two complex-structured groups at the highest level. A nonparametric bootstrap procedure was also used to estimate the confidence of each of our group assignments. Our group assignments generally had high confidences in classification, indicating great degree of certainty of our HSS assignments into complex-structured groups made by clustering. The physical and kinematic properties of the two groups were assessed in terms of mass, effective radius, surface density and mass-luminosity ratio. The first group consisted of older, smaller and less bright HSS while the second group consisted of brighter and younger HSS. Our analysis provides novel insight into the physical and evolutionary properties of HSS and also helps understand physical and evolutionary properties of candidate globular clusters. Further, the candidate globular clusters are seen to have very high chance of being really so rather than dwarfs or dwarf ellipticals that are also indicated to be quite distinct from each other.  ( 2 min )
    Nearest neighbor density functional estimation from inverse Laplace transform. (arXiv:1805.08342v4 [math.ST] UPDATED)
    A new approach to $L_2$-consistent estimation of a general density functional using $k$-nearest neighbor distances is proposed, where the functional under consideration is in the form of the expectation of some function $f$ of the densities at each point. The estimator is designed to be asymptotically unbiased, using the convergence of the normalized volume of a $k$-nearest neighbor ball to a Gamma distribution in the large-sample limit, and naturally involves the inverse Laplace transform of a scaled version of the function $f.$ Some instantiations of the proposed estimator recover existing $k$-nearest neighbor based estimators of Shannon and R\'enyi entropies and Kullback--Leibler and R\'enyi divergences, and discover new consistent estimators for many other functionals such as logarithmic entropies and divergences. The $L_2$-consistency of the proposed estimator is established for a broad class of densities for general functionals, and the convergence rate in mean squared error is established as a function of the sample size for smooth, bounded densities.  ( 2 min )
    Optimal Ratio for Data Splitting. (arXiv:2202.03326v1 [stat.ML])
    It is common to split a dataset into training and testing sets before fitting a statistical or machine learning model. However, there is no clear guidance on how much data should be used for training and testing. In this article we show that the optimal splitting ratio is $\sqrt{p}:1$, where $p$ is the number of parameters in a linear regression model that explains the data well.  ( 2 min )
    Linear Model with Local Differential Privacy. (arXiv:2202.02448v1 [cs.CR])
    Scientific collaborations benefit from collaborative learning of distributed sources, but remain difficult to achieve when data are sensitive. In recent years, privacy preserving techniques have been widely studied to analyze distributed data across different agencies while protecting sensitive information. Secure multiparty computation has been widely studied for privacy protection with high privacy level but intense computation cost. There are also other security techniques sacrificing partial data utility to reduce disclosure risk. A major challenge is to balance data utility and disclosure risk while maintaining high computation efficiency. In this paper, matrix masking technique is applied to encrypt data such that the secure schemes are against malicious adversaries while achieving local differential privacy. The proposed schemes are designed for linear models and can be implemented for both vertical and horizontal partitioning scenarios. Moreover, cross validation is studied to prevent overfitting and select optimal parameters without additional communication cost. Simulation results present the efficiency of proposed schemes to analyze dataset with millions of records and high-dimensional data (n << p).  ( 2 min )
    SLIDE: a surrogate fairness constraint to ensure fairness consistency. (arXiv:2202.03165v1 [stat.ML])
    As they have a vital effect on social decision makings, AI algorithms should be not only accurate and but also fair. Among various algorithms for fairness AI, learning a prediction model by minimizing the empirical risk (e.g., cross-entropy) subject to a given fairness constraint has received much attention. To avoid computational difficulty, however, a given fairness constraint is replaced by a surrogate fairness constraint as the 0-1 loss is replaced by a convex surrogate loss for classification problems. In this paper, we investigate the validity of existing surrogate fairness constraints and propose a new surrogate fairness constraint called SLIDE, which is computationally feasible and asymptotically valid in the sense that the learned model satisfies the fairness constraint asymptotically and achieves a fast convergence rate. Numerical experiments confirm that the SLIDE works well for various benchmark datasets.  ( 2 min )
    Personalized Public Policy Analysis in Social Sciences using Causal-Graphical Normalizing Flows. (arXiv:2202.03281v1 [cs.LG])
    Structural Equation/Causal Models (SEMs/SCMs) are widely used in epidemiology and social sciences to identify and analyze the average treatment effect (ATE) and conditional ATE (CATE). Traditional causal effect estimation methods such as Inverse Probability Weighting (IPW) and more recently Regression-With-Residuals (RWR) are widely used - as they avoid the challenging task of identifying the SCM parameters - to estimate ATE and CATE. However, much work remains before traditional estimation methods can be used for counterfactual inference, and for the benefit of Personalized Public Policy Analysis (P$^3$A) in the social sciences. While doctors rely on personalized medicine to tailor treatments to patients in laboratory settings (relatively closed systems), P$^3$A draws inspiration from such tailoring but adapts it for open social systems. In this article, we develop a method for counterfactual inference that we name causal-Graphical Normalizing Flow (c-GNF), facilitating P$^3$A. First, we show how c-GNF captures the underlying SCM without making any assumption about functional forms. Second, we propose a novel dequantization trick to deal with discrete variables, which is a limitation of normalizing flows in general. Third, we demonstrate in experiments that c-GNF performs on-par with IPW and RWR in terms of bias and variance for estimating the ATE, when the true functional forms are known, and better when they are unknown. Fourth and most importantly, we conduct counterfactual inference with c-GNFs, demonstrating promising empirical performance. Because IPW and RWR, like other traditional methods, lack the capability of counterfactual inference, c-GNFs will likely play a major role in tailoring personalized treatment, facilitating P$^3$A, optimizing social interventions - in contrast to the current `one-size-fits-all' approach of existing methods.  ( 2 min )
    HARFE: Hard-Ridge Random Feature Expansion. (arXiv:2202.02877v1 [stat.ML])
    We propose a random feature model for approximating high-dimensional sparse additive functions called the hard-ridge random feature expansion method (HARFE). This method utilizes a hard-thresholding pursuit-based algorithm applied to the sparse ridge regression (SRR) problem to approximate the coefficients with respect to the random feature matrix. The SRR formulation balances between obtaining sparse models that use fewer terms in their representation and ridge-based smoothing that tend to be robust to noise and outliers. In addition, we use a random sparse connectivity pattern in the random feature matrix to match the additive function assumption. We prove that the HARFE method is guaranteed to converge with a given error bound depending on the noise and the parameters of the sparse ridge regression model. Based on numerical results on synthetic data as well as on real datasets, the HARFE approach obtains lower (or comparable) error than other state-of-the-art algorithms.  ( 2 min )
    OMLT: Optimization & Machine Learning Toolkit. (arXiv:2202.02414v1 [stat.ML])
    The optimization and machine learning toolkit (OMLT) is an open-source software package incorporating neural network and gradient-boosted tree surrogate models, which have been trained using machine learning, into larger optimization problems. We discuss the advances in optimization technology that made OMLT possible and show how OMLT seamlessly integrates with the algebraic modeling language Pyomo. We demonstrate how to use OMLT for solving decision-making problems in both computer science and engineering.  ( 2 min )
    Almost Optimal Proper Learning and Testing Polynomials. (arXiv:2202.03207v1 [cs.LG])
    We give the first almost optimal polynomial-time proper learning algorithm of Boolean sparse multivariate polynomial under the uniform distribution. For $s$-sparse polynomial over $n$ variables and $\epsilon=1/s^\beta$, $\beta>1$, our algorithm makes $$q_U=\left(\frac{s}{\epsilon}\right)^{\frac{\log \beta}{\beta}+O(\frac{1}{\beta})}+ \tilde O\left(s\right)\left(\log\frac{1}{\epsilon}\right)\log n$$ queries. Notice that our query complexity is sublinear in $1/\epsilon$ and almost linear in $s$. All previous algorithms have query complexity at least quadratic in $s$ and linear in $1/\epsilon$. We then prove the almost tight lower bound $$q_L=\left(\frac{s}{\epsilon}\right)^{\frac{\log \beta}{\beta}+\Omega(\frac{1}{\beta})}+ \Omega\left(s\right)\left(\log\frac{1}{\epsilon}\right)\log n,$$ Applying the reduction in~\cite{Bshouty19b} with the above algorithm, we give the first almost optimal polynomial-time tester for $s$-sparse polynomial. Our tester, for $\beta>3.404$, makes $$\tilde O\left(\frac{s}{\epsilon}\right)$$ queries.  ( 2 min )
    An Experimental Design Approach for Regret Minimization in Logistic Bandits. (arXiv:2202.02407v1 [stat.ML])
    In this work we consider the problem of regret minimization for logistic bandits. The main challenge of logistic bandits is reducing the dependence on a potentially large problem dependent constant $\kappa$ that can at worst scale exponentially with the norm of the unknown parameter $\theta_{\ast}$. Abeille et al. (2021) have applied self-concordance of the logistic function to remove this worst-case dependence providing regret guarantees like $O(d\log^2(\kappa)\sqrt{\dot\mu T}\log(|\mathcal{X}|))$ where $d$ is the dimensionality, $T$ is the time horizon, and $\dot\mu$ is the variance of the best-arm. This work improves upon this bound in the fixed arm setting by employing an experimental design procedure that achieves a minimax regret of $O(\sqrt{d \dot\mu T\log(|\mathcal{X}|)})$. Our regret bound in fact takes a tighter instance (i.e., gap) dependent regret bound for the first time in logistic bandits. We also propose a new warmup sampling algorithm that can dramatically reduce the lower order term in the regret in general and prove that it can replace the lower order term dependency on $\kappa$ to $\log^2(\kappa)$ for some instances. Finally, we discuss the impact of the bias of the MLE on the logistic bandit problem, providing an example where $d^2$ lower order regret (cf., it is $d$ for linear bandits) may not be improved as long as the MLE is used and how bias-corrected estimators may be used to make it closer to $d$.  ( 2 min )
    The Implicit Bias of Gradient Descent on Generalized Gated Linear Networks. (arXiv:2202.02649v1 [stat.ML])
    Understanding the asymptotic behavior of gradient-descent training of deep neural networks is essential for revealing inductive biases and improving network performance. We derive the infinite-time training limit of a mathematically tractable class of deep nonlinear neural networks, gated linear networks (GLNs), and generalize these results to gated networks described by general homogeneous polynomials. We study the implications of our results, focusing first on two-layer GLNs. We then apply our theoretical predictions to GLNs trained on MNIST and show how architectural constraints and the implicit bias of gradient descent affect performance. Finally, we show that our theory captures a substantial portion of the inductive bias of ReLU networks. By making the inductive bias explicit, our framework is poised to inform the development of more efficient, biologically plausible, and robust learning algorithms.  ( 2 min )
    Gaussian Graphical Models as an Ensemble Method for Distributed Gaussian Processes. (arXiv:2202.03287v1 [cs.LG])
    Distributed Gaussian process (DGP) is a popular approach to scale GP to big data which divides the training data into some subsets, performs local inference for each partition, and aggregates the results to acquire global prediction. To combine the local predictions, the conditional independence assumption is used which basically means there is a perfect diversity between the subsets. Although it keeps the aggregation tractable, it is often violated in practice and generally yields poor results. In this paper, we propose a novel approach for aggregating the Gaussian experts' predictions by Gaussian graphical model (GGM) where the target aggregation is defined as an unobserved latent variable and the local predictions are the observed variables. We first estimate the joint distribution of latent and observed variables using the Expectation-Maximization (EM) algorithm. The interaction between experts can be encoded by the precision matrix of the joint distribution and the aggregated predictions are obtained based on the property of conditional Gaussian distribution. Using both synthetic and real datasets, our experimental evaluations illustrate that our new method outperforms other state-of-the-art DGP approaches.  ( 2 min )
    Discovering Distribution Shifts using Latent Space Representations. (arXiv:2202.02339v1 [cs.LG])
    Rapid progress in representation learning has led to a proliferation of embedding models, and to associated challenges of model selection and practical application. It is non-trivial to assess a model's generalizability to new, candidate datasets and failure to generalize may lead to poor performance on downstream tasks. Distribution shifts are one cause of reduced generalizability, and are often difficult to detect in practice. In this paper, we use the embedding space geometry to propose a non-parametric framework for detecting distribution shifts, and specify two tests. The first test detects shifts by establishing a robustness boundary, determined by an intelligible performance criterion, for comparing reference and candidate datasets. The second test detects shifts by featurizing and classifying multiple subsamples of two datasets as in-distribution and out-of-distribution. In evaluation, both tests detect model-impacting distribution shifts, in various shift scenarios, for both synthetic and real-world datasets.  ( 2 min )
    Riemannian Score-Based Generative Modeling. (arXiv:2202.02763v1 [cs.LG])
    Score-based generative models (SGMs) are a novel class of generative models demonstrating remarkable empirical performance. One uses a diffusion to add progressively Gaussian noise to the data, while the generative model is a "denoising" process obtained by approximating the time-reversal of this "noising" diffusion. However, current SGMs make the underlying assumption that the data is supported on a Euclidean manifold with flat geometry. This prevents the use of these models for applications in robotics, geoscience or protein modeling which rely on distributions defined on Riemannian manifolds. To overcome this issue, we introduce Riemannian Score-based Generative Models (RSGMs) which extend current SGMs to the setting of compact Riemannian manifolds. We illustrate our approach with earth and climate science data and show how RSGMs can be accelerated by solving a Schr\"odinger bridge problem on manifolds.  ( 2 min )
    Generalized Causal Tree for Uplift Modeling. (arXiv:2202.02416v1 [stat.ME])
    Uplift modeling is crucial in various applications ranging from marketing and policy-making to personalized recommendations. The main objective is to learn optimal treatment allocations for a heterogeneous population. A primary line of existing work modifies the loss function of the decision tree algorithm to identify cohorts with heterogeneous treatment effects. Another line of work estimates the individual treatment effects separately for the treatment group and the control group using off-the-shelf supervised learning algorithms. The former approach that directly models the heterogeneous treatment effect is known to outperform the latter in practice. However, the existing tree-based methods are mostly limited to a single treatment and a single control use case, except for a handful of extensions to multiple discrete treatments. In this paper, we fill this gap in the literature by proposing a generalization to the tree-based approaches to tackle multiple discrete and continuous-valued treatments. We focus on a generalization of the well-known causal tree algorithm due to its desirable statistical properties, but our generalization technique can be applied to other tree-based approaches as well. We perform extensive experiments to showcase the efficacy of our method when compared to other methods.  ( 2 min )
    Membership Inference Attacks and Defenses in Neural Network Pruning. (arXiv:2202.03335v1 [cs.CR])
    Neural network pruning has been an essential technique to reduce the computation and memory requirements for using deep neural networks for resource-constrained devices. Most existing research focuses primarily on balancing the sparsity and accuracy of a pruned neural network by strategically removing insignificant parameters and retraining the pruned model. Such efforts on reusing training samples pose serious privacy risks due to increased memorization, which, however, has not been investigated yet. In this paper, we conduct the first analysis of privacy risks in neural network pruning. Specifically, we investigate the impacts of neural network pruning on training data privacy, i.e., membership inference attacks. We first explore the impact of neural network pruning on prediction divergence, where the pruning process disproportionately affects the pruned model's behavior for members and non-members. Meanwhile, the influence of divergence even varies among different classes in a fine-grained manner. Enlighten by such divergence, we proposed a self-attention membership inference attack against the pruned neural networks. Extensive experiments are conducted to rigorously evaluate the privacy impacts of different pruning approaches, sparsity levels, and adversary knowledge. The proposed attack shows the higher attack performance on the pruned models when compared with eight existing membership inference attacks. In addition, we propose a new defense mechanism to protect the pruning process by mitigating the prediction divergence based on KL-divergence distance, whose effectiveness has been experimentally demonstrated to effectively mitigate the privacy risks while maintaining the sparsity and accuracy of the pruned models.  ( 2 min )
    A Variational Edge Partition Model for Supervised Graph Representation Learning. (arXiv:2202.03233v1 [stat.ML])
    Graph neural networks (GNNs), which propagate the node features through the edges and learn how to transform the aggregated features under label supervision, have achieved great success in supervised feature extraction for both node-level and graph-level classification tasks. However, GNNs typically treat the graph structure as given and ignore how the edges are formed. This paper introduces a graph generative process to model how the observed edges are generated by aggregating the node interactions over a set of overlapping node communities, each of which contributes to the edges via a logical OR mechanism. Based on this generative model, we partition each edge into the summation of multiple community-specific weighted edges and use them to define community-specific GNNs. A variational inference framework is proposed to jointly learn a GNN based inference network that partitions the edges into different communities, these community-specific GNNs, and a GNN based predictor that combines community-specific GNNs for the end classification task. Extensive evaluations on real-world graph datasets have verified the effectiveness of the proposed method in learning discriminative representations for both node-level and graph-level classification tasks.  ( 2 min )
    Distributionally Robust Fair Principal Components via Geodesic Descents. (arXiv:2202.03071v1 [cs.LG])
    Principal component analysis is a simple yet useful dimensionality reduction technique in modern machine learning pipelines. In consequential domains such as college admission, healthcare and credit approval, it is imperative to take into account emerging criteria such as the fairness and the robustness of the learned projection. In this paper, we propose a distributionally robust optimization problem for principal component analysis which internalizes a fairness criterion in the objective function. The learned projection thus balances the trade-off between the total reconstruction error and the reconstruction error gap between subgroups, taken in the min-max sense over all distributions in a moment-based ambiguity set. The resulting optimization problem over the Stiefel manifold can be efficiently solved by a Riemannian subgradient descent algorithm with a sub-linear convergence rate. Our experimental results on real-world datasets show the merits of our proposed method over state-of-the-art baselines.  ( 2 min )
    Introducing explainable supervised machine learning into interactive feedback loops for statistical production system. (arXiv:2202.03212v1 [cs.LG])
    Statistical production systems cover multiple steps from the collection, aggregation, and integration of data to tasks like data quality assurance and dissemination. While the context of data quality assurance is one of the most promising fields for applying machine learning, the lack of curated and labeled training data is often a limiting factor. The statistical production system for the Centralised Securities Database features an interactive feedback loop between data collected by the European Central Bank and data quality assurance performed by data quality managers at National Central Banks. The quality assurance feedback loop is based on a set of rule-based checks for raising exceptions, upon which the user either confirms the data or corrects an actual error. In this paper we use the information received from this feedback loop to optimize the exceptions presented to the National Central Banks thereby improving the quality of exceptions generated and the time consumed on the system by the users authenticating those exceptions. For this approach we make use of explainable supervised machine learning to (a) identify the types of exceptions and (b) to prioritize which exceptions are more likely to require an intervention or correction by the NCBs. Furthermore, we provide an explainable AI taxonomy aiming to identify the different explainable AI needs that arose during the project.  ( 2 min )
    Structure-Aware Transformer for Graph Representation Learning. (arXiv:2202.03036v1 [stat.ML])
    The Transformer architecture has gained growing attention in graph representation learning recently, as it naturally overcomes several limitations of graph neural networks (GNNs) by avoiding their strict structural inductive biases and instead only encoding the graph structure via positional encoding. Here, we show that the node representations generated by the Transformer with positional encoding do not necessarily capture structural similarity between them. To address this issue, we propose the Structure-Aware Transformer, a class of simple and flexible graph transformers built upon a new self-attention mechanism. This new self-attention incorporates structural information into the original self-attention by extracting a subgraph representation rooted at each node before computing the attention. We propose several methods for automatically generating the subgraph representation and show theoretically that the resulting representations are at least as expressive as the subgraph representations. Empirically, our method achieves state-of-the-art performance on five graph prediction benchmarks. Our structure-aware framework can leverage any existing GNN to extract the subgraph representation, and we show that it systematically improves performance relative to the base GNN model, successfully combining the advantages of GNNs and transformers.  ( 2 min )
    Bayesian Linear Bandits for Large-Scale Recommender Systems. (arXiv:2202.03167v1 [cs.LG])
    Potentially, taking advantage of available side information boosts the performance of recommender systems; nevertheless, with the rise of big data, the side information has often several dimensions. Hence, it is imperative to develop decision-making algorithms that can cope with such a high-dimensional context in real-time. That is especially challenging when the decision-maker has a variety of items to recommend. In this paper, we build upon the linear contextual multi-armed bandit framework to address this problem. We develop a decision-making policy for a linear bandit problem with high-dimensional context vectors and several arms. Our policy employs Thompson sampling and feeds it with reduced context vectors, where the dimensionality reduction follows by random projection. Our proposed recommender system follows this policy to learn online the item preferences of users while keeping its runtime as low as possible. We prove a regret bound that scales as a factor of the reduced dimension instead of the original one. For numerical evaluation, we use our algorithm to build a recommender system and apply it to real-world datasets. The theoretical and numerical results demonstrate the effectiveness of our proposed algorithm compared to the state-of-the-art in terms of computational complexity and regret performance.  ( 2 min )
    NUQ: Nonparametric Uncertainty Quantification for Deterministic Neural Networks. (arXiv:2202.03101v1 [stat.ML])
    This paper proposes a fast and scalable method for uncertainty quantification of machine learning models' predictions. First, we show the principled way to measure the uncertainty of predictions for a classifier based on Nadaraya-Watson's nonparametric estimate of the conditional label distribution. Importantly, the approach allows to disentangle explicitly aleatoric and epistemic uncertainties. The resulting method works directly in the feature space. However, one can apply it to any neural network by considering an embedding of the data induced by the network. We demonstrate the strong performance of the method in uncertainty estimation tasks on a variety of real-world image datasets, such as MNIST, SVHN, CIFAR-100 and several versions of ImageNet.  ( 2 min )
    HERMES: Hybrid Error-corrector Model with inclusion of External Signals for nonstationary fashion time series. (arXiv:2202.03224v1 [eess.SP])
    Developing models and algorithms to draw causal inference for time series is a long standing statistical problem. It is crucial for many applications, in particular for fashion or retail industries, to make optimal inventory decisions and avoid massive wastes. By tracking thousands of fashion trends on social media with state-of-the-art computer vision approaches, we propose a new model for fashion time series forecasting. Our contribution is twofold. We first provide publicly the first fashion dataset gathering 10000 weekly fashion time series. As influence dynamics are the key of emerging trend detection, we associate with each time series an external weak signal representing behaviors of influencers. Secondly, to leverage such a complex and rich dataset, we propose a new hybrid forecasting model. Our approach combines per-time-series parametric models with seasonal components and a global recurrent neural network to include sporadic external signals. This hybrid model provides state-of-the-art results on the proposed fashion dataset, on the weekly time series of the M4 competition \cite{makridakis2018m4}, and illustrates the benefit of the contribution of external weak signals.  ( 2 min )
    Metric-valued regression. (arXiv:2202.03045v1 [cs.LG])
    We propose an efficient algorithm for learning mappings between two metric spaces, $\X$ and $\Y$. Our procedure is strongly Bayes-consistent whenever $\X$ and $\Y$ are topologically separable and $\Y$ is "bounded in expectation" (our term; the separability assumption can be somewhat weakened). At this level of generality, ours is the first such learnability result for unbounded loss in the agnostic setting. Our technique is based on metric medoids (a variant of Fr\'echet means) and presents a significant departure from existing methods, which, as we demonstrate, fail to achieve Bayes-consistency on general instance- and label-space metrics. Our proofs introduce the technique of {\em semi-stable compression}, which may be of independent interest.  ( 2 min )
    On the expressivity of bi-Lipschitz normalizing flows. (arXiv:2107.07232v2 [cs.LG] UPDATED)
    An invertible function is bi-Lipschitz if both the function and its inverse have bounded Lipschitz constants. Nowadays, most Normalizing Flows are bi-Lipschitz by design or by training to limit numerical errors (among other things). In this paper, we discuss the expressivity of bi-Lipschitz Normalizing Flows and identify several target distributions that are difficult to approximate using such models. Then, we characterize the expressivity of bi-Lipschitz Normalizing Flows by giving several lower bounds on the Total Variation distance between these particularly unfavorable distributions and their best possible approximation. Finally, we discuss potential remedies which include using more complex latent distributions.  ( 2 min )
    Anticorrelated Noise Injection for Improved Generalization. (arXiv:2202.02831v1 [stat.ML])
    Injecting artificial noise into gradient descent (GD) is commonly employed to improve the performance of machine learning models. Usually, uncorrelated noise is used in such perturbed gradient descent (PGD) methods. It is, however, not known if this is optimal or whether other types of noise could provide better generalization performance. In this paper, we zoom in on the problem of correlating the perturbations of consecutive PGD steps. We consider a variety of objective functions for which we find that GD with anticorrelated perturbations ("Anti-PGD") generalizes significantly better than GD and standard (uncorrelated) PGD. To support these experimental findings, we also derive a theoretical analysis that demonstrates that Anti-PGD moves to wider minima, while GD and PGD remain stuck in suboptimal regions or even diverge. This new connection between anticorrelated noise and generalization opens the field to novel ways to exploit noise for training machine learning models.  ( 2 min )
    Neural Tangent Kernel Analysis of Deep Narrow Neural Networks. (arXiv:2202.02981v1 [cs.LG])
    The tremendous recent progress in analyzing the training dynamics of overparameterized neural networks has primarily focused on wide networks and therefore does not sufficiently address the role of depth in deep learning. In this work, we present the first trainability guarantee of infinitely deep but narrow neural networks. We study the infinite-depth limit of a multilayer perceptron (MLP) with a specific initialization and establish a trainability guarantee using the NTK theory. We then extend the analysis to an infinitely deep convolutional neural network (CNN) and perform brief experiments  ( 2 min )
    Algorithms that get old : the case of generative algorithms. (arXiv:2202.03008v1 [stat.ML])
    Generative IA networks, like the Variational Auto-Encoders (VAE), and Generative Adversarial Networks (GANs) produce new objects each time when asked to do so. However, this behavior is unlike that of human artists that change their style as times go by and seldom return to the initial point. We investigate a situation where VAEs are requested to sample from a probability measure described by some empirical set. Based on recent works on Radon-Sobolev statistical distances, we propose a numerical paradigm, to be used in conjunction with a generative algorithm, that satisfies the two following requirements: the objects created do not repeat and evolve to fill the entire target probability measure.  ( 2 min )
    Unsupervised physics-informed disentanglement of multimodal data for high-throughput scientific discovery. (arXiv:2202.03242v1 [cs.LG])
    We introduce physics-informed multimodal autoencoders (PIMA) - a variational inference framework for discovering shared information in multimodal scientific datasets representative of high-throughput testing. Individual modalities are embedded into a shared latent space and fused through a product of experts formulation, enabling a Gaussian mixture prior to identify shared features. Sampling from clusters allows cross-modal generative modeling, with a mixture of expert decoder imposing inductive biases encoding prior scientific knowledge and imparting structured disentanglement of the latent space. This approach enables discovery of fingerprints which may be detected in high-dimensional heterogeneous datasets, avoiding traditional bottlenecks related to high-fidelity measurement and characterization. Motivated by accelerated co-design and optimization of materials manufacturing processes, a dataset of lattice metamaterials from metal additive manufacturing demonstrates accurate cross modal inference between images of mesoscale topology and mechanical stress-strain response.  ( 2 min )
    Local Explanation of Dialogue Response Generation. (arXiv:2106.06528v2 [cs.CL] UPDATED)
    In comparison to the interpretation of classification models, the explanation of sequence generation models is also an important problem, however it has seen little attention. In this work, we study model-agnostic explanations of a representative text generation task -- dialogue response generation. Dialog response generation is challenging with its open-ended sentences and multiple acceptable responses. To gain insights into the reasoning process of a generation model, we propose a new method, local explanation of response generation (LERG) that regards the explanations as the mutual interaction of segments in input and output sentences. LERG views the sequence prediction as uncertainty estimation of a human response and then creates explanations by perturbing the input and calculating the certainty change over the human response. We show that LERG adheres to desired properties of explanations for text generation including unbiased approximation, consistency and cause identification. Empirically, our results show that our method consistently improves other widely used methods on proposed automatic- and human- evaluation metrics for this new task by 4.4-12.8%. Our analysis demonstrates that LERG can extract both explicit and implicit relations between input and output segments.  ( 2 min )
    Heterogeneous Overdispersed Count Data Regressions via Double Penalized Estimations. (arXiv:2110.03552v2 [stat.ME] UPDATED)
    This paper studies the non-asymptotic merits of the double $\ell_1$-regularized for heterogeneous overdispersed count data via negative binomial regressions. Under the restricted eigenvalue conditions, we prove the oracle inequalities for Lasso estimators of two partial regression coefficients for the first time, using concentration inequalities of empirical processes. Furthermore, derived from the oracle inequalities, the consistency and convergence rate for the estimators are the theoretical guarantees for further statistical inference. Finally, both simulations and a real data analysis demonstrate that the new methods are effective.  ( 2 min )
    Grassmann Stein Variational Gradient Descent. (arXiv:2202.03297v1 [stat.ML])
    Stein variational gradient descent (SVGD) is a deterministic particle inference algorithm that provides an efficient alternative to Markov chain Monte Carlo. However, SVGD has been found to suffer from variance underestimation when the dimensionality of the target distribution is high. Recent developments have advocated projecting both the score function and the data onto real lines to sidestep this issue, although this can severely overestimate the epistemic (model) uncertainty. In this work, we propose Grassmann Stein variational gradient descent (GSVGD) as an alternative approach, which permits projections onto arbitrary dimensional subspaces. Compared with other variants of SVGD that rely on dimensionality reduction, GSVGD updates the projectors simultaneously for the score function and the data, and the optimal projectors are determined through a coupled Grassmann-valued diffusion process which explores favourable subspaces. Both our theoretical and experimental results suggest that GSVGD enjoys efficient state-space exploration in high-dimensional problems that have an intrinsic low-dimensional structure.  ( 2 min )
    Tight Mutual Information Estimation With Contrastive Fenchel-Legendre Optimization. (arXiv:2107.01131v2 [stat.ML] UPDATED)
    Successful applications of InfoNCE and its variants have popularized the use of contrastive variational mutual information (MI) estimators in machine learning. While featuring superior stability, these estimators crucially depend on costly large-batch training, and they sacrifice bound tightness for variance reduction. To overcome these limitations, we revisit the mathematics of popular variational MI bounds from the lens of unnormalized statistical modeling and convex optimization. Our investigation not only yields a new unified theoretical framework encompassing popular variational MI bounds but also leads to a novel, simple, and powerful contrastive MI estimator named as FLO. Theoretically, we show that the FLO estimator is tight, and it provably converges under stochastic gradient descent. Empirically, our FLO estimator overcomes the limitations of its predecessors and learns more efficiently. The utility of FLO is verified using an extensive set of benchmarks, which also reveals the trade-offs in practical MI estimation.  ( 2 min )
    Hedging using reinforcement learning: Contextual $k$-Armed Bandit versus $Q$-learning. (arXiv:2007.01623v2 [cs.LG] UPDATED)
    The construction of replication strategies for contingent claims in the presence of risk and market friction is a key problem of financial engineering. In real markets, continuous replication, such as in the model of Black, Scholes and Merton (BSM), is not only unrealistic but it is also undesirable due to high transaction costs. A variety of methods have been proposed to balance between effective replication and losses in the incomplete market setting. With the rise of Artificial Intelligence (AI), AI-based hedgers have attracted considerable interest, where particular attention was given to Recurrent Neural Network systems and variations of the $Q$-learning algorithm. From a practical point of view, sufficient samples for training such an AI can only be obtained from a simulator of the market environment. Yet if an agent was trained solely on simulated data, the run-time performance will primarily reflect the accuracy of the simulation, which leads to the classical problem of model choice and calibration. In this article, the hedging problem is viewed as an instance of a risk-averse contextual $k$-armed bandit problem, which is motivated by the simplicity and sample-efficiency of the architecture. This allows for realistic online model updates from real-world data. We find that the $k$-armed bandit model naturally fits to the Profit and Loss formulation of hedging, providing for a more accurate and sample efficient approach than $Q$-learning and reducing to the Black-Scholes model in the absence of transaction costs and risks.  ( 2 min )
    Using Partial Monotonicity in Submodular Maximization. (arXiv:2202.03051v1 [cs.LG])
    Over the last two decades, submodular function maximization has been the workhorse of many discrete optimization problems in machine learning applications. Traditionally, the study of submodular functions was based on binary function properties. However, such properties have an inherit weakness, namely, if an algorithm assumes functions that have a particular property, then it provides no guarantee for functions that violate this property, even when the violation is very slight. Therefore, recent works began to consider continuous versions of function properties. Probably the most significant among these (so far) are the submodularity ratio and the curvature, which were studied extensively together and separately. The monotonicity property of set functions plays a central role in submodular maximization. Nevertheless, and despite all the above works, no continuous version of this property has been suggested to date (as far as we know). This is unfortunate since submoduar functions that are almost monotone often arise in machine learning applications. In this work we fill this gap by defining the monotonicity ratio, which is a continues version of the monotonicity property. We then show that for many standard submodular maximization algorithms one can prove new approximation guarantees that depend on the monotonicity ratio; leading to improved approximation ratios for the common machine learning applications of movie recommendation, quadratic programming and image summarization.  ( 2 min )
    A Dimension-Insensitive Algorithm for Stochastic Zeroth-Order Optimization. (arXiv:2104.11283v3 [math.OC] UPDATED)
    This paper concerns a convex, stochastic zeroth-order optimization (S-ZOO) problem. The objective is to minimize the expectation of a cost function whose gradient is not directly accessible. For this problem, traditional optimization algorithms mostly yield query complexities that grow polynomially with dimensionality (the number of decision variables). Consequently, these methods may not perform well in solving massive-dimensional problems arising in many modern applications. Although more recent methods can be provably dimension-insensitive, almost all of them require arguably more stringent conditions such as everywhere sparse or compressible gradient. In this paper, we propose a sparsity-inducing stochastic gradient-free (SI-SGF) algorithm, which provably yields a dimension-free (up to a logarithmic term) query complexity in both convex and strongly convex cases. Such insensitivity to the dimensionality growth is proven, for the first time, to be achievable when neither gradient sparsity nor gradient compressibility is satisfied. Our numerical results demonstrate a consistency between our theoretical prediction and the empirical performance.  ( 2 min )
    SODA: Self-organizing data augmentation in deep neural networks -- Application to biomedical image segmentation tasks. (arXiv:2202.03223v1 [stat.ML])
    In practice, data augmentation is assigned a predefined budget in terms of newly created samples per epoch. When using several types of data augmentation, the budget is usually uniformly distributed over the set of augmentations but one can wonder if this budget should not be allocated to each type in a more efficient way. This paper leverages online learning to allocate on the fly this budget as part of neural network training. This meta-algorithm can be run at almost no extra cost as it exploits gradient based signals to determine which type of data augmentation should be preferred. Experiments suggest that this strategy can save computation time and thus goes in the way of greener machine learning practices.  ( 2 min )
    Learning Optimal Resource Allocations in Wireless Systems. (arXiv:1807.08088v3 [cs.LG] UPDATED)
    This paper considers the design of optimal resource allocation policies in wireless communication systems which are generically modeled as a functional optimization problem with stochastic constraints. These optimization problems have the structure of a learning problem in which the statistical loss appears as a constraint, motivating the development of learning methodologies to attempt their solution. To handle stochastic constraints, training is undertaken in the dual domain. It is shown that this can be done with small loss of optimality when using near-universal learning parameterizations. In particular, since deep neural networks (DNN) are near-universal their use is advocated and explored. DNNs are trained here with a model-free primal-dual method that simultaneously learns a DNN parametrization of the resource allocation policy and optimizes the primal and dual variables. Numerical simulations demonstrate the strong performance of the proposed approach on a number of common wireless resource allocation problems.  ( 2 min )
    Improved Certified Defenses against Data Poisoning with (Deterministic) Finite Aggregation. (arXiv:2202.02628v1 [cs.LG])
    Data poisoning attacks aim at manipulating model behaviors through distorting training data. Previously, an aggregation-based certified defense, Deep Partition Aggregation (DPA), was proposed to mitigate this threat. DPA predicts through an aggregation of base classifiers trained on disjoint subsets of data, thus restricting its sensitivity to dataset distortions. In this work, we propose an improved certified defense against general poisoning attacks, namely Finite Aggregation. In contrast to DPA, which directly splits the training set into disjoint subsets, our method first splits the training set into smaller disjoint subsets and then combines duplicates of them to build larger (but not disjoint) subsets for training base classifiers. This reduces the worst-case impacts of poison samples and thus improves certified robustness bounds. In addition, we offer an alternative view of our method, bridging the designs of deterministic and stochastic aggregation-based certified defenses. Empirically, our proposed Finite Aggregation consistently improves certificates on MNIST, CIFAR-10, and GTSRB, boosting certified fractions by up to 3.05%, 3.87% and 4.77%, respectively, while keeping the same clean accuracies as DPA's, effectively establishing a new state of the art in (pointwise) certified robustness against data poisoning.  ( 2 min )
    Efficient Logistic Regression with Local Differential Privacy. (arXiv:2202.02650v1 [cs.CR])
    Internet of Things devices are expanding rapidly and generating huge amount of data. There is an increasing need to explore data collected from these devices. Collaborative learning provides a strategic solution for the Internet of Things settings but also raises public concern over data privacy. In recent years, large amount of privacy preserving techniques have been developed based on differential privacy and secure multi-party computation. A major challenge of collaborative learning is to balance disclosure risk and data utility while maintaining high computation efficiency. In this paper, we proposed privacy preserving logistic regression model using matrix encryption approach. The secure scheme achieves local differential privacy and can be implemented for both vertical and horizontal partitioning scenarios. Moreover, cross validation is investigated to generate robust model results without increasing the communication cost. Simulation illustrates the high efficiency of proposed scheme to analyze dataset with millions of records. Experimental evaluations further demonstrate high model accuracy while achieving privacy protection.  ( 2 min )
    A new similarity measure for covariate shift with applications to nonparametric regression. (arXiv:2202.02837v1 [math.ST])
    We study covariate shift in the context of nonparametric regression. We introduce a new measure of distribution mismatch between the source and target distributions that is based on the integrated ratio of probabilities of balls at a given radius. We use the scaling of this measure with respect to the radius to characterize the minimax rate of estimation over a family of H\"older continuous functions under covariate shift. In comparison to the recently proposed notion of transfer exponent, this measure leads to a sharper rate of convergence and is more fine-grained. We accompany our theory with concrete instances of covariate shift that illustrate this sharp difference.  ( 2 min )
    On Neural Differential Equations. (arXiv:2202.02435v1 [cs.LG])
    The conjoining of dynamical systems and deep learning has become a topic of great interest. In particular, neural differential equations (NDEs) demonstrate that neural networks and differential equation are two sides of the same coin. Traditional parameterised differential equations are a special case. Many popular neural network architectures, such as residual networks and recurrent networks, are discretisations. NDEs are suitable for tackling generative problems, dynamical systems, and time series (particularly in physics, finance, ...) and are thus of interest to both modern machine learning and traditional mathematical modelling. NDEs offer high-capacity function approximation, strong priors on model space, the ability to handle irregular data, memory efficiency, and a wealth of available theory on both sides. This doctoral thesis provides an in-depth survey of the field. Topics include: neural ordinary differential equations (e.g. for hybrid neural/mechanistic modelling of physical systems); neural controlled differential equations (e.g. for learning functions of irregular time series); and neural stochastic differential equations (e.g. to produce generative models capable of representing complex stochastic dynamics, or sampling from complex high-dimensional distributions). Further topics include: numerical methods for NDEs (e.g. reversible differential equations solvers, backpropagation through differential equations, Brownian reconstruction); symbolic regression for dynamical systems (e.g. via regularised evolution); and deep implicit models (e.g. deep equilibrium models, differentiable optimisation). We anticipate this thesis will be of interest to anyone interested in the marriage of deep learning with dynamical systems, and hope it will provide a useful reference for the current state of the art.  ( 2 min )
    Diversify and Disambiguate: Learning From Underspecified Data. (arXiv:2202.03418v1 [cs.LG])
    Many datasets are underspecified, which means there are several equally viable solutions for the data. Underspecified datasets can be problematic for methods that learn a single hypothesis because different functions that achieve low training loss can focus on different predictive features and thus have widely varying predictions on out-of-distribution data. We propose DivDis, a simple two-stage framework that first learns a diverse collection of hypotheses for a task by leveraging unlabeled data from the test distribution. We then disambiguate by selecting one of the discovered hypotheses using minimal additional supervision, in the form of additional labels or inspection of function visualization. We demonstrate the ability of DivDis to find hypotheses that use robust features in image classification and natural language processing problems with underspecification.  ( 2 min )
    Learning fair representation with a parametric integral probability metric. (arXiv:2202.02943v1 [stat.ML])
    As they have a vital effect on social decision-making, AI algorithms should be not only accurate but also fair. Among various algorithms for fairness AI, learning fair representation (LFR), whose goal is to find a fair representation with respect to sensitive variables such as gender and race, has received much attention. For LFR, the adversarial training scheme is popularly employed as is done in the generative adversarial network type algorithms. The choice of a discriminator, however, is done heuristically without justification. In this paper, we propose a new adversarial training scheme for LFR, where the integral probability metric (IPM) with a specific parametric family of discriminators is used. The most notable result of the proposed LFR algorithm is its theoretical guarantee about the fairness of the final prediction model, which has not been considered yet. That is, we derive theoretical relations between the fairness of representation and the fairness of the prediction model built on the top of the representation (i.e., using the representation as the input). Moreover, by numerical experiments, we show that our proposed LFR algorithm is computationally lighter and more stable, and the final prediction model is competitive or superior to other LFR algorithms using more complex discriminators.  ( 2 min )
    Bilevel Optimization with a Lower-level Contraction: Optimal Sample Complexity without Warm-Start. (arXiv:2202.03397v1 [stat.ML])
    We analyze a general class of bilevel problems, in which the upper-level problem consists in the minimization of a smooth objective function and the lower-level problem is to find the fixed point of a smooth contraction map. This type of problems include instances of meta-learning, hyperparameter optimization and data poisoning adversarial attacks. Several recent works have proposed algorithms which warm-start the lower-level problem, i.e. they use the previous lower-level approximate solution as a staring point for the lower-level solver. This warm-start procedure allows one to improve the sample complexity in both the stochastic and deterministic settings, achieving in some cases the order-wise optimal sample complexity. We show that without warm-start, it is still possible to achieve order-wise optimal and near-optimal sample complexity for the stochastic and deterministic settings, respectively. In particular, we propose a simple method which uses stochastic fixed point iterations at the lower-level and projected inexact gradient descent at the upper-level, that reaches an $\epsilon$-stationary point using $O(\epsilon^{-2})$ and $\tilde{O}(\epsilon^{-1})$ samples for the stochastic and the deterministic setting, respectively. Compared to methods using warm-start, ours is better suited for meta-learning and yields a simpler analysis that does not need to study the coupled interactions between the upper-level and lower-level iterates.  ( 2 min )
    BAM: Bayes with Adaptive Memory. (arXiv:2202.02405v1 [cs.LG])
    Online learning via Bayes' theorem allows new data to be continuously integrated into an agent's current beliefs. However, a naive application of Bayesian methods in non-stationary environments leads to slow adaptation and results in state estimates that may converge confidently to the wrong parameter value. A common solution when learning in changing environments is to discard/downweight past data; however, this simple mechanism of "forgetting" fails to account for the fact that many real-world environments involve revisiting similar states. We propose a new framework, Bayes with Adaptive Memory (BAM), that takes advantage of past experience by allowing the agent to choose which past observations to remember and which to forget. We demonstrate that BAM generalizes many popular Bayesian update rules for non-stationary environments. Through a variety of experiments, we demonstrate the ability of BAM to continuously adapt in an ever-changing world.  ( 2 min )
    Pushing the Efficiency-Regret Pareto Frontier for Online Learning of Portfolios and Quantum States. (arXiv:2202.02765v1 [cs.LG])
    We revisit the classical online portfolio selection problem. It is widely assumed that a trade-off between computational complexity and regret is unavoidable, with Cover's Universal Portfolios algorithm, SOFT-BAYES and ADA-BARRONS currently constituting its state-of-the-art Pareto frontier. In this paper, we present the first efficient algorithm, BISONS, that obtains polylogarithmic regret with memory and per-step running time requirements that are polynomial in the dimension, displacing ADA-BARRONS from the Pareto frontier. Additionally, we resolve a COLT 2020 open problem by showing that a certain Follow-The-Regularized-Leader algorithm with log-barrier regularization suffers an exponentially larger dependence on the dimension than previously conjectured. Thus, we rule out this algorithm as a candidate for the Pareto frontier. We also extend our algorithm and analysis to a more general problem than online portfolio selection, viz. online learning of quantum states with log loss. This algorithm, called SCHRODINGER'S BISONS, is the first efficient algorithm with polylogarithmic regret for this more general problem.  ( 2 min )
    Beyond Black Box Densities: Parameter Learning for the Deviated Components. (arXiv:2202.02651v1 [stat.ML])
    As we collect additional samples from a data population for which a known density function estimate may have been previously obtained by a black box method, the increased complexity of the data set may result in the true density being deviated from the known estimate by a mixture distribution. To model this phenomenon, we consider the \emph{deviating mixture model} $(1-\lambda^{*})h_0 + \lambda^{*} (\sum_{i = 1}^{k} p_{i}^{*} f(x|\theta_{i}^{*}))$, where $h_0$ is a known density function, while the deviated proportion $\lambda^{*}$ and latent mixing measure $G_{*} = \sum_{i = 1}^{k} p_{i}^{*} \delta_{\theta_i^{*}}$ associated with the mixture distribution are unknown. Via a novel notion of distinguishability between the known density $h_{0}$ and the deviated mixture distribution, we establish rates of convergence for the maximum likelihood estimates of $\lambda^{*}$ and $G^{*}$ under Wasserstein metric. Simulation studies are carried out to illustrate the theory.  ( 2 min )
    Test Set Sizing Via Random Matrix Theory. (arXiv:2112.05977v2 [stat.ML] UPDATED)
    This paper uses techniques from Random Matrix Theory to find the ideal training-testing data split for a simple linear regression with m data points, each an independent n-dimensional multivariate Gaussian. It defines "ideal" as satisfying the integrity metric, i.e. the empirical model error is the actual measurement noise, and thus fairly reflects the value or lack of same of the model. This paper is the first to solve for the training and test size for any model in a way that is truly optimal. The number of data points in the training set is the root of a quartic polynomial Theorem 1 derives which depends only on m and n; the covariance matrix of the multivariate Gaussian, the true model parameters, and the true measurement noise drop out of the calculations. The critical mathematical difficulties were realizing that the problems herein were discussed in the context of the Jacobi Ensemble, a probability distribution describing the eigenvalues of a known random matrix model, and evaluating a new integral in the style of Selberg and Aomoto. Mathematical results are supported with thorough computational evidence. This paper is a step towards automatic choices of training/test set sizes in machine learning.  ( 2 min )
    Likelihood Training of Schr\"odinger Bridge using Forward-Backward SDEs Theory. (arXiv:2110.11291v3 [stat.ML] UPDATED)
    Schr\"odinger Bridge (SB) is an entropy-regularized optimal transport problem that has received increasing attention in deep generative modeling for its mathematical flexibility compared to the Scored-based Generative Model (SGM). However, it remains unclear whether the optimization principle of SB relates to the modern training of deep generative models, which often rely on constructing log-likelihood objectives.This raises questions on the suitability of SB models as a principled alternative for generative applications. In this work, we present a novel computational framework for likelihood training of SB models grounded on Forward-Backward Stochastic Differential Equations Theory - a mathematical methodology appeared in stochastic optimal control that transforms the optimality condition of SB into a set of SDEs. Crucially, these SDEs can be used to construct the likelihood objectives for SB that, surprisingly, generalizes the ones for SGM as special cases. This leads to a new optimization principle that inherits the same SB optimality yet without losing applications of modern generative training techniques, and we show that the resulting training algorithm achieves comparable results on generating realistic images on MNIST, CelebA, and CIFAR10. Our code is available at https://github.com/ghliu/SB-FBSDE.  ( 2 min )
    Learning Curves for SGD on Structured Features. (arXiv:2106.02713v4 [stat.ML] UPDATED)
    The generalization performance of a machine learning algorithm such as a neural network depends in a non-trivial way on the structure of the data distribution. To analyze the influence of data structure on test loss dynamics, we study an exactly solveable model of stochastic gradient descent (SGD) on mean square loss which predicts test loss when training on features with arbitrary covariance structure. We solve the theory exactly for both Gaussian features and arbitrary features and we show that the simpler Gaussian model accurately predicts test loss of nonlinear random-feature models and deep neural networks trained with SGD on real datasets such as MNIST and CIFAR-10. We show that the optimal batch size at a fixed compute budget is typically small and depends on the feature correlation structure, demonstrating the computational benefits of SGD with small batch sizes. Lastly, we extend our theory to the more usual setting of stochastic gradient descent on a fixed subsampled training set, showing that both training and test error can be accurately predicted in our framework on real data.  ( 2 min )
    Importance Weighting Approach in Kernel Bayes' Rule. (arXiv:2202.02474v1 [stat.ML])
    We study a nonparametric approach to Bayesian computation via feature means, where the expectation of prior features is updated to yield expected posterior features, based on regression from kernel or neural net features of the observations. All quantities involved in the Bayesian update are learned from observed data, making the method entirely model-free. The resulting algorithm is a novel instance of a kernel Bayes' rule (KBR). Our approach is based on importance weighting, which results in superior numerical stability to the existing approach to KBR, which requires operator inversion. We show the convergence of the estimator using a novel consistency analysis on the importance weighting estimator in the infinity norm. We evaluate our KBR on challenging synthetic benchmarks, including a filtering problem with a state-space model involving high dimensional image observations. The proposed method yields uniformly better empirical performance than the existing KBR, and competitive performance with other competing methods.  ( 2 min )
    Dependence model assessment and selection with DecoupleNets. (arXiv:2202.03406v1 [stat.ML])
    Neural networks are suggested for learning a map from $d$-dimensional samples with any underlying dependence structure to multivariate uniformity in $d'$ dimensions. This map, termed DecoupleNet, is used for dependence model assessment and selection. If the data-generating dependence model was known, and if it was among the few analytically tractable ones, one such transformation for $d'=d$ is Rosenblatt's transform. DecoupleNets only require an available sample and are applicable to $d'<d$, in particular $d'=2$. This allows for simpler model assessment and selection without loss of information, both numerically and, because $d'=2$, graphically. Through simulation studies based on data from various copulas, the feasibility and validity of this novel approach is demonstrated. Applications to real world data illustrate its usefulness for model assessment and selection.  ( 2 min )

  • Open

    What do you call a neural network with irregular connections?
    Most modern deep neural networks are densely connected, where every unit in one layer is connected to every unit in the next, and so on. However, is there a term for a neural network whose connections aren't as methodical? For instance, a neural network whose connections and hidden units are randomly generated? submitted by /u/Cogitarius [link] [comments]  ( 1 min )
    Research on the use of image processing to optimize the analysis of lung disorders in chest radiographs
    Hi! I am developing a research and I would like your help. The aim of my research is to analyze which data augmentation strategies are effective in radiographs of lung disorders. From these strategies, for example, it is possible to generate more useful images for training intelligent algorithms based on the state of the art of Deep Learning. In this way, we invite you to help us analyze the quality of the strategies proposed throughout the research. It is a quick survey, but extremely valuable for the continuity of my project. Any questions can ask me! Research Link: https://data-augmentation-app-en.herokuapp.com submitted by /u/alyssonmach [link] [comments]  ( 1 min )
    how to avoid regression in a neural net
    when i add a new dataset to train my model what assurence do i get that the old training data is not compromised like if i train my network to recogninse images of a car , how to be sure that the new car image that i put in the training set will not make my model miss some old photo that i have previously training my model with submitted by /u/fehmitn [link] [comments]  ( 1 min )
  • Open

    How will Google pathways change things?
    How exactly will Google pathways change AI and other things? Will it even be a big change? submitted by /u/Particular_Leader_16 [link] [comments]
    Guide on how to make an ai program that predicts a number from different input numbers
    So I have a dataset for some numbers that represents voltages and other data, now in the trainung dataset it will have an output with the line that has a value, now how can I make an ai that takes this data tries it and then all I will change is 1 column where I will edit the voltages only and leave other untouched, I wanted to jnow how to make the ai predict the output now when I edit the voltage for it? I don't need ready thing I want a guide on how to achieve it and if there is any code in github that looks near so I can understand and start working on mine, I am new to ai and need help and guidance. Thanks alot submitted by /u/Separate_Sample3117 [link] [comments]  ( 1 min )
    Question about AI chat bot software
    Was googling some questions on AI chat bots and one link led me here, hope this is the right place to ask. If not and someone could direct me to a more appropriate subreddit, I'd really appreciate that. I'm developing an app for a game I am designing. I'd like part of the app to be an in-game styled AI "assistant", where the user can chat with it, and the bot has knowledge of in-game-world facts and events and 'secret info'. I don't need it to be a conversational bot (How are you? I'm fine, and yourself? Whats your favorite food? etc etc) however I'd like it to have some basic conversational skills (maybe with a way to train it to respond to questions like "Whats your favorite..." with a response like "I'm sorry I cannot help you with that.") and ideally not just give completely canned r…  ( 2 min )
    MIT Researchers Open-Source ‘Dynamo’: A Machine Learning-Based Python Framework For Gaining Insights Into Dynamic Biological Processes
    In partnership with the University of Pittsburgh School of Medicine, researchers at MIT have developed a machine learning framework to define the mathematical equations describing a cell’s pathway from one condition to another. The framework is named “dynamo” and can also determine the underlying mechanisms that drive cell changes. Their research focused on how cells change over time rather than how they migrate through space. Continue Reading Paper: https://www.cell.com/cell/pdf/S0092-8674(21)01577-4.pdf Github: https://github.com/aristoteleo/dynamo-release ​ https://preview.redd.it/skplewn5hog81.png?width=1668&format=png&auto=webp&s=4aa84aa44b8002d30b83dd1fd0607251c7fcef0d submitted by /u/ai-lover [link] [comments]  ( 1 min )
    A.I. Is Being Boosted by Supercomputers
    submitted by /u/Beautiful-Credit-868 [link] [comments]
    Best Machine Learning Books for Data Science to read
    Data Science, as you probably know, covers a wide spectrum of domains and Machine Learning is one of them. Data Science basically comprises various fields and techniques, like Statistics and Artificial Intelligence for Data Analysis to draw meaningful insights. If you are interested to learn ML for Data Science then these Best machine Learning books 2022 will help you a lot submitted by /u/sivasiriyapureddy [link] [comments]  ( 1 min )
    Long Short Term Memory Cell Visualized
    submitted by /u/vadhavaniyafaijan [link] [comments]
    using NEAT AI where a character jumps over an obstacle (in a simple game), how to know if it wants to run or walk?
    Hi, So I wanted to try this: https://www.youtube.com/watch?v=CKFCIzPSBjE&ab_channel=CodeBucket and I want to improve it by allowing the character (in this case the dinosaur) to run or walk. Do I just add an additional +1 to the output (like if the output was jump or dont jump --> 1 output, now the output is jump / dont jump --> 1 output, walk or run --> 2nd output)? If yes, how do I know which of the 2 output is telling me to walk or run? like if it's the second one, is it that when output is > 0.50 it is run and 0.50 is walk and < 0.50 is run? and how will the dinosaur know that it is currently walking or running? do I add it in as additional input too? because I dont want it to be running, then AI result output says to run again, since it cannot run anymore as it is currently running. thank you very much. submitted by /u/askingredditsginput [link] [comments]  ( 1 min )
    Hi! I am studying AI Business consulting and I am writing a report and I ned to find good business case examples where ML have been used in supply chain forecasting. If you have any ideas and inspo, please share.
    submitted by /u/AINotBot [link] [comments]  ( 1 min )
    Substrates of Mind: This is getting real... Working on a system to generate visualizations of my internal mental/emotional state
    submitted by /u/general_gengen [link] [comments]  ( 1 min )
    Artificial Nightmares : James Webb Telescope Images 02 || DD PYTTI VQGAN AI Art Video [4K 60 FPS]
    submitted by /u/Thenamessd [link] [comments]
  • Open

    [P] The Cedille paper is out! (largest French language model, openly available)
    Enfin ! The CedilleAI paper is out 🥖🇫🇷 https://arxiv.org/abs/2202.03371 Cedille is a finetuned version of GPT-J in French, check out the launch post for previous discussions: https://www.reddit.com/r/MachineLearning/comments/qqzuh0/p_cedille_the_largest_french_language_model_6b/ Cedille juuust beats GPT-3 (davinci) in French language perplexity. This is great news given that Cedille is ~30x smaller than GPT-3. It’s also the first French language model that is competitive at zero-shot/instruction-based tasks. Cedille beats GPT-3 for en→fr translation (WMT14) and other large French models at summarization (OrangeSum). We analysed Cedille’s toxicity using Detoxify, following the approach from the RealToxicityPrompt paper, and it’s generally better behaved than its peers (GPT-2 is even less toxic, but it writes gibberish French, so that’s cheating!) In general, Cedille is a testament to the efficiency of finetuning compared to training from scratch. Our experiments also indicate no major advantage of using dedicated French tokenizers (we are using GPT-2 tokenizer). Cedille is available on HuggingFace and can be tried out on our playground: app.cedille.ai. API keys are available, though the API is still in beta. submitted by /u/MasterScrat [link] [comments]  ( 1 min )
    [D] Paper Discussion Video - Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents (+Author Interview)
    https://youtu.be/OUCwujwE7bA In this video: Paper explanation, followed by first author interview with Wenlong Huang. Large language models contain extraordinary amounts of world knowledge that can be queried in various ways. But their output format is largely uncontrollable. This paper investigates the VirtualHome environment, which expects a particular set of actions, objects, and verbs to be used. Turns out, with proper techniques and only using pre-trained models (no fine-tuning), one can translate unstructured language model outputs into the structured grammar of the environment. This is potentially very useful anywhere where the models' world knowledge needs to be provided in a particular structured format. ​ OUTLINE: 0:00 - Intro & Overview 2:45 - The VirtualHome environment 6:25 - The problem of plan evaluation 8:40 - Contributions of this paper 16:40 - Start of interview 24:00 - How to use language models with environments? 34:00 - What does model size matter? 40:00 - How to fix the large models' outputs? 55:00 - Possible improvements to the translation procedure 59:00 - Why does Codex perform so well? 1:02:15 - Diving into experimental results 1:14:15 - Future outlook ​ Paper: https://arxiv.org/abs/2201.07207 Website: https://wenlong.page/language-planner/ Code: https://github.com/huangwl18/language-planner Wenlong's Twitter: https://twitter.com/wenlong_huang submitted by /u/ykilcher [link] [comments]  ( 1 min )
    [P] Machine Learning Simplified Book
    Hello everyone. My name is Andrew and for several years I've been working on to make the learning path for ML easier. I wrote a manual on machine learning that everyone understands - Machine Learning Simplified Book. The main purpose of my book is to build an intuitive understanding of how algorithms work through basic examples. In order to understand the presented material, it is enough to know basic mathematics and linear algebra. After reading this book, you will know the basics of supervised learning, understand complex mathematical models, understand the entire pipeline of a typical ML project, and also be able to share your knowledge with colleagues from related industries and with technical professionals. And for those who find the theoretical part not enough - I supplemented the book with a repository on GitHub, which has Python implementation of every method and algorithm that I describe in each chapter. You can read the book absolutely free at the link below: -> https://themlsbook.com I would appreciate it if you recommend my book to those who might be interested in this topic, as well as for any feedback provided. Thanks! (attaching one of the pipelines described in the book).; ​ https://preview.redd.it/ryvu8670tog81.png?width=1572&format=png&auto=webp&s=16756a6098b162f6bd6d771be765d34a16f6b0c4 submitted by /u/5x12 [link] [comments]  ( 1 min )
    [D] Accelerated stochastic optimization for convex problems
    Can someone advise accelerated optimization method that uses convexity of the problem, that being able to work with very large input dimension and still being able to work with noisy gradient (input observations are noisy)? There are plenty of papers on the topic, most without PyTorch implementation, and it will be quite time-consuming to try them all. submitted by /u/likan_blk [link] [comments]  ( 1 min )
    Trying to improve performance of dynamic time warping in matching curve shapes [D]
    Hello, so I am new to the practice of dynamic time warping for matching curve shapes from time series data, and wanted to reach out to this community for some potential feedback on whether I am on the right track with my project and how I can improve it. I wanted to reach out to this community since understanding how to implement a curve-shape matching function through dynamic time warping (DTW) seems like it is an essential part for me to understand how I can leverage machine learning for further automated classification of time series data (Even though I realize DTW may not "technically" be a form a machine learning). I have time series data which consists of many lines making up a wide range of curve shapes. These lines start at different time stamps on the x-axis, and so I send all of…  ( 3 min )
    [N] Experiment tracking with DvC and Guild AI
    I'm the author of Guild AI (open source experiment tracking). For some time now Guild users have asked for DvC support. This is now available as a pre-release. If you're using DvC to manage data sets you may be interested in this integration. You can use Guild to run DvC stages as experiments. Guild's approach to running experiments is quite different from DvC's in that the project code is copied to a new run directory and run there, rather than from your project directory. You can then use Guild's tool set to view, compare, and manage runs. Parameters and metrics are also integrated. E.g. you can run hyperparameter optimization without code change along this line: $ guild run dvc.yaml:train dropout=[0.1:0.4] --trials 20 --optimizer gp This will run the DvC train stage 20 times using different values for the dropout parameter (defined in a DvC tracked params file) using Bayesian optimization with Gaussian processes. If you're a Guild AI user, you can now define dependencies on DvC managed files using the dvcfile source type. This functionality is available for early evaluation. Feedback is most welcome. submitted by /u/gar1t [link] [comments]  ( 1 min )
    [D] Best statistical/ ML approach to combine numerically simulated outcome and unstructured data?
    Hello all! I have been assigned to a research project in collaboration with a manufacturing company who are looking to improve the accuracy of their model. Their research question is to improve their Finite Element Analysis model (FEA) accuracy with statistical techniques/Machine learning methods. What they are producing is high-end metal products and they are trying to improve the manufacturing process. Basically they have raw material with specific properties (strength, weights, etc...) that they are heating at very high temperature and pressure which gives then the final product. The problem is that during the heating phase the product exhibits some shrinkage. It is this shrinkage that they are trying to model. Their engineers have already developed an FEA model but it is not precise…  ( 2 min )
    Research on the use of image processing to optimize the analysis of lung disorders in chest radiographs [R]
    Hi! I am developing a research and I would like your help. The aim of my research is to analyze which data augmentation strategies are effective in radiographs of lung disorders. From these strategies, for example, it is possible to generate more useful images for training intelligent algorithms based on the state of the art of Deep Learning. In this way, we invite you to help us analyze the quality of the strategies proposed throughout the research. It is a quick survey, but extremely valuable for the continuity of my project. Any questions can ask me! ​ Research Link: https://data-augmentation-app-en.herokuapp.com submitted by /u/alyssonmach [link] [comments]  ( 1 min )
    [D] How do very large models generalize and do models just become data banks after a point.
    Hi, so I'm not experienced, but in my limited experience larger models tend to generalize very poorly even with big datasets; bias-variance tradeoff and all that. But I was wondering, do very large models like GPT-3 generalize out-of-domain? Or are they just a sophisticated "Lookup table"? submitted by /u/bloodmummy [link] [comments]  ( 1 min )
    [R] New research paper about ML and faces - A Design-Led Exploration of Material Interactions between Machine Learning and Digital Portraiture
    Hi, I recently published a research pictorial about Machine Learning and Digital Portraits, based on some experimentation with RunwayML. Curious to know what folks here think? ML isn't our main area of research (I'm a designer/documentary-maker), but there's links to the work of others (e.g. Jesse Benjamin) who are doing interesting work in this space. Links below if anyone's interested! Video: https://vimeo.com/656078159 Paper: http://designresearch.works/assets/papers/iasdr2021-faces-green-lindley-mason-coulton.pdf submitted by /u/cin3hack3r [link] [comments]  ( 1 min )
    [R] NEW Robustness Benchmark for 3D Point Cloud Recognition, ModelNet40-C. "Benchmarking Robustness of 3D Point Cloud Recognition Against Common Corruptions"
    Paper link: https://arxiv.org/abs/2201.12296 Github page: https://github.com/jiachens/ModelNet40-C Project page: https://sites.google.com/umich.edu/modelnet40c The authors create ModelNet40-C, a comprehensive dataset containing 185,100 point clouds from 40 classes, 15 corruption types, and 5 severity levels. The authors also provide a detailed taxonomy of the constructed corruption types. ModelNet40-C is a comprehensive dataset for benchmarking the corruption robustness of 3D point cloud classification. Sampled Visualization of ModelNet40-C from 15 Corruption Types. Their benchmark and analysis have unveiled that despite the remarkable progress of 3D point cloud recognition models, their robustness against common corruption still requires significant considerations. There is a ~3x performance gap between ModelNet40 and ModelNet40-C. Performance Gap between Models on ModelNet40 and ModelNet40-C. The confusion matrix averaged from six models also cross-validates that ModelNet40-C still preserve the semantics of the original point clouds. Confusion Matrix of Model Performance on ModelNet40-C. The authors also conducted in-depth analysis and summarized many useful insights. For example, they show that PointNet achieves strong performance on density corruptions, but fails on transformation corruptions due to its global feature-focusing design. Besides, Transformer-based design is more robust against corruption than others with proper training recipes. Model Performance on ModelNet40-C with Data Augmentation Strategies. Another Sampled Visualization of ModelNet40-C. submitted by /u/jiachens [link] [comments]  ( 1 min )
    [D] "Censorship" in offline reinforcement learning?
    I've got what may be a strange question about offline reinforcement learning - not an area I'm super familiar with. I'm working on a problem where each "trace" in my dataset (i.e., a sequence of state-actions for a given patient) can either terminate (e.g., if the patient dies) or cease being observed (i.e., we don't know what happened next, but the patient did not actually die). Is there any work on how to treat this problem when using value-based learning (like deep Q-learning approaches)? Is there risk of poor estimation if not every "trace" terminates? submitted by /u/Neither_Bee3128 [link] [comments]  ( 1 min )
    [N] Gym now has a documentation website
    5 months after taking over maintenance, I'm happy to announce that Gym now has a proper documentation website for the first time in it's life: https://www.gymlibrary.ml/ It's still somewhat in beta and active development is going on, and if you'd like to create PRs or issues the repo for the website is here: https://github.com/Farama-Foundation/gym-docs submitted by /u/jkterry1 [link] [comments]  ( 1 min )
    [R] PhD thesis: On Neural Differential Equations!
    arXiv link here TL;DR: I've written a "textbook" for neural differential equations (NDEs). Includes ordinary/stochastic/controlled/rough diffeqs, for learning physics, time series, generative problems etc. [+ Unpublished material on generalised adjoint methods, symbolic regression, universal approximation, ...] Hello everyone! I've been posting on this subreddit for a while now, mostly about either tech stacks (JAX vs PyTorch etc.) -- or about "neural differential equations", and more generally the places where physics meets machine learning. If you're interested, then I wanted to share that my doctoral thesis is now available online! Rather than the usual staple-papers-together approach, I decided to go a little further and write a 231-page kind-of-a-textbook. [If you're curious how t…  ( 8 min )
    [D] Is it fair to compare AlphaCode to human programmers?
    DeepMind's recent AlphaCode paper has triggered some bloated headlines that compare AI to human programmers. To be fair, DeepMind itself doesn't make such claims. Their blog post says it is "the first time an AI code generation system has reached a competitive level of performance in programming competitions." Surely, DeepMind has done a terrific job at putting together the right pieces that transform natural language problem descriptions into working programs. And it has been able to compete in Codeforces at an impressive level. But here's the catch: Does "competitive level of performance in programming competitions" necessarily mean that AlphaCode can be compared to a human programmer? I think there's a fundamental problem with such benchmarks and their presentation in the media. Codin…  ( 7 min )
    [P] Python library for Variational Autoencoder benchmarking
    Hi all, I am putting together some unified implementations of common (Variational) Autoencoder models so we can train them in the same setting and make reproducible comparisons. In particular, they can use flexible and custom neural network architectures. Github link: https://github.com/clementchadebec/benchmark_VAE submitted by /u/cchad-8 [link] [comments]  ( 1 min )
    [R] Paper: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework. Shocking performance in text-to-image synthesis and open-domain tasks.
    Paper link: https://arxiv.org/pdf/2202.03052.pdf Github page: https://github.com/OFA-Sys/OFA The authors propose OFA, a unified multimodal pretrained model that unifies modalities (i.e., cross-modality, vision, language) and tasks (e.g., image generation, visual grounding, image captioning, image classification, text generation, etc.) to a simple sequence-to-sequence learning framework. ​ Quite interesting results in text-to-image generation. Besides generating realistic images, it can generate things really counterfactual! ​ https://preview.redd.it/h8kp50pkilg81.png?width=1393&format=png&auto=webp&s=f5e5a289ea32ec3b628cff2eb18876f8e605f5f9 ​ https://preview.redd.it/k1r12z5uilg81.png?width=1395&format=png&auto=webp&s=2f87c61cb69daaaf54559dedfbe039f7dbe1f869 ​ Also, the authors show some cases of open-domain and unseen task performance. Really interesting to see that the model can deal with anime pictures. ​ https://preview.redd.it/xlfak3c2jlg81.png?width=1005&format=png&auto=webp&s=313fb481b94cd14240aa2f33758fca976fe1c9e3 ​ https://preview.redd.it/jykocvj3jlg81.png?width=940&format=png&auto=webp&s=2c81ccbb6e812e23f28749c292cd706b905a0762 submitted by /u/BeeBeeBooBoo-KZ [link] [comments]  ( 1 min )
    [D] Can Akakai information criteria (AIC) be used to find a Parsimonious deep learning Model?
    I am using LSTM with multiple inputs. I want to find a parsimonious model. Akaike's information criterion (AIC) compares the quality of a set of statistical models to each other. Though LSTM is not a statistical model, can AIC be used to find a parsimonious lstm model? submitted by /u/SnooPoems4937 [link] [comments]  ( 1 min )
    [P] Feature Selection for tabular data (Zoofs, python package)
    Hi everyone, elated to share the latest version of our library for tabular feature selection using a variety of nature-inspired wrapper algorithms. Please check it out and give us a star if you like it. We are actively looking for contributors. https://github.com/jaswinder9051998/zoofs submitted by /u/Jasssinghhira [link] [comments]  ( 1 min )
    [D] How to compare two curves representing model performance (y) over noise applied to the data (x)?
    Hi experts, ​ I have four curves representing my models (different architectures and/or training) performances . On the x there is the noise applied to the test images, from 0 to 100 sd. (see image) ​ ​ https://preview.redd.it/q2tcoevljig81.png?width=378&format=png&auto=webp&s=84066026db23e4e536ccc7628663c96b3c093b69 How can I statistically compare these curves in a meaningful way? what I came up so far are the Kolmogorov-Smirnov test, or a paired sample test, but I wonder if there is a more 'ML' way of doing this. submitted by /u/Chronosandkairos_ [link] [comments]  ( 2 min )
    [P] Are there better methods of recommendation that collaborative filtering? If so, where can I read up/learn about these methods?
    Hi, im trying to make a recommendation system that recommends anime to a person based off a database of user data, and im trying to work out the best way to go about it. Could someone give me a helping hand please? Thanks submitted by /u/tvvvs [link] [comments]  ( 1 min )
    [D] Gradient descent to minimize the sum of arc length distances on a unit sphere
    I have to write a batch gradient descent algorithm to find the point on the unit sphere (in any number of dimensions) which minimizes the sum of arc length distances to each of a set of given points also on the sphere. This is what I have so far : import math import numpy as np # https://math.stackexchange.com/questions/231221/great-arc-distance-between-two-points-on-a-unit-sphere # arc_length = radius* math.acos((math.cos(theta_1)*math.cos(theta_2) + math.sin(theta_1)*math.sin(theta_2)*math.cos(phi_1-phi_2))) phi = np.random.uniform(math.radians(0),math.radians(360),(50,1)) # generating random phi theta = np.random.uniform(math.radians(0),math.radians(180),(50,1)) # generating ramdom theta radius = 1 epochs = 10000 learning_rate = 0.001 arc_len = 0 theta_2 = 0 phi_2 = 0 for i in range (…  ( 1 min )
  • Open

    "Adversarially Trained Actor Critic for Offline Reinforcement Learning", Cheng et al 2022 {MS}
    submitted by /u/gwern [link] [comments]
    Sharing a tutorial on Monte Carlo control implementation and application to solve the Frozen Lake problem
    submitted by /u/lebrokholic [link] [comments]  ( 1 min )
    generative adversarial imitation learning algorithm applied in a time-series data
    Hi everyone, excuse me for a such question that could be silly ! my question is if we can apply the generative adversarial imitation learning algorithm on a time-series data to make predictions. so the objective will be to mimic the temporal behavior of an agent. submitted by /u/Adventurous-Car-2927 [link] [comments]  ( 1 min )
    Designing Societally Beneficial Reinforcement Learning Systems
    submitted by /u/robotphilanthropist [link] [comments]
    Gym now has a (beta) official documentation website!
    submitted by /u/jkterry1 [link] [comments]  ( 1 min )
    Multiple Actions: DDPG
    Hello, I have a question on handling multiple actions in deep reinforcement learning (I am working on DDPG). What I mean by multiple actions is having two continuous vectors to optimize; each has its own scale and boundaries. Currently, I am defining one action space in my environment (Box space), where it contains a vector (the first half has specific boundaries for action 1, and the second half is for action 2). Then, at the output of the actor, I rescale the first half of the action vector based on the limits of action 1 and rescale the second half based on action 2. Is this the right way to do that? Or am I missing something here? The results are not optimized, and I am trying to determine if the problem is from the formulation itself. submitted by /u/alicefaisal [link] [comments]  ( 1 min )
    How does one train a value network?
    The way I want to go about making a self play agent to play simple board games like Connect 4 is to make a tree of all the possible actions the agent can make at a specific state and use a value network to calculate how good each move is and then choose the branch of the tree that has the highest sum of values but how do I train such a network? submitted by /u/Bobingstern [link] [comments]  ( 2 min )
    How to "fit" the output of the Critic to the dimension of the reward?
    In the A2C algorithm the advantage function is the driving factor. Just a short reminder, it is calculated like: A = R + factor * V(s_t+1) - V(s_t) Now, for this to work the dimension of the Reward R and the Output of the Critic V(s_t) must match, is there any recommended way of achieving this and maybe even doing the reverse, rescaling the reward so it matches V(s_t) ? submitted by /u/luzariuSsuckSs [link] [comments]  ( 2 min )
    Can you do the following when bootstrapping?
    Lets say i do n-stpe bootstrapping with n=5. Then i would calculate the discounted reward every 5 steps and keep the first and the last (that is to say the (first+5)th state). And all states and rewards inbetween are thrown away. Now is it also possible to perform the same on the second state, the third state and so on aswell? So to bootstrap from the 2nd to the 6th state for example even though i already 'used' the second state in a previous bootstrap. submitted by /u/dirtInfestor1 [link] [comments]  ( 1 min )
    How to modify the inclination angle of the pole in cartpole-v1 initially?
    I want to change the initial pole angle in cartpole environment before training starts. Then I want to take it to 0 degrees of pole inclination angle and keep it steady while maximizing the reward. submitted by /u/abhi_ranjan_ [link] [comments]  ( 1 min )
    AlphaZero for Knapsacks?
    Heyyo. I’ve been trying to use RL to solve the knapsack problem as the simplest case for something “grand” that I’m building towards (a system to automatically play daily fantasy sports). I have an integer program that can solve these problems optimally and I’m comparing solutions. I’ve found that my RL bot will reliably learn how to grab the best items from 20 possible items, but after that, performance drops dramatically. I need to be able to handle ~300 items at a time. Maybe I’m not tuning the algorithms correctly? Is it weird to try AlphaGo for this problem? The basic formulation that I’m considering is to give two bots the same items and the winner is the one that picks the best knapsack (most value while under the weight limit or the least overweight). I’m not really sure what to expect, but I wanted a sanity check before I spend a bunch of time on this haha. Thanks for reading! submitted by /u/MaceGrim [link] [comments]  ( 1 min )
  • Open

    Why Data Management is Today’s Most Important Business Discipline
    Why do I think Data Management is the most important business discipline for the 21st century (and why my “Big Data MBA” book and course are more relevant than ever)?  Here is my self-serving logic: Artificial Intelligence (AI) is the most powerful business discipline of our generation (with AI’s ability to continuously learn and adapt… Read More »Why Data Management is Today’s Most Important Business Discipline The post Why Data Management is Today’s Most Important Business Discipline appeared first on Data Science Central.  ( 5 min )
    Job trends in the tech world for 2022
    By 2025 new technologies will have destroyed 85 million jobs. However, they also will have created 97 million new ones instead, according to the World Economic Forum report. Thus, there is naturally a high demand for IT specialists in the labor market. Information technology has grown significantly, and now every industry needs them. Why IT… Read More »Job trends in the tech world for 2022 The post Job trends in the tech world for 2022 appeared first on Data Science Central.  ( 4 min )
  • Open

    Lyapunov exponents
    Chaotic systems are unpredictable. Or rather chaotic systems are not deterministically predictable in the long run. You can make predictions if you weaken one of these requirements. You can make deterministic predictions in the short run, or statistical predictions in the long run. Lyapunov exponents are a way to measure how quickly the short run […] Lyapunov exponents first appeared on John D. Cook.  ( 2 min )
  • Open

    Robot See, Robot Do
    Posted by Kevin Zakka, Student Researcher and Andy Zeng, Research Scientist, Robotics at Google People learn to do things by watching others — from mimicking new dance moves, to watching YouTube cooking videos. We’d like robots to do the same, i.e., to learn new skills by watching people do things during training. Today, however, the predominant paradigm for teaching robots is to remote control them using specialized hardware for teleoperation and then train them to imitate pre-recorded demonstrations. This limits both who can provide the demonstrations (programmers & roboticists) and where they can be provided (lab settings). If robots could instead self-learn new tasks by watching humans, this capability could allow them to be deployed in more unstructured settings like the home, and ma…  ( 9 min )
    Unlocking the Full Potential of Datacenter ML Accelerators with Platform-Aware Neural Architecture Search
    Posted by Sheng Li, Staff Software Engineer and Norman P. Jouppi, Google Fellow, Google Research Continuing advances in the design and implementation of datacenter (DC) accelerators for machine learning (ML), such as TPUs and GPUs, have been critical for powering modern ML models and applications at scale. These improved accelerators exhibit peak performance (e.g., FLOPs) that is orders of magnitude better than traditional computing systems. However, there is a fast-widening gap between the available peak performance offered by state-of-the-art hardware and the actual achieved performance when ML models run on that hardware. One approach to address this gap is to design hardware-specific ML models that optimize both performance (e.g., throughput and latency) and model quality. Recen…  ( 7 min )
  • Open

    Implement MLOps using AWS pre-trained AI Services with AWS Organizations
    The AWS Machine Learning Operations (MLOps) framework is an iterative and repetitive process for evolving AI models over time. Like DevOps, practitioners gain efficiencies promoting their artifacts through various environments (such as quality assurance, integration, and production) for quality control. In parallel, customers rapidly adopt multi-account strategies through AWS Organizations and AWS Control Tower to […]  ( 9 min )
  • Open

    Burgers, Fries and a Side of AI: Startup Offers Taste of Drive-Thru Convenience
    Eating into open hours and menus, a labor shortage has gobbled up fast-food services employees, but some restaurants are trying out a new staff member to bring back the drive-thru good times: AI. Toronto startup HuEx is in pilot tests with a conversational AI assistant for drive-thrus to help support service at several popular Canadian Read article > The post Burgers, Fries and a Side of AI: Startup Offers Taste of Drive-Thru Convenience appeared first on The Official NVIDIA Blog.  ( 2 min )
    Meet the Omnivore: Developer Sleighs Complex Manufacturing Workflows With Digital Twin of Santa’s Workshop
    Don’t be fooled by the candy canes, hot cocoa and CEO’s jolly demeanor. Santa’s workshop is the very model of a 21st-century enterprise: pioneering mass customization and perfecting a worldwide distribution system able to meet almost bottomless global demand. The post Meet the Omnivore: Developer Sleighs Complex Manufacturing Workflows With Digital Twin of Santa’s Workshop appeared first on The Official NVIDIA Blog.  ( 4 min )
  • Open

    Learning Inception Attention for Image Synthesis and Image Recognition. (arXiv:2112.14804v2 [cs.CV] UPDATED)
    Image synthesis and image recognition have witnessed remarkable progress, but often at the expense of computationally expensive training and inference. Learning lightweight yet expressive deep model has emerged as an important and interesting direction. Inspired by the well-known split-transform-aggregate design heuristic in the Inception building block, this paper proposes a Skip-Layer Inception Module (SLIM) that facilitates efficient learning of image synthesis models, and a same-layer variant (dubbed as SLIM too) as a stronger alternative to the well-known ResNeXts for image recognition. In SLIM, the input feature map is first split into a number of groups (e.g., 4).Each group is then transformed to a latent style vector(via channel-wise attention) and a latent spatial mask (via spatial attention). The learned latent masks and latent style vectors are aggregated to modulate the target feature map. For generative learning, SLIM is built on a recently proposed lightweight Generative Adversarial Networks (i.e., FastGANs) which present a skip-layer excitation(SLE) module. For few-shot image synthesis tasks, the proposed SLIM achieves better performance than the SLE work and other related methods. For one-shot image synthesis tasks, it shows stronger capability of preserving images structures than prior arts such as the SinGANs. For image classification tasks, the proposed SLIM is used as a drop-in replacement for convolution layers in ResNets (resulting in ResNeXt-like models) and achieves better accuracy in theImageNet-1000 dataset, with significantly smaller model complexity  ( 2 min )
    Exploiting Action Impact Regularity and Exogenous State Variables for Offline Reinforcement Learning. (arXiv:2111.08066v2 [cs.LG] UPDATED)
    Offline reinforcement learning-learning a policy from a batch of data-is known to be hard for general MDPs. In this work, we explore a restricted class of MDPs to obtain guarantees for offline reinforcement learning. The key property, which we call Action Impact Regularity (AIR), is that actions primarily impact a part of the state (an endogenous component) with limited impact on the remaining part of the state (an exogenous component). We propose an algorithm that exploits the AIR property, and provide theoretical guarantees for the outputted policy. Finally, we demonstrate that our algorithm outperforms existing offline reinforcement learning algorithms across different data collection policies in simulated and real world environments where the regularity holds.  ( 2 min )
    Showing Your Offline Reinforcement Learning Work: Online Evaluation Budget Matters. (arXiv:2110.04156v2 [cs.LG] UPDATED)
    In this work, we argue for the importance of an online evaluation budget for a reliable comparison of deep offline RL algorithms. First, we delineate that the online evaluation budget is problem-dependent, where some problems allow for less but others for more. And second, we demonstrate that the preference between algorithms is budget-dependent across a diverse range of decision-making domains such as Robotics, Finance, and Energy Management. Following the points above, we suggest reporting the performance of deep offline RL algorithms under varying online evaluation budgets. To facilitate this, we propose to use a reporting tool from the NLP field, Expected Validation Performance. This technique makes it possible to reliably estimate expected maximum performance under different budgets while not requiring any additional computation beyond hyperparameter search. By employing this tool, we also show that Behavioral Cloning is often more favorable to offline RL algorithms when working within a limited budget.  ( 2 min )
    Accelerated Proximal Alternating Gradient-Descent-Ascent for Nonconvex Minimax Machine Learning. (arXiv:2112.11663v4 [cs.LG] UPDATED)
    Alternating gradient-descent-ascent (AltGDA) is an optimization algorithm that has been widely used for model training in various machine learning applications, which aim to solve a nonconvex minimax optimization problem. However, the existing studies show that it suffers from a high computation complexity in nonconvex minimax optimization. In this paper, we develop a single-loop and fast AltGDA-type algorithm that leverages proximal gradient updates and momentum acceleration to solve regularized nonconvex minimax optimization problems. By identifying the intrinsic Lyapunov function of this algorithm, we prove that it converges to a critical point of the nonconvex minimax optimization problem and achieves a computation complexity $\mathcal{O}(\kappa^{1.5}\epsilon^{-2})$, where $\epsilon$ is the desired level of accuracy and $\kappa$ is the problem's condition number. Such a computation complexity improves the state-of-the-art complexities of single-loop GDA and AltGDA algorithms (see the summary of comparison in Table 1). We demonstrate the effectiveness of our algorithm via an experiment on adversarial deep learning.  ( 2 min )
    Diversity Policy Gradient for Sample Efficient Quality-Diversity Optimization. (arXiv:2006.08505v4 [cs.AI] UPDATED)
    A fascinating aspect of nature lies in its ability to produce a large and diverse collection of organisms that are all high-performing in their niche. By contrast, most AI algorithms focus on finding a single efficient solution to a given problem. Aiming for diversity in addition to performance is a convenient way to deal with the exploration-exploitation trade-off that plays a central role in learning. It also allows for increased robustness when the returned collection contains several working solutions to the considered problem, making it well-suited for real applications such as robotics. Quality-Diversity (QD) methods are evolutionary algorithms designed for this purpose. This paper proposes a novel algorithm, QDPG, which combines the strength of Policy Gradient algorithms and Quality Diversity approaches to produce a collection of diverse and high-performing neural policies in continuous control environments. The main contribution of this work is the introduction of a Diversity Policy Gradient (DPG) that exploits information at the time-step level to drive policies towards more diversity in a sample-efficient manner. Specifically, QDPG selects neural controllers from a MAP-Elites grid and uses two gradient-based mutation operators to improve both quality and diversity. Our results demonstrate that QDPG is significantly more sample-efficient than its evolutionary competitors.  ( 2 min )
    Scalable One-Pass Optimisation of High-Dimensional Weight-Update Hyperparameters by Implicit Differentiation. (arXiv:2110.10461v2 [cs.LG] UPDATED)
    Machine learning training methods depend plentifully and intricately on hyperparameters, motivating automated strategies for their optimisation. Many existing algorithms restart training for each new hyperparameter choice, at considerable computational cost. Some hypergradient-based one-pass methods exist, but these either cannot be applied to arbitrary optimiser hyperparameters (such as learning rates and momenta) or take several times longer to train than their base models. We extend these existing methods to develop an approximate hypergradient-based hyperparameter optimiser which is applicable to any continuous hyperparameter appearing in a differentiable model weight update, yet requires only one training episode, with no restarts. We also provide a motivating argument for convergence to the true hypergradient, and perform tractable gradient-based optimisation of independent learning rates for each model parameter. Our method performs competitively from varied random hyperparameter initialisations on several UCI datasets and Fashion-MNIST (using a one-layer MLP), Penn Treebank (using an LSTM) and CIFAR-10 (using a ResNet-18), in time only 2-3x greater than vanilla training.  ( 2 min )
    Exploring Heterogeneous Characteristics of Layers in ASR Models for More Efficient Training. (arXiv:2110.04267v2 [cs.LG] UPDATED)
    Transformer-based architectures have been the subject of research aimed at understanding their overparameterization and the non-uniform importance of their layers. Applying these approaches to Automatic Speech Recognition, we demonstrate that the state-of-the-art Conformer models generally have multiple ambient layers. We study the stability of these layers across runs and model sizes, propose that group normalization may be used without disrupting their formation, and examine their correlation with model weight updates in each layer. Finally, we apply these findings to Federated Learning in order to improve the training procedure, by targeting Federated Dropout to layers by importance. This allows us to reduce the model size optimized by clients without quality degradation, and shows potential for future exploration.  ( 2 min )
    Whole Brain Vessel Graphs: A Dataset and Benchmark for Graph Learning and Neuroscience (VesselGraph). (arXiv:2108.13233v2 [cs.LG] UPDATED)
    Biological neural networks define the brain function and intelligence of humans and other mammals, and form ultra-large, spatial, structured graphs. Their neuronal organization is closely interconnected with the spatial organization of the brain's microvasculature, which supplies oxygen to the neurons and builds a complementary spatial graph. This vasculature (or the vessel structure) plays an important role in neuroscience; for example, the organization of (and changes to) vessel structure can represent early signs of various pathologies, e.g. Alzheimer's disease or stroke. Recently, advances in tissue clearing have enabled whole brain imaging and segmentation of the entirety of the mouse brain's vasculature. Building on these advances in imaging, we are presenting an extendable dataset of whole-brain vessel graphs based on specific imaging protocols. Specifically, we extract vascular graphs using a refined graph extraction scheme leveraging the volume rendering engine Voreen and provide them in an accessible and adaptable form through the OGB and PyTorch Geometric dataloaders. Moreover, we benchmark numerous state-of-the-art graph learning algorithms on the biologically relevant tasks of vessel prediction and vessel classification using the introduced vessel graph dataset. Our work paves a path towards advancing graph learning research into the field of neuroscience. Complementarily, the presented dataset raises challenging graph learning research questions for the machine learning community, in terms of incorporating biological priors into learning algorithms, or in scaling these algorithms to handle sparse,spatial graphs with millions of nodes and edges. All datasets and code are available for download at https://github.com/jocpae/VesselGraph .  ( 2 min )
    Learning from Small Samples: Transformation-Invariant SVMs with Composition and Locality at Multiple Scales. (arXiv:2109.12784v3 [cs.LG] UPDATED)
    Motivated by the problem of learning with small sample size, this paper shows how to incorporate into support-vector machines (SVMs) those properties that have made convolutional neural networks (CNNs) successful. Particularly important is the ability to incorporate domain knowledge of invariances, e.g., translational invariance of images. Kernels based on the \textit{maximum} similarity over a group of transformations are not generally positive definite. Perhaps it is for this reason that they have neither previously been experimentally tested for their performance nor studied theoretically. We address this lacuna and show that positive definiteness indeed holds \textit{with high probability} for kernels based on the maximum similarity in the small training sample set regime of interest, and that they do yield the best results in that regime. We also show how additional properties such as their ability to incorporate local features at multiple spatial scales, e.g., as done in CNNs through max pooling, and to provide the benefits of composition through the architecture of multiple layers, can also be embedded into SVMs. We verify through experiments on widely available image sets that the resulting SVMs do provide superior accuracy in comparison to well-established deep neural network benchmarks for small sample sizes.  ( 2 min )
    A Survey on Deep Reinforcement Learning for Data Processing and Analytics. (arXiv:2108.04526v3 [cs.LG] UPDATED)
    Data processing and analytics are fundamental and pervasive. Algorithms play a vital role in data processing and analytics where many algorithm designs have incorporated heuristics and general rules from human knowledge and experience to improve their effectiveness. Recently, reinforcement learning, deep reinforcement learning (DRL) in particular, is increasingly explored and exploited in many areas because it can learn better strategies in complicated environments it is interacting with than statically designed algorithms. Motivated by this trend, we provide a comprehensive review of recent works focusing on utilizing DRL to improve data processing and analytics. First, we present an introduction to key concepts, theories, and methods in DRL. Next, we discuss DRL deployment on database systems, facilitating data processing and analytics in various aspects, including data organization, scheduling, tuning, and indexing. Then, we survey the application of DRL in data processing and analytics, ranging from data preparation, natural language processing to healthcare, fintech, etc. Finally, we discuss important open challenges and future research directions of using DRL in data processing and analytics.  ( 2 min )
    Policy Optimization for Constrained MDPs with Provable Fast Global Convergence. (arXiv:2111.00552v2 [cs.LG] UPDATED)
    We address the problem of finding the optimal policy of a constrained Markov decision process (CMDP) using a gradient descent-based algorithm. Previous results have shown that a primal-dual approach can achieve an $\mathcal{O}(1/\sqrt{T})$ global convergence rate for both the optimality gap and the constraint violation. We propose a new algorithm called policy mirror descent-primal dual (PMD-PD) algorithm that can provably achieve a faster $\mathcal{O}(\log(T)/T)$ convergence rate for both the optimality gap and the constraint violation. For the primal (policy) update, the PMD-PD algorithm utilizes a modified value function and performs natural policy gradient steps, which is equivalent to a mirror descent step with appropriate regularization. For the dual update, the PMD-PD algorithm uses modified Lagrange multipliers to ensure a faster convergence rate. We also present two extensions of this approach to the settings with zero constraint violation and sample-based estimation. Experimental results demonstrate the faster convergence rate and the better performance of the PMD-PD algorithm compared with existing policy gradient-based algorithms.  ( 2 min )
    Generative Planning for Temporally Coordinated Exploration in Reinforcement Learning. (arXiv:2201.09765v2 [cs.LG] UPDATED)
    Standard model-free reinforcement learning algorithms optimize a policy that generates the action to be taken in the current time step in order to maximize expected future return. While flexible, it faces difficulties arising from the inefficient exploration due to its single step nature. In this work, we present Generative Planning method (GPM), which can generate actions not only for the current step, but also for a number of future steps (thus termed as generative planning). This brings several benefits to GPM. Firstly, since GPM is trained by maximizing value, the plans generated from it can be regarded as intentional action sequences for reaching high value regions. GPM can therefore leverage its generated multi-step plans for temporally coordinated exploration towards high value regions, which is potentially more effective than a sequence of actions generated by perturbing each action at single step level, whose consistent movement decays exponentially with the number of exploration steps. Secondly, starting from a crude initial plan generator, GPM can refine it to be adaptive to the task, which, in return, benefits future explorations. This is potentially more effective than commonly used action-repeat strategy, which is non-adaptive in its form of plans. Additionally, since the multi-step plan can be interpreted as the intent of the agent from now to a span of time period into the future, it offers a more informative and intuitive signal for interpretation. Experiments are conducted on several benchmark environments and the results demonstrated its effectiveness compared with several baseline methods.  ( 2 min )
    Generalized Decision Transformer for Offline Hindsight Information Matching. (arXiv:2111.10364v3 [cs.LG] UPDATED)
    How to extract as much learning signal from each trajectory data has been a key problem in reinforcement learning (RL), where sample inefficiency has posed serious challenges for practical applications. Recent works have shown that using expressive policy function approximators and conditioning on future trajectory information -- such as future states in hindsight experience replay or returns-to-go in Decision Transformer (DT) -- enables efficient learning of multi-task policies, where at times online RL is fully replaced by offline behavioral cloning, e.g. sequence modeling. We demonstrate that all these approaches are doing hindsight information matching (HIM) -- training policies that can output the rest of trajectory that matches some statistics of future state information. We present Generalized Decision Transformer (GDT) for solving any HIM problem, and show how different choices for the feature function and the anti-causal aggregator not only recover DT as a special case, but also lead to novel Categorical DT (CDT) and Bi-directional DT (BDT) for matching different statistics of the future. For evaluating CDT and BDT, we define offline multi-task state-marginal matching (SMM) and imitation learning (IL) as two generic HIM problems, propose a Wasserstein distance loss as a metric for both, and empirically study them on MuJoCo continuous control benchmarks. CDT, which simply replaces anti-causal summation with anti-causal binning in DT, enables the first effective offline multi-task SMM algorithm that generalizes well to unseen and even synthetic multi-modal state-feature distributions. BDT, which uses an anti-causal second transformer as the aggregator, can learn to model any statistics of the future and outperforms DT variants in offline multi-task IL. Our generalized formulations from HIM and GDT greatly expand the role of powerful sequence modeling architectures in modern RL.  ( 3 min )
    Environmental Sound Extraction Using Onomatopoeia. (arXiv:2112.00209v3 [cs.SD] UPDATED)
    Onomatopoeia, which is a character sequence that phonetically imitates a sound, is effective in expressing characteristics of sound such as duration, pitch, and timbre. We propose an environmental-sound-extraction method using onomatopoeia to specify the target sound to be extracted. With this method, we estimate a time-frequency mask from an input mixture spectrogram and onomatopoeia by using U-Net architecture then extract the corresponding target sound by masking the spectrogram. Experimental results indicate that the proposed method can extract only the target sound corresponding to onomatopoeia and performs better than conventional methods that use sound-event classes to specify the target sound.  ( 2 min )
    GhostShiftAddNet: More Features from Energy-Efficient Operations. (arXiv:2109.09495v3 [cs.LG] UPDATED)
    Deep convolutional neural networks (CNNs) are computationally and memory intensive. In CNNs, intensive multiplication can have resource implications that may challenge the ability for effective deployment of inference on resource-constrained edge devices. This paper proposes GhostShiftAddNet, where the motivation is to implement a hardware-efficient deep network: a multiplication-free CNN with fewer redundant features. We introduce a new bottleneck block, GhostSA, that converts all multiplications in the block to cheap operations. The bottleneck uses an appropriate number of bit-shift filters to process intrinsic feature maps, then applies a series of transformations that consist of bit-wise shifts with addition operations to generate more feature maps that fully learn to capture information underlying intrinsic features. We schedule the number of bit-shift and addition operations for different hardware platforms. We conduct extensive experiments and ablation studies with desktop and embedded (Jetson Nano) devices for implementation and measurements. We demonstrate the proposed GhostSA block can replace bottleneck blocks in the backbone of state-of-the-art networks architectures and gives improved performance on image classification benchmarks. Further, our GhostShiftAddNet can achieve higher classification accuracy with fewer FLOPs and parameters (reduced by up to 3x) than GhostNet. When compared to GhostNet, inference latency on the Jetson Nano is improved by 1.3x and 2x on the GPU and CPU respectively. Code is available open-source on \url{https://github.com/JIABI/GhostShiftAddNet}.  ( 2 min )
    Spectroscopy Approaches for Food Safety Applications: Improving Data Efficiency Using Active Learning and Semi-Supervised Learning. (arXiv:2110.03765v4 [cs.LG] UPDATED)
    The past decade witnesses a rapid development in the measurement and monitoring technologies for food science. Among these technologies, spectroscopy has been widely used for the analysis of food quality, safety, and nutritional properties. Due to the complexity of food systems and the lack of comprehensive predictive models, rapid and simple measurements to predict complex properties in food systems are largely missing. Machine Learning (ML) has shown great potential to improve classification and prediction of these properties. However, the barriers to collect large datasets for ML applications still persists. In this paper, we explore different approaches of data annotation and model training to improve data efficiency for ML applications. Specifically, we leverage Active Learning (AL) and Semi-Supervised Learning (SSL) and investigate four approaches: baseline passive learning, AL, SSL, and a hybrid of AL and SSL. To evaluate these approaches, we collect two spectroscopy datasets: predicting plasma dosage and detecting foodborne pathogen. Our experimental results show that, compared to the de facto passive learning approach, AL and SSL methods reduce the number of labeled samples by 50% and 25% for each ML application, respectively.  ( 2 min )
    Learn to Predict Equilibria via Fixed Point Networks. (arXiv:2106.00906v2 [cs.LG] UPDATED)
    Systems of competing agents can often be modeled as games. Assuming rationality, the most likely outcomes are given by an equilibrium, e.g. a Nash equilibrium. In many practical settings, games are influenced by context, i.e. additional data beyond the control of any agent (e.g. weather for traffic and fiscal policy for market economies). Often only game equilibria are observed, while the players' true cost functions are unknown. This work introduces Nash Fixed Point Networks (N-FPNs), a class of implicit neural networks that learn to predict the equilibria given only the context. The N-FPN design fuses data-driven modeling with provided constraints on the actions available to agents. N-FPNs are compatible with the recently introduced Jacobian-Free Backpropagation technique for training implicit networks, making them significantly faster to train than prior models. N-FPNs can exploit novel constraint decoupling to avoid costly projections. Provided numerical examples show the efficacy of N-FPNs on atomic and non-atomic games (e.g. traffic routing)  ( 2 min )
    BernNet: Learning Arbitrary Graph Spectral Filters via Bernstein Approximation. (arXiv:2106.10994v3 [cs.LG] UPDATED)
    Many representative graph neural networks, e.g., GPR-GNN and ChebNet, approximate graph convolutions with graph spectral filters. However, existing work either applies predefined filter weights or learns them without necessary constraints, which may lead to oversimplified or ill-posed filters. To overcome these issues, we propose BernNet, a novel graph neural network with theoretical support that provides a simple but effective scheme for designing and learning arbitrary graph spectral filters. In particular, for any filter over the normalized Laplacian spectrum of a graph, our BernNet estimates it by an order-$K$ Bernstein polynomial approximation and designs its spectral property by setting the coefficients of the Bernstein basis. Moreover, we can learn the coefficients (and the corresponding filter weights) based on observed graphs and their associated signals and thus achieve the BernNet specialized for the data. Our experiments demonstrate that BernNet can learn arbitrary spectral filters, including complicated band-rejection and comb filters, and it achieves superior performance in real-world graph modeling tasks. Code is available at https://github.com/ivam-he/BernNet.  ( 2 min )
    CADDA: Class-wise Automatic Differentiable Data Augmentation for EEG Signals. (arXiv:2106.13695v3 [cs.LG] UPDATED)
    Data augmentation is a key element of deep learning pipelines, as it informs the network during training about transformations of the input data that keep the label unchanged. Manually finding adequate augmentation methods and parameters for a given pipeline is however rapidly cumbersome. In particular, while intuition can guide this decision for images, the design and choice of augmentation policies remains unclear for more complex types of data, such as neuroscience signals. Besides, class-dependent augmentation strategies have been surprisingly unexplored in the literature, although it is quite intuitive: changing the color of a car image does not change the object class to be predicted, but doing the same to the picture of an orange does. This paper investigates gradient-based automatic data augmentation algorithms amenable to class-wise policies with exponentially larger search spaces. Motivated by supervised learning applications using EEG signals for which good augmentation policies are mostly unknown, we propose a new differentiable relaxation of the problem. In the class-agnostic setting, results show that our new relaxation leads to optimal performance with faster training than competing gradient-based methods, while also outperforming gradient-free methods in the class-wise setting. This work proposes also novel differentiable augmentation operations relevant for sleep stage classification.  ( 2 min )
    Shape-consistent Generative Adversarial Networks for multi-modal Medical segmentation maps. (arXiv:2201.09693v2 [eess.IV] UPDATED)
    Image translation across domains for unpaired datasets has gained interest and great improvement lately. In medical imaging, there are multiple imaging modalities, with very different characteristics. Our goal is to use cross-modality adaptation between CT and MRI whole cardiac scans for semantic segmentation. We present a segmentation network using synthesised cardiac volumes for extremely limited datasets. Our solution is based on a 3D cross-modality generative adversarial network to share information between modalities and generate synthesized data using unpaired datasets. Our network utilizes semantic segmentation to improve generator shape consistency, thus creating more realistic synthesised volumes to be used when re-training the segmentation network. We show that improved segmentation can be achieved on small datasets when using spatial augmentations to improve a generative adversarial network. These augmentations improve the generator capabilities, thus enhancing the performance of the Segmentor. Using only 16 CT and 16 MRI cardiovascular volumes, improved results are shown over other segmentation methods while using the suggested architecture.  ( 2 min )
    On Model Selection Consistency of Lasso for High-Dimensional Ising Models on Tree-like Graphs. (arXiv:2110.08500v2 [stat.ML] UPDATED)
    We consider the problem of high-dimensional Ising model selection using neighborhood-based least absolute shrinkage and selection operator (Lasso). It is rigorously proved that under some mild coherence conditions on the population covariance matrix of the Ising model, consistent model selection can be achieved with sample sizes $n=\Omega{(d^3\log{p})}$ for any tree-like graph in the paramagnetic phase, where $p$ is the number of variables and $d$ is the maximum node degree. The obtained sufficient conditions for consistent model selection with Lasso are the same in the scaling of the sample complexity as that of $\ell_1$-regularized logistic regression.  ( 2 min )
    ENERO: Efficient Real-Time WAN Routing Optimization with Deep Reinforcement Learning. (arXiv:2109.10883v2 [cs.NI] UPDATED)
    Wide Area Networks (WAN) are a key infrastructure in today's society. During the last years, WANs have seen a considerable increase in network's traffic and network applications, imposing new requirements on existing network technologies (e.g., low latency and high throughput). Consequently, Internet Service Providers (ISP) are under pressure to ensure the customer's Quality of Service and fulfill Service Level Agreements. Network operators leverage Traffic Engineering (TE) techniques to efficiently manage network's resources. However, WAN's traffic can drastically change during time and the connectivity can be affected due to external factors (e.g., link failures). Therefore, TE solutions must be able to adapt to dynamic scenarios in real-time. In this paper we propose Enero, an efficient real-time TE solution based on a two-stage optimization process. In the first one, Enero leverages Deep Reinforcement Learning (DRL) to optimize the routing configuration by generating a long-term TE strategy. To enable efficient operation over dynamic network scenarios (e.g., when link failures occur), we integrated a Graph Neural Network into the DRL agent. In the second stage, Enero uses a Local Search algorithm to improve DRL's solution without adding computational overhead to the optimization process. The experimental results indicate that Enero is able to operate in real-world dynamic network topologies in 4.5 seconds on average for topologies up to 100 edges.  ( 2 min )
    FRL: Federated Rank Learning. (arXiv:2110.04350v2 [cs.LG] UPDATED)
    Federated learning (FL) allows mutually untrusted clients to collaboratively train a common machine learning model without sharing their private/proprietary training data among each other. FL is unfortunately susceptible to poisoning by malicious clients who aim to hamper the accuracy of the commonly trained model through sending malicious model updates during FL's training process. We argue that the key factor to the success of poisoning attacks against existing FL systems is the large space of model updates available to the clients, allowing malicious clients to search for the most poisonous model updates, e.g., by solving an optimization problem. To address this, we propose Federated Rank Learning (FRL). FRL reduces the space of client updates from model parameter updates (a continuous space of float numbers) in standard FL to the space of parameter rankings (a discrete space of integer values). To be able to train the global model using parameter ranks (instead of parameter weights), FRL leverage ideas from recent supermasks training mechanisms. Specifically, FRL clients rank the parameters of a randomly initialized neural network (provided by the server) based on their local training data. The FRL server uses a voting mechanism to aggregate the parameter rankings submitted by clients in each training epoch to generate the global ranking of the next training epoch. Intuitively, our voting-based aggregation mechanism prevents poisoning clients from making significant adversarial modifications to the global model, as each client will have a single vote! We demonstrate the robustness of FRL to poisoning through analytical proofs and experimentation. We also show FRL's high communication efficiency. Our experiments demonstrate the superiority of FRL in real-world FL settings.  ( 2 min )
    Project CGX: Algorithmic and System Support for Scalable Deep Learning on a Budget. (arXiv:2111.08617v3 [cs.DC] UPDATED)
    The ability to scale out training workloads has been one of the key performance enablers of deep learning. The main scaling approach is data-parallel GPU-based training, which has been boosted by hardware and software support for highly efficient inter-GPU communication, in particular via bandwidth overprovisioning. This support comes at a price: there is an order of magnitude cost difference between "cloud-grade" servers with such support, relative to their "consumer-grade" counterparts, although server-grade and consumer-grade GPUs can have similar computational envelopes. In this paper, we investigate whether the expensive hardware overprovisioning approach can be supplanted via algorithmic and system design, and propose a framework called CGX, which provides efficient software support for communication compression. We show that this framework is able to remove communication bottlenecks from consumer-grade multi-GPU systems, in the absence of hardware support: when training modern models and tasks to full accuracy, CGX provides self-speedups of 2-3X for an 8-GPU commodity node, enabling it to surpass the throughput of a much more expensive NVIDIA DGX-1 server. In the multi-node setting, CGX enables significant additional speedups by identifying and solving the novel adaptive compression problem, in which we can automatically set compression levels in a layer-wise fashion, balancing speedup and accuracy recovery.  ( 2 min )
    Simple yet Sharp Sensitivity Analysis for Unmeasured Confounding. (arXiv:2104.13020v5 [stat.ME] UPDATED)
    We present a method for assessing the sensitivity of the true causal effect to unmeasured confounding. The method requires the analyst to set two intuitive parameters. Otherwise, the method is assumption-free. The method returns an interval that contains the true causal effect, and whose bounds are arbitrarily sharp, i.e. practically attainable. We show experimentally that our bounds can be tighter than those obtained by the method of Ding and VanderWeele (2016a) which, moreover, requires to set one more parameter than our method. Finally, we extend our method to bound the natural direct and indirect effects when there are measured mediators and unmeasured exposure-outcome confounding.  ( 2 min )
    A Meta-Learning Control Algorithm with Provable Finite-Time Guarantees. (arXiv:2008.13265v6 [cs.LG] UPDATED)
    In this work we provide provable regret guarantees for an online meta-learning control algorithm in an iterative control setting, where in each iteration the system to be controlled is a linear deterministic system that is different and unknown, the cost for the controller in an iteration is a general additive cost function and the control input is required to be constrained, which if violated incurs an additional cost. We prove (i) that the algorithm achieves a regret for the controller cost and constraint violation that are $O(T^{3/4})$ for an episode of duration $T$ with respect to the best policy that satisfies the control input control constraints and (ii) that the average of the regret for the controller cost and constraint violation with respect to the same policy vary as $O((1+\log(N)/N)T^{3/4})$ with the number of iterations $N$, showing that the worst regret for the learning within an iteration continuously improves with experience of more iterations.  ( 2 min )
    Q-attention: Enabling Efficient Learning for Vision-based Robotic Manipulation. (arXiv:2105.14829v2 [cs.RO] UPDATED)
    Despite the success of reinforcement learning methods, they have yet to have their breakthrough moment when applied to a broad range of robotic manipulation tasks. This is partly due to the fact that reinforcement learning algorithms are notoriously difficult and time consuming to train, which is exacerbated when training from images rather than full-state inputs. As humans perform manipulation tasks, our eyes closely monitor every step of the process with our gaze focusing sequentially on the objects being manipulated. With this in mind, we present our Attention-driven Robotic Manipulation (ARM) algorithm, which is a general manipulation algorithm that can be applied to a range of sparse-rewarded tasks, given only a small number of demonstrations. ARM splits the complex task of manipulation into a 3 stage pipeline: (1) a Q-attention agent extracts relevant pixel locations from RGB and point cloud inputs, (2) a next-best pose agent that accepts crops from the Q-attention agent and outputs poses, and (3) a control agent that takes the goal pose and outputs joint actions. We show that current learning algorithms fail on a range of RLBench tasks, whilst ARM is successful.  ( 2 min )
    Learned Optimizers for Analytic Continuation. (arXiv:2107.13265v2 [cs.LG] UPDATED)
    Traditional maximum entropy and sparsity-based algorithms for analytic continuation often suffer from the ill-posed kernel matrix or demand tremendous computation time for parameter tuning. Here we propose a neural network method by convex optimization and replace the ill-posed inverse problem by a sequence of well-conditioned surrogate problems. After training, the learned optimizers are able to give a solution of high quality with low time cost and achieve higher parameter efficiency than heuristic fully-connected networks. The output can also be used as a neural default model to improve the maximum entropy for better performance. Our methods may be easily extended to other high-dimensional inverse problems via large-scale pretraining.  ( 2 min )
    Management of Resource at the Network Edge for Federated Learning. (arXiv:2107.03428v2 [cs.NI] UPDATED)
    Federated learning has been explored as a promising solution for training at the edge, where end devices collaborate to train models without sharing data with other entities. Since the execution of these learning models occurs at the edge, where resources are limited, new solutions must be developed. In this paper, we describe the recent work on resource management at the edge, and explore the challenges and future directions to allow the execution of federated learning at the edge. Some of the problems of this management, such as discovery of resources, deployment, load balancing, migration, and energy efficiency will be discussed in the paper.  ( 2 min )
    Seeing Differently, Acting Similarly: Imitation Learning with Heterogeneous Observations. (arXiv:2106.09256v2 [cs.LG] UPDATED)
    In many real-world imitation learning tasks, the demonstrator and the learner have to act under different but full observation spaces. This situation generates significant obstacles for existing imitation learning approaches to work. Previous related works need to assume the coexistence of two observation spaces in the demonstrations or that all along the learning process. While in reality, the expert usually provides the demonstration with their observations only, and the observation coexistence will be limited due to the high cost. So in this work, we model the observation mismatch in the imitation learning problem with the above two challenges as a two-phase learning process, namely Heterogeneously Observable Imitation Learning (HOIL). We analyze the underlying learning issues with these challenges, i.e., the dynamics mismatch and the support mismatch, and further propose the Importance Weighting with REjection (IWRE) algorithm based on the techniques of importance-weighting and learning with rejection for querying to solve these issues across the observation spaces. Experimental results show that IWRE can successfully solve the difficult HOIL tasks, including the challenging task of transforming the vision-based demonstrations to random access memory (RAM)-based policies under the Atari domain.  ( 2 min )
    EdiBERT, a generative model for image editing. (arXiv:2111.15264v2 [cs.CV] UPDATED)
    Advances in computer vision are pushing the limits of im-age manipulation, with generative models sampling detailed images on various tasks. However, a specialized model is often developed and trained for each specific task, even though many image edition tasks share similarities. In denoising, inpainting, or image compositing, one always aims at generating a realistic image from a low-quality one. In this paper, we aim at making a step towards a unified approach for image editing. To do so, we propose EdiBERT, a bi-directional transformer trained in the discrete latent space built by a vector-quantized auto-encoder. We argue that such a bidirectional model is suited for image manipulation since any patch can be re-sampled conditionally to the whole image. Using this unique and straightforward training objective, we show that the resulting model matches state-of-the-art performances on a wide variety of tasks: image denoising, image completion, and image composition.
    Self-adaptive deep neural network: Numerical approximation to functions and PDEs. (arXiv:2109.02839v2 [math.NA] UPDATED)
    Designing an optimal deep neural network for a given task is important and challenging in many machine learning applications. To address this issue, we introduce a self-adaptive algorithm: the adaptive network enhancement (ANE) method, written as loops of the form train, estimate and enhance. Starting with a small two-layer neural network (NN), the step train is to solve the optimization problem at the current NN; the step estimate is to compute a posteriori estimator/indicators using the solution at the current NN; the step enhance is to add new neurons to the current NN. Novel network enhancement strategies based on the computed estimator/indicators are developed in this paper to determine how many new neurons and when a new layer should be added to the current NN. The ANE method provides a natural process for obtaining a good initialization in training the current NN; in addition, we introduce an advanced procedure on how to initialize newly added neurons for a better approximation. We demonstrate that the ANE method can automatically design a nearly minimal NN for learning functions exhibiting sharp transitional layers as well as discontinuous solutions of hyperbolic partial differential equations.
    Quantifying the Burden of Exploration and the Unfairness of Free Riding. (arXiv:1810.08743v5 [cs.LG] UPDATED)
    We consider the multi-armed bandit setting with a twist. Rather than having just one decision maker deciding which arm to pull in each round, we have $n$ different decision makers (agents). In the simple stochastic setting, we show that a "free-riding" agent observing another "self-reliant" agent can achieve just $O(1)$ regret, as opposed to the regret lower bound of $\Omega (\log t)$ when one decision maker is playing in isolation. This result holds whenever the self-reliant agent's strategy satisfies either one of two assumptions: (1) each arm is pulled at least $\gamma \ln t$ times in expectation for a constant $\gamma$ that we compute, or (2) the self-reliant agent achieves $o(t)$ realized regret with high probability. Both of these assumptions are satisfied by standard zero-regret algorithms. Under the second assumption, we further show that the free rider only needs to observe the number of times each arm is pulled by the self-reliant agent, and not the rewards realized. In the linear contextual setting, each arm has a distribution over parameter vectors, each agent has a context vector, and the reward realized when an agent pulls an arm is the inner product of that agent's context vector with a parameter vector sampled from the pulled arm's distribution. We show that the free rider can achieve $O(1)$ regret in this setting whenever the free rider's context is a small (in $L_2$-norm) linear combination of other agents' contexts and all other agents pull each arm $\Omega (\log t)$ times with high probability. Again, this condition on the self-reliant players is satisfied by standard zero-regret algorithms like UCB. We also prove a number of lower bounds.
    A Topological Data Analysis Based Classifier. (arXiv:2111.05214v3 [cs.LG] UPDATED)
    Topological Data Analysis (TDA) is an emergent field that aims to discover topological information hidden in a dataset. TDA tools have been commonly used to create filters and topological descriptors to improve Machine Learning (ML) methods. This paper proposes an algorithm that applies TDA directly to multi-class classification problems, without any further ML stage, showing advantages for imbalanced datasets. The proposed algorithm builds a filtered simplicial complex on the dataset. Persistent Homology (PH) is applied to guide the selection of a sub-complex where unlabeled points obtain the label with the majority of votes from labeled neighboring points. We select 8 datasets with different dimensions, degrees of class overlap and imbalanced samples per class. On average, the proposed TDABC method was better than KNN and weighted-KNN. It behaves competitively with Local SVM and Random Forest baseline classifiers in balanced datasets, and it outperforms all baseline methods classifying entangled and minority classes.
    A Robust Blockchain Readiness Index Model. (arXiv:2101.09162v2 [cs.CR] UPDATED)
    As the blockchain ecosystem gets more mature many businesses, investors, and entrepreneurs are seeking opportunities on working with blockchain systems and cryptocurrencies. A critical challenge for these actors is to identify the most suitable environment to start or evolve their businesses. In general, the question is to identify which countries are offering the most suitable conditions to host their blockchain-based activities and implement their innovative projects. The Blockchain Readiness Index (BRI) provides a numerical metric (referred to as the blockchain readiness score) in measuring the maturity/readiness levels of a country in adopting blockchain and cryptocurrencies. In doing so, BRI leverages on techniques from information retrieval to algorithmically derive an index ranking for a set of countries. The index considers a range of indicators organized under five pillars: Government Regulation, Research, Technology, Industry, and User Engagement. In this paper, we further extent BRI with the capability of deriving the index - at the country level - even in the presence of missing information for the indicators. In doing so, we are proposing two weighting schemes namely, linear and sigmoid weighting for refining the initial estimates for the indicator values. A classification framework was employed to evaluate the effectiveness of the developed techniques which yielded to a significant classification accuracy.
    MedGCN: Medication recommendation and lab test imputation via graph convolutional networks. (arXiv:1904.00326v3 [cs.LG] UPDATED)
    Laboratory testing and medication prescription are two of the most important routines in daily clinical practice. Developing an artificial intelligence system that can automatically make lab test imputations and medication recommendations can save costs on potentially redundant lab tests and inform physicians of a more effective prescription. We present an intelligent medical system (named MedGCN) that can automatically recommend the patients' medications based on their incomplete lab tests, and can even accurately estimate the lab values that have not been taken. In our system, we integrate the complex relations between multiple types of medical entities with their inherent features in a heterogeneous graph. Then we model the graph to learn a distributed representation for each entity in the graph based on graph convolutional networks (GCN). By the propagation of graph convolutional networks, the entity representations can incorporate multiple types of medical information that can benefit multiple medical tasks. Moreover, we introduce a cross regularization strategy to reduce overfitting for multi-task training by the interaction between the multiple tasks. In this study, we construct a graph to associate 4 types of medical entities, i.e., patients, encounters, lab tests, and medications, and applied a graph neural network to learn node embeddings for medication recommendation and lab test imputation. we validate our MedGCN model on two real-world datasets: NMEDW and MIMIC-III. The experimental results on both datasets demonstrate that our model can outperform the state-of-the-art in both tasks. We believe that our innovative system can provide a promising and reliable way to assist physicians to make medication prescriptions and to save costs on potentially redundant lab tests.
    Modeling Adversarial Noise for Adversarial Training. (arXiv:2109.09901v4 [cs.LG] UPDATED)
    Deep neural networks have been demonstrated to be vulnerable to adversarial noise, promoting the development of defense against adversarial attacks. Motivated by the fact that adversarial noise contains well-generalizing features and that the relationship between adversarial data and natural data can help infer natural data and make reliable predictions, in this paper, we study to model adversarial noise by learning the transition relationship between adversarial labels (i.e. the flipped labels used to generate adversarial data) and natural labels (i.e. the ground truth labels of the natural data). Specifically, we introduce an instance-dependent transition matrix to relate adversarial labels and natural labels, which can be seamlessly embedded with the target model (enabling us to model stronger adaptive adversarial noise). Empirical evaluations demonstrate that our method could effectively improve adversarial accuracy.
    Manifold Fitting in Ambient Space. (arXiv:1909.13492v2 [stat.ML] UPDATED)
    Modern sample points in many applications no longer comprise real vectors in a real vector space but sample points of much more complex structures, which may be represented as points in a space with a certain underlying geometric structure, namely a manifold. Manifold learning is an emerging field for learning the underlying structure. The study of manifold learning can be split into two main branches: dimension reduction and manifold fitting. With the aim of combining statistics and geometry, we address the problem of manifold fitting in the ambient space. Inspired by the relation between the eigenvalues of the Laplace-Beltrami operator and the geometry of a manifold, we aim to find a small set of points that preserve the geometry of the underlying manifold. From this relationship, we extend the idea of subsampling to sample points in high-dimensional space and employ the Moving Least Squares (MLS) approach to approximate the underlying manifold. We analyze the two core steps in our proposed method theoretically and also provide the bounds for the MLS approach. Our simulation results and theoretical analysis demonstrate the superiority of our method in estimating the underlying manifold.
    The FEDHC Bayesian network learning algorithm. (arXiv:2012.00113v5 [stat.ML] UPDATED)
    The paper proposes a new hybrid Bayesian network learning algorithm, termed Forward Early Dropping Hill Climbing (FEDHC), devised to work with either continuous or categorical variables. Specifically for the case of continuous data, a robust to outliers version of FEDHC, that can be adopted by other BN learning algorithms, is proposed. Further, the paper manifests that the only implementation of MMHC in the statistical software \textit{R}, is prohibitively expensive and a new implementation is offered. The FEDHC is tested via Monte Carlo simulations that distinctly show it is computationally efficient, and produces Bayesian networks of similar to, or of higher accuracy than MMHC and PCHC. Finally, an application of FEDHC, PCHC and MMHC algorithms to real data, from the field of economics, is demonstrated using the statistical software \textit{R}.
    Graph Convolution for Semi-Supervised Classification: Improved Linear Separability and Out-of-Distribution Generalization. (arXiv:2102.06966v4 [cs.LG] UPDATED)
    Recently there has been increased interest in semi-supervised classification in the presence of graphical information. A new class of learning models has emerged that relies, at its most basic level, on classifying the data after first applying a graph convolution. To understand the merits of this approach, we study the classification of a mixture of Gaussians, where the data corresponds to the node attributes of a stochastic block model. We show that graph convolution extends the regime in which the data is linearly separable by a factor of roughly $1/\sqrt{D}$, where $D$ is the expected degree of a node, as compared to the mixture model data on its own. Furthermore, we find that the linear classifier obtained by minimizing the cross-entropy loss after the graph convolution generalizes to out-of-distribution data where the unseen data can have different intra- and inter-class edge probabilities from the training data.
    Taming Sparsely Activated Transformer with Stochastic Experts. (arXiv:2110.04260v3 [cs.CL] UPDATED)
    Sparsely activated models (SAMs), such as Mixture-of-Experts (MoE), can easily scale to have outrageously large amounts of parameters without significant increase in computational cost. However, SAMs are reported to be parameter inefficient such that larger models do not always lead to better performance. While most on-going research focuses on improving SAMs models by exploring methods of routing inputs to experts, our analysis reveals that such research might not lead to the solution we expect, i.e., the commonly-used routing methods based on gating mechanisms do not work better than randomly routing inputs to experts. In this paper, we propose a new expert-based model, THOR (Transformer witH StOchastic ExpeRts). Unlike classic expert-based models, such as the Switch Transformer, experts in THOR are randomly activated for each input during training and inference. THOR models are trained using a consistency regularized loss, where experts learn not only from training data but also from other experts as teachers, such that all the experts make consistent predictions. We validate the effectiveness of THOR on machine translation tasks. Results show that THOR models are more parameter efficient in that they significantly outperform the Transformer and MoE models across various settings. For example, in multilingual translation, THOR outperforms the Switch Transformer by 2 BLEU scores, and obtains the same BLEU score as that of a state-of-the-art MoE model that is 18 times larger. Our code is publicly available at: https://github.com/microsoft/Stochastic-Mixture-of-Experts.
    Fast Gradient Non-sign Methods. (arXiv:2110.12734v3 [cs.CV] UPDATED)
    Adversarial attacks make their success in DNNs, and among them, gradient-based algorithms become one of the mainstreams. Based on the linearity hypothesis, under $\ell_\infty$ constraint, $sign$ operation applied to the gradients is a good choice for generating perturbations. However, side-effects from such operation exist since it leads to the bias of direction between real gradients and perturbations. In other words, current methods contain a gap between real gradients and actual noises, which leads to biased and inefficient attacks. Therefore in this paper, based on the Taylor expansion, the bias is analyzed theoretically, and the correction of $sign$, i.e., Fast Gradient Non-sign Method (FGNM), is further proposed. Notably, FGNM is a general routine that seamlessly replaces the conventional $sign$ operation in gradient-based attacks with negligible extra computational cost. Extensive experiments demonstrate the effectiveness of our methods. Specifically, for untargeted black-box attacks, ours outperform them by 27.5% at most and 9.5% on average. For targeted attacks against defense models, it is 15.1% and 12.7%. Our anonymous code is publicly available at https://github.com/yaya-cheng/FGNM
    Multiset-Equivariant Set Prediction with Approximate Implicit Differentiation. (arXiv:2111.12193v2 [cs.LG] UPDATED)
    Most set prediction models in deep learning use set-equivariant operations, but they actually operate on multisets. We show that set-equivariant functions cannot represent certain functions on multisets, so we introduce the more appropriate notion of multiset-equivariance. We identify that the existing Deep Set Prediction Network (DSPN) can be multiset-equivariant without being hindered by set-equivariance and improve it with approximate implicit differentiation, allowing for better optimization while being faster and saving memory. In a range of toy experiments, we show that the perspective of multiset-equivariance is beneficial and that our changes to DSPN achieve better results in most cases. On CLEVR object property prediction, we substantially improve over the state-of-the-art Slot Attention from 8% to 77% in one of the strictest evaluation metrics because of the benefits made possible by implicit differentiation.
    Learning with Neighbor Consistency for Noisy Labels. (arXiv:2202.02200v1 [cs.CV])
    Recent advances in deep learning have relied on large, labelled datasets to train high-capacity models. However, collecting large datasets in a time- and cost-efficient manner often results in label noise. We present a method for learning from noisy labels that leverages similarities between training examples in feature space, encouraging the prediction of each example to be similar to its nearest neighbours. Compared to training algorithms that use multiple models or distinct stages, our approach takes the form of a simple, additional regularization term. It can be interpreted as an inductive version of the classical, transductive label propagation algorithm. We thoroughly evaluate our method on datasets evaluating both synthetic (CIFAR-10, CIFAR-100) and realistic (mini-WebVision, Clothing1M, mini-ImageNet-Red) noise, and achieve competitive or state-of-the-art accuracies across all of them.
    Generalized vec trick for fast learning of pairwise kernel models. (arXiv:2009.01054v2 [cs.LG] UPDATED)
    Pairwise learning corresponds to the supervised learning setting where the goal is to make predictions for pairs of objects. Prominent applications include predicting drug-target or protein-protein interactions, or customer-product preferences. In this work, we present a comprehensive review of pairwise kernels, that have been proposed for incorporating prior knowledge about the relationship between the objects. Specifically, we consider the standard, symmetric and anti-symmetric Kronecker product kernels, metric-learning, Cartesian, ranking, as well as linear, polynomial and Gaussian kernels. Recently, a O(nm + nq) time generalized vec trick algorithm, where n, m, and q denote the number of pairs, drugs and targets, was introduced for training kernel methods with the Kronecker product kernel. This was a significant improvement over previous O(n^2) training methods, since in most real-world applications m,q << n. In this work we show how all the reviewed kernels can be expressed as sums of Kronecker products, allowing the use of generalized vec trick for speeding up their computation. In the experiments, we demonstrate how the introduced approach allows scaling pairwise kernels to much larger data sets than previously feasible, and provide an extensive comparison of the kernels on a number of biological interaction prediction tasks.
    Knowledge Cross-Distillation for Membership Privacy. (arXiv:2111.01363v3 [cs.CR] UPDATED)
    A membership inference attack (MIA) poses privacy risks for the training data of a machine learning model. With an MIA, an attacker guesses if the target data are a member of the training dataset. The state-of-the-art defense against MIAs, distillation for membership privacy (DMP), requires not only private data for protection but a large amount of unlabeled public data. However, in certain privacy-sensitive domains, such as medicine and finance, the availability of public data is not guaranteed. Moreover, a trivial method for generating public data by using generative adversarial networks significantly decreases the model accuracy, as reported by the authors of DMP. To overcome this problem, we propose a novel defense against MIAs that uses knowledge distillation without requiring public data. Our experiments show that the privacy protection and accuracy of our defense are comparable to those of DMP for the benchmark tabular datasets used in MIA research, Purchase100 and Texas100, and our defense has a much better privacy-utility trade-off than those of the existing defenses that also do not use public data for the image dataset CIFAR10.
    Functional Mixtures-of-Experts. (arXiv:2202.02249v1 [stat.ME])
    We consider the statistical analysis of heterogeneous data for clustering and prediction purposes, in situations where the observations include functions, typically time series. We extend the modeling with Mixtures-of-Experts (ME), as a framework of choice in modeling heterogeneity in data for prediction and clustering with vectorial observations, to this functional data analysis context. We first present a new family of functional ME (FME) models, in which the predictors are potentially noisy observations, from entire functions, and the data generating process of the pair predictor and the real response, is governed by a hidden discrete variable representing an unknown partition, leading to complex situations to which the standard ME framework is not adapted. Second, we provide sparse and interpretable functional representations of the FME models, thanks to Lasso-like regularizations, notably on the derivatives of the underlying functional parameters of the model, projected onto a set of continuous basis functions. We develop dedicated expectation--maximization algorithms for Lasso-like regularized maximum-likelihood parameter estimation strategies, to encourage sparse and interpretable solutions. The proposed FME models and the developed EM-Lasso algorithms are studied in simulated scenarios and in applications to two real data sets, and the obtained results demonstrate their performance in accurately capturing complex nonlinear relationships between the response and the functional predictor, and in clustering.
    Pre-Trained Neural Language Models for Automatic Mobile App User Feedback Answer Generation. (arXiv:2202.02294v1 [cs.CL])
    Studies show that developers' answers to the mobile app users' feedbacks on app stores can increase the apps' star rating. To help app developers generate answers that are related to the users' issues, recent studies develop models to generate the answers automatically. Aims: The app response generation models use deep neural networks and require training data. Pre-Trained neural language Models (PTM) used in Natural Language Processing (NLP) take advantage of the information they learned from a large corpora in an unsupervised manner, and can reduce the amount of required training data. In this paper, we evaluate PTMs to generate replies to the mobile app user feedbacks. Method: We train a Transformer model from scratch and fine-tune two PTMs to evaluate the generated responses, which are compared to RRGEN, a current app response model. We also evaluate the models with different portions of the training data. Results: The results on a large dataset evaluated by automatic metrics show that PTMs obtain lower scores than the baselines. However, our human evaluation confirms that PTMs can generate more relevant and meaningful responses to the posted feedbacks. Moreover, the performance of PTMs has less drop compared to other models when the amount of training data is reduced to 1/3. Conclusion: PTMs are useful in generating responses to app reviews and are more robust models to the amount of training data provided. However, the prediction time is 19X than RRGEN. This study can provide new avenues for research in adapting the PTMs for analyzing mobile app user feedbacks. Index Terms-mobile app user feedback analysis, neural pre-trained language models, automatic answer generation
    Pixle: a fast and effective black-box attack based on rearranging pixels. (arXiv:2202.02236v1 [cs.LG])
    Recent research has found that neural networks are vulnerable to several types of adversarial attacks, where the input samples are modified in such a way that the model produces a wrong prediction that misclassifies the adversarial sample. In this paper we focus on black-box adversarial attacks, that can be performed without knowing the inner structure of the attacked model, nor the training procedure, and we propose a novel attack that is capable of correctly attacking a high percentage of samples by rearranging a small number of pixels within the attacked image. We demonstrate that our attack works on a large number of datasets and models, that it requires a small number of iterations, and that the distance between the original sample and the adversarial one is negligible to the human eye.
    Graph-Coupled Oscillator Networks. (arXiv:2202.02296v1 [cs.LG])
    We propose Graph-Coupled Oscillator Networks (GraphCON), a novel framework for deep learning on graphs. It is based on discretizations of a second-order system of ordinary differential equations (ODEs), which model a network of nonlinear forced and damped oscillators, coupled via the adjacency structure of the underlying graph. The flexibility of our framework permits any basic GNN layer (e.g. convolutional or attentional) as the coupling function, from which a multi-layer deep neural network is built up via the dynamics of the proposed ODEs. We relate the oversmoothing problem, commonly encountered in GNNs, to the stability of steady states of the underlying ODE and show that zero-Dirichlet energy steady states are not stable for our proposed ODEs. This demonstrates that the proposed framework mitigates the oversmoothing problem. Finally, we show that our approach offers competitive performance with respect to the state-of-the-art on a variety of graph-based learning tasks.
    Polynomial convergence of iterations of certain random operators in Hilbert space. (arXiv:2202.02266v1 [math.FA])
    We study the convergence of random iterative sequence of a family of operators on infinite dimensional Hilbert spaces, which are inspired by the Stochastic Gradient Descent (SGD) algorithm in the case of the noiseless regression, as studied in [1]. We demonstrate that its polynomial convergence rate depends on the initial state, while the randomness plays a role only in the choice of the best constant factor and we close the gap between the upper and lower bounds.
    Rediscovering orbital mechanics with machine learning. (arXiv:2202.02306v1 [astro-ph.EP])
    We present an approach for using machine learning to automatically discover the governing equations and hidden properties of real physical systems from observations. We train a "graph neural network" to simulate the dynamics of our solar system's Sun, planets, and large moons from 30 years of trajectory data. We then use symbolic regression to discover an analytical expression for the force law implicitly learned by the neural network, which our results showed is equivalent to Newton's law of gravitation. The key assumptions that were required were translational and rotational equivariance, and Newton's second and third laws of motion. Our approach correctly discovered the form of the symbolic force law. Furthermore, our approach did not require any assumptions about the masses of planets and moons or physical constants. They, too, were accurately inferred through our methods. Though, of course, the classical law of gravitation has been known since Isaac Newton, our result serves as a validation that our method can discover unknown laws and hidden properties from observed data. More broadly this work represents a key step toward realizing the potential of machine learning for accelerating scientific discovery.
    Inherent Tradeoffs in Learning Fair Representations. (arXiv:1906.08386v6 [cs.LG] UPDATED)
    Real-world applications of machine learning tools in high-stakes domains are often regulated to be fair, in the sense that the predicted target should satisfy some quantitative notion of parity with respect to a protected attribute. However, the exact tradeoff between fairness and accuracy is not entirely clear, even for the basic paradigm of classification problems. In this paper, we characterize an inherent tradeoff between statistical parity and accuracy in the classification setting by providing a lower bound on the sum of group-wise errors of any fair classifiers. Our impossibility theorem could be interpreted as a certain uncertainty principle in fairness: if the base rates differ among groups, then any fair classifier satisfying statistical parity has to incur a large error on at least one of the groups. We further extend this result to give a lower bound on the joint error of any (approximately) fair classifiers, from the perspective of learning fair representations. To show that our lower bound is tight, assuming oracle access to Bayes (potentially unfair) classifiers, we also construct an algorithm that returns a randomized classifier that is both optimal (in terms of accuracy) and fair. Interestingly, when the protected attribute can take more than two values, an extension of this lower bound does not admit an analytic solution. Nevertheless, in this case, we show that the lower bound can be efficiently computed by solving a linear program, which we term as the TV-Barycenter problem, a barycenter problem under the TV-distance. On the upside, we prove that if the group-wise Bayes optimal classifiers are close, then learning fair representations leads to an alternative notion of fairness, known as the accuracy parity, which states that the error rates are close between groups. Finally, we also conduct experiments on real-world datasets to confirm our theoretical findings.
    SignSGD: Fault-Tolerance to Blind and Byzantine Adversaries. (arXiv:2202.02085v1 [cs.LG])
    Distributed learning has become a necessity for training ever-growing models. In a distributed setting, the task is shared among several devices. Typically, the learning process is monitored by a server. Also, some of the devices can be faulty, deliberately or not, and the usual distributed SGD algorithm cannot defend itself from omniscient adversaries. Therefore, we need to devise a fault-tolerant gradient descent algorithm. We based our article on the SignSGD algorithm, which relies on the sharing of gradients signs between the devices and the server. We provide a theoretical upper bound for the convergence rate of SignSGD to extend the results of the original paper. Our theoretical results estimate the convergence rate of SignSGD against a proportion of general adversaries, such as Byzantine adversaries. We implemented the algorithm along with Byzantine strategies in order to try to crush the learning process. Therefore, we provide empirical observations from our experiments to support our theory. Our code is available on GitHub and our experiments are reproducible by using the provided parameters.
    Supervised Contrastive Learning for Product Matching. (arXiv:2202.02098v1 [cs.LG])
    Contrastive learning has seen increasing success in the fields of computer vision and information retrieval in recent years. This poster is the first work that applies contrastive learning to the task of product matching in e-commerce using product offers from different e-shops. More specifically, we employ a supervised contrastive learning technique to pre-train a Transformer encoder which is afterwards fine-tuned for the matching problem using pair-wise training data. We further propose a source-aware sampling strategy which enables contrastive learning to be applied for use cases in which the training data does not contain product idenifiers. We show that applying supervised contrastive pre-training in combination with source-aware sampling significantly improves the state-of-the art performance on several widely used benchmark datasets: For Abt-Buy, we reach an F1 of 94.29 (+3.24 compared to the previous state-of-the-art), for Amazon-Google 79.28 (+ 3.7). For WDC Computers datasets, we reach improvements between +0.8 and +8.84 F1 depending on the training set size. Further experiments with data augmentation and self-supervised contrastive pre-training show, that the former can be helpful for smaller training sets while the latter leads to a significant decline in performance due to inherent label-noise. We thus conclude that contrastive pre-training has a high potential for product matching use cases in which explicit supervision is available.
    Deep End-to-end Causal Inference. (arXiv:2202.02195v1 [stat.ML])
    Causal inference is essential for data-driven decision making across domains such as business engagement, medical treatment or policy making. However, research on causal discovery and inference has evolved separately, and the combination of the two domains is not trivial. In this work, we develop Deep End-to-end Causal Inference (DECI), a single flow-based method that takes in observational data and can perform both causal discovery and inference, including conditional average treatment effect (CATE) estimation. We provide a theoretical guarantee that DECI can recover the ground truth causal graph under mild assumptions. In addition, our method can handle heterogeneous, real-world, mixed-type data with missing values, allowing for both continuous and discrete treatment decisions. Moreover, the design principle of our method can generalize beyond DECI, providing a general End-to-end Causal Inference (ECI) recipe, which enables different ECI frameworks to be built using existing methods. Our results show the superior performance of DECI when compared to relevant baselines for both causal discovery and (C)ATE estimation in over a thousand experiments on both synthetic datasets and other causal machine learning benchmark datasets.
    EcoFlow: Efficient Convolutional Dataflows for Low-Power Neural Network Accelerators. (arXiv:2202.02310v1 [cs.LG])
    Dilated and transposed convolutions are widely used in modern convolutional neural networks (CNNs). These kernels are used extensively during CNN training and inference of applications such as image segmentation and high-resolution image generation. Although these kernels have grown in popularity, they stress current compute systems due to their high memory intensity, exascale compute demands, and large energy consumption. We find that commonly-used low-power CNN inference accelerators based on spatial architectures are not optimized for both of these convolutional kernels. Dilated and transposed convolutions introduce significant zero padding when mapped to the underlying spatial architecture, significantly degrading performance and energy efficiency. Existing approaches that address this issue require significant design changes to the otherwise simple, efficient, and well-adopted architectures used to compute direct convolutions. To address this challenge, we propose EcoFlow, a new set of dataflows and mapping algorithms for dilated and transposed convolutions. These algorithms are tailored to execute efficiently on existing low-cost, small-scale spatial architectures and requires minimal changes to the network-on-chip of existing accelerators. EcoFlow eliminates zero padding through careful dataflow orchestration and data mapping tailored to the spatial architecture. EcoFlow enables flexible and high-performance transpose and dilated convolutions on architectures that are otherwise optimized for CNN inference. We evaluate the efficiency of EcoFlow on CNN training workloads and Generative Adversarial Network (GAN) training workloads. Experiments in our new cycle-accurate simulator show that EcoFlow 1) reduces end-to-end CNN training time between 7-85%, and 2) improves end-to-end GAN training performance between 29-42%, compared to state-of-the-art CNN inference accelerators.
    Stochastic smoothing of the top-K calibrated hinge loss for deep imbalanced classification. (arXiv:2202.02193v1 [stat.ML])
    In modern classification tasks, the number of labels is getting larger and larger, as is the size of the datasets encountered in practice. As the number of classes increases, class ambiguity and class imbalance become more and more problematic to achieve high top-1 accuracy. Meanwhile, Top-K metrics (metrics allowing K guesses) have become popular, especially for performance reporting. Yet, proposing top-K losses tailored for deep learning remains a challenge, both theoretically and practically. In this paper we introduce a stochastic top-K hinge loss inspired by recent developments on top-K calibrated losses. Our proposal is based on the smoothing of the top-K operator building on the flexible "perturbed optimizer" framework. We show that our loss function performs very well in the case of balanced datasets, while benefiting from a significantly lower computational time than the state-of-the-art top-K loss function. In addition, we propose a simple variant of our loss for the imbalanced case. Experiments on a heavy-tailed dataset show that our loss function significantly outperforms other baseline loss functions.
    Backpropagation Neural Tree. (arXiv:2202.02248v1 [cs.LG])
    We propose a novel algorithm called Backpropagation Neural Tree (BNeuralT), which is a stochastic computational dendritic tree. BNeuralT takes random repeated inputs through its leaves and imposes dendritic nonlinearities through its internal connections like a biological dendritic tree would do. Considering the dendritic-tree like plausible biological properties, BNeuralT is a single neuron neural tree model with its internal sub-trees resembling dendritic nonlinearities. BNeuralT algorithm produces an ad hoc neural tree which is trained using a stochastic gradient descent optimizer like gradient descent (GD), momentum GD, Nesterov accelerated GD, Adagrad, RMSprop, or Adam. BNeuralT training has two phases, each computed in a depth-first search manner: the forward pass computes neural tree's output in a post-order traversal, while the error backpropagation during the backward pass is performed recursively in a pre-order traversal. A BNeuralT model can be considered a minimal subset of a neural network (NN), meaning it is a "thinned" NN whose complexity is lower than an ordinary NN. Our algorithm produces high-performing and parsimonious models balancing the complexity with descriptive ability on a wide variety of machine learning problems: classification, regression, and pattern recognition.
    COIL: Constrained Optimization in Learned Latent Space -- Learning Representations for Valid Solutions. (arXiv:2202.02163v1 [cs.NE])
    Constrained optimization problems can be difficult because their search spaces have properties not conducive to search, e.g., multimodality, discontinuities, or deception. To address such difficulties, considerable research has been performed on creating novel evolutionary algorithms or specialized genetic operators. However, if the representation that defined the search space could be altered such that it only permitted valid solutions that satisfied the constraints, the task of finding the optimal would be made more feasible without any need for specialized optimization algorithms. We propose the use of a Variational Autoencoder to learn such representations. We present Constrained Optimization in Latent Space (COIL), which uses a VAE to generate a learned latent representation from a dataset comprising samples from the valid region of the search space according to a constraint, thus enabling the optimizer to find the objective in the new space defined by the learned representation. We investigate the value of this approach on different constraint types and for different numbers of variables. We show that, compared to an identical GA using a standard representation, COIL with its learned latent representation can satisfy constraints and find solutions with distance to objective up to two orders of magnitude closer.
    Dikaios: Privacy Auditing of Algorithmic Fairness via Attribute Inference Attacks. (arXiv:2202.02242v1 [cs.CR])
    Machine learning (ML) models have been deployed for high-stakes applications. Due to class imbalance in the sensitive attribute observed in the datasets, ML models are unfair on minority subgroups identified by a sensitive attribute, such as race and sex. In-processing fairness algorithms ensure model predictions are independent of sensitive attribute. Furthermore, ML models are vulnerable to attribute inference attacks where an adversary can identify the values of sensitive attribute by exploiting their distinguishable model predictions. Despite privacy and fairness being important pillars of trustworthy ML, the privacy risk introduced by fairness algorithms with respect to attribute leakage has not been studied. We identify attribute inference attacks as an effective measure for auditing blackbox fairness algorithms to enable model builder to account for privacy and fairness in the model design. We proposed Dikaios, a privacy auditing tool for fairness algorithms for model builders which leveraged a new effective attribute inference attack that account for the class imbalance in sensitive attributes through an adaptive prediction threshold. We evaluated Dikaios to perform a privacy audit of two in-processing fairness algorithms over five datasets. We show that our attribute inference attacks with adaptive prediction threshold significantly outperform prior attacks. We highlighted the limitations of in-processing fairness algorithms to ensure indistinguishable predictions across different values of sensitive attributes. Indeed, the attribute privacy risk of these in-processing fairness schemes is highly variable according to the proportion of the sensitive attributes in the dataset. This unpredictable effect of fairness mechanisms on the attribute privacy risk is an important limitation on their utilization which has to be accounted by the model builder.
    Linear Speedup in Personalized Collaborative Learning. (arXiv:2111.05968v2 [cs.LG] UPDATED)
    Collaborative training can improve the accuracy of a model for a user by trading off the model's bias (introduced by using data from other users who are potentially different) against its variance (due to the limited amount of data on any single user). In this work, we formalize the personalized collaborative learning problem as a stochastic optimization of a task $0$ while given access to $N$ related but different tasks $1,\dots, N$. We give convergence guarantees for two algorithms in this setting -- a popular collaboration method known as \emph{weighted gradient averaging}, and a novel \emph{bias correction} method -- and explore conditions under which we can achieve linear speedup w.r.t. the number of auxiliary tasks $N$. Further, we also empirically study their performance confirming our theoretical insights.
    Deep Generative Modelling: A Comparative Review of VAEs, GANs, Normalizing Flows, Energy-Based and Autoregressive Models. (arXiv:2103.04922v3 [cs.LG] UPDATED)
    Deep generative models are a class of techniques that train deep neural networks to model the distribution of training samples. Research has fragmented into various interconnected approaches, each of which make trade-offs including run-time, diversity, and architectural restrictions. In particular, this compendium covers energy-based models, variational autoencoders, generative adversarial networks, autoregressive models, normalizing flows, in addition to numerous hybrid approaches. These techniques are compared and contrasted, explaining the premises behind each and how they are interrelated, while reviewing current state-of-the-art advances and implementations.
    Decoupling Local and Global Representations of Time Series. (arXiv:2202.02262v1 [cs.LG])
    Real-world time series data are often generated from several sources of variation. Learning representations that capture the factors contributing to this variability enables a better understanding of the data via its underlying generative process and improves performance on downstream machine learning tasks. This paper proposes a novel generative approach for learning representations for the global and local factors of variation in time series. The local representation of each sample models non-stationarity over time with a stochastic process prior, and the global representation of the sample encodes the time-independent characteristics. To encourage decoupling between the representations, we introduce counterfactual regularization that minimizes the mutual information between the two variables. In experiments, we demonstrate successful recovery of the true local and global variability factors on simulated data, and show that representations learned using our method yield superior performance on downstream tasks on real-world datasets. We believe that the proposed way of defining representations is beneficial for data modelling and yields better insights into the complexity of real-world data.
    Affine-Invariant Integrated Rank-Weighted Depth: Definition, Properties and Finite Sample Analysis. (arXiv:2106.11068v3 [stat.ML] UPDATED)
    Because it determines a center-outward ordering of observations in $\mathbb{R}^d$ with $d\geq 2$, the concept of statistical depth permits to define quantiles and ranks for multivariate data and use them for various statistical tasks (e.g. inference, hypothesis testing). Whereas many depth functions have been proposed \textit{ad-hoc} in the literature since the seminal contribution of \cite{Tukey75}, not all of them possess the properties desirable to emulate the notion of quantile function for univariate probability distributions. In this paper, we propose an extension of the \textit{integrated rank-weighted} statistical depth (IRW depth in abbreviated form) originally introduced in \cite{IRW}, modified in order to satisfy the property of \textit{affine-invariance}, fulfilling thus all the four key axioms listed in the nomenclature elaborated by \cite{ZuoS00a}. The variant we propose, referred to as the Affine-Invariant IRW depth (AI-IRW in short), involves the covariance/precision matrices of the (supposedly square integrable) $d$-dimensional random vector $X$ under study, in order to take into account the directions along which $X$ is most variable to assign a depth value to any point $x\in \mathbb{R}^d$. The accuracy of the sampling version of the AI-IRW depth is investigated from a nonasymptotic perspective. Namely, a concentration result for the statistical counterpart of the AI-IRW depth is proved. Beyond the theoretical analysis carried out, applications to anomaly detection are considered and numerical results are displayed, providing strong empirical evidence of the relevance of the depth function we propose here.
    BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning. (arXiv:2202.02005v1 [cs.RO])
    In this paper, we study the problem of enabling a vision-based robotic manipulation system to generalize to novel tasks, a long-standing challenge in robot learning. We approach the challenge from an imitation learning perspective, aiming to study how scaling and broadening the data collected can facilitate such generalization. To that end, we develop an interactive and flexible imitation learning system that can learn from both demonstrations and interventions and can be conditioned on different forms of information that convey the task, including pre-trained embeddings of natural language or videos of humans performing the task. When scaling data collection on a real robot to more than 100 distinct tasks, we find that this system can perform 24 unseen manipulation tasks with an average success rate of 44%, without any robot demonstrations for those tasks.
    Learning What to Memorize: Using Intrinsic Motivation to Form Useful Memory in Partially Observable Reinforcement Learning. (arXiv:2110.12810v2 [cs.LG] UPDATED)
    Reinforcement Learning faces an important challenge in partial observable environments that has long-term dependencies. In order to learn in an ambiguous environment, an agent has to keep previous perceptions in a memory. Earlier memory based approaches use a fixed method to determine what to keep in the memory, which limits them to certain problems. In this study, we follow the idea of giving the control of the memory to the agent by allowing it to have memory-changing actions. This learning mechanism is supported by an intrinsic motivation to memorize rare observations that can help the agent to disambiguate its state in the environment. Our approach is experimented and analyzed on several partial observable tasks with long-term dependencies and compared with other memory based methods.
    Image-to-Image MLP-mixer for Image Reconstruction. (arXiv:2202.02018v1 [cs.CV])
    Neural networks are highly effective tools for image reconstruction problems such as denoising and compressive sensing. To date, neural networks for image reconstruction are almost exclusively convolutional. The most popular architecture is the U-Net, a convolutional network with a multi-resolution architecture. In this work, we show that a simple network based on the multi-layer perceptron (MLP)-mixer enables state-of-the art image reconstruction performance without convolutions and without a multi-resolution architecture, provided that the training set and the size of the network are moderately large. Similar to the original MLP-mixer, the image-to-image MLP-mixer is based exclusively on MLPs operating on linearly-transformed image patches. Contrary to the original MLP-mixer, we incorporate structure by retaining the relative positions of the image patches. This imposes an inductive bias towards natural images which enables the image-to-image MLP-mixer to learn to denoise images based on fewer examples than the original MLP-mixer. Moreover, the image-to-image MLP-mixer requires fewer parameters to achieve the same denoising performance than the U-Net and its parameters scale linearly in the image resolution instead of quadratically as for the original MLP-mixer. If trained on a moderate amount of examples for denoising, the image-to-image MLP-mixer outperforms the U-Net by a slight margin. It also outperforms the vision transformer tailored for image reconstruction and classical un-trained methods such as BM3D, making it a very effective tool for image reconstruction problems.  ( 2 min )
    Cross-Modality Multi-Atlas Segmentation Using Deep Neural Networks. (arXiv:2202.02000v1 [eess.IV])
    Multi-atlas segmentation (MAS) is a promising framework for medical image segmentation. Generally, MAS methods register multiple atlases, i.e., medical images with corresponding labels, to a target image; and the transformed atlas labels can be combined to generate target segmentation via label fusion schemes. Many conventional MAS methods employed the atlases from the same modality as the target image. However, the number of atlases with the same modality may be limited or even missing in many clinical applications. Besides, conventional MAS methods suffer from the computational burden of registration or label fusion procedures. In this work, we design a novel cross-modality MAS framework, which uses available atlases from a certain modality to segment a target image from another modality. To boost the computational efficiency of the framework, both the image registration and label fusion are achieved by well-designed deep neural networks. For the atlas-to-target image registration, we propose a bi-directional registration network (BiRegNet), which can efficiently align images from different modalities. For the label fusion, we design a similarity estimation network (SimNet), which estimates the fusion weight of each atlas by measuring its similarity to the target image. SimNet can learn multi-scale information for similarity estimation to improve the performance of label fusion. The proposed framework was evaluated by the left ventricle and liver segmentation tasks on the MM-WHS and CHAOS datasets, respectively. Results have shown that the framework is effective for cross-modality MAS in both registration and label fusion. The code will be released publicly on \url{https://github.com/NanYoMy/cmmas} once the manuscript is accepted.  ( 2 min )
    Complex-to-Real Random Features for Polynomial Kernels. (arXiv:2202.02031v1 [stat.ML])
    Kernel methods are ubiquitous in statistical modeling due to their theoretical guarantees as well as their competitive empirical performance. Polynomial kernels are of particular importance as their feature maps model the interactions between the dimensions of the input data. However, the construction time of explicit feature maps scales exponentially with the polynomial degree and a naive application of the kernel trick does not scale to large datasets. In this work, we propose Complex-to-Real (CtR) random features for polynomial kernels that leverage intermediate complex random projections and can yield kernel estimates with much lower variances than their real-valued analogs. The resulting features are real-valued, simple to construct and have the following advantages over the state-of-the-art: 1) shorter construction times, 2) lower kernel approximation errors for commonly used degrees, 3) they enable us to obtain a closed-form expression for their variance.
    Data Scaling Laws in NMT: The Effect of Noise and Architecture. (arXiv:2202.01994v1 [cs.LG])
    In this work, we study the effect of varying the architecture and training data quality on the data scaling properties of Neural Machine Translation (NMT). First, we establish that the test loss of encoder-decoder transformer models scales as a power law in the number of training samples, with a dependence on the model size. Then, we systematically vary aspects of the training setup to understand how they impact the data scaling laws. In particular, we change the following (1) Architecture and task setup: We compare to a transformer-LSTM hybrid, and a decoder-only transformer with a language modeling loss (2) Noise level in the training distribution: We experiment with filtering, and adding iid synthetic noise. In all the above cases, we find that the data scaling exponents are minimally impacted, suggesting that marginally worse architectures or training data can be compensated for by adding more data. Lastly, we find that using back-translated data instead of parallel data, can significantly degrade the scaling exponent.
    Implementation of a Type-2 Fuzzy Logic Based Prediction System for the Nigerian Stock Exchange. (arXiv:2202.02107v1 [cs.AI])
    Stock Market can be easily seen as one of the most attractive places for investors, but it is also very complex in terms of making trading decisions. Predicting the market is a risky venture because of the uncertainties and nonlinear nature of the market. Deciding on the right time to trade is key to every successful trader as it can lead to either a huge gain of money or totally a loss in investment that will be recorded as a careless trade. The aim of this research is to develop a prediction system for stock market using Fuzzy Logic Type2 which will handle these uncertainties and complexities of human behaviour in general when it comes to buy, hold or sell decision making in stock trading. The proposed system was developed using VB.NET programming language as frontend and Microsoft SQL Server as backend. A total of four different technical indicators were selected for this research. The selected indicators are the Relative Strength Index, William Average, Moving Average Convergence and Divergence, and Stochastic Oscillator. These indicators serve as input variable to the Fuzzy System. The MACD and SO are deployed as primary indicators, while the RSI and WA are used as secondary indicators. Fibonacci retracement ratio was adopted for the secondary indicators to determine their support and resistance level in terms of making trading decisions. The input variables to the Fuzzy System is fuzzified to Low, Medium, and High using the Triangular and Gaussian Membership Function. The Mamdani Type Fuzzy Inference rules were used for combining the trading rules for each input variable to the fuzzy system. The developed system was tested using sample data collected from ten different companies listed on the Nigerian Stock Exchange for a total of fifty two periods. The dataset collected are Opening, High, Low, and Closing prices of each security.
    Interpretability methods of machine learning algorithms with applications in breast cancer diagnosis. (arXiv:2202.02131v1 [cs.LG])
    Early detection of breast cancer is a powerful tool towards decreasing its socioeconomic burden. Although, artificial intelligence (AI) methods have shown remarkable results towards this goal, their "black box" nature hinders their wide adoption in clinical practice. To address the need for AI guided breast cancer diagnosis, interpretability methods can be utilized. In this study, we used AI methods, i.e., Random Forests (RF), Neural Networks (NN) and Ensembles of Neural Networks (ENN), towards this goal and explained and optimized their performance through interpretability techniques, such as the Global Surrogate (GS) method, the Individual Conditional Expectation (ICE) plots and the Shapley values (SV). The Wisconsin Diagnostic Breast Cancer (WDBC) dataset of the open UCI repository was used for the training and evaluation of the AI algorithms. The best performance for breast cancer diagnosis was achieved by the proposed ENN (96.6% accuracy and 0.96 area under the ROC curve), and its predictions were explained by ICE plots, proving that its decisions were compliant with current medical knowledge and can be further utilized to gain new insights in the pathophysiological mechanisms of breast cancer. Feature selection based on features' importance according to the GS model improved the performance of the RF (leading the accuracy from 96.49% to 97.18% and the area under the ROC curve from 0.96 to 0.97) and feature selection based on features' importance according to SV improved the performance of the NN (leading the accuracy from 94.6% to 95.53% and the area under the ROC curve from 0.94 to 0.95). Compared to other approaches on the same dataset, our proposed models demonstrated state of the art performance while being interpretable.
    LTU Attacker for Membership Inference. (arXiv:2202.02278v1 [cs.LG])
    We address the problem of defending predictive models, such as machine learning classifiers (Defender models), against membership inference attacks, in both the black-box and white-box setting, when the trainer and the trained model are publicly released. The Defender aims at optimizing a dual objective: utility and privacy. Both utility and privacy are evaluated with an external apparatus including an Attacker and an Evaluator. On one hand, Reserved data, distributed similarly to the Defender training data, is used to evaluate Utility; on the other hand, Reserved data, mixed with Defender training data, is used to evaluate membership inference attack robustness. In both cases classification accuracy or error rate are used as the metric: Utility is evaluated with the classification accuracy of the Defender model; Privacy is evaluated with the membership prediction error of a so-called "Leave-Two-Unlabeled" LTU Attacker, having access to all of the Defender and Reserved data, except for the membership label of one sample from each. We prove that, under certain conditions, even a "na\"ive" LTU Attacker can achieve lower bounds on privacy loss with simple attack strategies, leading to concrete necessary conditions to protect privacy, including: preventing over-fitting and adding some amount of randomness. However, we also show that such a na\"ive LTU Attacker can fail to attack the privacy of models known to be vulnerable in the literature, demonstrating that knowledge must be complemented with strong attack strategies to turn the LTU Attacker into a powerful means of evaluating privacy. Our experiments on the QMNIST and CIFAR-10 datasets validate our theoretical results and confirm the roles of over-fitting prevention and randomness in the algorithms to protect against privacy attacks.
    Generative Modeling of Complex Data. (arXiv:2202.02145v1 [cs.LG])
    In recent years, several models have improved the capacity to generate synthetic tabular datasets. However, such models focus on synthesizing simple columnar tables and are not useable on real-life data with complex structures. This paper puts forward a generic framework to synthesize more complex data structures with composite and nested types. It then proposes one practical implementation, built with causal transformers, for struct (mappings of types) and lists (repeated instances of a type). The results on standard benchmark datasets show that such implementation consistently outperforms current state-of-the-art models both in terms of machine learning utility and statistical similarity. Moreover, it shows very strong results on two complex hierarchical datasets with multiple nesting and sparse data, that were previously out of reach.  ( 2 min )
    Musical Audio Similarity with Self-supervised Convolutional Neural Networks. (arXiv:2202.02112v1 [cs.SD])
    We have built a music similarity search engine that lets video producers search by listenable music excerpts, as a complement to traditional full-text search. Our system suggests similar sounding track segments in a large music catalog by training a self-supervised convolutional neural network with triplet loss terms and musical transformations. Semi-structured user interviews demonstrate that we can successfully impress professional video producers with the quality of the search experience, and perceived similarities to query tracks averaged 7.8/10 in user testing. We believe this search tool will make for a more natural search experience that is easier to find music to soundtrack videos with.  ( 2 min )
    Group invariant machine learning by fundamental domain projections. (arXiv:2202.02164v1 [cs.LG])
    We approach the well-studied problem of supervised group invariant and equivariant machine learning from the point of view of geometric topology. We propose a novel approach using a pre-processing step, which involves projecting the input data into a geometric space which parametrises the orbits of the symmetry group. This new data can then be the input for an arbitrary machine learning model (neural network, random forest, support-vector machine etc). We give an algorithm to compute the geometric projection, which is efficient to implement, and we illustrate our approach on some example machine learning problems (including the well-studied problem of predicting Hodge numbers of CICY matrices), in each case finding an improvement in accuracy versus others in the literature. The geometric topology viewpoint also allows us to give a unified description of so-called intrinsic approaches to group equivariant machine learning, which encompasses many other approaches in the literature.  ( 2 min )
    Twitter Referral Behaviours on News Consumption with Ensemble Clustering of Click-Stream Data in Turkish Media. (arXiv:2202.02056v1 [cs.SI])
    Click-stream data, which comes with a massive volume generated by the human activities on the websites, has become a prominent feature to identify readers' characteristics by the newsrooms after the digitization of the news outlets. It is essential to have elastic architectures to process the streaming data, particularly for unprecedented traffic, enabling conducting more comprehensive analyses such as recommending mostly related articles to the readers. Although the nature of click-stream data has a similar logic within the websites, it has inherent limitations to recognize human behaviors when looking from a broad perspective, which brings the need of limiting the problem in niche areas. This study investigates the anonymized readers' click activities in the organizations' websites to identify news consumption patterns following referrals from Twitter, who incidentally reach but propensity is mainly the routed news content. The investigation is widened to a broad perspective by linking the log data with news content to enrich the insights rather than sticking into the web journey. The methodologies on ensemble cluster analysis with mixed-type embedding strategies are applied and compared to find similar reader groups and interests independent from time. Our results demonstrate that the quality of clustering mixed-type data set approaches to optimal internal validation scores when embedded by Uniform Manifold Approximation and Projection (UMAP) and using consensus function as a key to access the most applicable hyper parameter configurations in the given ensemble rather than using consensus function results directly. Evaluation of the resulting clusters highlights specific clusters repeatedly present in the samples, which provide insights to the news organizations and overcome the degradation of the modeling behaviors due to the change in the interest over time.  ( 2 min )
    Correcting Confounding via Random Selection of Background Variables. (arXiv:2202.02150v1 [stat.ML])
    We propose a method to distinguish causal influence from hidden confounding in the following scenario: given a target variable Y, potential causal drivers X, and a large number of background features, we propose a novel criterion for identifying causal relationship based on the stability of regression coefficients of X on Y with respect to selecting different background features. To this end, we propose a statistic V measuring the coefficient's variability. We prove, subject to a symmetry assumption for the background influence, that V converges to zero if and only if X contains no causal drivers. In experiments with simulated data, the method outperforms state of the art algorithms. Further, we report encouraging results for real-world data. Our approach aligns with the general belief that causal insights admit better generalization of statistical associations across environments, and justifies similar existing heuristic approaches from the literature.  ( 2 min )
    Proceedings 10th International Workshop on Theorem Proving Components for Educational Software. (arXiv:2202.02144v1 [cs.LO])
    This EPTCS volume contains the proceedings of the ThEdu'21 workshop, promoted on 11 July 2021, as a satellite event of CADE-28. Due to the COVID-19 pandemic, CADE-28 and all its co-located events happened as virtual events. ThEdu'21 was a vibrant workshop, with an invited talk by Gilles Dowek (ENS Paris-Saclay), eleven contributions, and one demonstration. After the workshop an open call for papers was issued and attracted 10 submissions, 7 of which have been accepted by the reviewers, and collected in the present post-proceedings volume. The ThEdu series pursues the smooth transition from an intuitive way of doing mathematics at secondary school to a more formal approach to the subject in STEM education, while favouring software support for this transition by exploiting the power of theorem-proving technologies. The volume editors hope that this collection of papers will further promote the development of theorem-proving based software, and that it will collaborate on improving mutual understanding between computer scientists, mathematicians and stakeholders in education.  ( 2 min )
    Towards a consistent interpretation of AIOps models. (arXiv:2202.02298v1 [cs.LG])
    Artificial Intelligence for IT Operations (AIOps) has been adopted in organizations in various tasks, including interpreting models to identify indicators of service failures. To avoid misleading practitioners, AIOps model interpretations should be consistent (i.e., different AIOps models on the same task agree with one another on feature importance). However, many AIOps studies violate established practices in the machine learning community when deriving interpretations, such as interpreting models with suboptimal performance, though the impact of such violations on the interpretation consistency has not been studied. In this paper, we investigate the consistency of AIOps model interpretation along three dimensions: internal consistency, external consistency, and time consistency. We conduct a case study on two AIOps tasks: predicting Google cluster job failures, and Backblaze hard drive failures. We find that the randomness from learners, hyperparameter tuning, and data sampling should be controlled to generate consistent interpretations. AIOps models with AUCs greater than 0.75 yield more consistent interpretation compared to low-performing models. Finally, AIOps models that are constructed with the Sliding Window or Full History approaches have the most consistent interpretation with the trends presented in the entire datasets. Our study provides valuable guidelines for practitioners to derive consistent AIOps model interpretation.
    Contention Window Optimization in IEEE 802.11ax Networks with Deep Reinforcement Learning. (arXiv:2003.01492v5 [cs.NI] UPDATED)
    The proper setting of contention window (CW) values has a significant impact on the efficiency of Wi-Fi networks. Unfortunately, the standard method used by 802.11 networks is not scalable enough to maintain stable throughput for an increasing number of stations, yet it remains the default method of channel access for 802.11ax single-user transmissions. Therefore, we propose a new method of CW control, which leverages deep reinforcement learning (DRL) principles to learn the correct settings under different network conditions. Our method, called centralized contention window optimization with DRL (CCOD), supports two trainable control algorithms: deep Q-network (DQN) and deep deterministic policy gradient (DDPG). We demonstrate through simulations that it offers efficiency close to optimal (even in dynamic topologies) while keeping computational cost low.
    Video Violence Recognition and Localization using a Semi-Supervised Hard-Attention Model. (arXiv:2202.02212v1 [cs.CV])
    Empowering automated violence monitoring and surveillance systems amid the growing social violence and extremist activities worldwide could keep communities safe and save lives. The questionable reliability of human monitoring personnel and the increasing number of surveillance cameras makes automated artificial intelligence-based solutions compelling. Improving the current state-of-the-art deep learning approaches to video violence recognition to higher levels of accuracy and performance could enable surveillance systems to be more reliable and scalable. The main contribution of the proposed deep reinforcement learning method is to achieve state-of-the-art accuracy on RWF, Hockey, and Movies datasets while removing some of the computationally expensive processes and input features used in the previous solutions. The implementation of hard attention using a semi-supervised learning method made the proposed method capable of rough violence localization and added increased agent interpretability to the violence detection system.
    Deep invariant networks with differentiable augmentation layers. (arXiv:2202.02142v1 [cs.LG])
    Designing learning systems which are invariant to certain data transformations is critical in machine learning. Practitioners can typically enforce a desired invariance on the trained model through the choice of a network architecture, e.g. using convolutions for translations, or using data augmentation. Yet, enforcing true invariance in the network can be difficult, and data invariances are not always known a piori. State-of-the-art methods for learning data augmentation policies require held-out data and are based on bilevel optimization problems, which are complex to solve and often computationally demanding. In this work we investigate new ways of learning invariances only from the training data. Using learnable augmentation layers built directly in the network, we demonstrate that our method is very versatile. It can incorporate any type of differentiable augmentation and be applied to a broad class of learning problems beyond computer vision. We provide empirical evidence showing that our approach is easier and faster to train than modern automatic data augmentation techniques based on bilevel optimization, while achieving comparable results. Experiments show that while the invariances transferred to a model through automatic data augmentation are limited by the model expressivity, the invariance yielded by our approach is insensitive to it by design.  ( 2 min )
    Source data selection for out-of-domain generalization. (arXiv:2202.02155v1 [cs.LG])
    Models that perform out-of-domain generalization borrow knowledge from heterogeneous source data and apply it to a related but distinct target task. Transfer learning has proven effective for accomplishing this generalization in many applications. However, poor selection of a source dataset can lead to poor performance on the target, a phenomenon called negative transfer. In order to take full advantage of available source data, this work studies source data selection with respect to a target task. We propose two source selection methods that are based on the multi-bandit theory and random search, respectively. We conduct a thorough empirical evaluation on both simulated and real data. Our proposals can be also viewed as diagnostics for the existence of a reweighted source subsamples that perform better than the random selection of available samples.  ( 2 min )
    Robust Linear Regression for General Feature Distribution. (arXiv:2202.02080v1 [cs.LG])
    We investigate robust linear regression where data may be contaminated by an oblivious adversary, i.e., an adversary than may know the data distribution but is otherwise oblivious to the realizations of the data samples. This model has been previously analyzed under strong assumptions. Concretely, $\textbf{(i)}$ all previous works assume that the covariance matrix of the features is positive definite; and $\textbf{(ii)}$ most of them assume that the features are centered (i.e. zero mean). Additionally, all previous works make additional restrictive assumption, e.g., assuming that the features are Gaussian or that the corruptions are symmetrically distributed. In this work we go beyond these assumptions and investigate robust regression under a more general set of assumptions: $\textbf{(i)}$ we allow the covariance matrix to be either positive definite or positive semi definite, $\textbf{(ii)}$ we do not necessarily assume that the features are centered, $\textbf{(iii)}$ we make no further assumption beyond boundedness (sub-Gaussianity) of features and measurement noise. Under these assumption we analyze a natural SGD variant for this problem and show that it enjoys a fast convergence rate when the covariance matrix is positive definite. In the positive semi definite case we show that there are two regimes: if the features are centered we can obtain a standard convergence rate; otherwise the adversary can cause any learner to fail arbitrarily.  ( 2 min )
    Polyphonic pitch detection with convolutional recurrent neural networks. (arXiv:2202.02115v1 [cs.SD])
    Recent directions in automatic speech recognition (ASR) research have shown that applying deep learning models from image recognition challenges in computer vision is beneficial. As automatic music transcription (AMT) is superficially similar to ASR, in the sense that methods often rely on transforming spectrograms to symbolic sequences of events (e.g. words or notes), deep learning should benefit AMT as well. In this work, we outline an online polyphonic pitch detection system that streams audio to MIDI by ConvLSTMs. Our system achieves state-of-the-art results on the 2007 MIREX multi-F0 development set, with an F-measure of 83\% on the bassoon, clarinet, flute, horn and oboe ensemble recording without requiring any musical language modelling or assumptions of instrument timbre.  ( 2 min )
    Capturing and incorporating expert knowledge into machine learning models for quality prediction in manufacturing. (arXiv:2202.02003v1 [cs.LG])
    Increasing digitalization enables the use of machine learning methods for analyzing and optimizing manufacturing processes. A main application of machine learning is the construction of quality prediction models, which can be used, among other things, for documentation purposes, as assistance systems for process operators, or for adaptive process control. The quality of such machine learning models typically strongly depends on the amount and the quality of data used for training. In manufacturing, the size of available datasets before start of production is often limited. In contrast to data, expert knowledge commonly is available in manufacturing. Therefore, this study introduces a general methodology for building quality prediction models with machine learning methods on small datasets by integrating shape expert knowledge, that is, prior knowledge about the shape of the input-output relationship to be learned. The proposed methodology is applied to a brushing process with $125$ data points for predicting the surface roughness as a function of five process variables. As opposed to conventional machine learning methods for small datasets, the proposed methodology produces prediction models that strictly comply with all the expert knowledge specified by the involved process specialists. In particular, the direct involvement of process experts in the training of the models leads to a very clear interpretation and, by extension, to a high acceptance of the models. Another merit of the proposed methodology is that, in contrast to most conventional machine learning methods, it involves no time-consuming and often heuristic hyperparameter tuning or model selection step.  ( 2 min )
    TIML: Task-Informed Meta-Learning for Agriculture. (arXiv:2202.02124v1 [cs.LG])
    Labeled datasets for agriculture are extremely spatially imbalanced. When developing algorithms for data-sparse regions, a natural approach is to use transfer learning from data-rich regions. While standard transfer learning approaches typically leverage only direct inputs and outputs, geospatial imagery and agricultural data are rich in metadata that can inform transfer learning algorithms, such as the spatial coordinates of data-points or the class of task being learned. We build on previous work exploring the use of meta-learning for agricultural contexts in data-sparse regions and introduce task-informed meta-learning (TIML), an augmentation to model-agnostic meta-learning which takes advantage of task-specific metadata. We apply TIML to crop type classification and yield estimation, and find that TIML significantly improves performance compared to a range of benchmarks in both contexts, across a diversity of model architectures. While we focus on tasks from agriculture, TIML could offer benefits to any meta-learning setup with task-specific metadata, such as classification of geo-tagged images and species distribution modelling.  ( 2 min )
    Choosing an Appropriate Platform and Workflow for Processing Camera Trap Data using Artificial Intelligence. (arXiv:2202.02283v1 [cs.LG])
    Camera traps have transformed how ecologists study wildlife species distributions, activity patterns, and interspecific interactions. Although camera traps provide a cost-effective method for monitoring species, the time required for data processing can limit survey efficiency. Thus, the potential of Artificial Intelligence (AI), specifically Deep Learning (DL), to process camera-trap data has gained considerable attention. Using DL for these applications involves training algorithms, such as Convolutional Neural Networks (CNNs), to automatically detect objects and classify species. To overcome technical challenges associated with training CNNs, several research communities have recently developed platforms that incorporate DL in easy-to-use interfaces. We review key characteristics of four AI-powered platforms -- Wildlife Insights (WI), MegaDetector (MD), Machine Learning for Wildlife Image Classification (MLWIC2), and Conservation AI -- including data management tools and AI features. We also provide R code in an open-source GitBook, to demonstrate how users can evaluate model performance, and incorporate AI output in semi-automated workflows. We found that species classifications from WI and MLWIC2 generally had low recall values (animals that were present in the images often were not classified to the correct species). Yet, the precision of WI and MLWIC2 classifications for some species was high (i.e., when classifications were made, they were generally accurate). MD, which classifies images using broader categories (e.g., "blank" or "animal"), also performed well. Thus, we conclude that, although species classifiers were not accurate enough to automate image processing, DL could be used to improve efficiencies by accepting classifications with high confidence values for certain species or by filtering images containing blanks.
    Numerical Demonstration of Multiple Actuator Constraint Enforcement Algorithm for a Molten Salt Loop. (arXiv:2202.02094v1 [eess.SY])
    To advance the paradigm of autonomous operation for nuclear power plants, a data-driven machine learning approach to control is sought. Autonomous operation for next-generation reactor designs is anticipated to bolster safety and improve economics. However, any algorithms that are utilized need to be interpretable, adaptable, and robust. In this work, we focus on the specific problem of optimal control during autonomous operation. We will demonstrate an interpretable and adaptable data-driven machine learning approach to autonomous control of a molten salt loop. To address interpretability, we utilize a data-driven algorithm to identify system dynamics in state-space representation. To address adaptability, a control algorithm will be utilized to modify actuator setpoints while enforcing constant, and time-dependent constraints. Robustness is not addressed in this work, and is part of future work. To demonstrate the approach, we designed a numerical experiment requiring intervention to enforce constraints during a load-follow type transient.  ( 2 min )
    Urban Region Profiling via A Multi-Graph Representation Learning Framework. (arXiv:2202.02074v1 [cs.AI])
    Urban region profiling can benefit urban analytics. Although existing studies have made great efforts to learn urban region representation from multi-source urban data, there are still three limitations: (1) Most related methods focused merely on global-level inter-region relations while overlooking local-level geographical contextual signals and intra-region information; (2) Most previous works failed to develop an effective yet integrated fusion module which can deeply fuse multi-graph correlations; (3) State-of-the-art methods do not perform well in regions with high variance socioeconomic attributes. To address these challenges, we propose a multi-graph representative learning framework, called Region2Vec, for urban region profiling. Specifically, except that human mobility is encoded for inter-region relations, geographic neighborhood is introduced for capturing geographical contextual information while POI side information is adopted for representing intra-region information by knowledge graph. Then, graphs are used to capture accessibility, vicinity, and functionality correlations among regions. To consider the discriminative properties of multiple graphs, an encoder-decoder multi-graph fusion module is further proposed to jointly learn comprehensive representations. Experiments on real-world datasets show that Region2Vec can be employed in three applications and outperforms all state-of-the-art baselines. Particularly, Region2Vec has better performance than previous studies in regions with high variance socioeconomic attributes.  ( 2 min )
    Identifiability of Label Noise Transition Matrix. (arXiv:2202.02016v1 [cs.LG])
    The noise transition matrix plays a central role in the problem of learning from noisy labels. Among many other reasons, a significant number of existing solutions rely on access to it. Estimating the transition matrix without using ground truth labels is a critical and challenging task. When label noise transition depends on each instance, the problem of identifying the instance-dependent noise transition matrix becomes substantially more challenging. Despite recent works proposing solutions for learning from instance-dependent noisy labels, we lack a unified understanding of when such a problem remains identifiable, and therefore learnable. This paper seeks to provide answers to a sequence of related questions: What are the primary factors that contribute to the identifiability of a noise transition matrix? Can we explain the observed empirical successes? When a problem is not identifiable, what can we do to make it so? We will relate our theoretical findings to the literature and hope to provide guidelines for developing effective solutions for battling instance-dependent label noise.  ( 2 min )
    To Impute or not to Impute? -- Missing Data in Treatment Effect Estimation. (arXiv:2202.02096v1 [stat.ML])
    Missing data is a systemic problem in practical scenarios that causes noise and bias when estimating treatment effects. This makes treatment effect estimation from data with missingness a particularly tricky endeavour. A key reason for this is that standard assumptions on missingness are rendered insufficient due to the presence of an additional variable, treatment, besides the individual and the outcome. Having a treatment variable introduces additional complexity with respect to why some variables are missing that is not fully explored by previous work. In our work we identify a new missingness mechanism, which we term mixed confounded missingness (MCM), where some missingness determines treatment selection and other missingness is determined by treatment selection. Given MCM, we show that naively imputing all data leads to poor performing treatment effects models, as the act of imputation effectively removes information necessary to provide unbiased estimates. However, no imputation at all also leads to biased estimates, as missingness determined by treatment divides the population in distinct subpopulations, where estimates across these populations will be biased. Our solution is selective imputation, where we use insights from MCM to inform precisely which variables should be imputed and which should not. We empirically demonstrate how various learners benefit from selective imputation compared to other solutions for missing data.  ( 2 min )
    Introducing Block-Toeplitz Covariance Matrices to Remaster Linear Discriminant Analysis for Event-related Potential Brain-computer Interfaces. (arXiv:2202.02001v1 [cs.LG])
    Covariance matrices of noisy multichannel electroencephalogram time series data are hard to estimate due to high dimensionality. In brain-computer interfaces (BCI) based on event-related potentials and a linear discriminant analysis (LDA) for classification, the state of the art to address this problem is by shrinkage regularization. We propose a novel idea to tackle this problem by enforcing a block-Toeplitz structure for the covariance matrix of the LDA, which implements an assumption of signal stationarity in short time windows for each channel. On data of 213 subjects collected under 13 event-related potential BCI protocols, the resulting 'ToeplitzLDA' significantly increases the binary classification performance compared to shrinkage regularized LDA (up to 6 AUC points) and Riemannian classification approaches (up to 2 AUC points). This translates to greatly improved application level performances, as exemplified on data recorded during an unsupervised visual speller application, where spelling errors could be reduced by 81% on average for 25 subjects. Aside from lower memory and time complexity for LDA training, ToeplitzLDA proved to be almost invariant even to a twenty-fold time dimensionality enlargement, which reduces the need of expert knowledge regarding feature extraction.  ( 2 min )
    Multi-Output Gaussian Process-Based Data Augmentation for Multi-Building and Multi-Floor Indoor Localization. (arXiv:2202.01980v1 [cs.NI])
    Location fingerprinting based on RSSI becomes a mainstream indoor localization technique due to its advantage of not requiring the installation of new infrastructure and the modification of existing devices, especially given the prevalence of Wi-Fi-enabled devices and the ubiquitous Wi-Fi access in modern buildings. The use of AI/ML technologies like DNNs makes location fingerprinting more accurate and reliable, especially for large-scale multi-building and multi-floor indoor localization. The application of DNNs for indoor localization, however, depends on a large amount of preprocessed and deliberately-labeled data for their training. Considering the difficulty of the data collection in an indoor environment, especially under the current epidemic situation of COVID-19, we investigate three different methods of RSSI data augmentation based on Multi-Output Gaussian Process (MOGP), i.e., by a single floor, by neighboring floors, and by a single building; unlike Single-Output Gaussian Process (SOGP), MOGP can take into account the correlation among RSSI observations from multiple Access Points (APs) deployed closely to each other (e.g., APs on the same floor of a building) by collectively handling them. The feasibility of the MOGP-based RSSI data augmentation is demonstrated through experiments based on the state-of-the-art RNN indoor localization model and the UJIIndoorLoc, i.e., the most popular publicly-available multi-building and multi-floor indoor localization database, where the RNN model trained with the UJIIndoorLoc database augmented by using the whole RSSI data of a building in fitting an MOGP model (i.e., by a single building) outperforms the other two augmentation methods as well as the RNN model trained with the original UJIIndoorLoc database, resulting in the mean three-dimensional positioning error of 8.42 m.  ( 3 min )
    Robust Vector Quantized-Variational Autoencoder. (arXiv:2202.01987v1 [cs.LG])
    Image generative models can learn the distributions of the training data and consequently generate examples by sampling from these distributions. However, when the training dataset is corrupted with outliers, generative models will likely produce examples that are also similar to the outliers. In fact, a small portion of outliers may induce state-of-the-art generative models, such as Vector Quantized-Variational AutoEncoder (VQ-VAE), to learn a significant mode from the outliers. To mitigate this problem, we propose a robust generative model based on VQ-VAE, which we name Robust VQ-VAE (RVQ-VAE). In order to achieve robustness, RVQ-VAE uses two separate codebooks for the inliers and outliers. To ensure the codebooks embed the correct components, we iteratively update the sets of inliers and outliers during each training epoch. To ensure that the encoded data points are matched to the correct codebooks, we quantize using a weighted Euclidean distance, whose weights are determined by directional variances of the codebooks. Both codebooks, together with the encoder and decoder, are trained jointly according to the reconstruction loss and the quantization loss. We experimentally demonstrate that RVQ-VAE is able to generate examples from inliers even if a large portion of the training data points are corrupted.  ( 2 min )
    Neural Dual Contouring. (arXiv:2202.01999v1 [cs.CV])
    We introduce neural dual contouring (NDC), a new data-driven approach to mesh reconstruction based on dual contouring (DC). Like traditional DC, it produces exactly one vertex per grid cell and one quad for each grid edge intersection, a natural and efficient structure for reproducing sharp features. However, rather than computing vertex locations and edge crossings with hand-crafted functions that depend directly on difficult-to-obtain surface gradients, NDC uses a neural network to predict them. As a result, NDC can be trained to produce meshes from signed or unsigned distance fields, binary voxel grids, or point clouds (with or without normals); and it can produce open surfaces in cases where the input represents a sheet or partial surface. During experiments with five prominent datasets, we find that NDC, when trained on one of the datasets, generalizes well to the others. Furthermore, NDC provides better surface reconstruction accuracy, feature preservation, output complexity, triangle quality, and inference time in comparison to previous learned (e.g., neural marching cubes, convolutional occupancy networks) and traditional (e.g., Poisson) methods.  ( 2 min )
    Performance of multilabel machine learning models and risk stratification schemas for predicting stroke and bleeding risk in patients with non-valvular atrial fibrillation. (arXiv:2202.01975v1 [q-bio.QM])
    Appropriate antithrombotic therapy for patients with atrial fibrillation (AF) requires assessment of ischemic stroke and bleeding risks. However, risk stratification schemas such as CHA2DS2-VASc and HAS-BLED have modest predictive capacity for patients with AF. Machine learning (ML) techniques may improve predictive performance and support decision-making for appropriate antithrombotic therapy. We compared the performance of multilabel ML models with the currently used risk scores for predicting outcomes in AF patients. Materials and Methods This was a retrospective cohort study of 9670 patients, mean age 76.9 years, 46% women, who were hospitalized with non-valvular AF, and had 1-year follow-up. The primary outcome was ischemic stroke and major bleeding admission. The secondary outcomes were all-cause death and event-free survival. The discriminant power of ML models was compared with clinical risk scores by the area under the curve (AUC). Risk stratification was assessed using the net reclassification index. Results Multilabel gradient boosting machine provided the best discriminant power for stroke, major bleeding, and death (AUC = 0.685, 0.709, and 0.765 respectively) compared to other ML models. It provided modest performance improvement for stroke compared to CHA2DS2-VASc (AUC = 0.652), but significantly improved major bleeding prediction compared to HAS-BLED (AUC = 0.522). It also had a much greater discriminant power for death compared with CHA2DS2-VASc (AUC = 0.606). Also, models identified additional risk features (such as hemoglobin level, renal function, etc.) for each outcome. Conclusions Multilabel ML models can outperform clinical risk stratification scores for predicting the risk of major bleeding and death in non-valvular AF patients.  ( 2 min )
    Hybrid Neural Coded Modulation: Design and Training Methods. (arXiv:2202.01972v1 [cs.IT])
    We propose a hybrid coded modulation scheme which composes of inner and outer codes. The outer-code can be any standard binary linear code with efficient soft decoding capability (e.g. low-density parity-check (LDPC) codes). The inner code is designed using a deep neural network (DNN) which takes the channel coded bits and outputs modulated symbols. For training the DNN, we propose to use a loss function that is inspired by the generalized mutual information. The resulting constellations are shown to outperform the conventional quadrature amplitude modulation (QAM) based coding scheme for modulation order 16 and 64 with 5G standard LDPC codes.  ( 2 min )
    Demystify Optimization and Generalization of Over-parameterized PAC-Bayesian Learning. (arXiv:2202.01958v1 [cs.LG])
    PAC-Bayesian is an analysis framework where the training error can be expressed as the weighted average of the hypotheses in the posterior distribution whilst incorporating the prior knowledge. In addition to being a pure generalization bound analysis tool, PAC-Bayesian bound can also be incorporated into an objective function to train a probabilistic neural network, making them a powerful and relevant framework that can numerically provide a tight generalization bound for supervised learning. For simplicity, we call probabilistic neural network learned using training objectives derived from PAC-Bayesian bounds as {\it PAC-Bayesian learning}. Despite their empirical success, the theoretical analysis of PAC-Bayesian learning for neural networks is rarely explored. This paper proposes a new class of convergence and generalization analysis for PAC-Bayes learning when it is used to train the over-parameterized neural networks by the gradient descent method. For a wide probabilistic neural network, we show that when PAC-Bayes learning is applied, the convergence result corresponds to solving a kernel ridge regression when the probabilistic neural tangent kernel (PNTK) is used as its kernel. Based on this finding, we further characterize the uniform PAC-Bayesian generalization bound which improves over the Rademacher complexity-based bound for non-probabilistic neural network. Finally, drawing the insight from our theoretical results, we propose a proxy measure for efficient hyperparameters selection, which is proven to be time-saving.  ( 2 min )
    A Reinforcement Learning Framework for PQoS in a Teleoperated Driving Scenario. (arXiv:2202.01949v1 [cs.NI])
    In recent years, autonomous networks have been designed with Predictive Quality of Service (PQoS) in mind, as a means for applications operating in the industrial and/or automotive sectors to predict unanticipated Quality of Service (QoS) changes and react accordingly. In this context, Reinforcement Learning (RL) has come out as a promising approach to perform accurate predictions, and optimize the efficiency and adaptability of wireless networks. Along these lines, in this paper we propose the design of a new entity, implemented at the RAN-level that, with the support of an RL framework, implements PQoS functionalities. Specifically, we focus on the design of the reward function of the learning agent, able to convert QoS estimates into appropriate countermeasures if QoS requirements are not satisfied. We demonstrate via ns-3 simulations that our approach achieves the best trade-off in terms of QoS and Quality of Experience (QoE) performance of end users in a teleoperated-driving-like scenario, compared to other baseline solutions.  ( 2 min )
    Generalizing to New Physical Systems via Context-Informed Dynamics Model. (arXiv:2202.01889v1 [cs.LG])
    Data-driven approaches to modeling physical systems fail to generalize to unseen systems that share the same general dynamics with the learning domain, but correspond to different physical contexts. We propose a new framework for this key problem, context-informed dynamics adaptation (CoDA), which takes into account the distributional shift across systems for fast and efficient adaptation to new dynamics. CoDA leverages multiple environments, each associated to a different dynamic, and learns to condition the dynamics model on contextual parameters, specific to each environment. The conditioning is performed via a hypernetwork, learned jointly with a context vector from observed data. The proposed formulation constrains the search hypothesis space to foster fast adaptation and better generalization across environments. It extends the expressivity of existing methods. We theoretically motivate our approach and show state-ofthe-art generalization results on a set of nonlinear dynamics, representative of a variety of application domains. We also show, on these systems, that new system parameters can be inferred from context vectors with minimal supervision.  ( 2 min )
    Net benefit, calibration, threshold selection, and training objectives for algorithmic fairness in healthcare. (arXiv:2202.01906v1 [stat.ML])
    A growing body of work uses the paradigm of algorithmic fairness to frame the development of techniques to anticipate and proactively mitigate the introduction or exacerbation of health inequities that may follow from the use of model-guided decision-making. We evaluate the interplay between measures of model performance, fairness, and the expected utility of decision-making to offer practical recommendations for the operationalization of algorithmic fairness principles for the development and evaluation of predictive models in healthcare. We conduct an empirical case-study via development of models to estimate the ten-year risk of atherosclerotic cardiovascular disease to inform statin initiation in accordance with clinical practice guidelines. We demonstrate that approaches that incorporate fairness considerations into the model training objective typically do not improve model performance or confer greater net benefit for any of the studied patient populations compared to the use of standard learning paradigms followed by threshold selection concordant with patient preferences, evidence of intervention effectiveness, and model calibration. These results hold when the measured outcomes are not subject to differential measurement error across patient populations and threshold selection is unconstrained, regardless of whether differences in model performance metrics, such as in true and false positive error rates, are present. In closing, we argue for focusing model development efforts on developing calibrated models that predict outcomes well for all patient populations while emphasizing that such efforts are complementary to transparent reporting, participatory design, and reasoning about the impact of model-informed interventions in context.  ( 2 min )
    Ad-datasets: a meta-collection of data sets for autonomous driving. (arXiv:2202.01909v1 [cs.LG])
    Autonomous driving is among the largest domains in which deep learning has been fundamental for progress within the last years. The rise of datasets went hand in hand with this development. All the more striking is the fact that researchers do not have a tool available that provides a quick, comprehensive and up-to-date overview of data sets and their features in the domain of autonomous driving. In this paper, we present ad-datasets, an online tool that provides such an overview for more than 150 data sets. The tool enables users to sort and filter the data sets according to currently 16 different categories. ad-datasets is an open-source project with community contributions. It is in constant development, ensuring that the content stays up-to-date.  ( 2 min )
    Sampling with Riemannian Hamiltonian Monte Carlo in a Constrained Space. (arXiv:2202.01908v1 [cs.LG])
    We demonstrate for the first time that ill-conditioned, non-smooth, constrained distributions in very high dimension, upwards of 100,000, can be sampled efficiently $\textit{in practice}$. Our algorithm incorporates constraints into the Riemannian version of Hamiltonian Monte Carlo and maintains sparsity. This allows us to achieve a mixing rate independent of smoothness and condition numbers. On benchmark data sets in systems biology and linear programming, our algorithm outperforms existing packages by orders of magnitude. In particular, we achieve a 1,000-fold speed-up for sampling from the largest published human metabolic network (RECON3D). Our package has been incorporated into the COBRA toolbox.  ( 2 min )
    Smartphone-based Hard-braking Event Detection at Scale for Road Safety Services. (arXiv:2202.01934v1 [cs.LG])
    Road crashes are the sixth leading cause of lost disability-adjusted life-years (DALYs) worldwide. One major challenge in traffic safety research is the sparsity of crashes, which makes it difficult to achieve a fine-grain understanding of crash causations and predict future crash risk in a timely manner. Hard-braking events have been widely used as a safety surrogate due to their relatively high prevalence and ease of detection with embedded vehicle sensors. As an alternative to using sensors fixed in vehicles, this paper presents a scalable approach for detecting hard-braking events using the kinematics data collected from smartphone sensors. We train a Transformer-based machine learning model for hard-braking event detection using concurrent sensor readings from smartphones and vehicle sensors from drivers who connect their phone to the vehicle while navigating in Google Maps. The detection model shows superior performance with a $0.83$ Area under the Precision-Recall Curve (PR-AUC), which is $3.8\times$better than a GPS speed-based heuristic model, and $166.6\times$better than an accelerometer-based heuristic model. The detected hard-braking events are strongly correlated with crashes from publicly available datasets, supporting their use as a safety surrogate. In addition, we conduct model fairness and selection bias evaluation to ensure that the safety benefits are equally shared. The developed methodology can benefit many safety applications such as identifying safety hot spots at road network level, evaluating the safety of new user interfaces, as well as using routing to improve traffic safety.  ( 2 min )
    Learning Representation from Neural Fisher Kernel with Low-rank Approximation. (arXiv:2202.01944v1 [cs.LG])
    In this paper, we study the representation of neural networks from the view of kernels. We first define the Neural Fisher Kernel (NFK), which is the Fisher Kernel applied to neural networks. We show that NFK can be computed for both supervised and unsupervised learning models, which can serve as a unified tool for representation extraction. Furthermore, we show that practical NFKs exhibit low-rank structures. We then propose an efficient algorithm that computes a low rank approximation of NFK, which scales to large datasets and networks. We show that the low-rank approximation of NFKs derived from unsupervised generative models and supervised learning models gives rise to high-quality compact representations of data, achieving competitive results on a variety of machine learning tasks.  ( 2 min )
    Knowledge Graph Based Waveform Recommendation: A New Communication Waveform Design Paradigm. (arXiv:2202.01926v1 [eess.SP])
    Traditionally, a communication waveform is designed by experts based on communication theory and their experiences on a case-by-case basis, which is usually laborious and time-consuming. In this paper, we investigate the waveform design from a novel perspective and propose a new waveform design paradigm with the knowledge graph (KG)-based intelligent recommendation system. The proposed paradigm aims to improve the design efficiency by structural characterization and representations of existing waveforms and intelligently utilizing the knowledge learned from them. To achieve this goal, we first build a communication waveform knowledge graph (CWKG) with a first-order neighbor node, for which both structured semantic knowledge and numerical parameters of a waveform are integrated by representation learning. Based on the developed CWKG, we further propose an intelligent communication waveform recommendation system (CWRS) to generate waveform candidates. In the CWRS, an improved involution1D operator, which is channel-agnostic and space-specific, is introduced according to the characteristics of KG-based waveform representation for feature extraction, and the multi-head self-attention is adopted to weigh the influence of various components for feature fusion. Meanwhile, multilayer perceptron-based collaborative filtering is used to evaluate the matching degree between the requirement and the waveform candidate. Simulation results show that the proposed CWKG-based CWRS can automatically recommend waveform candidates with high reliability.  ( 2 min )
    DeepQMLP: A Scalable Quantum-Classical Hybrid DeepNeural Network Architecture for Classification. (arXiv:2202.01899v1 [quant-ph])
    Quantum machine learning (QML) is promising for potential speedups and improvements in conventional machine learning (ML) tasks (e.g., classification/regression). The search for ideal QML models is an active research field. This includes identification of efficient classical-to-quantum data encoding scheme, construction of parametric quantum circuits (PQC) with optimal expressivity and entanglement capability, and efficient output decoding scheme to minimize the required number of measurements, to name a few. However, most of the empirical/numerical studies lack a clear path towards scalability. Any potential benefit observed in a simulated environment may diminish in practical applications due to the limitations of noisy quantum hardware (e.g., under decoherence, gate-errors, and crosstalk). We present a scalable quantum-classical hybrid deep neural network (DeepQMLP) architecture inspired by classical deep neural network architectures. In DeepQMLP, stacked shallow Quantum Neural Network (QNN) models mimic the hidden layers of a classical feed-forward multi-layer perceptron network. Each QNN layer produces a new and potentially rich representation of the input data for the next layer. This new representation can be tuned by the parameters of the circuit. Shallow QNN models experience less decoherence, gate errors, etc. which make them (and the network) more resilient to quantum noise. We present numerical studies on a variety of classification problems to show the trainability of DeepQMLP. We also show that DeepQMLP performs reasonably well on unseen data and exhibits greater resilience to noise over QNN models that use a deep quantum circuit. DeepQMLP provided up to 25.3% lower loss and 7.92% higher accuracy during inference under noise than QMLP.  ( 2 min )
    Multi-task graph neural networks for simultaneous prediction of global and atomic properties in ferromagnetic systems. (arXiv:2202.01954v1 [cond-mat.mtrl-sci])
    We introduce a multi-tasking graph convolutional neural network, HydraGNN, to simultaneously predict both global and atomic physical properties and demonstrate with ferromagnetic materials. We train HydraGNN on an open-source ab initio density functional theory (DFT) dataset for iron-platinum (FePt) with a fixed body centered tetragonal (BCT) lattice structure and fixed volume to simultaneously predict the mixing enthalpy (a global feature of the system), the atomic charge transfer, and the atomic magnetic moment across configurations that span the entire compositional range. By taking advantage of underlying physical correlations between material properties, multi-task learning (MTL) with HydraGNN provides effective training even with modest amounts of data. Moreover, this is achieved with just one architecture instead of three, as required by single-task learning (STL). The first convolutional layers of the HydraGNN architecture are shared by all learning tasks and extract features common to all material properties. The following layers discriminate the features of the different properties, the results of which are fed to the separate heads of the final layer to produce predictions. Numerical results show that HydraGNN effectively captures the relation between the configurational entropy and the material properties over the entire compositional range. Overall, the accuracy of simultaneous MTL predictions is comparable to the accuracy of the STL predictions. In addition, the computational cost of training HydraGNN for MTL is much lower than the original DFT calculations and also lower than training separate STL models for each property.  ( 2 min )
    Active metric learning and classification using similarity queries. (arXiv:2202.01953v1 [cs.LG])
    Active learning is commonly used to train label-efficient models by adaptively selecting the most informative queries. However, most active learning strategies are designed to either learn a representation of the data (e.g., embedding or metric learning) or perform well on a task (e.g., classification) on the data. However, many machine learning tasks involve a combination of both representation learning and a task-specific goal. Motivated by this, we propose a novel unified query framework that can be applied to any problem in which a key component is learning a representation of the data that reflects similarity. Our approach builds on similarity or nearest neighbor (NN) queries which seek to select samples that result in improved embeddings. The queries consist of a reference and a set of objects, with an oracle selecting the object most similar (i.e., nearest) to the reference. In order to reduce the number of solicited queries, they are chosen adaptively according to an information theoretic criterion. We demonstrate the effectiveness of the proposed strategy on two tasks -- active metric learning and active classification -- using a variety of synthetic and real world datasets. In particular, we demonstrate that actively selected NN queries outperform recently developed active triplet selection methods in a deep metric learning setting. Further, we show that in classification, actively selecting class labels can be reformulated as a process of selecting the most informative NN query, allowing direct application of our method.  ( 2 min )
    $\mathcal{F}$-EBM: Energy Based Learning of Functional Data. (arXiv:2202.01929v1 [cs.LG])
    Energy-Based Models (EBMs) have proven to be a highly effective approach for modelling densities on finite-dimensional spaces. Their ability to incorporate domain-specific choices and constraints into the structure of the model through composition make EBMs an appealing candidate for applications in physics, biology and computer vision and various other fields. In this work, we present a novel class of EBM which is able to learn distributions of functions (such as curves or surfaces) from functional samples evaluated at finitely many points. Two unique challenges arise in the functional context. Firstly, training data is often not evaluated along a fixed set of points. Secondly, steps must be taken to control the behaviour of the model between evaluation points, to mitigate overfitting. The proposed infinite-dimensional EBM employs a latent Gaussian process, which is weighted spectrally by an energy function parameterised with a neural network. The resulting EBM has the ability to utilize irregularly sampled training data and can output predictions at any resolution, providing an effective approach to up-scaling functional data. We demonstrate the efficacy of our proposed approach for modelling a range of datasets, including data collected from Standard and Poor's 500 (S\&P) and UK National grid.  ( 2 min )
    Predictive Closed-Loop Service Automation in O-RAN based Network Slicing. (arXiv:2202.01966v1 [cs.NI])
    Network slicing provides introduces customized and agile network deployment for managing different service types for various verticals under the same infrastructure. To cater to the dynamic service requirements of these verticals and meet the required quality-of-service (QoS) mentioned in the service-level agreement (SLA), network slices need to be isolated through dedicated elements and resources. Additionally, allocated resources to these slices need to be continuously monitored and intelligently managed. This enables immediate detection and correction of any SLA violation to support automated service assurance in a closed-loop fashion. By reducing human intervention, intelligent and closed-loop resource management reduces the cost of offering flexible services. Resource management in a network shared among verticals (potentially administered by different providers), would be further facilitated through open and standardized interfaces. Open radio access network (O-RAN) is perhaps the most promising RAN architecture that inherits all the aforementioned features, namely intelligence, open and standard interfaces, and closed control loop. Inspired by this, in this article we provide a closed-loop and intelligent resource provisioning scheme for O-RAN slicing to prevent SLA violations. In order to maintain realism, a real-world dataset of a large operator is used to train a learning solution for optimizing resource utilization in the proposed closed-loop service automation process. Moreover, the deployment architecture and the corresponding flow that are cognizant of the O-RAN requirements are also discussed.  ( 2 min )
    PSO-PINN: Physics-Informed Neural Networks Trained with Particle Swarm Optimization. (arXiv:2202.01943v1 [cs.LG])
    Physics-informed neural networks (PINNs) have recently emerged as a promising application of deep learning in a wide range of engineering and scientific problems based on partial differential equation models. However, evidence shows that PINN training by gradient descent displays pathologies and stiffness in gradient flow dynamics. In this paper, we propose the use of a hybrid particle swarm optimization and gradient descent approach to train PINNs. The resulting PSO-PINN algorithm not only mitigates the undesired behaviors of PINNs trained with standard gradient descent, but also presents an ensemble approach to PINN that affords the possibility of robust predictions with quantified uncertainty. Experimental results using the Poisson, advection, and Burgers equations show that PSO-PINN consistently outperforms a baseline PINN trained with Adam gradient descent.  ( 2 min )
    Learning from a Learning User for Optimal Recommendations. (arXiv:2202.01879v1 [cs.LG])
    In real-world recommendation problems, especially those with a formidably large item space, users have to gradually learn to estimate the utility of any fresh recommendations from their experience about previously consumed items. This in turn affects their interaction dynamics with the system and can invalidate previous algorithms built on the omniscient user assumption. In this paper, we formalize a model to capture such "learning users" and design an efficient system-side learning solution, coined Noise-Robust Active Ellipsoid Search (RAES), to confront the challenges brought by the non-stationary feedback from such a learning user. Interestingly, we prove that the regret of RAES deteriorates gracefully as the convergence rate of user learning becomes worse, until reaching linear regret when the user's learning fails to converge. Experiments on synthetic datasets demonstrate the strength of RAES for such a contemporaneous system-user learning problem. Our study provides a novel perspective on modeling the feedback loop in recommendation problems.  ( 2 min )
    Unsupervised Learning Based Hybrid Beamforming with Low-Resolution Phase Shifters for MU-MIMO Systems. (arXiv:2202.01946v1 [eess.SP])
    Millimeter wave (mmWave) is a key technology for fifth-generation (5G) and beyond communications. Hybrid beamforming has been proposed for large-scale antenna systems in mmWave communications. Existing hybrid beamforming designs based on infinite-resolution phase shifters (PSs) are impractical due to hardware cost and power consumption. In this paper, we propose an unsupervised-learning-based scheme to jointly design the analog precoder and combiner with low-resolution PSs for multiuser multiple-input multiple-output (MU-MIMO) systems. We transform the analog precoder and combiner design problem into a phase classification problem and propose a generic neural network architecture, termed the phase classification network (PCNet), capable of producing solutions of various PS resolutions. Simulation results demonstrate the superior sum-rate and complexity performance of the proposed scheme, as compared to state-of-the-art hybrid beamforming designs for the most commonly used low-resolution PS configurations.  ( 2 min )
    Tsetlin Machine for Solving Contextual Bandit Problems. (arXiv:2202.01914v1 [cs.LG])
    This paper introduces an interpretable contextual bandit algorithm using Tsetlin Machines, which solves complex pattern recognition tasks using propositional logic. The proposed bandit learning algorithm relies on straightforward bit manipulation, thus simplifying computation and interpretation. We then present a mechanism for performing Thompson sampling with Tsetlin Machine, given its non-parametric nature. Our empirical analysis shows that Tsetlin Machine as a base contextual bandit learner outperforms other popular base learners on eight out of nine datasets. We further analyze the interpretability of our learner, investigating how arms are selected based on propositional expressions that model the context.  ( 2 min )
    Distribution Embedding Networks for Meta-Learning with Heterogeneous Covariate Spaces. (arXiv:2202.01940v1 [stat.ML])
    We propose Distribution Embedding Networks (DEN) for classification with small data using meta-learning techniques. Unlike existing meta-learning approaches that focus on image recognition tasks and require the training and target tasks to be similar, DEN is specifically designed to be trained on a diverse set of training tasks and applied on tasks whose number and distribution of covariates differ vastly from its training tasks. Such property of DEN is enabled by its three-block architecture: a covariate transformation block followed by a distribution embedding block and then a classification block. We provide theoretical insights to show that this architecture allows the embedding and classification blocks to be fixed after pre-training on a diverse set of tasks; only the covariate transformation block with relatively few parameters needs to be updated for each new task. To facilitate the training of DEN, we also propose an approach to synthesize binary classification training tasks, and demonstrate that DEN outperforms existing methods in a number of synthetic and real tasks in numerical studies.  ( 2 min )
    A Hybrid Physics Machine Learning Approach for Macroscopic Traffic State Estimation. (arXiv:2202.01888v1 [cs.LG])
    Full-field traffic state information (i.e., flow, speed, and density) is critical for the successful operation of Intelligent Transportation Systems (ITS) on freeways. However, incomplete traffic information tends to be directly collected from traffic detectors that are insufficiently installed in most areas, which is a major obstacle to the popularization of ITS. To tackle this issue, this paper introduces an innovative traffic state estimation (TSE) framework that hybrid regression machine learning techniques (e.g., artificial neural network (ANN), random forest (RF), and support vector machine (SVM)) with a traffic physics model (e.g., second-order macroscopic traffic flow model) using limited information from traffic sensors as inputs to construct accurate and full-field estimated traffic state for freeway systems. To examine the effectiveness of the proposed TSE framework, this paper conducted empirical studies on a real-world data set collected from a stretch of I-15 freeway in Salt Lake City, Utah. Experimental results show that the proposed method has been proved to estimate full-field traffic information accurately. Hence, the proposed method could provide accurate and full-field traffic information, thus providing the basis for the popularization of ITS.  ( 2 min )
    Unified Fake News Detection using Transfer Learning of Bidirectional Encoder Representation from Transformers model. (arXiv:2202.01907v1 [cs.LG])
    Automatic detection of fake news is needed for the public as the accessibility of social media platforms has been increasing rapidly. Most of the prior models were designed and validated on individual datasets separately. But the lack of generalization in models might lead to poor performance when deployed in real-world applications since the individual datasets only cover limited subjects and sequence length over the samples. This paper attempts to develop a unified model by combining publicly available datasets to detect fake news samples effectively.  ( 2 min )
    Theoretical Exploration of Solutions of Feedforward ReLU networks. (arXiv:2202.01919v1 [cs.LG])
    This paper aims to interpret the mechanism of feedforward ReLU networks by exploring their solutions for piecewise linear functions through basic rules. The constructed solutions should be universal enough to explain the network architectures of engineering. In order for that, we borrow the methodology of theoretical physics to develop the theories. Some of the consequences of our theories include: Under geometric backgrounds, the solutions of both three-layer networks and deep-layer networks are presented, and the solution universality is ensured by several ways; We give clear and intuitive interpretations of each component of network architectures, such as the parameter-sharing mechanism for multi-output, the function of each layer, the advantage of deep layers, the redundancy of parameters, and so on. We explain three typical network architectures: the subnetwork of last three layers of convolutional networks, multi-layer feedforward networks, and the decoder of autoencoders. This paper is expected to provide a basic foundation of theories of feedforward ReLU networks for further investigations.  ( 2 min )
    Artificial Intelligence Powered Material Search Engine. (arXiv:2202.01916v1 [cond-mat.mtrl-sci])
    Many data-driven applications in material science have been made possible because of recent breakthroughs in artificial intelligence(AI). The use of AI in material engineering is becoming more viable as the number of material data such as X-Ray diffraction, various spectroscopy, and microscope data grows. In this work, we have reported a material search engine that uses the interatomic space (d value) from X-ray diffraction to provide material information. We have investigated various techniques for predicting prospective material using X-ray diffraction data. We used the Random Forest, Naive Bayes (Gaussian), and Neural Network algorithms to achieve this. These algorithms have an average accuracy of 88.50\%, 100.0\%, and 88.89\%, respectively. Finally, we combined all these techniques into an ensemble approach to make the prediction more generic. This ensemble method has a ~100\% accuracy rate. Furthermore, we are designing a graph neural network (GNN)-based architecture to improve interpretability and accuracy. Thus, we want to solve the computational and time complexity of traditional dictionary-based and metadata-based material search engines and to provide a more generic prediction.  ( 2 min )
    Time-Constrained Learning. (arXiv:2202.01913v1 [cs.LG])
    Consider a scenario in which we have a huge labeled dataset ${\cal D}$ and a limited time to train some given learner using ${\cal D}$. Since we may not be able to use the whole dataset, how should we proceed? Questions of this nature motivate the definition of the Time-Constrained Learning Task (TCL): Given a dataset ${\cal D}$ sampled from an unknown distribution $\mu$, a learner ${\cal L}$ and a time limit $T$, the goal is to obtain in at most $T$ units of time the classification model with highest possible accuracy w.r.t. to $\mu$, among those that can be built by ${\cal L}$ using the dataset ${\cal D}$. We propose TCT, an algorithm for the TCL task designed based that on principles from Machine Teaching. We present an experimental study involving 5 different Learners and 20 datasets where we show that TCT consistently outperforms two other algorithms: the first is a Teacher for black-box learners proposed in [Dasgupta et al., ICML 19] and the second is a natural adaptation of random sampling for the TCL setting. We also compare TCT with Stochastic Gradient Descent training -- our method is again consistently better. While our work is primarily practical, we also show that a stripped-down version of TCT has provable guarantees. Under reasonable assumptions, the time our algorithm takes to achieve a certain accuracy is never much bigger than the time it takes the batch teacher (which sends a single batch of examples) to achieve similar accuracy, and in some case it is almost exponentially better.  ( 2 min )
    Identifying stimulus-driven neural activity patterns in multi-patient intracranial recordings. (arXiv:2202.01933v1 [q-bio.NC])
    Identifying stimulus-driven neural activity patterns is critical for studying the neural basis of cognition. This can be particularly challenging in intracranial datasets, where electrode locations typically vary across patients. This chapter first presents an overview of the major challenges to identifying stimulus-driven neural activity patterns in the general case. Next, we will review several modality-specific considerations and approaches, along with a discussion of several issues that are particular to intracranial recordings. Against this backdrop, we will consider a variety of within-subject and across-subject approaches to identifying and modeling stimulus-driven neural activity patterns in multi-patient intracranial recordings. These approaches include generalized linear models, multivariate pattern analysis, representational similarity analysis, joint stimulus-activity models, hierarchical matrix factorization models, Gaussian process models, geometric alignment models, inter-subject correlations, and inter-subject functional correlations. Examples from the recent literature serve to illustrate the major concepts and provide the conceptual intuitions for each approach.  ( 2 min )
    Yordle: An Efficient Imitation Learning for Branch and Bound. (arXiv:2202.01896v1 [math.OC])
    Combinatorial optimization problems have aroused extensive research interests due to its huge application potential. In practice, there are highly redundant patterns and characteristics during solving the combinatorial optimization problem, which can be captured by machine learning models. Thus, the 2021 NeurIPS Machine Learning for Combinatorial Optimization (ML4CO) competition is proposed with the goal of improving state-of-the-art combinatorial optimization solvers by replacing key heuristic components with machine learning techniques. This work presents our solution and insights gained by team qqy in the dual task of the competition. Our solution is a highly efficient imitation learning framework for performance improvement of Branch and Bound (B&B), named Yordle. It employs a hybrid sampling method and an efficient data selection method, which not only accelerates the model training but also improves the decision quality during branching variable selection. In our experiments, Yordle greatly outperforms the baseline algorithm adopted by the competition while requiring significantly less time and amounts of data to train the decision model. Specifically, we use only 1/4 of the amount of data compared to that required for the baseline algorithm, to achieve around 50% higher score than baseline algorithm. The proposed framework Yordle won the championship of the student leaderboard.  ( 2 min )
    A Machine Learning Smartphone-based Sensing for Driver Behavior Classification. (arXiv:2202.01893v1 [cs.LG])
    Driver behavior profiling is one of the main issues in the insurance industries and fleet management, thus being able to classify the driver behavior with low-cost mobile applications remains in the spotlight of autonomous driving. However, using mobile sensors may face the challenge of security, privacy, and trust issues. To overcome those challenges, we propose to collect data sensors using Carla Simulator available in smartphones (Accelerometer, Gyroscope, GPS) in order to classify the driver behavior using speed, acceleration, direction, the 3-axis rotation angles (Yaw, Pitch, Roll) taking into account the speed limit of the current road and weather conditions to better identify the risky behavior. Secondly, after fusing inter-axial data from multiple sensors into a single file, we explore different machine learning algorithms for time series classification to evaluate which algorithm results in the highest performance.  ( 2 min )
    A Note on "Assessing Generalization of SGD via Disagreement". (arXiv:2202.01851v1 [cs.LG])
    Jiang et al. (2021) give empirical evidence that the average test error of deep neural networks can be estimated via the prediction disagreement of two separately trained networks. They also provide a theoretical explanation that this 'Generalization Disagreement Equality' follows from the well-calibrated nature of deep ensembles under the notion of a proposed 'class-aggregated calibration'. In this paper we show that the approach suggested might be impractical because a deep ensemble's calibration deteriorates under distribution shift, which is exactly when the coupling of test error and disagreement would be of practical value. We present both theoretical and experimental evidence, re-deriving the theoretical statements using a simple Bayesian perspective and show them to be straightforward and more generic: they apply to any discriminative model -- not only ensembles whose members output one-hot class predictions. The proposed calibration metrics are also equivalent to two metrics introduced by Nixon et al. (2019): 'ACE' and 'SCE'.  ( 2 min )
    Best Practices and Scoring System on Reviewing A.I. based Medical Imaging Papers: Part 1 Classification. (arXiv:2202.01863v1 [eess.IV])
    With the recent advances in A.I. methodologies and their application to medical imaging, there has been an explosion of related research programs utilizing these techniques to produce state-of-the-art classification performance. Ultimately, these research programs culminate in submission of their work for consideration in peer reviewed journals. To date, the criteria for acceptance vs. rejection is often subjective; however, reproducible science requires reproducible review. The Machine Learning Education Sub-Committee of SIIM has identified a knowledge gap and a serious need to establish guidelines for reviewing these studies. Although there have been several recent papers with this goal, this present work is written from the machine learning practitioners standpoint. In this series, the committee will address the best practices to be followed in an A.I.-based study and present the required sections in terms of examples and discussion of what should be included to make the studies cohesive, reproducible, accurate, and self-contained. This first entry in the series focuses on the task of image classification. Elements such as dataset curation, data pre-processing steps, defining an appropriate reference standard, data partitioning, model architecture and training are discussed. The sections are presented as they would be detailed in a typical manuscript, with content describing the necessary information that should be included to make sure the study is of sufficient quality to be considered for publication. The goal of this series is to provide resources to not only help improve the review process for A.I.-based medical imaging papers, but to facilitate a standard for the information that is presented within all components of the research study. We hope to provide quantitative metrics in what otherwise may be a qualitative review process.  ( 2 min )
    Incorporating Sum Constraints into Multitask Gaussian Processes. (arXiv:2202.01793v1 [stat.ML])
    Machine learning models can be improved by adapting them to respect existing background knowledge. In this paper we consider multitask Gaussian processes, with background knowledge in the form of constraints that require a specific sum of the outputs to be constant. This is achieved by conditioning the prior distribution on the constraint fulfillment. The approach allows for both linear and nonlinear constraints. We demonstrate that the constraints are fulfilled with high precision and that the construction can improve the overall prediction accuracy as compared to the standard Gaussian process.  ( 2 min )
    Transport Score Climbing: Variational Inference using Forward KL and Adaptive Neural Transport. (arXiv:2202.01841v1 [stat.ML])
    Variational inference often minimizes the "reverse" Kullbeck-Leibler (KL) KL(q||p) from the approximate distribution q to the posterior p. Recent work studies the "forward" KL KL(p||q), which unlike reverse KL does not lead to variational approximations that underestimate uncertainty. This paper introduces Transport Score Climbing (TSC), a method that optimizes KL(p||q) by using Hamiltonian Monte Carlo (HMC) and a novel adaptive transport map. The transport map improves the trajectory of HMC by acting as a change of variable between the latent variable space and a warped space. TSC uses HMC samples to dynamically train the transport map while optimizing KL(p||q). TSC leverages synergies, where better transport maps lead to better HMC sampling, which then leads to better transport maps. We demonstrate TSC on synthetic and real data. We find that TSC achieves competitive performance when training variational autoencoders on large-scale data.  ( 2 min )
    AtmoDist: Self-supervised Representation Learning for Atmospheric Dynamics. (arXiv:2202.01897v1 [physics.ao-ph])
    Representation learning has proven to be a powerful methodology in a wide variety of machine learning applications. For atmospheric dynamics, however, it has so far not been considered, arguably due to the lack of large-scale, labeled datasets that could be used for training. In this work, we show that the difficulty is benign and introduce a self-supervised learning task that defines a categorical loss for a wide variety of unlabeled atmospheric datasets. Specifically, we train a neural network on the simple yet intricate task of predicting the temporal distance between atmospheric fields, e.g. the components of the wind field, from distinct but nearby times. Despite this simplicity, a neural network will provide good predictions only when it develops an internal representation that captures intrinsic aspects of atmospheric dynamics. We demonstrate this by introducing a data-driven distance metric for atmospheric states based on representations learned from ERA5 reanalysis. When employ as a loss function for downscaling, this Atmodist distance leads to downscaled fields that match the true statistics more closely than the previous state-of-the-art based on an l2-loss and whose local behavior is more realistic. Since it is derived from observational data, AtmoDist also provides a novel perspective on atmospheric predictability.  ( 2 min )
    Practical Imitation Learning in the Real World via Task Consistency Loss. (arXiv:2202.01862v1 [cs.RO])
    Recent work in visual end-to-end learning for robotics has shown the promise of imitation learning across a variety of tasks. Such approaches are expensive both because they require large amounts of real world training demonstrations and because identifying the best model to deploy in the real world requires time-consuming real-world evaluations. These challenges can be mitigated by simulation: by supplementing real world data with simulated demonstrations and using simulated evaluations to identify high performing policies. However, this introduces the well-known "reality gap" problem, where simulator inaccuracies decorrelate performance in simulation from that of reality. In this paper, we build on top of prior work in GAN-based domain adaptation and introduce the notion of a Task Consistency Loss (TCL), a self-supervised loss that encourages sim and real alignment both at the feature and action-prediction levels. We demonstrate the effectiveness of our approach by teaching a mobile manipulator to autonomously approach a door, turn the handle to open the door, and enter the room. The policy performs control from RGB and depth images and generalizes to doors not encountered in training data. We achieve 80% success across ten seen and unseen scenes using only ~16.2 hours of teleoperated demonstrations in sim and real. To the best of our knowledge, this is the first work to tackle latched door opening from a purely end-to-end learning approach, where the task of navigation and manipulation are jointly modeled by a single neural network.  ( 2 min )
    Even Simpler Deterministic Matrix Sketching. (arXiv:2202.01780v1 [cs.DS])
    This paper provides a one-line proof of Frequent Directions (FD) for sketching streams of matrices. The simpler proof arises from sketching the covariance of the stream of matrices rather than the stream itself.  ( 1 min )
    Rethinking Explainability as a Dialogue: A Practitioner's Perspective. (arXiv:2202.01875v1 [cs.LG])
    As practitioners increasingly deploy machine learning models in critical domains such as health care, finance, and policy, it becomes vital to ensure that domain experts function effectively alongside these models. Explainability is one way to bridge the gap between human decision-makers and machine learning models. However, most of the existing work on explainability focuses on one-off, static explanations like feature importances or rule lists. These sorts of explanations may not be sufficient for many use cases that require dynamic, continuous discovery from stakeholders. In the literature, few works ask decision-makers about the utility of existing explanations and other desiderata they would like to see in an explanation going forward. In this work, we address this gap and carry out a study where we interview doctors, healthcare professionals, and policymakers about their needs and desires for explanations. Our study indicates that decision-makers would strongly prefer interactive explanations in the form of natural language dialogues. Domain experts wish to treat machine learning models as "another colleague", i.e., one who can be held accountable by asking why they made a particular decision through expressive and accessible natural language interactions. Considering these needs, we outline a set of five principles researchers should follow when designing interactive explanations as a starting place for future work. Further, we show why natural language dialogues satisfy these principles and are a desirable way to build interactive explanations. Next, we provide a design of a dialogue system for explainability and discuss the risks, trade-offs, and research opportunities of building these systems. Overall, we hope our work serves as a starting place for researchers and engineers to design interactive explainability systems.  ( 2 min )
    A Robust Phased Elimination Algorithm for Corruption-Tolerant Gaussian Process Bandits. (arXiv:2202.01850v1 [stat.ML])
    We consider the sequential optimization of an unknown, continuous, and expensive to evaluate reward function, from noisy and adversarially corrupted observed rewards. When the corruption attacks are subject to a suitable budget $C$ and the function lives in a Reproducing Kernel Hilbert Space (RKHS), the problem can be posed as corrupted Gaussian process (GP) bandit optimization. We propose a novel robust elimination-type algorithm that runs in epochs, combines exploration with infrequent switching to select a small subset of actions, and plays each action for multiple time instants. Our algorithm, Robust GP Phased Elimination (RGP-PE), successfully balances robustness to corruptions with exploration and exploitation such that its performance degrades minimally in the presence (or absence) of adversarial corruptions. When $T$ is the number of samples and $\gamma_T$ is the maximal information gain, the corruption-dependent term in our regret bound is $O(C \gamma_T^{3/2})$, which is significantly tighter than the existing $O(C \sqrt{T \gamma_T})$ for several commonly-considered kernels. We perform the first empirical study of robustness in the corrupted GP bandit setting, and show that our algorithm is robust against a variety of adversarial attacks.  ( 2 min )
    Exploiting Independent Instruments: Identification and Distribution Generalization. (arXiv:2202.01864v1 [stat.ML])
    Instrumental variable models allow us to identify a causal function between covariates X and a response Y, even in the presence of unobserved confounding. Most of the existing estimators assume that the error term in the response Y and the hidden confounders are uncorrelated with the instruments Z. This is often motivated by a graphical separation, an argument that also justifies independence. Posing an independence condition, however, leads to strictly stronger identifiability results. We connect to existing literature in econometrics and provide a practical method for exploiting independence that can be combined with any gradient-based learning procedure. We see that even in identifiable settings, taking into account higher moments may yield better finite sample results. Furthermore, we exploit the independence for distribution generalization. We prove that the proposed estimator is invariant to distributional shifts on the instruments and worst-case optimal whenever these shifts are sufficiently strong. These results hold even in the under-identified case where the instruments are not sufficiently rich to identify the causal function.  ( 2 min )
    Adversarially Robust Models may not Transfer Better: Sufficient Conditions for Domain Transferability from the View of Regularization. (arXiv:2202.01832v1 [cs.LG])
    Machine learning (ML) robustness and domain generalization are fundamentally correlated: they essentially concern data distribution shifts under adversarial and natural settings, respectively. On one hand, recent studies show that more robust (adversarially trained) models are more generalizable. On the other hand, there is a lack of theoretical understanding of their fundamental connections. In this paper, we explore the relationship between regularization and domain transferability considering different factors such as norm regularization and data augmentations (DA). We propose a general theoretical framework proving that factors involving the model function class regularization are sufficient conditions for relative domain transferability. Our analysis implies that "robustness" is neither necessary nor sufficient for transferability; rather, robustness induced by adversarial training is a by-product of such function class regularization. We then discuss popular DA protocols and show when they can be viewed as the function class regularization under certain conditions and therefore improve generalization. We conduct extensive experiments to verify our theoretical findings and show several counterexamples where robustness and generalization are negatively correlated on different datasets.  ( 2 min )
    Flexible Triggering Kernels for Hawkes Process Modeling. (arXiv:2202.01869v1 [cs.LG])
    Recently proposed encoder-decoder structures for modeling Hawkes processes use transformer-inspired architectures, which encode the history of events via embeddings and self-attention mechanisms. These models deliver better prediction and goodness-of-fit than their RNN-based counterparts. However, they often require high computational and memory complexity requirements and sometimes fail to adequately capture the triggering function of the underlying process. So motivated, we introduce an efficient and general encoding of the historical event sequence by replacing the complex (multilayered) attention structures with triggering kernels of the observed data. Noting the similarity between the triggering kernels of a point process and the attention scores, we use a triggering kernel to replace the weights used to build history representations. Our estimate for the triggering function is equipped with a sigmoid gating mechanism that captures local-in-time triggering effects that are otherwise challenging with standard decaying-over-time kernels. Further, taking both event type representations and temporal embeddings as inputs, the model learns the underlying triggering type-time kernel parameters given pairs of event types. We present experiments on synthetic and real data sets widely used by competing models, while further including a COVID-19 dataset to illustrate a scenario where longitudinal covariates are available. Results show the proposed model outperforms existing approaches while being more efficient in terms of computational complexity and yielding interpretable results via direct application of the newly introduced kernel.  ( 2 min )
    Oral cancer detection and interpretation: Deep multiple instance learning versus conventional deep single instance learning. (arXiv:2202.01783v1 [eess.IV])
    The current medical standard for setting an oral cancer (OC) diagnosis is histological examination of a tissue sample from the oral cavity. This process is time consuming and more invasive than an alternative approach of acquiring a brush sample followed by cytological analysis. Skilled cytotechnologists are able to detect changes due to malignancy, however, to introduce this approach into clinical routine is associated with challenges such as a lack of experts and labour-intensive work. To design a trustworthy OC detection system that would assist cytotechnologists, we are interested in AI-based methods that reliably can detect cancer given only per-patient labels (minimizing annotation bias), and also provide information on which cells are most relevant for the diagnosis (enabling supervision and understanding). We, therefore, perform a comparison of a conventional single instance learning (SIL) approach and a modern multiple instance learning (MIL) method suitable for OC detection and interpretation, utilizing three different neural network architectures. To facilitate systematic evaluation of the considered approaches, we introduce a synthetic PAP-QMNIST dataset, that serves as a model of OC data, while offering access to per-instance ground truth. Our study indicates that on PAP-QMNIST, the SIL performs better, on average, than the MIL approach. Performance at the bag level on real-world cytological data is similar for both methods, yet the single instance approach performs better on average. Visual examination by cytotechnologist indicates that the methods manage to identify cells which deviate from normality, including malignant cells as well as those suspicious for dysplasia. We share the code as open source at https://github.com/MIDA-group/OralCancerMILvsSIL  ( 2 min )
    Cross-Platform Difference in Facebook and Text Messages Language Use: Illustrated by Depression Diagnosis. (arXiv:2202.01802v1 [cs.CL])
    How does language differ across one's Facebook status updates vs. one's text messages (SMS)? In this study, we show how Facebook and SMS use differs in psycho-linguistic characteristics and how these differences drive downstream analyses with an illustration of depression diagnosis. We use a sample of consenting participants who shared Facebook status updates, SMS data, and answered a standard psychological depression screener. We quantify domain differences using psychologically driven lexical methods and find that language on Facebook involves more personal concerns, experiences, and content features while the language in SMS contains more informal and style features. Next, we estimate depression from both text domains, using a depression model trained on Facebook data, and find a drop in accuracy when predicting self-reported depression assessments from the SMS-based depression estimates. Finally, we evaluate a simple domain adaption correction based on words driving the cross-platform differences and applied it to the SMS-derived depression estimates, resulting in significant improvement in prediction. Our work shows the Facebook vs. SMS difference in language use and suggests the necessity of cross-domain adaption for text-based predictions.  ( 2 min )
    Weighted Random Cut Forest Algorithm for Anomaly Detections. (arXiv:2202.01891v1 [cs.LG])
    Random cut forest (RCF) algorithms have been developed for anomaly detection, particularly for the anomaly detection in time-series data. The RCF algorithm is the improved version of the isolation forest algorithm. Unlike the isolation forest algorithm, the RCF algorithm has the power of determining whether the real-time input has anomaly by inserting the input in the constructed tree network. There have been developed various RCF algorithms including Robust RCF (RRCF) with which the cutting procedure is adaptively chosen probabilistically. RRCF shows better performance compared to the isolation forest as the cutting dimension is decided based on the geometric range of the data. The overall data structure is, however, not considered in the adaptive cutting algorithm with the RRCF. In this paper, we propose a new RCF, so-called the weighted RCF (WRCF). In order to introduce the WRCF, we first introduce a new geometric measure, i.e., a \textit{density measure} which is crucial for the construction of the WRCF. We provide various mathematical properties of the density measure. The proposed WRCF also cuts the tree network adaptively, but with consideration of the denseness of the data. The proposed method is more efficient when the data is structured and achieves the desired anomaly score more rapidly than the RRCF. We provide theorems that prove our claims with numerical examples.  ( 2 min )
    Characterizing & Finding Good Data Orderings for Fast Convergence of Sequential Gradient Methods. (arXiv:2202.01838v1 [cs.LG])
    While SGD, which samples from the data with replacement is widely studied in theory, a variant called Random Reshuffling (RR) is more common in practice. RR iterates through random permutations of the dataset and has been shown to converge faster than SGD. When the order is chosen deterministically, a variant called incremental gradient descent (IG), the existing convergence bounds show improvement over SGD but are worse than RR. However, these bounds do not differentiate between a good and a bad ordering and hold for the worst choice of order. Meanwhile, in some cases, choosing the right order when using IG can lead to convergence faster than RR. In this work, we quantify the effect of order on convergence speed, obtaining convergence bounds based on the chosen sequence of permutations while also recovering previous results for RR. In addition, we show benefits of using structured shuffling when various levels of abstractions (e.g. tasks, classes, augmentations, etc.) exists in the dataset in theory and in practice. Finally, relying on our measure, we develop a greedy algorithm for choosing good orders during training, achieving superior performance (by more than 14 percent in accuracy) over RR.  ( 2 min )
    Modeling unknown dynamical systems with hidden parameters. (arXiv:2202.01858v1 [stat.ML])
    We present a data-driven numerical approach for modeling unknown dynamical systems with missing/hidden parameters. The method is based on training a deep neural network (DNN) model for the unknown system using its trajectory data. A key feature is that the unknown dynamical system contains system parameters that are completely hidden, in the sense that no information about the parameters is available through either the measurement trajectory data or our prior knowledge of the system. We demonstrate that by training a DNN using the trajectory data with sufficient time history, the resulting DNN model can accurately model the unknown dynamical system. For new initial conditions associated with new, and unknown, system parameters, the DNN model can produce accurate system predictions over longer time.  ( 2 min )
    Hidden Heterogeneity: When to Choose Similarity-Based Calibration. (arXiv:2202.01840v1 [cs.LG])
    Trustworthy classifiers are essential to the adoption of machine learning predictions in many real-world settings. The predicted probability of possible outcomes can inform high-stakes decision making, particularly when assessing the expected value of alternative decisions or the risk of bad outcomes. These decisions require well calibrated probabilities, not just the correct prediction of the most likely class. Black-box classifier calibration methods can improve the reliability of a classifier's output without requiring retraining. However, these methods are unable to detect subpopulations where calibration could improve prediction accuracy. Such subpopulations are said to exhibit "hidden heterogeneity" (HH), because the original classifier did not detect them. The paper proposes a quantitative measure for HH. It also introduces two similarity-weighted calibration methods that can address HH by adapting locally to each test item: SWC weights the calibration set by similarity to the test item, and SWC-HH explicitly incorporates hidden heterogeneity to filter the calibration set. Experiments show that the improvements in calibration achieved by similarity-based calibration methods correlate with the amount of HH present and, given sufficient calibration data, generally exceed calibration achieved by global methods. HH can therefore serve as a useful diagnostic tool for identifying when local calibration methods are needed.  ( 2 min )
    Robust Audio Anomaly Detection. (arXiv:2202.01784v1 [cs.SD])
    We propose an outlier robust multivariate time series model which can be used for detecting previously unseen anomalous sounds based on noisy training data. The presented approach doesn't assume the presence of labeled anomalies in the training dataset and uses a novel deep neural network architecture to learn the temporal dynamics of the multivariate time series at multiple resolutions while being robust to contaminations in the training dataset. The temporal dynamics are modeled using recurrent layers augmented with attention mechanism. These recurrent layers are built on top of convolutional layers allowing the network to extract features at multiple resolutions. The output of the network is an outlier robust probability density function modeling the conditional probability of future samples given the time series history. State-of-the-art approaches using other multiresolution architectures are contrasted with our proposed approach. We validate our solution using publicly available machine sound datasets. We demonstrate the effectiveness of our approach in anomaly detection by comparing against several state-of-the-art models.  ( 2 min )
  • Open

    Learning from Small Samples: Transformation-Invariant SVMs with Composition and Locality at Multiple Scales. (arXiv:2109.12784v3 [cs.LG] UPDATED)
    Motivated by the problem of learning with small sample size, this paper shows how to incorporate into support-vector machines (SVMs) those properties that have made convolutional neural networks (CNNs) successful. Particularly important is the ability to incorporate domain knowledge of invariances, e.g., translational invariance of images. Kernels based on the \textit{maximum} similarity over a group of transformations are not generally positive definite. Perhaps it is for this reason that they have neither previously been experimentally tested for their performance nor studied theoretically. We address this lacuna and show that positive definiteness indeed holds \textit{with high probability} for kernels based on the maximum similarity in the small training sample set regime of interest, and that they do yield the best results in that regime. We also show how additional properties such as their ability to incorporate local features at multiple spatial scales, e.g., as done in CNNs through max pooling, and to provide the benefits of composition through the architecture of multiple layers, can also be embedded into SVMs. We verify through experiments on widely available image sets that the resulting SVMs do provide superior accuracy in comparison to well-established deep neural network benchmarks for small sample sizes.
    SignSGD: Fault-Tolerance to Blind and Byzantine Adversaries. (arXiv:2202.02085v1 [cs.LG])
    Distributed learning has become a necessity for training ever-growing models. In a distributed setting, the task is shared among several devices. Typically, the learning process is monitored by a server. Also, some of the devices can be faulty, deliberately or not, and the usual distributed SGD algorithm cannot defend itself from omniscient adversaries. Therefore, we need to devise a fault-tolerant gradient descent algorithm. We based our article on the SignSGD algorithm, which relies on the sharing of gradients signs between the devices and the server. We provide a theoretical upper bound for the convergence rate of SignSGD to extend the results of the original paper. Our theoretical results estimate the convergence rate of SignSGD against a proportion of general adversaries, such as Byzantine adversaries. We implemented the algorithm along with Byzantine strategies in order to try to crush the learning process. Therefore, we provide empirical observations from our experiments to support our theory. Our code is available on GitHub and our experiments are reproducible by using the provided parameters.
    Graph-Coupled Oscillator Networks. (arXiv:2202.02296v1 [cs.LG])
    We propose Graph-Coupled Oscillator Networks (GraphCON), a novel framework for deep learning on graphs. It is based on discretizations of a second-order system of ordinary differential equations (ODEs), which model a network of nonlinear forced and damped oscillators, coupled via the adjacency structure of the underlying graph. The flexibility of our framework permits any basic GNN layer (e.g. convolutional or attentional) as the coupling function, from which a multi-layer deep neural network is built up via the dynamics of the proposed ODEs. We relate the oversmoothing problem, commonly encountered in GNNs, to the stability of steady states of the underlying ODE and show that zero-Dirichlet energy steady states are not stable for our proposed ODEs. This demonstrates that the proposed framework mitigates the oversmoothing problem. Finally, we show that our approach offers competitive performance with respect to the state-of-the-art on a variety of graph-based learning tasks.
    Generalized Decision Transformer for Offline Hindsight Information Matching. (arXiv:2111.10364v3 [cs.LG] UPDATED)
    How to extract as much learning signal from each trajectory data has been a key problem in reinforcement learning (RL), where sample inefficiency has posed serious challenges for practical applications. Recent works have shown that using expressive policy function approximators and conditioning on future trajectory information -- such as future states in hindsight experience replay or returns-to-go in Decision Transformer (DT) -- enables efficient learning of multi-task policies, where at times online RL is fully replaced by offline behavioral cloning, e.g. sequence modeling. We demonstrate that all these approaches are doing hindsight information matching (HIM) -- training policies that can output the rest of trajectory that matches some statistics of future state information. We present Generalized Decision Transformer (GDT) for solving any HIM problem, and show how different choices for the feature function and the anti-causal aggregator not only recover DT as a special case, but also lead to novel Categorical DT (CDT) and Bi-directional DT (BDT) for matching different statistics of the future. For evaluating CDT and BDT, we define offline multi-task state-marginal matching (SMM) and imitation learning (IL) as two generic HIM problems, propose a Wasserstein distance loss as a metric for both, and empirically study them on MuJoCo continuous control benchmarks. CDT, which simply replaces anti-causal summation with anti-causal binning in DT, enables the first effective offline multi-task SMM algorithm that generalizes well to unseen and even synthetic multi-modal state-feature distributions. BDT, which uses an anti-causal second transformer as the aggregator, can learn to model any statistics of the future and outperforms DT variants in offline multi-task IL. Our generalized formulations from HIM and GDT greatly expand the role of powerful sequence modeling architectures in modern RL.
    Quantifying the Burden of Exploration and the Unfairness of Free Riding. (arXiv:1810.08743v5 [cs.LG] UPDATED)
    We consider the multi-armed bandit setting with a twist. Rather than having just one decision maker deciding which arm to pull in each round, we have $n$ different decision makers (agents). In the simple stochastic setting, we show that a "free-riding" agent observing another "self-reliant" agent can achieve just $O(1)$ regret, as opposed to the regret lower bound of $\Omega (\log t)$ when one decision maker is playing in isolation. This result holds whenever the self-reliant agent's strategy satisfies either one of two assumptions: (1) each arm is pulled at least $\gamma \ln t$ times in expectation for a constant $\gamma$ that we compute, or (2) the self-reliant agent achieves $o(t)$ realized regret with high probability. Both of these assumptions are satisfied by standard zero-regret algorithms. Under the second assumption, we further show that the free rider only needs to observe the number of times each arm is pulled by the self-reliant agent, and not the rewards realized. In the linear contextual setting, each arm has a distribution over parameter vectors, each agent has a context vector, and the reward realized when an agent pulls an arm is the inner product of that agent's context vector with a parameter vector sampled from the pulled arm's distribution. We show that the free rider can achieve $O(1)$ regret in this setting whenever the free rider's context is a small (in $L_2$-norm) linear combination of other agents' contexts and all other agents pull each arm $\Omega (\log t)$ times with high probability. Again, this condition on the self-reliant players is satisfied by standard zero-regret algorithms like UCB. We also prove a number of lower bounds.
    Stochastic smoothing of the top-K calibrated hinge loss for deep imbalanced classification. (arXiv:2202.02193v1 [stat.ML])
    In modern classification tasks, the number of labels is getting larger and larger, as is the size of the datasets encountered in practice. As the number of classes increases, class ambiguity and class imbalance become more and more problematic to achieve high top-1 accuracy. Meanwhile, Top-K metrics (metrics allowing K guesses) have become popular, especially for performance reporting. Yet, proposing top-K losses tailored for deep learning remains a challenge, both theoretically and practically. In this paper we introduce a stochastic top-K hinge loss inspired by recent developments on top-K calibrated losses. Our proposal is based on the smoothing of the top-K operator building on the flexible "perturbed optimizer" framework. We show that our loss function performs very well in the case of balanced datasets, while benefiting from a significantly lower computational time than the state-of-the-art top-K loss function. In addition, we propose a simple variant of our loss for the imbalanced case. Experiments on a heavy-tailed dataset show that our loss function significantly outperforms other baseline loss functions.
    Deep End-to-end Causal Inference. (arXiv:2202.02195v1 [stat.ML])
    Causal inference is essential for data-driven decision making across domains such as business engagement, medical treatment or policy making. However, research on causal discovery and inference has evolved separately, and the combination of the two domains is not trivial. In this work, we develop Deep End-to-end Causal Inference (DECI), a single flow-based method that takes in observational data and can perform both causal discovery and inference, including conditional average treatment effect (CATE) estimation. We provide a theoretical guarantee that DECI can recover the ground truth causal graph under mild assumptions. In addition, our method can handle heterogeneous, real-world, mixed-type data with missing values, allowing for both continuous and discrete treatment decisions. Moreover, the design principle of our method can generalize beyond DECI, providing a general End-to-end Causal Inference (ECI) recipe, which enables different ECI frameworks to be built using existing methods. Our results show the superior performance of DECI when compared to relevant baselines for both causal discovery and (C)ATE estimation in over a thousand experiments on both synthetic datasets and other causal machine learning benchmark datasets.
    Transport Score Climbing: Variational Inference using Forward KL and Adaptive Neural Transport. (arXiv:2202.01841v1 [stat.ML])
    Variational inference often minimizes the "reverse" Kullbeck-Leibler (KL) KL(q||p) from the approximate distribution q to the posterior p. Recent work studies the "forward" KL KL(p||q), which unlike reverse KL does not lead to variational approximations that underestimate uncertainty. This paper introduces Transport Score Climbing (TSC), a method that optimizes KL(p||q) by using Hamiltonian Monte Carlo (HMC) and a novel adaptive transport map. The transport map improves the trajectory of HMC by acting as a change of variable between the latent variable space and a warped space. TSC uses HMC samples to dynamically train the transport map while optimizing KL(p||q). TSC leverages synergies, where better transport maps lead to better HMC sampling, which then leads to better transport maps. We demonstrate TSC on synthetic and real data. We find that TSC achieves competitive performance when training variational autoencoders on large-scale data.
    Exploiting Independent Instruments: Identification and Distribution Generalization. (arXiv:2202.01864v1 [stat.ML])
    Instrumental variable models allow us to identify a causal function between covariates X and a response Y, even in the presence of unobserved confounding. Most of the existing estimators assume that the error term in the response Y and the hidden confounders are uncorrelated with the instruments Z. This is often motivated by a graphical separation, an argument that also justifies independence. Posing an independence condition, however, leads to strictly stronger identifiability results. We connect to existing literature in econometrics and provide a practical method for exploiting independence that can be combined with any gradient-based learning procedure. We see that even in identifiable settings, taking into account higher moments may yield better finite sample results. Furthermore, we exploit the independence for distribution generalization. We prove that the proposed estimator is invariant to distributional shifts on the instruments and worst-case optimal whenever these shifts are sufficiently strong. These results hold even in the under-identified case where the instruments are not sufficiently rich to identify the causal function.
    Distribution Embedding Networks for Meta-Learning with Heterogeneous Covariate Spaces. (arXiv:2202.01940v1 [stat.ML])
    We propose Distribution Embedding Networks (DEN) for classification with small data using meta-learning techniques. Unlike existing meta-learning approaches that focus on image recognition tasks and require the training and target tasks to be similar, DEN is specifically designed to be trained on a diverse set of training tasks and applied on tasks whose number and distribution of covariates differ vastly from its training tasks. Such property of DEN is enabled by its three-block architecture: a covariate transformation block followed by a distribution embedding block and then a classification block. We provide theoretical insights to show that this architecture allows the embedding and classification blocks to be fixed after pre-training on a diverse set of tasks; only the covariate transformation block with relatively few parameters needs to be updated for each new task. To facilitate the training of DEN, we also propose an approach to synthesize binary classification training tasks, and demonstrate that DEN outperforms existing methods in a number of synthetic and real tasks in numerical studies.
    Active metric learning and classification using similarity queries. (arXiv:2202.01953v1 [cs.LG])
    Active learning is commonly used to train label-efficient models by adaptively selecting the most informative queries. However, most active learning strategies are designed to either learn a representation of the data (e.g., embedding or metric learning) or perform well on a task (e.g., classification) on the data. However, many machine learning tasks involve a combination of both representation learning and a task-specific goal. Motivated by this, we propose a novel unified query framework that can be applied to any problem in which a key component is learning a representation of the data that reflects similarity. Our approach builds on similarity or nearest neighbor (NN) queries which seek to select samples that result in improved embeddings. The queries consist of a reference and a set of objects, with an oracle selecting the object most similar (i.e., nearest) to the reference. In order to reduce the number of solicited queries, they are chosen adaptively according to an information theoretic criterion. We demonstrate the effectiveness of the proposed strategy on two tasks -- active metric learning and active classification -- using a variety of synthetic and real world datasets. In particular, we demonstrate that actively selected NN queries outperform recently developed active triplet selection methods in a deep metric learning setting. Further, we show that in classification, actively selecting class labels can be reformulated as a process of selecting the most informative NN query, allowing direct application of our method.
    Functional Mixtures-of-Experts. (arXiv:2202.02249v1 [stat.ME])
    We consider the statistical analysis of heterogeneous data for clustering and prediction purposes, in situations where the observations include functions, typically time series. We extend the modeling with Mixtures-of-Experts (ME), as a framework of choice in modeling heterogeneity in data for prediction and clustering with vectorial observations, to this functional data analysis context. We first present a new family of functional ME (FME) models, in which the predictors are potentially noisy observations, from entire functions, and the data generating process of the pair predictor and the real response, is governed by a hidden discrete variable representing an unknown partition, leading to complex situations to which the standard ME framework is not adapted. Second, we provide sparse and interpretable functional representations of the FME models, thanks to Lasso-like regularizations, notably on the derivatives of the underlying functional parameters of the model, projected onto a set of continuous basis functions. We develop dedicated expectation--maximization algorithms for Lasso-like regularized maximum-likelihood parameter estimation strategies, to encourage sparse and interpretable solutions. The proposed FME models and the developed EM-Lasso algorithms are studied in simulated scenarios and in applications to two real data sets, and the obtained results demonstrate their performance in accurately capturing complex nonlinear relationships between the response and the functional predictor, and in clustering.
    Modeling unknown dynamical systems with hidden parameters. (arXiv:2202.01858v1 [stat.ML])
    We present a data-driven numerical approach for modeling unknown dynamical systems with missing/hidden parameters. The method is based on training a deep neural network (DNN) model for the unknown system using its trajectory data. A key feature is that the unknown dynamical system contains system parameters that are completely hidden, in the sense that no information about the parameters is available through either the measurement trajectory data or our prior knowledge of the system. We demonstrate that by training a DNN using the trajectory data with sufficient time history, the resulting DNN model can accurately model the unknown dynamical system. For new initial conditions associated with new, and unknown, system parameters, the DNN model can produce accurate system predictions over longer time.
    The FEDHC Bayesian network learning algorithm. (arXiv:2012.00113v5 [stat.ML] UPDATED)
    The paper proposes a new hybrid Bayesian network learning algorithm, termed Forward Early Dropping Hill Climbing (FEDHC), devised to work with either continuous or categorical variables. Specifically for the case of continuous data, a robust to outliers version of FEDHC, that can be adopted by other BN learning algorithms, is proposed. Further, the paper manifests that the only implementation of MMHC in the statistical software \textit{R}, is prohibitively expensive and a new implementation is offered. The FEDHC is tested via Monte Carlo simulations that distinctly show it is computationally efficient, and produces Bayesian networks of similar to, or of higher accuracy than MMHC and PCHC. Finally, an application of FEDHC, PCHC and MMHC algorithms to real data, from the field of economics, is demonstrated using the statistical software \textit{R}.
    Simple yet Sharp Sensitivity Analysis for Unmeasured Confounding. (arXiv:2104.13020v5 [stat.ME] UPDATED)
    We present a method for assessing the sensitivity of the true causal effect to unmeasured confounding. The method requires the analyst to set two intuitive parameters. Otherwise, the method is assumption-free. The method returns an interval that contains the true causal effect, and whose bounds are arbitrarily sharp, i.e. practically attainable. We show experimentally that our bounds can be tighter than those obtained by the method of Ding and VanderWeele (2016a) which, moreover, requires to set one more parameter than our method. Finally, we extend our method to bound the natural direct and indirect effects when there are measured mediators and unmeasured exposure-outcome confounding.
    Scalable One-Pass Optimisation of High-Dimensional Weight-Update Hyperparameters by Implicit Differentiation. (arXiv:2110.10461v2 [cs.LG] UPDATED)
    Machine learning training methods depend plentifully and intricately on hyperparameters, motivating automated strategies for their optimisation. Many existing algorithms restart training for each new hyperparameter choice, at considerable computational cost. Some hypergradient-based one-pass methods exist, but these either cannot be applied to arbitrary optimiser hyperparameters (such as learning rates and momenta) or take several times longer to train than their base models. We extend these existing methods to develop an approximate hypergradient-based hyperparameter optimiser which is applicable to any continuous hyperparameter appearing in a differentiable model weight update, yet requires only one training episode, with no restarts. We also provide a motivating argument for convergence to the true hypergradient, and perform tractable gradient-based optimisation of independent learning rates for each model parameter. Our method performs competitively from varied random hyperparameter initialisations on several UCI datasets and Fashion-MNIST (using a one-layer MLP), Penn Treebank (using an LSTM) and CIFAR-10 (using a ResNet-18), in time only 2-3x greater than vanilla training.
    Generalized vec trick for fast learning of pairwise kernel models. (arXiv:2009.01054v2 [cs.LG] UPDATED)
    Pairwise learning corresponds to the supervised learning setting where the goal is to make predictions for pairs of objects. Prominent applications include predicting drug-target or protein-protein interactions, or customer-product preferences. In this work, we present a comprehensive review of pairwise kernels, that have been proposed for incorporating prior knowledge about the relationship between the objects. Specifically, we consider the standard, symmetric and anti-symmetric Kronecker product kernels, metric-learning, Cartesian, ranking, as well as linear, polynomial and Gaussian kernels. Recently, a O(nm + nq) time generalized vec trick algorithm, where n, m, and q denote the number of pairs, drugs and targets, was introduced for training kernel methods with the Kronecker product kernel. This was a significant improvement over previous O(n^2) training methods, since in most real-world applications m,q << n. In this work we show how all the reviewed kernels can be expressed as sums of Kronecker products, allowing the use of generalized vec trick for speeding up their computation. In the experiments, we demonstrate how the introduced approach allows scaling pairwise kernels to much larger data sets than previously feasible, and provide an extensive comparison of the kernels on a number of biological interaction prediction tasks.
    Structure-adaptive manifold estimation. (arXiv:1906.05014v5 [math.ST] UPDATED)
    We consider a problem of manifold estimation from noisy observations. Many manifold learning procedures locally approximate a manifold by a weighted average over a small neighborhood. However, in the presence of large noise, the assigned weights become so corrupted that the averaged estimate shows very poor performance. We suggest a structure-adaptive procedure, which simultaneously reconstructs a smooth manifold and estimates projections of the point cloud onto this manifold. The proposed approach iteratively refines the weights on each step, using the structural information obtained at previous steps. After several iterations, we obtain nearly "oracle" weights, so that the final estimates are nearly efficient even in the presence of relatively large noise. In our theoretical study, we establish tight lower and upper bounds proving asymptotic optimality of the method for manifold estimation under the Hausdorff loss, provided that the noise degrades to zero fast enough.
    Quality Assessment of Low Light Restored Images: A Subjective Study and an Unsupervised Model. (arXiv:2202.02277v1 [eess.IV])
    The quality assessment (QA) of restored low light images is an important tool for benchmarking and improving low light restoration (LLR) algorithms. While several LLR algorithms exist, the subjective perception of the restored images has been much less studied. Challenges in capturing aligned low light and well-lit image pairs and collecting a large number of human opinion scores of quality for training, warrant the design of unsupervised (or opinion unaware) no-reference (NR) QA methods. This work studies the subjective perception of low light restored images and their unsupervised NR QA. Our contributions are two-fold. We first create a dataset of restored low light images using various LLR methods, conduct a subjective QA study and benchmark the performance of existing QA methods. We then present a self-supervised contrastive learning technique to extract distortion aware features from the restored low light images. We show that these features can be effectively used to build an opinion unaware image quality analyzer. Detailed experiments reveal that our unsupervised NR QA model achieves state-of-the-art performance among all such quality measures for low light restored images.
    Manifold Fitting in Ambient Space. (arXiv:1909.13492v2 [stat.ML] UPDATED)
    Modern sample points in many applications no longer comprise real vectors in a real vector space but sample points of much more complex structures, which may be represented as points in a space with a certain underlying geometric structure, namely a manifold. Manifold learning is an emerging field for learning the underlying structure. The study of manifold learning can be split into two main branches: dimension reduction and manifold fitting. With the aim of combining statistics and geometry, we address the problem of manifold fitting in the ambient space. Inspired by the relation between the eigenvalues of the Laplace-Beltrami operator and the geometry of a manifold, we aim to find a small set of points that preserve the geometry of the underlying manifold. From this relationship, we extend the idea of subsampling to sample points in high-dimensional space and employ the Moving Least Squares (MLS) approach to approximate the underlying manifold. We analyze the two core steps in our proposed method theoretically and also provide the bounds for the MLS approach. Our simulation results and theoretical analysis demonstrate the superiority of our method in estimating the underlying manifold.
    MedGCN: Medication recommendation and lab test imputation via graph convolutional networks. (arXiv:1904.00326v3 [cs.LG] UPDATED)
    Laboratory testing and medication prescription are two of the most important routines in daily clinical practice. Developing an artificial intelligence system that can automatically make lab test imputations and medication recommendations can save costs on potentially redundant lab tests and inform physicians of a more effective prescription. We present an intelligent medical system (named MedGCN) that can automatically recommend the patients' medications based on their incomplete lab tests, and can even accurately estimate the lab values that have not been taken. In our system, we integrate the complex relations between multiple types of medical entities with their inherent features in a heterogeneous graph. Then we model the graph to learn a distributed representation for each entity in the graph based on graph convolutional networks (GCN). By the propagation of graph convolutional networks, the entity representations can incorporate multiple types of medical information that can benefit multiple medical tasks. Moreover, we introduce a cross regularization strategy to reduce overfitting for multi-task training by the interaction between the multiple tasks. In this study, we construct a graph to associate 4 types of medical entities, i.e., patients, encounters, lab tests, and medications, and applied a graph neural network to learn node embeddings for medication recommendation and lab test imputation. we validate our MedGCN model on two real-world datasets: NMEDW and MIMIC-III. The experimental results on both datasets demonstrate that our model can outperform the state-of-the-art in both tasks. We believe that our innovative system can provide a promising and reliable way to assist physicians to make medication prescriptions and to save costs on potentially redundant lab tests.
    Inherent Tradeoffs in Learning Fair Representations. (arXiv:1906.08386v6 [cs.LG] UPDATED)
    Real-world applications of machine learning tools in high-stakes domains are often regulated to be fair, in the sense that the predicted target should satisfy some quantitative notion of parity with respect to a protected attribute. However, the exact tradeoff between fairness and accuracy is not entirely clear, even for the basic paradigm of classification problems. In this paper, we characterize an inherent tradeoff between statistical parity and accuracy in the classification setting by providing a lower bound on the sum of group-wise errors of any fair classifiers. Our impossibility theorem could be interpreted as a certain uncertainty principle in fairness: if the base rates differ among groups, then any fair classifier satisfying statistical parity has to incur a large error on at least one of the groups. We further extend this result to give a lower bound on the joint error of any (approximately) fair classifiers, from the perspective of learning fair representations. To show that our lower bound is tight, assuming oracle access to Bayes (potentially unfair) classifiers, we also construct an algorithm that returns a randomized classifier that is both optimal (in terms of accuracy) and fair. Interestingly, when the protected attribute can take more than two values, an extension of this lower bound does not admit an analytic solution. Nevertheless, in this case, we show that the lower bound can be efficiently computed by solving a linear program, which we term as the TV-Barycenter problem, a barycenter problem under the TV-distance. On the upside, we prove that if the group-wise Bayes optimal classifiers are close, then learning fair representations leads to an alternative notion of fairness, known as the accuracy parity, which states that the error rates are close between groups. Finally, we also conduct experiments on real-world datasets to confirm our theoretical findings.  ( 3 min )
    Pixle: a fast and effective black-box attack based on rearranging pixels. (arXiv:2202.02236v1 [cs.LG])
    Recent research has found that neural networks are vulnerable to several types of adversarial attacks, where the input samples are modified in such a way that the model produces a wrong prediction that misclassifies the adversarial sample. In this paper we focus on black-box adversarial attacks, that can be performed without knowing the inner structure of the attacked model, nor the training procedure, and we propose a novel attack that is capable of correctly attacking a high percentage of samples by rearranging a small number of pixels within the attacked image. We demonstrate that our attack works on a large number of datasets and models, that it requires a small number of iterations, and that the distance between the original sample and the adversarial one is negligible to the human eye.  ( 2 min )
    Graph Convolution for Semi-Supervised Classification: Improved Linear Separability and Out-of-Distribution Generalization. (arXiv:2102.06966v4 [cs.LG] UPDATED)
    Recently there has been increased interest in semi-supervised classification in the presence of graphical information. A new class of learning models has emerged that relies, at its most basic level, on classifying the data after first applying a graph convolution. To understand the merits of this approach, we study the classification of a mixture of Gaussians, where the data corresponds to the node attributes of a stochastic block model. We show that graph convolution extends the regime in which the data is linearly separable by a factor of roughly $1/\sqrt{D}$, where $D$ is the expected degree of a node, as compared to the mixture model data on its own. Furthermore, we find that the linear classifier obtained by minimizing the cross-entropy loss after the graph convolution generalizes to out-of-distribution data where the unseen data can have different intra- and inter-class edge probabilities from the training data.  ( 2 min )
    A Meta-Learning Control Algorithm with Provable Finite-Time Guarantees. (arXiv:2008.13265v6 [cs.LG] UPDATED)
    In this work we provide provable regret guarantees for an online meta-learning control algorithm in an iterative control setting, where in each iteration the system to be controlled is a linear deterministic system that is different and unknown, the cost for the controller in an iteration is a general additive cost function and the control input is required to be constrained, which if violated incurs an additional cost. We prove (i) that the algorithm achieves a regret for the controller cost and constraint violation that are $O(T^{3/4})$ for an episode of duration $T$ with respect to the best policy that satisfies the control input control constraints and (ii) that the average of the regret for the controller cost and constraint violation with respect to the same policy vary as $O((1+\log(N)/N)T^{3/4})$ with the number of iterations $N$, showing that the worst regret for the learning within an iteration continuously improves with experience of more iterations.  ( 2 min )
    Deep Generative Modelling: A Comparative Review of VAEs, GANs, Normalizing Flows, Energy-Based and Autoregressive Models. (arXiv:2103.04922v3 [cs.LG] UPDATED)
    Deep generative models are a class of techniques that train deep neural networks to model the distribution of training samples. Research has fragmented into various interconnected approaches, each of which make trade-offs including run-time, diversity, and architectural restrictions. In particular, this compendium covers energy-based models, variational autoencoders, generative adversarial networks, autoregressive models, normalizing flows, in addition to numerous hybrid approaches. These techniques are compared and contrasted, explaining the premises behind each and how they are interrelated, while reviewing current state-of-the-art advances and implementations.  ( 2 min )
    Identifiability of Label Noise Transition Matrix. (arXiv:2202.02016v1 [cs.LG])
    The noise transition matrix plays a central role in the problem of learning from noisy labels. Among many other reasons, a significant number of existing solutions rely on access to it. Estimating the transition matrix without using ground truth labels is a critical and challenging task. When label noise transition depends on each instance, the problem of identifying the instance-dependent noise transition matrix becomes substantially more challenging. Despite recent works proposing solutions for learning from instance-dependent noisy labels, we lack a unified understanding of when such a problem remains identifiable, and therefore learnable. This paper seeks to provide answers to a sequence of related questions: What are the primary factors that contribute to the identifiability of a noise transition matrix? Can we explain the observed empirical successes? When a problem is not identifiable, what can we do to make it so? We will relate our theoretical findings to the literature and hope to provide guidelines for developing effective solutions for battling instance-dependent label noise.  ( 2 min )
    Affine-Invariant Integrated Rank-Weighted Depth: Definition, Properties and Finite Sample Analysis. (arXiv:2106.11068v3 [stat.ML] UPDATED)
    Because it determines a center-outward ordering of observations in $\mathbb{R}^d$ with $d\geq 2$, the concept of statistical depth permits to define quantiles and ranks for multivariate data and use them for various statistical tasks (e.g. inference, hypothesis testing). Whereas many depth functions have been proposed \textit{ad-hoc} in the literature since the seminal contribution of \cite{Tukey75}, not all of them possess the properties desirable to emulate the notion of quantile function for univariate probability distributions. In this paper, we propose an extension of the \textit{integrated rank-weighted} statistical depth (IRW depth in abbreviated form) originally introduced in \cite{IRW}, modified in order to satisfy the property of \textit{affine-invariance}, fulfilling thus all the four key axioms listed in the nomenclature elaborated by \cite{ZuoS00a}. The variant we propose, referred to as the Affine-Invariant IRW depth (AI-IRW in short), involves the covariance/precision matrices of the (supposedly square integrable) $d$-dimensional random vector $X$ under study, in order to take into account the directions along which $X$ is most variable to assign a depth value to any point $x\in \mathbb{R}^d$. The accuracy of the sampling version of the AI-IRW depth is investigated from a nonasymptotic perspective. Namely, a concentration result for the statistical counterpart of the AI-IRW depth is proved. Beyond the theoretical analysis carried out, applications to anomaly detection are considered and numerical results are displayed, providing strong empirical evidence of the relevance of the depth function we propose here.  ( 2 min )
    Incorporating Sum Constraints into Multitask Gaussian Processes. (arXiv:2202.01793v1 [stat.ML])
    Machine learning models can be improved by adapting them to respect existing background knowledge. In this paper we consider multitask Gaussian processes, with background knowledge in the form of constraints that require a specific sum of the outputs to be constant. This is achieved by conditioning the prior distribution on the constraint fulfillment. The approach allows for both linear and nonlinear constraints. We demonstrate that the constraints are fulfilled with high precision and that the construction can improve the overall prediction accuracy as compared to the standard Gaussian process.  ( 2 min )
    De-Sequentialized Monte Carlo: a parallel-in-time particle smoother. (arXiv:2202.02264v1 [stat.CO])
    Particle smoothers are SMC (Sequential Monte Carlo) algorithms designed to approximate the joint distribution of the states given observations from a state-space model. We propose dSMC (de-Sequentialized Monte Carlo), a new particle smoother that is able to process $T$ observations in $\mathcal{O}(\log T)$ time on parallel architecture. This compares favourably with standard particle smoothers, the complexity of which is linear in $T$. We derive $\mathcal{L}_p$ convergence results for dSMC, with an explicit upper bound, polynomial in $T$. We then discuss how to reduce the variance of the smoothing estimates computed by dSMC by (i) designing good proposal distributions for sampling the particles at the initialization of the algorithm, as well as by (ii) using lazy resampling to increase the number of particles used in dSMC. Finally, we design a particle Gibbs sampler based on dSMC, which is able to perform parameter inference in a state-space model at a $\mathcal{O}(\log(T))$ cost on parallel hardware.  ( 2 min )
    $\mathcal{F}$-EBM: Energy Based Learning of Functional Data. (arXiv:2202.01929v1 [cs.LG])
    Energy-Based Models (EBMs) have proven to be a highly effective approach for modelling densities on finite-dimensional spaces. Their ability to incorporate domain-specific choices and constraints into the structure of the model through composition make EBMs an appealing candidate for applications in physics, biology and computer vision and various other fields. In this work, we present a novel class of EBM which is able to learn distributions of functions (such as curves or surfaces) from functional samples evaluated at finitely many points. Two unique challenges arise in the functional context. Firstly, training data is often not evaluated along a fixed set of points. Secondly, steps must be taken to control the behaviour of the model between evaluation points, to mitigate overfitting. The proposed infinite-dimensional EBM employs a latent Gaussian process, which is weighted spectrally by an energy function parameterised with a neural network. The resulting EBM has the ability to utilize irregularly sampled training data and can output predictions at any resolution, providing an effective approach to up-scaling functional data. We demonstrate the efficacy of our proposed approach for modelling a range of datasets, including data collected from Standard and Poor's 500 (S\&P) and UK National grid.  ( 2 min )
    To Impute or not to Impute? -- Missing Data in Treatment Effect Estimation. (arXiv:2202.02096v1 [stat.ML])
    Missing data is a systemic problem in practical scenarios that causes noise and bias when estimating treatment effects. This makes treatment effect estimation from data with missingness a particularly tricky endeavour. A key reason for this is that standard assumptions on missingness are rendered insufficient due to the presence of an additional variable, treatment, besides the individual and the outcome. Having a treatment variable introduces additional complexity with respect to why some variables are missing that is not fully explored by previous work. In our work we identify a new missingness mechanism, which we term mixed confounded missingness (MCM), where some missingness determines treatment selection and other missingness is determined by treatment selection. Given MCM, we show that naively imputing all data leads to poor performing treatment effects models, as the act of imputation effectively removes information necessary to provide unbiased estimates. However, no imputation at all also leads to biased estimates, as missingness determined by treatment divides the population in distinct subpopulations, where estimates across these populations will be biased. Our solution is selective imputation, where we use insights from MCM to inform precisely which variables should be imputed and which should not. We empirically demonstrate how various learners benefit from selective imputation compared to other solutions for missing data.  ( 2 min )
    On Model Selection Consistency of Lasso for High-Dimensional Ising Models on Tree-like Graphs. (arXiv:2110.08500v2 [stat.ML] UPDATED)
    We consider the problem of high-dimensional Ising model selection using neighborhood-based least absolute shrinkage and selection operator (Lasso). It is rigorously proved that under some mild coherence conditions on the population covariance matrix of the Ising model, consistent model selection can be achieved with sample sizes $n=\Omega{(d^3\log{p})}$ for any tree-like graph in the paramagnetic phase, where $p$ is the number of variables and $d$ is the maximum node degree. The obtained sufficient conditions for consistent model selection with Lasso are the same in the scaling of the sample complexity as that of $\ell_1$-regularized logistic regression.  ( 2 min )
    Detecting Distributional Differences in Labeled Sequence Data with Application to Tropical Cyclone Satellite Imagery. (arXiv:2202.02253v1 [stat.AP])
    Our goal is to quantify whether, and if so how, spatio-temporal patterns in tropical cyclone (TC) satellite imagery signal an upcoming rapid intensity change event. To address this question, we propose a new nonparametric test of association between a time series of images and a series of binary event labels. We ask whether there is a difference in distribution between (dependent but identically distributed) 24-h sequences of images preceding an event versus a non-event. By rewriting the statistical test as a regression problem, we leverage neural networks to infer modes of structural evolution of TC convection that are representative of the lead-up to rapid intensity change events. Dependencies between nearby sequences are handled by a bootstrap procedure that estimates the marginal distribution of the label series. We prove that type I error control is guaranteed as long as the distribution of the label series is well-estimated, which is made easier by the extensive historical data for binary TC event labels. We show empirical evidence that our proposed method identifies archetypes of infrared imagery associated with elevated rapid intensification risk, typically marked by deep or deepening core convection over time. Such results provide a foundation for improved forecasts of rapid intensification.  ( 2 min )
    Multiset-Equivariant Set Prediction with Approximate Implicit Differentiation. (arXiv:2111.12193v2 [cs.LG] UPDATED)
    Most set prediction models in deep learning use set-equivariant operations, but they actually operate on multisets. We show that set-equivariant functions cannot represent certain functions on multisets, so we introduce the more appropriate notion of multiset-equivariance. We identify that the existing Deep Set Prediction Network (DSPN) can be multiset-equivariant without being hindered by set-equivariance and improve it with approximate implicit differentiation, allowing for better optimization while being faster and saving memory. In a range of toy experiments, we show that the perspective of multiset-equivariance is beneficial and that our changes to DSPN achieve better results in most cases. On CLEVR object property prediction, we substantially improve over the state-of-the-art Slot Attention from 8% to 77% in one of the strictest evaluation metrics because of the benefits made possible by implicit differentiation.  ( 2 min )
    A Robust Phased Elimination Algorithm for Corruption-Tolerant Gaussian Process Bandits. (arXiv:2202.01850v1 [stat.ML])
    We consider the sequential optimization of an unknown, continuous, and expensive to evaluate reward function, from noisy and adversarially corrupted observed rewards. When the corruption attacks are subject to a suitable budget $C$ and the function lives in a Reproducing Kernel Hilbert Space (RKHS), the problem can be posed as corrupted Gaussian process (GP) bandit optimization. We propose a novel robust elimination-type algorithm that runs in epochs, combines exploration with infrequent switching to select a small subset of actions, and plays each action for multiple time instants. Our algorithm, Robust GP Phased Elimination (RGP-PE), successfully balances robustness to corruptions with exploration and exploitation such that its performance degrades minimally in the presence (or absence) of adversarial corruptions. When $T$ is the number of samples and $\gamma_T$ is the maximal information gain, the corruption-dependent term in our regret bound is $O(C \gamma_T^{3/2})$, which is significantly tighter than the existing $O(C \sqrt{T \gamma_T})$ for several commonly-considered kernels. We perform the first empirical study of robustness in the corrupted GP bandit setting, and show that our algorithm is robust against a variety of adversarial attacks.  ( 2 min )
    Generalizing to New Physical Systems via Context-Informed Dynamics Model. (arXiv:2202.01889v1 [cs.LG])
    Data-driven approaches to modeling physical systems fail to generalize to unseen systems that share the same general dynamics with the learning domain, but correspond to different physical contexts. We propose a new framework for this key problem, context-informed dynamics adaptation (CoDA), which takes into account the distributional shift across systems for fast and efficient adaptation to new dynamics. CoDA leverages multiple environments, each associated to a different dynamic, and learns to condition the dynamics model on contextual parameters, specific to each environment. The conditioning is performed via a hypernetwork, learned jointly with a context vector from observed data. The proposed formulation constrains the search hypothesis space to foster fast adaptation and better generalization across environments. It extends the expressivity of existing methods. We theoretically motivate our approach and show state-ofthe-art generalization results on a set of nonlinear dynamics, representative of a variety of application domains. We also show, on these systems, that new system parameters can be inferred from context vectors with minimal supervision.  ( 2 min )
    Complex-to-Real Random Features for Polynomial Kernels. (arXiv:2202.02031v1 [stat.ML])
    Kernel methods are ubiquitous in statistical modeling due to their theoretical guarantees as well as their competitive empirical performance. Polynomial kernels are of particular importance as their feature maps model the interactions between the dimensions of the input data. However, the construction time of explicit feature maps scales exponentially with the polynomial degree and a naive application of the kernel trick does not scale to large datasets. In this work, we propose Complex-to-Real (CtR) random features for polynomial kernels that leverage intermediate complex random projections and can yield kernel estimates with much lower variances than their real-valued analogs. The resulting features are real-valued, simple to construct and have the following advantages over the state-of-the-art: 1) shorter construction times, 2) lower kernel approximation errors for commonly used degrees, 3) they enable us to obtain a closed-form expression for their variance.  ( 2 min )
    Generative Modeling of Complex Data. (arXiv:2202.02145v1 [cs.LG])
    In recent years, several models have improved the capacity to generate synthetic tabular datasets. However, such models focus on synthesizing simple columnar tables and are not useable on real-life data with complex structures. This paper puts forward a generic framework to synthesize more complex data structures with composite and nested types. It then proposes one practical implementation, built with causal transformers, for struct (mappings of types) and lists (repeated instances of a type). The results on standard benchmark datasets show that such implementation consistently outperforms current state-of-the-art models both in terms of machine learning utility and statistical similarity. Moreover, it shows very strong results on two complex hierarchical datasets with multiple nesting and sparse data, that were previously out of reach.  ( 2 min )
    Correcting Confounding via Random Selection of Background Variables. (arXiv:2202.02150v1 [stat.ML])
    We propose a method to distinguish causal influence from hidden confounding in the following scenario: given a target variable Y, potential causal drivers X, and a large number of background features, we propose a novel criterion for identifying causal relationship based on the stability of regression coefficients of X on Y with respect to selecting different background features. To this end, we propose a statistic V measuring the coefficient's variability. We prove, subject to a symmetry assumption for the background influence, that V converges to zero if and only if X contains no causal drivers. In experiments with simulated data, the method outperforms state of the art algorithms. Further, we report encouraging results for real-world data. Our approach aligns with the general belief that causal insights admit better generalization of statistical associations across environments, and justifies similar existing heuristic approaches from the literature.  ( 2 min )
    Net benefit, calibration, threshold selection, and training objectives for algorithmic fairness in healthcare. (arXiv:2202.01906v1 [stat.ML])
    A growing body of work uses the paradigm of algorithmic fairness to frame the development of techniques to anticipate and proactively mitigate the introduction or exacerbation of health inequities that may follow from the use of model-guided decision-making. We evaluate the interplay between measures of model performance, fairness, and the expected utility of decision-making to offer practical recommendations for the operationalization of algorithmic fairness principles for the development and evaluation of predictive models in healthcare. We conduct an empirical case-study via development of models to estimate the ten-year risk of atherosclerotic cardiovascular disease to inform statin initiation in accordance with clinical practice guidelines. We demonstrate that approaches that incorporate fairness considerations into the model training objective typically do not improve model performance or confer greater net benefit for any of the studied patient populations compared to the use of standard learning paradigms followed by threshold selection concordant with patient preferences, evidence of intervention effectiveness, and model calibration. These results hold when the measured outcomes are not subject to differential measurement error across patient populations and threshold selection is unconstrained, regardless of whether differences in model performance metrics, such as in true and false positive error rates, are present. In closing, we argue for focusing model development efforts on developing calibrated models that predict outcomes well for all patient populations while emphasizing that such efforts are complementary to transparent reporting, participatory design, and reasoning about the impact of model-informed interventions in context.  ( 2 min )

  • Open

    AI painting vegetables 🥕
    submitted by /u/notrealAI [link] [comments]
    Artist uses AI to perfectly fake 70s science fiction pulp covers – artwork and titles
    submitted by /u/magenta_placenta [link] [comments]
    Last Week in AI - Light-based AI chips, DeepMind AI for competitive coding, EU and US AI regulations, and more!
    submitted by /u/regalalgorithm [link] [comments]
    Deep Learning for Computer Vision is not just Transformers: Facebook AI and UC Berkeley Propose a Convolutional Network for the 2020s
    Looking back to the 2010s, those years were characterized by the resurgence of Neural Networks and, in particular, Convolutional Neural Networks (ConvNet). Since the introduction of AlexNet, the field has evolved at a very fast pace. ConvNets have been successful due to several built-in inductive biases, the most important of which is translation equivariance. In parallel, the Natural Language Processing (NLP) field took a very different path, with Transformers becoming the dominant architecture. Continue Reading The Paper Summary Paper: https://arxiv.org/pdf/2201.03545.pdf Github: https://github.com/facebookresearch/ConvNeXt https://preview.redd.it/5lm10r7pxfg81.png?width=1536&format=png&auto=webp&s=984e1a6cba181ed80196374222a580bea178931d submitted by /u/ai-lover [link] [comments]  ( 1 min )
    What DeepMind’s AlphaCode is and isn’t
    submitted by /u/bendee983 [link] [comments]
    Applied Machine Learning
    Hello! We’re launching the beta run of our new Applied Machine Learning class. The course is 4 weeks and designed to be the first step in one’s machine learning journey. We designed the class to bring to life real world machine learning problems, and talk about the iterative workflow of building models to SOTA. It is taught by Andrew Maas, who is currently a senior researcher at Apple and did his Machine Learning PhD under Andrew Ng. Class starts on March 21st. The course is being taught on co:rise, an innovative new edtech platform. Since this is a beta run of the class we’d be happy to share a discount code If you are willing to give feedback on the course and experience! Please let me know if you’re interested. submitted by /u/sb2nov [link] [comments]  ( 1 min )
    [Project]: AI APIs comparator with public data
    Hi everyone, We are working on a new project: a comparison of AI APIs available on the market based on public datasets. We present some basic metrics to facilitate the comparison. This can be useful for developers who want to integrate these kind of APIs or students and researchers who develop their own models and would like to compare themselves to the market providers (Google, Microsoft, etc.). We have started with OCR (Optical Character Recognition) but will soon do it for other technologies: https://compare.edenai.run/vision/ocr What do you think? Is there anything we haven't thought of? Thanks for your feedback, PS: The site is not responsive and is not yet optimized for mobile at this time submitted by /u/tah_zem [link] [comments]  ( 1 min )
    How to find collaborator.
    I’m a physician looking for a coder /data scientist to design programs to improve health care delivery. Where do I go to find such people? submitted by /u/aaronsmith101 [link] [comments]  ( 1 min )
    AI Acceleration in Pharmaceutical Industry
    The global Pharmaceutical Industry is facing one of its biggest challenges in the world today, the global pandemic. The need for more effective and economical drugs is on a rapid rise. The pharma companies using artificial intelligence are vital as, during the COVID19 pandemic, the race to discover effective vaccines has led to thriving on innovation. Pharma tech also faces challenges in determining patient needs via data mining, conducting comprehensive medical research, and enhancing global product distribution. The deployment of artificial intelligence in pharmacy has redefined how the biotech space can achieve its phenomenal breakthroughs with the elevated use of automated algorithms. AI tech-based process enhancement in Pharmaceutical Company’s comes together with crucial demand for medicine in their captive markets. Since AI-driven big data analysis enables better strategic decision-making at the executive level, we can now potentially keep every Pharmacy and hospital consistently stocked with life-saving medicines. It delivers value and reliability to patients and can also immensely help humanitarian missions and international agencies like the WHO in economically deficit regions around the globe. The recent surge in the Pharmaceutical Industry adapting to deep learning algorithms will help Pharmaceutical Company grow its IPs and production facility assets steadily. The medical drug development companies will invest in AI in discovering drugs for chronic kidney, diabetes, cancer, idiopathic pulmonary fibrosis, and oncology diseases. By anticipating demand, adjusting pricing, facilitating collaborations, and enhancing business negotiations, AI can transform the Pharma landscape worldwide in just a few years. submitted by /u/futureanalytica [link] [comments]  ( 1 min )
    Artificial Nightmares : James Webb Telescope Images 01 || DD PYTTI VQGAN AI Art Video [4K 60 FPS]
    submitted by /u/Thenamessd [link] [comments]
  • Open

    Generative Adversarial Imitation Learning (GAIL) vs Generative Adversarial Network (GAN)
    Hello I am new on deep learning and i am working on a prediction problem. after I have seen many algorithms I don´t see the difference between the two algorithms: Generative Adversarial Imitation Learning (GAIL) and Generative Adversarial Network (GAN). Generative Adversarial Imitation Learning (GAIL) is an Inverse Reinforcement Learning algorithm, the goal is to make the agent learn the expert policy from the state-action trajectory (=the training dataset we have (features + label) in the prediction problem) Could someone clarify a litte bit this confusion plz? because in both algos : Real data distribution – Occupancy measure of the expert policy Fake data distribution – Occupancy measure of the agent policy Real data – State-action pair generated by the expert policy Fake data – State-action pair generated by the agent policy submitted by /u/Adventurous-Car-2927 [link] [comments]  ( 1 min )
    Network update in CLDE PPO
    Hello, I am implementing a Centralized Learning Decentralized Execution (CLDE) version of PPO. I am just wondering, how do I use the individual actor experiences to update the actor and critic networks? Say I update the networks every 10k steps, do I compute individual actor losses for each agent and sum them up before using them in the PPO loss, or do I perform a full update on the network using each agent's experience individually (that would be N=NumOfAgents updates every 10k steps)? submitted by /u/AhmedNizam_ [link] [comments]  ( 1 min )
    Video on multi-step temporal difference learning methods
    submitted by /u/delayed_reward [link] [comments]
    Tips on getting results in any papers
    Hey guys 👋 I've recently started self-studying reinforcement learning (4 months) and having some problems with achieving desired results from papers. For example, I am confident implementing DQN and its variants by just looking at the psuedocode shown in the paper with a little guidance from other internet sources but I have zero idea on training the algorithms. (I mean I can run it but have no intuition about setting the right parameters e.g num_batches, learning rate etc) Is there any good sources that focus on training the model, like detailed steps or personal tips on getting the desired results? Any help on this would be really great. submitted by /u/History_One [link] [comments]  ( 1 min )
    Why do we use normal distributions for learning continous action spaces?
    In the "Reinforcement Learning" by Richard Sutton, the normal distribution is presented as the go to choice for doing it, but it is not really explained why. Is it because of the mathematical properties of the Gaussian distribution or are there other reasons? submitted by /u/luzariuSsuckSs [link] [comments]  ( 3 min )
    ISAAC GYM: How to teach the Franka robot to reach the goal sphere?
    Hi all, ​ I am trying to train the Franka so that it can autonomously move towards a shape that I randomly place in the environment. However, as it was not able to learn it at all, I simplified the task so that the goal shape is not even randomly placed, but fixed at one position (see the attached video): ​ https://drive.google.com/file/d/1-K-k1L-WtRyPMcL6abfFGaPKvetfnouy/view?usp=sharing ​ This can also be seen in the training log. One more weird thing is that the reward suddenly collapses after a certain amount of training steps. (The attached video however is from the state reward=450, before the collapse) ​ https://preview.redd.it/l8tymqifbdg81.png?width=1565&format=png&auto=webp&s=e87465da667b1a9719d178ccf49933048e1f77c0 ​ I hope to find some insights from you guys why this task can not be learned at all... The full code can be found: https://drive.google.com/file/d/1ZMqvbchaNYL8gbpAKxfa270bQnmT0IDz/view?usp=sharing ​ submitted by /u/Fun-Moose-3841 [link] [comments]  ( 1 min )
    Which algorithm has the shortest training time in environment breakout?
    Hi, breakout training time in normal one gpu server is too long in my points of view(about 2~3 days). So I just want to reduce the time drastically. So what is the best way? DrQ-like agent can be a good solution? or rainbow DQN can be? what do you think? Very Thanks. submitted by /u/Spiritual_Fig3632 [link] [comments]  ( 2 min )
  • Open

    [Discussion] Comparison of anchor box generation methods for object detection
    Hello! My question is regarding the generation of anchor boxes. I have come across three approaches in different papers: Handcrafting based on human intuition/expertise. K-means clustering from labelled data to generate multi scale and ratio boxes. e.g. YoLo. MetaAnchor - The generation of anchor boxes is assigned to a function with learnable parameters. Paper. Does anybody have any experience comparing the three methods? Any suggestions would be very helpful. I am currently experimenting with method 2 and 3 and will post back my experience once I have the results. Thank you for reading. Peace. submitted by /u/deaf_frog_jumps [link] [comments]  ( 1 min )
    [P] Deep Learning for time series forecasting (neuralforecast, python package)
    Hi everyone, today we released the first version of our deep learning library for time series forecasting. Please check it out and give us a star if you like it. We are actively looking for OS contributors and are also happy to help anyone put it into production for their specific use cases. https://github.com/Nixtla/neuralforecast/ A usage example gif :) : https://giphy.com/gifs/python-deep-learning-neuralforecast-mCdyxYtLLTNCQ0WhkP submitted by /u/fedegarzar [link] [comments]  ( 1 min )
    [N] Camera calibration challenge at CVSports (CVPR 22)
    You are a big soccer fan ? Interested in computer vision, in deep learning ? This challenge is for you ! EVS Broadcast Equipment is sponsoring a camera calibration challenge at the CVSports workshop of CVPR 2022. Find the best algorithm to give the most precise camera parameters and win the prize of 1.000$ ! Start from our github : https://github.com/SoccerNet/sn-calibration to see how to extract the dataset, try the baseline algorithm and then develop your own solution. Then you can publish your results on our evaluation server https://eval.ai/web/challenges/challenge-page/1537/overview to enter the public leaderboard ! If you're not into camera calibration, there are other challenges organized on the same dataset, learn more about it on the website of SoccerNet : https://www.soccer-net.org/ Cheers ! submitted by /u/fmagera [link] [comments]  ( 1 min )
    [D] Senior Ph.D. salaries in London
    Hello, I have a Ph.D. in Machine Learning with 10 years of industry research experience. I am considering relocating to London soon. Can someone shed light on the salary ranges in FAANG companies in London for senior ML research roles? In the U.S., these roles typically pay $400k+ in total compensation, but I'm not familiar with London pay standards. Cheers. submitted by /u/Independent-Rich-810 [link] [comments]  ( 1 min )
    [N] We started a talk series on ML for drug discovery
    Hi all! A few weeks ago we've started a freely accessible talk series on ML for drug discovery and molecular modeling. Molecular data is one of those domains for which ML does not quite work yet, but for which it holds a big promise nonetheless. There has been lots of interesting interaction during the first few talks, even more so because professors such as Yoshua Bengio have been consistently attending. Give our website a look for more information! The next talk is tomorrow (Tuesday Feb 8) at 10:00 EST on meta-learning. ~~ For a bit more context: I'm writing my Msc. thesis on drug discovery at a company called Valence which is closely integrated with Mila in Montreal. Drug discovery is a multi-disciplinary process that builds upon state-of-the-art knowledge from fields such as biology,…  ( 2 min )
    [D] Why doesn’t your team use an experiment tracking tool?
    Hi all, I was wondering – does everybody out there use some sort of experiment tracking/model versioning tool or not? In my team, we looked into using one a while ago, as sharing info on training runs (and the models themselves) started to be tedious. However, having reviewed a few of them, we decided to stick with our workflow of tensorboard + sharing zipped output directories + sometimes summarizing stuff in sheets etc. The reasons were mainly (not in any particular order / varying across different tools): the tools being rather expensive, not coping that well with our extensive configuration files, and often being too complex for our use case, resulting in them not adding that much value. Hence, I am curious what the situation is elsewhere -- if you don't use an experiment tracker and have considered one, why did you decide against it? If you do use one, what are the best things your team is getting out of it? What convinced you to start using it? submitted by /u/k_kristian [link] [comments]  ( 6 min )
    [D] What are the most successful models in kaggle competitions?
    I often still hear a lot about RandomForests/XGboost still being the go-to models to win a haggle competition. Most of this is from 3-4 year old lecture videos though. So I was wondering, even with all the progress the different DL domains have made, is this still the case? Do deep learning models mostly do really well in "lab-environments" but not in more real word scenarios which I guess kaggle competitions are closer too? submitted by /u/Sbsch95 [link] [comments]  ( 3 min )
    [D] Has anyone tried to speed up large language model training by initializing the embeddings with static word embeddings?
    So large language models require large compute to train from scratch. I'm curious if anyone has tried to see if initializing the embeddings in a LLM with embeddings from a static word embedding model like word2vec or glove trained with the same tokenizer would speed up training. I couldn't find anything like this by googling and was surprised. submitted by /u/WigglyHypersurface [link] [comments]  ( 1 min )
    [p]I like to share our NEW Stylegan3 DATASETS and MODELS with you.
    check this out: https://github.com/edstoica/lucid_stylegan3_datasets_models/blob/main/README.md (our growing collection of models and datasets) Examples: https://preview.redd.it/tzc0mvkhicg81.jpg?width=1064&format=pjpg&auto=webp&s=15ed6b356ebbd8f86ebd67329af7442c0cd03f00 https://preview.redd.it/7uwlg9lhicg81.jpg?width=1064&format=pjpg&auto=webp&s=cabd2529fb9a813ea4bc860d85d46db920d2fb10 https://preview.redd.it/h845d6lhicg81.png?width=1064&format=png&auto=webp&s=67920cb3d02bd3850ff847a6687b51f4d6d40480 https://preview.redd.it/lbm3w6lhicg81.jpg?width=2088&format=pjpg&auto=webp&s=75e3dccac86b1ce823c9dd953340870f7584de2e https://preview.redd.it/cl50mclhicg81.jpg?width=2088&format=pjpg&auto=webp&s=36f282fd563c7c9ab7c7d924b0af565c1fe3f8cb https://preview.redd.it/9cwms5lhicg81.jpg?width=1064&format=pjpg&auto=webp&s=0bde0fd70b182c2705d55686f454c86a8b227b9e https://preview.redd.it/2q2tt5lhicg81.jpg?width=1064&format=pjpg&auto=webp&s=3f08f92c1e84ebd008aaac66b9aeddc970c8907f submitted by /u/lucid_layers [link] [comments]
    [P]Change few lines of code, enjoy your high tea, and you've got a pre-training ViT-B-32 :P
    Hi, guys. Our team develop an integrated large-scale deep learning system, Colossal-AI. We didn't really expect that, with our efficient parallelization techniques, it takes only half an hour to pre-train a ViT-Base/32. It's now open-source on Github. https://github.com/hpcaitech/ColossalAI. We'd love your advice! It is a system that sits between classical DL frameworks(like Pytorch and Tensorflow) and the underlying hardware models are trained on. It offers the following features: Automatic infinite-dimensional parallelism Optimizer Library Adaptive task scheduling Elimination of redundant memory We also expose simple APIs to quick start accelerating your AI applications, without worrying about parallel programming details. Go and make a try of it. You may even do it on one GPU. This is our paper link: https://arxiv.org/abs/2110.14883 https://preview.redd.it/anknv6or2cg81.png?width=1281&format=png&auto=webp&s=88ae129f301f3c43aa1aeaac560cece4c8026651 submitted by /u/HPCAI-Tech [link] [comments]  ( 1 min )
    [P] Built a platform to do ML with JavaScript
    submitted by /u/northwestredditor [link] [comments]  ( 4 min )
    Where is the field going [D]
    I have some thoughts I would like to share. Machine learning is about fitting a function to data in order to predict its value for unseen input. There are many functions that can fit given data. The one we want is the one that can generalise best. We can evaluate generalisation ability to some extent on test data. However, the true function we fundamentally cannot know. It could be any one of many functions that fit the data we've got, both train and test. The vision of the machine learning community is developing learning algorithms that can generalise well from data to unseen examples. I'm questioning whether what we're trying to achieve is fundamentally possible, and I don't see how it could be. Of course you can extend Occam's razor and say the true function will be the smoothest. And I believe much success can be attributed to the fact you can make this extension in some cases. So is learning the best smooth function the holy grail and that's what we're working on perfecting? How are we ever going to tackle general problems i.e. ones where we don't assume Occam's razor? submitted by /u/lemlo100 [link] [comments]  ( 1 min )
  • Open

    Improve high-value research with Hugging Face and Amazon SageMaker asynchronous inference endpoints
    Many of our AWS customers provide research, analytics, and business intelligence as a service. This type of research and business intelligence enables their end customers to stay ahead of markets and competitors, identify growth opportunities, and address issues proactively. For example, some of our financial services sector customers do research for equities, hedge funds, and […]  ( 9 min )
  • Open

    Almost finished neural network. I hope you will like it.
    submitted by /u/vlad_ma [link] [comments]  ( 1 min )
    Asking for advice
    Hi everyone, First of all I ask you to excuse my potential stupidity for I am but an electrical engineer and a rookie in ML. For my thesis I am trying to build a NN that predicts the transformer tap changers in a network that has 20-30 transformers. Each transformer tap can operate at 5 positions, but only one(sometimes two) are the correct ones. My approach is to build a non exclusive multi-class classification NN with 5 neurons(outputs). Each of them showing the probability of its corresponding tap position to be the correct one (inspired by handwritten number recognition algorithm). However, from the engineering point of view, it is not correct to regulate the tap positions of only one of the transformers, in a feeder(certain zone of the network), you should do it for all of them. If per say I have 10 transformers (5 tap positions- possible integer labels), would I be able to simply use 50 neurons in the output, similar to the case of a single transformer? Would my algorithm perform well or it does not make any sense? The point is that the tap positions of transformers affect each other. Thank you very much submitted by /u/penlarroko [link] [comments]  ( 1 min )
    I wrote some code using Finetuner to improve the accuracy of ResNet model for image similarity detection. It displays more accurate results now. (Colab link in comments)
    ​ Before Finetuner After Finetuner submitted by /u/positiveCAPTCHAtest [link] [comments]  ( 1 min )
  • Open

    Softphone Technology in the pandemic times
    We are quickly discovering that inter-office phone calls and emails are not the best way to stay connected in an environment that’s becoming increasingly agile and remote. According to a Gartner, Inc. survey of 229 HR leaders conducted on April 2, 2020, nearly 50% of employers report 81% or more of their employees are working… Read More »Softphone Technology in the pandemic times The post Softphone Technology in the pandemic times appeared first on Data Science Central.  ( 3 min )
    How ETL Validation Scripts Automation Improves Data Validation
    Data teams struggle to clean and validate incoming data streams using ETLvalidation scripts, which can be costly, time-consuming, and difficult to scale. This will only get worse as enterprises contend with the projected 3x in data growth over the next five years. In order to have control over data operations and ensure effective cleansing and… Read More »How ETL Validation Scripts Automation Improves Data Validation The post How ETL Validation Scripts Automation Improves Data Validation appeared first on Data Science Central.  ( 5 min )
    How to write SQL queries for azure functions
    As we all know, azure functions only support C# and javascript languages for writing codes. So if we want to write SQL queries in our function app, how will we do that? Well here is the answer: Use SQL cmd Utility! Yes, this utility can be used inside any of your functions which you would… Read More »How to write SQL queries for azure functions The post How to write SQL queries for azure functions appeared first on Data Science Central.  ( 4 min )
    The Metaverse May Never Arrive, And Here’s Why
    The metaverse isn’t here yet, mostly because of scarce resources. Issues like lack of storage and bulky headsets are stymying progress. Shortages of precious metals is creating ethical issues and environmental concerns. Metaverse has become the new buzzword, yet it doesn’t exist yet. Progress towards the digital utopia is slow, and with good reason: lack… Read More »The Metaverse May Never Arrive, And Here’s Why The post The Metaverse May Never Arrive, And Here’s Why appeared first on Data Science Central.  ( 4 min )
    How to create an AI framework based on ethical values
    We all recognize the value of AI ethics but how do we implement AI ethics? The challenge is to create a solution that conforms to the ethical standards but also should be implementable and scalable Most importantly, the framework should start with values first and then point a way to implement the code The Understanding… Read More »How to create an AI framework based on ethical values The post How to create an AI framework based on ethical values appeared first on Data Science Central.  ( 3 min )
  • Open

    Self-orthogonal vectors and coding
    One of the surprising things about linear algebra over a finite field is that a non-zero vector can be orthogonal to itself. When you take the inner product of a real vector with itself, you get a sum of squares of real numbers. If any element in the sum is positive, the whole sum is […] Self-orthogonal vectors and coding first appeared on John D. Cook.  ( 3 min )
    Ternary Golay code in Python
    Marcel Golay discovered two “perfect” error-correcting codes: one binary and one ternary. These two codes stick out in the classification of perfect codes [1]. The ternary code is a linear code over GF(3), the field with three elements. You can encode a list of 5 base-three digits by multiplying the list as a row vector […] Ternary Golay code in Python first appeared on John D. Cook.  ( 2 min )
  • Open

    Top 10 AI Solutions Use Cases in the Healthcare Industry
    From 2019 to 2021, AI services in the healthcare market grew at a rate of 167.1%. The global market size of the AI healthcare market stood…  ( 4 min )

  • Open

    [P] I made a tool for finding the original sources of information on the web called Deepcite! It uses Spacy to check for sentence similarity and records user submitted labels.
    submitted by /u/fippy24 [link] [comments]  ( 1 min )
    [D] Pipeline for reproducible research in industry
    I'm working in a small company where ML is the core product. We do lots of experiments/research to try improve our current product however one thing I noticed it's hard to reproduce experiments/research done by other colleagues. Let's say I try out a new approach on a particular task and get some improvements on the currently deployed models. Then 6 months from now those results seems outdated (new papers/shift in business paradigms/...). What are the best practices on setting up a framework/pipeline to make experiments easy to track/reproduce? I was thinking about using MLFlow + DVC + Git.. Has anyone experienced this sort of issues? if so, how did you solve them? Thanks! submitted by /u/eatpasta_runfastah [link] [comments]  ( 1 min )
    [P] Minimal PyTorch implementation of NeRF. Full model implementation and training code in 320 lines.
    submitted by /u/michaelaalcorn [link] [comments]  ( 1 min )
    [Research] Querying Natural Language on Tabular Data
    https://arxiv.org/abs/2202.00454 ​ Question Answering in Tabular data have been tackled by deep learning models like TAPAS, however, it can be a pain when you run into memory issues and also lack of customizability of the results.The paper presents a novel approach to solving the issue of querying natural language on tabular data, with the help of any pre-trained model. The link to the Github repo is also given in the paper. submitted by /u/metalvendetta [link] [comments]  ( 1 min )
    [D] On the Soap Bubble mass effect of high-dimentional Gaussians
    Many times I have tried to mentally picture the counterintuitive concetration of mass of a Gaussian around a hypersphere, only to fall in the trap of mentally projecting down to 2D/3D and getting back a pancake-like shape. Until I though about "folding" the space - by rotating every point to my 2D plane of reference — or equivalently — "approaching" the Soap Bubble from all directions. The resulting distribution - basically a 2D Gaussian with a hole in the middle - was simple enough to describe analytically, but it's not a probability density I've ever seen in the ML wild. I'm not one to plug tweets (sorry if this isn't kosher), but this one got some attention, some found it inspiring, so I thought folks in /r/ML might be interested too. Here's the original post: https://twitter.com/alkalait/status/1490065507566895107 submitted by /u/alkalait [link] [comments]  ( 1 min )
    [D] What do you feel about outstanding reviewers receiving "preferential treatment" for their own paper submission?
    With all the complaints on review quality, one cause is that there isn't really a strong incentive for reviewers to do a great job, especially if they are bombarded with 6 papers to review within a 2 month period, and this repeats 3 times a year. (I think) most people believe that it is somewhat wrong to pay reviewers for their reviews, but obviously the honor of "outstanding reviewers" is too little. What do you think about the idea of giving outstanding reviewers "preferential treatment" for their own paper submissions? Specifically, every conference has many borderline cases, which are resolved in quite ad hoc and random manners. What if we allocate them more heavily toward papers whose authors have been "outstanding reviewers" (by some selection mechanism) in the recent past (e.g. fo…  ( 2 min )
    [D] Using NLP to solve Wordle
    I was watching 3Blue1Brown's video on The Mathematically Optimal Wordle Strategy wherein they come up with a statistical approach based on Shannon's Entropy to find an effective strategy to solve a wordle puzzle. This got me thinking, How could Deep Learning models like BERT or GPT be used to solve a puzzle like this. I am aware that deep learning models might not be the right tool for something like this and there are far more computationally efficient methods out there, but for the sake of discussion how could these very large models be used for something like this. ​ Thank You! submitted by /u/supreethrao [link] [comments]  ( 1 min )
    [Discussion] Implementing late fusion without cross validation
    I am trying to implement Late fusion technique where a classifier is trained for each modality in dataset and a meta classifier is trained on predictions of base classifier. Is the strategy shown in the image below optimal for this purpose? The dataset I have is huge and cross-validation would be very time consuming. If I get rid of meta-classifier and simply average prediction probability from base classifier, can I simply use original split to train base classifiers? Also, I used the performed rest of experiments on the original splits. Can the results of late fusion be compared with them given that both are tested on same validation test? ​ https://preview.redd.it/ezahpef6h8g81.png?width=2444&format=png&auto=webp&s=cc0af889632f6818efdcc17975cb48302c83973c submitted by /u/eagleandwolf [link] [comments]  ( 1 min )
    [D] Improving precision in an unbalanced dataset (Keras DNN)
    Hi all, So say I have a binary classification problem with an unbalanced dataset, where the positive class is about 1/4 as prevalent as the negative class. I'm using keras, about 500 features, training set has 2.8M neg examples and about 700k pos examples. Mainly looking at F1 and precision as my metrics. For training, I use under-sampling to balance the dataset out, so that it becomes about 1.4M examples of balanced data. The validation (and testing) sets I obviously keep unbalanced to represent the true distribution of the data. During training, I get something like 60% precision on the training data, so 10% over chance (it's a noisy/difficult real world dataset/problem). For validation/testing, I get in the range of 30-40% precision, depending on what I set my thresholds to. I should note that the disparity is not due to overfitting, but due to the data distribution, since this disparity can be seen even from the 1st epoch - e.g. https://ibb.co/2nghqTm Other than just to keep parameter searching, gathering more more training data, and possibly engineering some more features, do you have any other advice for how I could further improve the performance of my model? I've tried using class weights instead of sampling in the past and generally find that sampling to balance the training set works better for datasets similar to what I have currently. I've not actually tried over-sampling to balance...just seems like adding extra processing time during training (would be 6M examples vs 1.4M) and I can't see it bumping my precision up any significant amount. Any other ideas? submitted by /u/Zman420 [link] [comments]  ( 2 min )
    [P] AlphaCode Explained
    Short 10 minute video giving an overview of how the recent AlphaCode system from DeepMind works. https://youtu.be/YjsoN5aJChA ​ Blog: https://www.deepmind.com/blog/article/Competitive-programming-with-AlphaCode Paper: https://storage.googleapis.com/deepmind-media/AlphaCode/competition_level_code_generation_with_alphacode.pdf Dataset: https://github.com/deepmind/code_contests CodeForces website: https://codeforces.com/ submitted by /u/Tea_Pearce [link] [comments]  ( 1 min )
    [D] Unsupervised Brain Models - How does Deep Learning inform Neuroscience? (Interview w/ Patrick Mineault)
    https://youtu.be/vfBAUYpMCTU Originally, Deep Learning sprang into existence inspired by how the brain processes information, but the two fields have diverged ever since. However, given that deep models can solve many perception tasks with remarkable accuracy, is it possible that we might be able to learn something about how the brain works by inspecting our models? I speak to Patrick Mineault about his blog post "2021 in review: unsupervised brain models" and we explore why neuroscientists are taking interest in unsupervised and self-supervised deep neural networks in order to explain how the brain works. We discuss a series of influential papers that have appeared last year, and we go into the more general questions of connecting neuroscience and machine learning. ​ OUTLINE: 0:00 - Intro & Overview 6:35 - Start of Interview 10:30 - Visual processing in the brain 12:50 - How does deep learning inform neuroscience? 21:15 - Unsupervised training explains the ventral stream 30:50 - Predicting own motion parameters explains the dorsal stream 42:20 - Why are there two different visual streams? 49:45 - Concept cells and representation learning 56:20 - Challenging the manifold theory 1:08:30 - What are current questions in the field? 1:13:40 - Should the brain inform deep learning? 1:18:50 - Neuromatch Academy and other endeavours ​ Blog Post: https://xcorr.net/2021/12/31/2021-in-review-unsupervised-brain-models/ Patrick's Blog: https://xcorr.net/ Twitter: https://twitter.com/patrickmineault Neuromatch Academy: https://academy.neuromatch.io/ submitted by /u/ykilcher [link] [comments]  ( 1 min )
    [D] Are the KKT Conditions still "Relevant" in Modern Machine Learning?
    It seems like when the KKT Conditions were first developed (https://en.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions), these were very useful for determining whether the solution to a optimization problem was optimal or not. However, it seems like nowadays, we hear about the KKT Conditions less - for instance, when reading about all the cool things that Deep Neural Networks are doing (e.g. AlphaGO, Protein Folding, Self Driving Cars) , the KKT Conditions never seem to be mentioned that much. I was wondering if someone could comment on the following: Do the KKT Conditions have as much importance in Machine Learning now as they did when they were first developed? In general, are the KKT Conditions important in Machine Learning? Thanks! submitted by /u/jj4646 [link] [comments]  ( 2 min )
    [D][R] How to get/extend an idea from top-tier ML paper
    As new PhD student I am Struggling to extract an idea from top tier ML/vision papers even after reading roundabout 200 papers. People often suggest to focus on conclusion/limitations section but I am noticing that most of the paper's limitation section are : 1) VAEs generate blurry output due to MSE loss, one may come up with a new loss function. Everyone knows is it a worth mentioning as future direction? 2) GANs training is difficult to converge due to adversarial training nature. Exploring a different training dynamics which yields stable training would be an interesting future direction. Again everyone knows but is it feasible for a PhD student ?. Please share your suggestions especially those have have published a top tier paper in summer internships, I am just curious how they make it in a short period of time?. thanks! submitted by /u/Lunch_More [link] [comments]  ( 5 min )
    [P] AI learns to BHOP in Dust 2
    submitted by /u/Gumbo64 [link] [comments]
    [D] Can Stochastic Gradient Descent Converge on Non-Convex Functions?
    As we know, there has been a lot of work and research done to demonstrate that the Gradient Descent Algorithm can converge on (deterministic) convex, differentiable and Lipschitz Continuous functions : ​ https://preview.redd.it/m2n4fcrgm4g81.png?width=1199&format=png&auto=webp&s=bcbbd0472b88bff9c6e49ac98ba124185d1a7469 However, I am interested in learning about to what extent convergence of Gradient Descent Based Algorithms (e.g. Stochastic Gradient Descent) has been studied for (non-deterministic) Non-Convex Functions. For instance, in Machine Learning applications with Neural Networks in the real world - Loss Functions almost always tend to be Non-Convex. Seeing as Non-Convex Functions usually have Saddle Points (i.e. point where the first derivatives of the Loss Function is 0), these…  ( 7 min )
  • Open

    Merging multiple images into one - trying to find tools
    Hello guys, I'm looking for places where I can find information regarding merging content of multiple (5/50/200) images into one. I can't even really find a specific name for it. The closest thing I've found is this paper https://arxiv.org/abs/1508.06576 and it's implementation with examples: https://github.com/jcjohnson/neural-style although they are using one image to transform the other one - I would like to merge many of them, not setting one as a primary - just trying to get the best results changing the number of images, its contents and the scales. I don't have much knowledge in the field - only a few hours having fun with other people's work using Google colab and doing things like deepfakes, but I'm looking for papers, YT videos, literally anything which would help me get into the field. Thank you all in advance! submitted by /u/iSunOfTheBeach [link] [comments]  ( 1 min )
    I'm working on a machine learning algorithm in c++ using CUDA. For testing purposes I tried to teach a "(2,400,400,1) <- layer structure" network the function of an xor operator. It seems to kinda work, but only with dramatically different learning rates for weights(0.03) and biases(0.001)...
    submitted by /u/qwedp [link] [comments]  ( 1 min )
    Why do we normalize image pixels by 255? | Kolbenkraft
    submitted by /u/cjmodi306 [link] [comments]  ( 1 min )
  • Open

    Will China's A.I. Nanny Augment China's Fertility Problem?
    submitted by /u/Beautiful-Credit-868 [link] [comments]
    We live in beautiful times where you can learn Machine Learning ("AI") and become an expert for free. Here are many curated resources in a complete guide for everyone (even with no tech background)
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 1 min )
  • Open

    An explorer in the sprawling universe of possible chemical combinations
    Heather Kulik embraces computer models as “the only way to make a dent” in the vast number of potential materials that could solve important problems.  ( 6 min )
  • Open

    Staying updated with the RL advances
    Hello esteemed community, This is not a technical question post but the answer to this question will help me to improve my technical knowledge. I am a grad student and was happened to get to know about the explainability in AI. Since I am interested in reinforcement learning, I looked for Explainable methods in RL which are dedicatedly referred as Explainable Reinforcement Learning (XRL). I saw a informative post in XRL on towards data science which was well written on various models with the sufficient math. It was a work of another grad student. I was curious behind the mind process of how people are ahead with a specific domain. What should I do to put myselves ahead and be in a position where I can write about such advanced RL topics. It was not in the, "Reinforcement Learning: An Introduction" by Richard S.Sutton and Andrew G.Barto. I was not aware of this in the Reinforcement learning course in coursera. I do convince myselves that the math is hard and it takes time to get good grasp on the concepts. Then in the process will get to know about lot of things. I just wanted to know whether it really takes time or am I missing on something ? I am really sorry if this question is petty...I mean people may think that this does not need lot of thinking someday we will get to know about the jargons but I just am concerned that I am always feeling like missing out on important things. Thanks a lot for your time. submitted by /u/Robust__Tensor [link] [comments]  ( 2 min )
    "Selective Eye-gaze Augmentation To Enhance Imitation Learning In Atari Games", Thammineni et al 2020 (using Atari-HEAD)
    submitted by /u/gwern [link] [comments]  ( 1 min )

  • Open

    [R] PromptBERT: Improving BERT Sentence Embeddings with Prompts. tl/dr For sentence embeddings, an input text prompt out performs average pooling and the CLS token. Anyone else confused by this?
    submitted by /u/BatmantoshReturns [link] [comments]  ( 1 min )
    [R] Local Feature Matching with Transformers for low-end devices
    Research about optimizing LoFTR method for devices with limited computational and memory resources like embedded devices or consumer-level GPUs. For example, it allows to use it on NVIDIA Jetson Nano 2GB with reasonable accuracy and performance. LoFTR is an efficient deep learning method for searching local feature matches on image pairs. Feature matches are an important resource for camera positioning tasks, for example, in SLAM. Paper: https://arxiv.org/abs/2202.00770 Code and demo: https://github.com/Kolkir/Coarse_LoFTR_TRT submitted by /u/kolkir [link] [comments]  ( 1 min )
    [R] BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
    submitted by /u/Illustrious_Row_9971 [link] [comments]  ( 1 min )
    [N]New Autonomous driving dataset
    https://feluccaai.com/mo7-rural-dataset/ submitted by /u/MikeQQ12 [link] [comments]  ( 1 min )
    [R] Can ANNs approximate multiple solutions for a PDE at once?
    Artificial Neural Networks can approximate any partial differential equation with enough neurons and hidden layers. If I was to create a NN infrastructure that takes 2 sets of inputs (2 * (x1,x2,x3,x4)) for a PDE and the model is trained to learn the PDE f(x) with 2 neurons in the output layer (one for each PDE solution approximation), would it be capable of approximating the 2 sets of inputs accurately? If anyone has any experience/ know of any papers that discuss the ability of NNs to approximate multiple input sets in the same NN model I'd appreciate any advice. Cheers! submitted by /u/Algo-G-H [link] [comments]  ( 1 min )
    [D] GAN output different scales
    Hello! I have a GAN for image2image translation and I must predict images with the generator that are on different scales i.e. without normalising one image can have values between -20 and 20 and another between -2 and 2. The GAN expects outputs between 0 and 1. If I normalize the values the GAN (in the case above by dividing by 40) and adding 0.5) learns to correctly predict the images in the range of -20 and 20 but not the others. I guess it is because the loss is much more strong for the ones that have higher value. Is there any paper or a known fix regarding this problem? submitted by /u/LazyButAmbitious [link] [comments]  ( 1 min )
    [R] DIAMBRA a New Ecosystem for Reinforcement Learning Research and Experimentation
    submitted by /u/DIAMBRA_AIArena [link] [comments]  ( 1 min )
    Powerfull visualization tool : Dimensionality Reduction + Clustering + Unsupervised Score Metrics [P]
    Hi everyone, I provide a high level module in python to perform state of the art dimensionality reduction and clustering. This is totally unsupervised and the performance is figure it out by unsupervised score metrics. This is compatible with GridSearch and BayesSearch (explain on the github's READme) Example DimReductionClustering is a sklearn estimator allowing to reduce the dimension of your data and then to apply an unsupervised clustering algorithm. The quality of the cluster can be done according to different metrics. The steps of the pipeline are the following: Perform a dimension reduction of the data using UMAP Numerically find the best epsilon parameter for DBSCAN Perform a density based clustering methods : DBSCAN Estimate cluster quality using silhouette score or DBCV Github link : https://github.com/MathieuCayssol/DimReductionClustering Nice to have feedback and happy if it is useful for you ! submitted by /u/Mathieu23AI [link] [comments]  ( 1 min )
    [P] Probe PyTorch models
    We wrote simple class to extract hidden layer outputs, gradients, and parameters of PyTorch models. Here's an example: probe = ModelProbe(model) out = model (inp) attn queries_filter = probe.forward_output['*.attention.query'] attn queries = torch.stack(attn_queries _filter.get list()) 🧑‍🏫 Demo that extracts attention maps of BERT 📚 Documentation 💻 Github submitted by /u/mlvpj [link] [comments]
    [D] Non-convex == Concave?
    Just a beginner's question. If a function is found to be non-convex. It is safe to call it a concave function? submitted by /u/das_gupta [link] [comments]
  • Open

    NeuronalNetwork for medicine
    Why dont we have giant tech companies training neuronalnetworks in order to predict and prevent diseases? submitted by /u/Particular-Cause-862 [link] [comments]
    OpenAI Team Introduces ‘InstructGPT’ Model Developed With Reinforcement Learning From Human Feedback (RLHF) To Make Models Safer, Helpful, And Aligned
    A system can theoretically learn anything from a set of data. In practice, however, it is little more than a model dependent on a few cases. Although pretrained language models such as Open AI’s GPT-3 have excelled at a wide range of natural language processing (NLP) tasks, there are times when unintended outputs, or those not following the user’s instructions, are generated. Not only that, but their outcomes have been observed to be prejudiced, untruthful, or poisonous, potentially having harmful societal consequences. OpenAI researchers have made substantial progress in better aligning big language models with users’ goals using reinforcement learning from human feedback (RLHF) methodologies. The team proposed InstructGPT models that have been demonstrated to produce more accurate and less harmful results in tests. Continue Reading Open AI Blog: https://openai.com/blog/instruction-following/ Paper: https://cdn.openai.com/papers/Training_language_models_to_follow_instructions_with_human_feedback.pdf https://preview.redd.it/pxj1gqang2g81.png?width=1452&format=png&auto=webp&s=6674387e037a9849b3d88472c367d5497389e896 submitted by /u/ai-lover [link] [comments]  ( 1 min )
    Will Intelligent programs (like AI)ever be able to modify its own code ?
    submitted by /u/MrAzimut [link] [comments]  ( 2 min )
    An inspiring review of all the changes GPT and LLM brought to us in 2021, by Dr. Alan D. Thompson
    submitted by /u/JavaMochaNeuroCam [link] [comments]  ( 1 min )
    LinkedIn Introduces DARWIN: A Unified “One-Stop” Data Science and Artificial Intelligence Platform That Would Centralize And Serve The Various Needs Of Data Scientists And AI Engineers
    LinkedIn is the world’s largest professional network, producing enormous amounts of high-quality data. Data scientists and AI developers have been using various tools to engage with data via multiple query and storage engines for EDA (exploratory data analysis) and visualization. This created a need for a unified “one-stop” data science platform to consolidate and service the varied demands. Hence, DARWIN Workbench came into play. DARWIN targeted use-cases similar to those addressed by other prominent data science platforms in the industry. While it uses the Jupyter ecosystem, it goes beyond Jupyter notebooks to meet the full range of data scientists and AI engineers’ needs at LinkedIn. Continue reading | LinkedIn Blog submitted by /u/ai-lover [link] [comments]  ( 1 min )
  • Open

    Lava lamp simulated by neural net in infinite loop
    submitted by /u/binaryfor [link] [comments]
    Sometimes we do not need complex neural nets (Flappy Bird AI)
    submitted by /u/hotcodist [link] [comments]
    Here are a few things you can do with Disco Diffusion V4.1
    ​ https://preview.redd.it/g9eme1q3wyf81.png?width=1280&format=png&auto=webp&s=1c5b928659447162a24c97e955baf82ce84b5dd8 https://preview.redd.it/9c8wv0q3wyf81.png?width=1280&format=png&auto=webp&s=4b3c492854bd7d4eafbe80f3a1ea160e29f6dc5e https://preview.redd.it/sjw1a4q3wyf81.png?width=1280&format=png&auto=webp&s=edf78bf2d01506c2d4d9e0ab82c4d9377f5e01ff https://preview.redd.it/3n2q84q3wyf81.png?width=1280&format=png&auto=webp&s=9ad33b46492669421ed8d858c176f7f626e84c3c https://preview.redd.it/qpa0q7q3wyf81.png?width=1280&format=png&auto=webp&s=d2080ab3c1d1e668f5585207633808ab3ea1ee6f submitted by /u/koalapon [link] [comments]  ( 1 min )
  • Open

    Using a different output activation on the SAC algorithm
    Hi Reddit, I'm currently using soft actor-critic for a task that involves predicting bit vectors. Thus, after some rounding, the actor will output a fixed-length vector that looks something like [0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0] The output is then compared to an existing database of bit vectors, and the most similar vector (via something like cosine similarity) within that database is selected as the final action that modifies the environment. If I am not mistaken, the actor in the original implementation applies softmax to the output, which may not be good in this situation for straightforward reasons. I was thinking about a different output activation like tanh or sigmoid, but I'm not sure if that would break the underlying algorithm that optimizes the model. Can I change the actor network's output activation without having to modify the optimization algorithm? If not, how should I go about changing it, if it is even possible? Also, assuming an activation like sigmoid works, would this count as a continuous or discrete action space? submitted by /u/uwuWasabi [link] [comments]  ( 1 min )
  • Open

    Lessons Learned from Writing Online
    It’s been one year since I published the last issue of my newsletter. At that time I decided to step back from writing in order to spend more time with my family and focus on my growing responsibilities as the… Read More The post Lessons Learned from Writing Online appeared first on ML in Production.  ( 10 min )
  • Open

    Lessons Learned from Writing Online
    It’s been one year since I published the last issue of my newsletter. At that time I decided to step back from writing in order to spend more time with my family and focus on my growing responsibilities as the… Read More The post Lessons Learned from Writing Online appeared first on ML in Production.  ( 10 min )
  • Open

    Evolution of IT Service Desk Operations
    During the late 80s and 90s, IT Helpdesk was a different beast as compared to what we have today. It acted as a single point of contact for users’ IT issues/ requests. Terms like “catch and dispatch” and “log and flog” have become prevalent and are being used. With the rapid adoption ITIL framework, the… Read More »Evolution of IT Service Desk Operations The post Evolution of IT Service Desk Operations appeared first on Data Science Central.  ( 3 min )
  • Open

    A Flexible Clustering Pipeline for Mining Text Intentions. (arXiv:2202.01211v1 [cs.CL])
    Mining the latent intentions from large volumes of natural language inputs is a key step to help data analysts design and refine Intelligent Virtual Assistants (IVAs) for customer service and sales support. We created a flexible and scalable clustering pipeline within the Verint Intent Manager (VIM) that integrates the fine-tuning of language models, a high performing k-NN library and community detection techniques to help analysts quickly surface and organize relevant user intentions from conversational texts. The fine-tuning step is necessary because pre-trained language models cannot encode texts to efficiently surface particular clustering structures when the target texts are from an unseen domain or the clustering task is not topic detection. We describe the pipeline and demonstrate its performance using BERT on three real-world text mining tasks. As deployed in the VIM application, this clustering pipeline produces high quality results, improving the performance of data analysts and reducing the time it takes to surface intentions from customer service data, thereby reducing the time it takes to build and deploy IVAs in new domains.  ( 2 min )
    Scaling Structured Inference with Randomization. (arXiv:2112.03638v2 [cs.LG] UPDATED)
    Deep discrete structured models have seen considerable progress recently, but traditional inference using dynamic programming (DP) typically works with a small number of states (less than hundreds), which severely limits model capacity. At the same time, across machine learning, there is a recent trend of using randomized truncation techniques to accelerate computations involving large sums. Here, we propose a family of randomized dynamic programming (RDP) algorithms for scaling structured models to tens of thousands of latent states. Our method is widely applicable to classical DP-based inference (partition, marginal, reparameterization, entropy) and different graph structures (chains, trees, and more general hypergraphs). It is also compatible with automatic differentiation: it can be integrated with neural networks seamlessly and learned with gradient-based optimizers. Our core technique approximates the sum-product by restricting and reweighting DP on a small subset of nodes, which reduces computation by orders of magnitude. We further achieve low bias and variance via Rao-Blackwellization and importance sampling. Experiments over different graphs demonstrate the accuracy and efficiency of our approach. Furthermore, when using RDP for training a structured variational autoencoder with a scaled inference network, we achieve better test likelihood than baselines and successfully prevent posterior collapse  ( 2 min )
    PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts. (arXiv:2202.01279v1 [cs.LG])
    PromptSource is a system for creating, sharing, and using natural language prompts. Prompts are functions that map an example from a dataset to a natural language input and target output. Using prompts to train and query language models is an emerging area in NLP that requires new tools that let users develop and refine these prompts collaboratively. PromptSource addresses the emergent challenges in this new setting with (1) a templating language for defining data-linked prompts, (2) an interface that lets users quickly iterate on prompt development by observing outputs of their prompts on many examples, and (3) a community-driven set of guidelines for contributing new prompts to a common pool. Over 2,000 prompts for roughly 170 datasets are already available in PromptSource. PromptSource is available at https://github.com/bigscience-workshop/promptsource.  ( 2 min )
    An Empirical Review of Optimization Techniques for Quantum Variational Circuits. (arXiv:2202.01389v1 [quant-ph])
    Quantum Variational Circuits (QVCs) are often claimed as one of the most potent uses of both near term and long term quantum hardware. The standard approaches to optimizing these circuits rely on a classical system to compute the new parameters at every optimization step. However, this process can be extremely challenging both in terms of navigating the exponentially scaling complex Hilbert space, barren plateaus, and the noise present in all foreseeable quantum hardware. Although a variety of optimization algorithms are employed in practice, there is often a lack of theoretical or empirical motivations for this choice. To this end we empirically evaluate the potential of many common gradient and gradient free optimizers on a variety of optimization tasks. These tasks include both classical and quantum data based optimization routines. Our evaluations were conducted in both noise free and noisy simulations. The large number of problems and optimizers yields strong empirical guidance for choosing optimizers for QVCs that is currently lacking.  ( 2 min )
    Who will Leave a Pediatric Weight Management Program and When? -- A machine learning approach for predicting attrition patterns. (arXiv:2202.01765v1 [cs.LG])
    Childhood obesity is a major public health concern. Multidisciplinary pediatric weight management programs are considered standard treatment for children with obesity and severe obesity who are not able to be successfully managed in the primary care setting; however, high drop-out rates (referred to as attrition) are a major hurdle in delivering successful interventions. Predicting attrition patterns can help providers reduce the attrition rates. Previous work has mainly focused on finding static predictors of attrition using statistical analysis methods. In this study, we present a machine learning model to predict (a) the likelihood of attrition, and (b) the change in body-mass index (BMI) percentile of children, at different time points after joining a weight management program. We use a five-year dataset containing the information related to around 4,550 children that we have compiled using data from the Nemours Pediatric Weight Management program. Our models show strong prediction performance as determined by high AUROC scores across different tasks (average AUROC of 0.75 for predicting attrition, and 0.73 for predicting weight outcomes). Additionally, we report the top features predicting attrition and weight outcomes in a series of explanatory experiments.  ( 2 min )
    Self-consistent Gradient-like Eigen Decomposition in Solving Schr\"odinger Equations. (arXiv:2202.01388v1 [quant-ph])
    The Schr\"odinger equation is at the heart of modern quantum mechanics. Since exact solutions of the ground state are typically intractable, standard approaches approximate Schr\"odinger equation as forms of nonlinear generalized eigenvalue problems $F(V)V = SV\Lambda$ in which $F(V)$, the matrix to be decomposed, is a function of its own top-$k$ smallest eigenvectors $V$, leading to a "self-consistency problem". Traditional iterative methods heavily rely on high-quality initial guesses of $V$ generated via domain-specific heuristics methods based on quantum mechanics. In this work, we eliminate such a need for domain-specific heuristics by presenting a novel framework, Self-consistent Gradient-like Eigen Decomposition (SCGLED) that regards $F(V)$ as a special "online data generator", thus allows gradient-like eigendecomposition methods in streaming $k$-PCA to approach the self-consistency of the equation from scratch in an iterative way similar to online learning. With several critical numerical improvements, SCGLED is robust to initial guesses, free of quantum-mechanism-based heuristics designs, and neat in implementation. Our experiments show that it not only can simply replace traditional heuristics-based initial guess methods with large performance advantage (achieved averagely 25x more precise than the best baseline in similar wall time), but also is capable of finding highly precise solutions independently without any traditional iterative methods.  ( 2 min )
    Prototype Memory for Large-scale Face Representation Learning. (arXiv:2105.02103v2 [cs.CV] UPDATED)
    Face representation learning using datasets with a massive number of identities requires appropriate training methods. Softmax-based approach, currently the state-of-the-art in face recognition, in its usual "full softmax" form is not suitable for datasets with millions of persons. Several methods, based on the "sampled softmax" approach, were proposed to remove this limitation. These methods, however, have a set of disadvantages. One of them is a problem of "prototype obsolescence": classifier weights (prototypes) of the rarely sampled classes receive too scarce gradients and become outdated and detached from the current encoder state, resulting in incorrect training signals. This problem is especially serious in ultra-large-scale datasets. In this paper, we propose a novel face representation learning model called Prototype Memory, which alleviates this problem and allows training on a dataset of any size. Prototype Memory consists of the limited-size memory module for storing recent class prototypes and employs a set of algorithms to update it in appropriate way. New class prototypes are generated on the fly using exemplar embeddings in the current mini-batch. These prototypes are enqueued to the memory and used in a role of classifier weights for softmax classification-based training. To prevent obsolescence and keep the memory in close connection with the encoder, prototypes are regularly refreshed, and oldest ones are dequeued and disposed of. Prototype Memory is computationally efficient and independent of dataset size. It can be used with various loss functions, hard example mining algorithms and encoder architectures. We prove the effectiveness of the proposed model by extensive experiments on popular face recognition benchmarks.  ( 2 min )
    Adaptive Sampling Strategies to Construct Equitable Training Datasets. (arXiv:2202.01327v1 [cs.LG])
    In domains ranging from computer vision to natural language processing, machine learning models have been shown to exhibit stark disparities, often performing worse for members of traditionally underserved groups. One factor contributing to these performance gaps is a lack of representation in the data the models are trained on. It is often unclear, however, how to operationalize representativeness in specific applications. Here we formalize the problem of creating equitable training datasets, and propose a statistical framework for addressing this problem. We consider a setting where a model builder must decide how to allocate a fixed data collection budget to gather training data from different subgroups. We then frame dataset creation as a constrained optimization problem, in which one maximizes a function of group-specific performance metrics based on (estimated) group-specific learning rates and costs per sample. This flexible approach incorporates preferences of model-builders and other stakeholders, as well as the statistical properties of the learning task. When data collection decisions are made sequentially, we show that under certain conditions this optimization problem can be efficiently solved even without prior knowledge of the learning rates. To illustrate our approach, we conduct a simulation study of polygenic risk scores on synthetic genomic data -- an application domain that often suffers from non-representative data collection. We find that our adaptive sampling strategy outperforms several common data collection heuristics, including equal and proportional sampling, demonstrating the value of strategic dataset design for building equitable models.  ( 2 min )
    Ranking with Confidence for Large Scale Comparison Data. (arXiv:2202.01670v1 [cs.LG])
    In this work, we leverage a generative data model considering comparison noise to develop a fast, precise, and informative ranking algorithm from pairwise comparisons that produces a measure of confidence on each comparison. The problem of ranking a large number of items from noisy and sparse pairwise comparison data arises in diverse applications, like ranking players in online games, document retrieval or ranking human perceptions. Although different algorithms are available, we need fast, large-scale algorithms whose accuracy degrades gracefully when the number of comparisons is too small. Fitting our proposed model entails solving a non-convex optimization problem, which we tightly approximate by a sum of quasi-convex functions and a regularization term. Resorting to an iterative reweighted minimization and the Primal-Dual Hybrid Gradient method, we obtain PD-Rank, achieving a Kendall tau 0.1 higher than all comparing methods, even for 10\% of wrong comparisons in simulated data matching our data model, and leading in accuracy if data is generated according to the Bradley-Terry model, in both cases faster by one order of magnitude, in seconds. In real data, PD-Rank requires less computational time to achieve the same Kendall tau than active learning methods.  ( 2 min )
    Harmony: Overcoming the hurdles of GPU memory capacity to train massive DNN models on commodity servers. (arXiv:2202.01306v1 [cs.DC])
    Deep neural networks (DNNs) have grown exponentially in complexity and size over the past decade, leaving only those who have access to massive datacenter-based resources with the ability to develop and train such models. One of the main challenges for the long tail of researchers who might have access to only limited resources (e.g., a single multi-GPU server) is limited GPU memory capacity compared to model size. The problem is so acute that the memory requirement of training large DNN models can often exceed the aggregate capacity of all available GPUs on commodity servers; this problem only gets worse with the trend of ever-growing model sizes. Current solutions that rely on virtualizing GPU memory (by swapping to/from CPU memory) incur excessive swapping overhead. In this paper, we present a new training framework, Harmony, and advocate rethinking how DNN frameworks schedule computation and move data to push the boundaries of training large models efficiently on modest multi-GPU deployments. Across many large DNN models, Harmony is able to reduce swap load by up to two orders of magnitude and obtain a training throughput speedup of up to 7.6x over highly optimized baselines with virtualized memory.  ( 2 min )
    An Artificial Intelligence Dataset for Solar Energy Locations in India. (arXiv:2202.01340v1 [cs.LG])
    Rapid development of renewable energy sources, particularly solar photovoltaics, is critical to mitigate climate change. As a result, India has set ambitious goals to install 300 gigawatts of solar energy capacity by 2030. Given the large footprint projected to meet these renewable energy targets the potential for land use conflicts over environmental and social values is high. To expedite development of solar energy, land use planners will need access to up-to-date and accurate geo-spatial information of PV infrastructure. The majority of recent studies use either predictions of resource suitability or databases that are either developed thru crowdsourcing that often have significant sampling biases or have time lags between when projects are permitted and when location data becomes available. Here, we address this shortcoming by developing a spatially explicit machine learning model to map utility-scale solar projects across India. Using these outputs, we provide a cumulative measure of the solar footprint across India and quantified the degree of land modification associated with land cover types that may cause conflicts. Our analysis indicates that over 74\% of solar development In India was built on landcover types that have natural ecosystem preservation, and agricultural values. Thus, with a mean accuracy of 92\% this method permits the identification of the factors driving land suitability for solar projects and will be of widespread interest for studies seeking to assess trade-offs associated with the global decarbonization of green-energy systems. In the same way, our model increases the feasibility of remote sensing and long-term monitoring of renewable energy deployment targets.  ( 2 min )
    Speaker Normalization for Self-supervised Speech Emotion Recognition. (arXiv:2202.01252v1 [cs.LG])
    Large speech emotion recognition datasets are hard to obtain, and small datasets may contain biases. Deep-net-based classifiers, in turn, are prone to exploit those biases and find shortcuts such as speaker characteristics. These shortcuts usually harm a model's ability to generalize. To address this challenge, we propose a gradient-based adversary learning framework that learns a speech emotion recognition task while normalizing speaker characteristics from the feature representation. We demonstrate the efficacy of our method on both speaker-independent and speaker-dependent settings and obtain new state-of-the-art results on the challenging IEMOCAP dataset.  ( 2 min )
    Gaussian Process Sampling and Optimization with Approximate Upper and Lower Bounds. (arXiv:2110.12087v2 [cs.LG] UPDATED)
    Many functions have approximately-known upper and/or lower bounds, potentially aiding the modeling of such functions. In this paper, we introduce Gaussian process models for functions where such bounds are (approximately) known. More specifically, we propose the first use of such bounds to improve Gaussian process (GP) posterior sampling and Bayesian optimization (BO). That is, we transform a GP model satisfying the given bounds, and then sample and weight functions from its posterior. To further exploit these bounds in BO settings, we present bounded entropy search (BES) to select the point gaining the most information about the underlying function, estimated by the GP samples, while satisfying the output constraints. We characterize the sample variance bounds and show that the decision made by BES is explainable. Our proposed approach is conceptually straightforward and can be used as a plug in extension to existing methods for GP posterior sampling and Bayesian optimization.  ( 2 min )
    To Smooth or Not? When Label Smoothing Meets Noisy Labels. (arXiv:2106.04149v4 [cs.LG] UPDATED)
    Label smoothing (LS) is an arising learning paradigm that uses the positively weighted average of both the hard training labels and uniformly distributed soft labels. It was shown that LS serves as a regularizer for training data with hard labels and therefore improves the generalization of the model. Later it was reported LS even helps with improving robustness when learning with noisy labels. However, we observed that the advantage of LS vanishes when we operate in a high label noise regime. Intuitively speaking, this is due to the increased entropy of $\mathbb{P}(\text{noisy label}|X)$ when the noise rate is high, in which case, further applying LS tends to "oversmooth" the estimated posterior. We proceeded to discover that several learning-with-noisy-labels solutions in the literature instead relate more closely to negative/not label smoothing (NLS), which acts counter to LS and defines as using a negative weight to combine the hard and soft labels! We provide understandings for the properties of LS and NLS when learning with noisy labels. Among other established properties, we theoretically show NLS is considered more beneficial when the label noise rates are high. We provide extensive experimental results on multiple benchmarks to support our findings too.  ( 2 min )
    Understanding Dimensional Collapse in Contrastive Self-supervised Learning. (arXiv:2110.09348v2 [cs.CV] UPDATED)
    Self-supervised visual representation learning aims to learn useful representations without relying on human annotations. Joint embedding approach bases on maximizing the agreement between embedding vectors from different views of the same image. Various methods have been proposed to solve the collapsing problem where all embedding vectors collapse to a trivial constant solution. Among these methods, contrastive learning prevents collapse via negative sample pairs. It has been shown that non-contrastive methods suffer from a lesser collapse problem of a different nature: dimensional collapse, whereby the embedding vectors end up spanning a lower-dimensional subspace instead of the entire available embedding space. Here, we show that dimensional collapse also happens in contrastive learning. In this paper, we shed light on the dynamics at play in contrastive learning that leads to dimensional collapse. Inspired by our theory, we propose a novel contrastive learning method, called DirectCLR, which directly optimizes the representation space without relying on a trainable projector. Experiments show that DirectCLR outperforms SimCLR with a trainable linear projector on ImageNet.  ( 2 min )
    Separating Rule Discovery and Global Solution Composition in a Learning Classifier System. (arXiv:2202.01677v1 [cs.LG])
    The utilization of digital agents to support crucial decision making is increasing in many industrial scenarios. However, trust in suggestions made by these agents is hard to achieve, though essential for profiting from their application, resulting in a need for explanations for both the decision making process as well as the model itself. For many systems, such as common deep learning black-box models, achieving at least some explainability requires complex post-processing, while other systems profit from being, to a reasonable extent, inherently interpretable. In this paper we propose an easily interpretable rule-based learning system specifically designed and thus especially suited for these scenarios and compare it on a set of regression problems against XCSF, a prominent rule-based learning system with a long research history. One key advantage of our system is that the rules' conditions and which rules compose a solution to the problem are evolved separately. We utilise independent rule fitnesses which allows users to specifically tailor their model structure to fit the given requirements for explainability. We find that the results of SupRB2's evaluation are comparable to XCSF's while allowing easier control of model structure and showing a substantially smaller sensitivity to random seeds and data splits. This increased control aids in subsequently providing explanations for both the training and the final structure of the model.  ( 2 min )
    Variance-Optimal Augmentation Logging for Counterfactual Evaluation in Contextual Bandits. (arXiv:2202.01721v1 [cs.LG])
    Methods for offline A/B testing and counterfactual learning are seeing rapid adoption in search and recommender systems, since they allow efficient reuse of existing log data. However, there are fundamental limits to using existing log data alone, since the counterfactual estimators that are commonly used in these methods can have large bias and large variance when the logging policy is very different from the target policy being evaluated. To overcome this limitation, we explore the question of how to design data-gathering policies that most effectively augment an existing dataset of bandit feedback with additional observations for both learning and evaluation. To this effect, this paper introduces Minimum Variance Augmentation Logging (MVAL), a method for constructing logging policies that minimize the variance of the downstream evaluation or learning problem. We explore multiple approaches to computing MVAL policies efficiently, and find that they can be substantially more effective in decreasing the variance of an estimator than na\"ive approaches.  ( 2 min )
    Learning with Asymmetric Kernels: Least Squares and Feature Interpretation. (arXiv:2202.01397v1 [cs.LG])
    Asymmetric kernels naturally exist in real life, e.g., for conditional probability and directed graphs. However, most of the existing kernel-based learning methods require kernels to be symmetric, which prevents the use of asymmetric kernels. This paper addresses the asymmetric kernel-based learning in the framework of the least squares support vector machine named AsK-LS, resulting in the first classification method that can utilize asymmetric kernels directly. We will show that AsK-LS can learn with asymmetric features, namely source and target features, while the kernel trick remains applicable, i.e., the source and target features exist but are not necessarily known. Besides, the computational burden of AsK-LS is as cheap as dealing with symmetric kernels. Experimental results on the Corel database, directed graphs, and the UCI database will show that in the case asymmetric information is crucial, the proposed AsK-LS can learn with asymmetric kernels and performs much better than the existing kernel methods that have to do symmetrization to accommodate asymmetric kernels.  ( 2 min )
    Exploring Sub-skeleton Trajectories for Interpretable Recognition of Sign Language. (arXiv:2202.01390v1 [cs.CV])
    Recent advances in tracking sensors and pose estimation software enable smart systems to use trajectories of skeleton joint locations for supervised learning. We study the problem of accurately recognizing sign language words, which is key to narrowing the communication gap between hard and non-hard of hearing people. Our method explores a geometric feature space that we call `sub-skeleton' aspects of movement. We assess similarity of feature space trajectories using natural, speed invariant distance measures, which enables clear and insightful nearest neighbor classification. The simplicity and scalability of our basic method allows for immediate application in different data domains with little to no parameter tuning. We demonstrate the effectiveness of our basic method, and a boosted variation, with experiments on data from different application domains and tracking technologies. Surprisingly, our simple methods improve sign recognition over recent, state-of-the-art approaches.  ( 2 min )
    Regression Transformer: Concurrent Conditional Generation and Regression by Blending Numerical and Textual Tokens. (arXiv:2202.01338v1 [cs.LG])
    We report the Regression Transformer (RT), a method that abstracts regression as a conditional sequence modeling problem. The RT casts continuous properties as sequences of numerical tokens and encodes them jointly with conventional tokens. This yields a dichotomous model that can seamlessly transition between solving regression tasks and conditional generation tasks; solely governed by the mask location. We propose several extensions to the XLNet objective and adopt an alternating training scheme to concurrently optimize property prediction and conditional text generation based on a self-consistency loss. Our experiments on both chemical and protein languages demonstrate that the performance of traditional regression models can be surpassed despite training with cross entropy loss. Importantly, priming the same model with continuous properties yields a highly competitive conditional generative models that outperforms specialized approaches in a constrained property optimization benchmark. In sum, the Regression Transformer opens the door for "swiss army knife" models that excel at both regression and conditional generation. This finds application particularly in property-driven, local exploration of the chemical or protein space.  ( 2 min )
    Topological Classification in a Wasserstein Distance Based Vector Space. (arXiv:2202.01275v1 [cs.LG])
    Classification of large and dense networks based on topology is very difficult due to the computational challenges of extracting meaningful topological features from real-world networks. In this paper we present a computationally tractable approach to topological classification of networks by using principled theory from persistent homology and optimal transport to define a novel vector representation for topological features. The proposed vector space is based on the Wasserstein distance between persistence barcodes. The 1-skeleton of the network graph is employed to obtain 1-dimensional persistence barcodes that represent connected components and cycles. These barcodes and the corresponding Wasserstein distance can be computed very efficiently. The effectiveness of the proposed vector space is demonstrated using support vector machines to classify simulated networks and measured functional brain networks.  ( 2 min )
    SparGE: Sparse Coding-based Patient Similarity Learning via Low-rank Constraints and Graph Embedding. (arXiv:2202.01427v1 [cs.LG])
    Patient similarity assessment (PSA) is pivotal to evidence-based and personalized medicine, enabled by analyzing the increasingly available electronic health records (EHRs). However, machine learning approaches for PSA has to deal with inherent data deficiencies of EHRs, namely missing values, noise, and small sample sizes. In this work, an end-to-end discriminative learning framework, called SparGE, is proposed to address these data challenges of EHR for PSA. SparGE measures similarity by jointly sparse coding and graph embedding. First, we use low-rank constrained sparse coding to identify and calculate weight for similar patients, while denoising against missing values. Then, graph embedding on sparse representations is adopted to measure the similarity between patient pairs via preserving local relationships defined by distances. Finally, a global cost function is constructed to optimize related parameters. Experimental results on two private and public real-world healthcare datasets, namely SingHEART and MIMIC-III, show that the proposed SparGE significantly outperforms other machine learning patient similarity methods.  ( 2 min )
    Learning strides in convolutional neural networks. (arXiv:2202.01653v1 [cs.LG])
    Convolutional neural networks typically contain several downsampling operators, such as strided convolutions or pooling layers, that progressively reduce the resolution of intermediate representations. This provides some shift-invariance while reducing the computational complexity of the whole architecture. A critical hyperparameter of such layers is their stride: the integer factor of downsampling. As strides are not differentiable, finding the best configuration either requires cross-validation or discrete optimization (e.g. architecture search), which rapidly become prohibitive as the search space grows exponentially with the number of downsampling layers. Hence, exploring this search space by gradient descent would allow finding better configurations at a lower computational cost. This work introduces DiffStride, the first downsampling layer with learnable strides. Our layer learns the size of a cropping mask in the Fourier domain, that effectively performs resizing in a differentiable way. Experiments on audio and image classification show the generality and effectiveness of our solution: we use DiffStride as a drop-in replacement to standard downsampling layers and outperform them. In particular, we show that introducing our layer into a ResNet-18 architecture allows keeping consistent high performance on CIFAR10, CIFAR100 and ImageNet even when training starts from poor random stride configurations. Moreover, formulating strides as learnable variables allows us to introduce a regularization term that controls the computational complexity of the architecture. We show how this regularization allows trading off accuracy for efficiency on ImageNet.  ( 2 min )
    Optimal Network Compression. (arXiv:2008.08733v4 [q-fin.RM] UPDATED)
    This paper introduces a formulation of the optimal network compression problem for financial systems. This general formulation is presented for different levels of network compression or rerouting allowed from the initial interbank network. We prove that this problem is, generically, NP-hard. We focus on objective functions generated by systemic risk measures under systematic shocks to the financial network. We conclude by studying the optimal compression problem for specific networks; this permits us to study the so-called robust fragility of certain network topologies more generally as well as the potential benefits and costs of network compression. In particular, under systematic shocks and heterogeneous financial networks the typical heuristics of robust fragility no longer hold generally.  ( 2 min )
    Membership Inference Attacks on Machine Learning: A Survey. (arXiv:2103.07853v4 [cs.LG] UPDATED)
    Machine learning (ML) models have been widely applied to various applications, including image classification, text generation, audio recognition, and graph data analysis. However, recent studies have shown that ML models are vulnerable to membership inference attacks (MIAs), which aim to infer whether a data record was used to train a target model or not. MIAs on ML models can directly lead to a privacy breach. For example, via identifying the fact that a clinical record that has been used to train a model associated with a certain disease, an attacker can infer that the owner of the clinical record has the disease with a high chance. In recent years, MIAs have been shown to be effective on various ML models, e.g., classification models and generative models. Meanwhile, many defense methods have been proposed to mitigate MIAs. Although MIAs on ML models form a newly emerging and rapidly growing research area, there has been no systematic survey on this topic yet. In this paper, we conduct the first comprehensive survey on membership inference attacks and defenses. We provide the taxonomies for both attacks and defenses, based on their characterizations, and discuss their pros and cons. Based on the limitations and gaps identified in this survey, we point out several promising future research directions to inspire the researchers who wish to follow this area. This survey not only serves as a reference for the research community but also provides a clear description for researchers outside this research domain. To further help the researchers, we have created an online resource repository, which we will keep updated with future relevant work. Interested readers can find the repository at https://github.com/HongshengHu/membership-inference-machine-learning-literature.  ( 3 min )
    Comparing Bayesian Models for Organ Contouring in Head and Neck Radiotherapy. (arXiv:2111.01134v2 [eess.IV] UPDATED)
    Deep learning models for organ contouring in radiotherapy are poised for clinical usage, but currently, there exist few tools for automated quality assessment (QA) of the predicted contours. Using Bayesian models and their associated uncertainty, one can potentially automate the process of detecting inaccurate predictions. We investigate two Bayesian models for auto-contouring, DropOut and FlipOut, using a quantitative measure - expected calibration error (ECE) and a qualitative measure - region-based accuracy-vs-uncertainty (R-AvU) graphs. It is well understood that a model should have low ECE to be considered trustworthy. However, in a QA context, a model should also have high uncertainty in inaccurate regions and low uncertainty in accurate regions. Such behaviour could direct visual attention of expert users to potentially inaccurate regions, leading to a speed up in the QA process. Using R-AvU graphs, we qualitatively compare the behaviour of different models in accurate and inaccurate regions. Experiments are conducted on the MICCAI2015 Head and Neck Segmentation Challenge and on the DeepMindTCIA CT dataset using three models: DropOut-DICE, Dropout-CE (Cross Entropy) and FlipOut-CE. Quantitative results show that DropOut-DICE has the highest ECE, while Dropout-CE and FlipOut-CE have the lowest ECE. To better understand the difference between DropOut-CE and FlipOut-CE, we use the R-AvU graph which shows that FlipOut-CE has better uncertainty coverage in inaccurate regions than DropOut-CE. Such a combination of quantitative and qualitative metrics explores a new approach that helps to select which model can be deployed as a QA tool in clinical settings.  ( 2 min )
    Partitioning Cloud-based Microservices (via Deep Learning). (arXiv:2109.14569v2 [cs.LG] UPDATED)
    Cloud-based software has many advantages. When services are divided into many independent components, they are easier to update. Also, during peak demand, it is easier to scale cloud services (just hire more CPUs). Hence, many organizations are partitioning their monolithic enterprise applications into cloud-based microservices. Recently there has been much work using machine learning to simplify this partitioning task. Despite much research, no single partitioning method can be recommended as generally useful. More specifically, those prior solutions are "brittle"; i.e. if they work well for one kind of goal in one dataset, then they can be sub-optimal if applied to many datasets and multiple goals. In order to find a generally useful partitioning method, we propose DEEPLY. This new algorithm extends the CO-GCN deep learning partition generator with (a) a novel loss function and (b) some hyper-parameter optimization. As shown by our experiments, DEEPLY generally outperforms prior work (including CO-GCN, and others) across multiple datasets and goals. To the best of our knowledge, this is the first report in SE of such stable hyper-parameter optimization. To aid reuse of this work, DEEPLY is available on-line at https://bit.ly/2WhfFlB.  ( 2 min )
    Log-Euclidean Signatures for Intrinsic Distances Between Unaligned Datasets. (arXiv:2202.01671v1 [stat.ML])
    The need for efficiently comparing and representing datasets with unknown alignment spans various fields, from model analysis and comparison in machine learning to trend discovery in collections of medical datasets. We use manifold learning to compare the intrinsic geometric structures of different datasets by comparing their diffusion operators, symmetric positive-definite (SPD) matrices that relate to approximations of the continuous Laplace-Beltrami operator from discrete samples. Existing methods typically compare such operators in a pointwise manner or assume known data alignment. Instead, we exploit the Riemannian geometry of SPD matrices to compare these operators and define a new theoretically-motivated distance based on a lower bound of the log-Euclidean metric. Our framework facilitates comparison of data manifolds expressed in datasets with different sizes, numbers of features, and measurement modalities. Our log-Euclidean signature (LES) distance recovers meaningful structural differences, outperforming competing methods in various application domains.  ( 2 min )
    MGC: A Complex-Valued Graph Convolutional Network for Directed Graphs. (arXiv:2110.07570v2 [cs.LG] UPDATED)
    Recent advancements in Graph Neural Networks have led to state-of-the-art performance on graph representation learning. However, the majority of existing works process directed graphs by symmetrization, which causes loss of directional information. To address this issue, we introduce the magnetic Laplacian, a discrete Schr\"odinger operator with magnetic field, which preserves edge directionality by encoding it into a complex phase with an electric charge parameter. By adopting a truncated variant of PageRank named Linear- Rank, we design and build a low-pass filter for homogeneous graphs and a high-pass filter for heterogeneous graphs. In this work, we propose a complex-valued graph convolutional network named Magnetic Graph Convolutional network (MGC). With the corresponding complex-valued techniques, we ensure our model will be degenerated into real-valued when the charge parameter is in specific values. We test our model on several graph datasets including directed homogeneous and heterogeneous graphs. The experimental results demonstrate that MGC is fast, powerful, and widely applicable.  ( 2 min )
    Peering Beyond the Gradient Veil with Distributed Auto Differentiation. (arXiv:2102.09631v3 [cs.LG] UPDATED)
    Although distributed machine learning has opened up many new and exciting research frontiers, fragmentation of models and data across different machines, nodes, and sites still results in considerable communication overhead, impeding reliable training in real-world contexts. The focus on gradients as the primary shared statistic during training has spawned a number of intuitive algorithms for distributed deep learning; however, gradient-centric training of large deep neural networks (DNNs) tends to be communication-heavy, often requiring additional adaptations such as sparsity constraints, compression, quantization, and more, to curtail bandwidth. We introduce an innovative, communication-friendly approach for training distributed DNNs, which capitalizes on the outer-product structure of the gradient as revealed by the mechanics of auto-differentiation. The exposed structure of the gradient evokes a new class of distributed learning algorithm, which is naturally more communication-efficient than full gradient sharing. Our approach, called distributed auto-differentiation (dAD), builds off a marriage of rank-based compression and the innate structure of the gradient as an outer-product. We demonstrate that dAD trains more efficiently than other state of the art distributed methods on modern architectures, such as transformers, when applied to large-scale text and imaging datasets. The future of distributed learning, we determine, need not be dominated by gradient-centric algorithms.  ( 2 min )
    Learning Mechanically Driven Emergent Behavior with Message Passing Neural Networks. (arXiv:2202.01380v1 [cs.LG])
    From designing architected materials to connecting mechanical behavior across scales, computational modeling is a critical tool in solid mechanics. Recently, there has been a growing interest in using machine learning to reduce the computational cost of physics-based simulations. Notably, while machine learning approaches that rely on Graph Neural Networks (GNNs) have shown success in learning mechanics, the performance of GNNs has yet to be investigated on a myriad of solid mechanics problems. In this work, we examine the ability of GNNs to predict a fundamental aspect of mechanically driven emergent behavior: the connection between a column's geometric structure and the direction that it buckles. To accomplish this, we introduce the Asymmetric Buckling Columns (ABC) dataset, a dataset comprised of three sub-datasets of asymmetric and heterogeneous column geometries where the goal is to classify the direction of symmetry breaking (left or right) under compression after the onset of instability. Because of complex local geometry, the "image-like" data representations required for implementing standard convolutional neural network based metamodels are not ideal, thus motivating the use of GNNs. In addition to investigating GNN model architecture, we study the effect of different input data representation approaches, data augmentation, and combining multiple models as an ensemble. While we were able to obtain good results, we also showed that predicting solid mechanics based emergent behavior is non-trivial. Because both our model implementation and dataset are distributed under open-source licenses, we hope that future researchers can build on our work to create enhanced mechanics-specific machine learning pipelines for capturing the behavior of complex geometric structures.  ( 2 min )
    Adaptive Clustering and Personalization in Multi-Agent Stochastic Linear Bandits. (arXiv:2106.08902v2 [stat.ML] UPDATED)
    We consider the problem of minimizing regret in an $N$ agent heterogeneous stochastic linear bandits framework, where the agents (users) are similar but not all identical. We model user heterogeneity using two popularly used ideas in practice; (i) A clustering framework where users are partitioned into groups with users in the same group being identical to each other, but different across groups, and (ii) a personalization framework where no two users are necessarily identical, but a user's parameters are close to that of the population average. In the clustered users' setup, we propose a novel algorithm, based on successive refinement of cluster identities and regret minimization. We show that, for any agent, the regret scales as $\mathcal{O}(\sqrt{T/N})$, if the agent is in a `well separated' cluster, or scales as $\mathcal{O}(T^{\frac{1}{2} + \varepsilon}/(N)^{\frac{1}{2} -\varepsilon})$ if its cluster is not well separated, where $\varepsilon$ is positive and arbitrarily close to $0$. Our algorithm is adaptive to the cluster separation, and is parameter free -- it does not need to know the number of clusters, separation and cluster size, yet the regret guarantee adapts to the inherent complexity. In the personalization framework, we introduce a natural algorithm where, the personal bandit instances are initialized with the estimates of the global average model. We show that, an agent $i$ whose parameter deviates from the population average by $\epsilon_i$, attains a regret scaling of $\widetilde{O}(\epsilon_i\sqrt{T})$. This demonstrates that if the user representations are close (small $\epsilon_i)$, the resulting regret is low, and vice-versa. The results are empirically validated and we observe superior performance of our adaptive algorithms over non-adaptive baselines.  ( 2 min )
    RipsNet: a general architecture for fast and robust estimation of the persistent homology of point clouds. (arXiv:2202.01725v1 [cs.CG])
    The use of topological descriptors in modern machine learning applications, such as Persistence Diagrams (PDs) arising from Topological Data Analysis (TDA), has shown great potential in various domains. However, their practical use in applications is often hindered by two major limitations: the computational complexity required to compute such descriptors exactly, and their sensitivity to even low-level proportions of outliers. In this work, we propose to bypass these two burdens in a data-driven setting by entrusting the estimation of (vectorization of) PDs built on top of point clouds to a neural network architecture that we call RipsNet. Once trained on a given data set, RipsNet can estimate topological descriptors on test data very efficiently with generalization capacity. Furthermore, we prove that RipsNet is robust to input perturbations in terms of the 1-Wasserstein distance, a major improvement over the standard computation of PDs that only enjoys Hausdorff stability, yielding RipsNet to substantially outperform exactly-computed PDs in noisy settings. We showcase the use of RipsNet on both synthetic and real-world data. Our open-source implementation is publicly available at https://github.com/hensel-f/ripsnet and will be included in the Gudhi library.  ( 2 min )
    Learning Contraction Policies from Offline Data. (arXiv:2112.05911v2 [cs.LG] UPDATED)
    This paper proposes a data-driven method for learning convergent control policies from offline data using Contraction theory. Contraction theory enables constructing a policy that makes the closed-loop system trajectories inherently convergent towards a unique trajectory. At the technical level, identifying the contraction metric, which is the distance metric with respect to which a robot's trajectories exhibit contraction is often non-trivial. We propose to jointly learn the control policy and its corresponding contraction metric while enforcing contraction. To achieve this, we learn an implicit dynamics model of the robotic system from an offline data set consisting of the robot's state and input trajectories. Using this learned dynamics model, we propose a data augmentation algorithm for learning contraction policies. We randomly generate samples in the state-space and propagate them forward in time through the learned dynamics model to generate auxiliary sample trajectories. We then learn both the control policy and the contraction metric such that the distance between the trajectories from the offline data set and our generated auxiliary sample trajectories decreases over time. We evaluate the performance of our proposed framework on simulated robotic goal-reaching tasks and demonstrate that enforcing contraction results in faster convergence and greater robustness of the learned policy.  ( 2 min )
    Node Selection Toward Faster Convergence for Federated Learning on Non-IID Data. (arXiv:2105.07066v3 [cs.LG] UPDATED)
    Federated Learning (FL) is a distributed learning paradigm that enables a large number of resource-limited nodes to collaboratively train a model without data sharing. The non-independent-and-identically-distributed (non-i.i.d.) data samples invoke discrepancies between the global and local objectives, making the FL model slow to converge. In this paper, we proposed Optimal Aggregation algorithm for better aggregation, which finds out the optimal subset of local updates of participating nodes in each global round, by identifying and excluding the adverse local updates via checking the relationship between the local gradient and the global gradient. Then, we proposed a Probabilistic Node Selection framework (FedPNS) to dynamically change the probability for each node to be selected based on the output of Optimal Aggregation. FedPNS can preferentially select nodes that propel faster model convergence. The unbiasedness of the proposed FedPNS design is illustrated and the convergence rate improvement of FedPNS over the commonly adopted Federated Averaging (FedAvg) algorithm is analyzed theoretically. Experimental results demonstrate the effectiveness of FedPNS in accelerating the FL convergence rate, as compared to FedAvg with random node selection.  ( 2 min )
    Robust Binary Models by Pruning Randomly-initialized Networks. (arXiv:2202.01341v1 [cs.LG])
    We propose ways to obtain robust models against adversarial attacks from randomly-initialized binary networks. Unlike adversarial training, which learns the model parameters, we in contrast learn the structure of the robust model by pruning a randomly-initialized binary network. Our method confirms the strong lottery ticket hypothesis in the presence of adversarial attacks. Compared to the results obtained in a non-adversarial setting, we in addition improve the performance and compression of the model by 1) using an adaptive pruning strategy for different layers, and 2) using a different initialization scheme such that all model parameters are initialized either to +1 or -1. Our extensive experiments demonstrate that our approach performs not only better than the state-of-the art for robust binary networks; it also achieves comparable or even better performance than full-precision network training methods.  ( 2 min )
    Inference-Time Unlabeled Personalized Federated Learning. (arXiv:2111.08356v2 [cs.LG] UPDATED)
    In Federated learning (FL), multiple clients collaborate to learn a shared model through a central server while they keep data decentralized. Personalized federated learning (PFL) further extends FL by learning personalized models per client. In both FL and PFL, all clients participate in the training process and their labeled data is used for training. However, in reality, novel clients may wish to join a prediction service after it has been deployed, obtaining predictions for their own unlabeled data. Here, we introduce a new learning setup, Inference-Time Unlabeled PFL (ITU-PFL), where a system trained on a set of clients, needs to be later applied to novel unlabeled clients at inference time. We propose a novel approach to this problem, ITUFL-HN, which uses a hypernetwork to produce a new model for the late-to-the-party client. Specifically, we train an encoder network that learns a representation for a client given its unlabeled data. That client representation is fed to a hypernetwork that generates a personalized model for that client. Evaluated on five benchmark datasets, we find that ITUFL-HN generalizes better than current FL and PFL methods, especially when the novel client has a large domain shift from training clients. We also analyzed the generalization error for novel clients, and showed analytically and experimentally how they can apply differential privacy to their data.  ( 2 min )
    Cyclical Pruning for Sparse Neural Networks. (arXiv:2202.01290v1 [cs.LG])
    Current methods for pruning neural network weights iteratively apply magnitude-based pruning on the model weights and re-train the resulting model to recover lost accuracy. In this work, we show that such strategies do not allow for the recovery of erroneously pruned weights. To enable weight recovery, we propose a simple strategy called \textit{cyclical pruning} which requires the pruning schedule to be periodic and allows for weights pruned erroneously in one cycle to recover in subsequent ones. Experimental results on both linear models and large-scale deep neural networks show that cyclical pruning outperforms existing pruning algorithms, especially at high sparsity ratios. Our approach is easy to tune and can be readily incorporated into existing pruning pipelines to boost performance.  ( 2 min )
    Fast Convex Optimization for Two-Layer ReLU Networks: Equivalent Model Classes and Cone Decompositions. (arXiv:2202.01331v1 [cs.LG])
    We develop fast algorithms and robust software for convex optimization of two-layer neural networks with ReLU activation functions. Our work leverages a convex reformulation of the standard weight-decay penalized training problem as a set of group-$\ell_1$-regularized data-local models, where locality is enforced by polyhedral cone constraints. In the special case of zero-regularization, we show that this problem is exactly equivalent to unconstrained optimization of a convex "gated ReLU" network. For problems with non-zero regularization, we show that convex gated ReLU models obtain data-dependent approximation bounds for the ReLU training problem. To optimize the convex reformulations, we develop an accelerated proximal gradient method and a practical augmented Lagrangian solver. We show that these approaches are faster than standard training heuristics for the non-convex problem, such as SGD, and outperform commercial interior-point solvers. Experimentally, we verify our theoretical results, explore the group-$\ell_1$ regularization path, and scale convex optimization for neural networks to image classification on MNIST and CIFAR-10.  ( 2 min )
    Generalizability of Machine Learning Models: Quantitative Evaluation of Three Methodological Pitfalls. (arXiv:2202.01337v1 [cs.LG])
    Despite the great potential of machine learning, the lack of generalizability has hindered the widespread adoption of these technologies in routine clinical practice. We investigate three methodological pitfalls: (1) violation of independence assumption, (2) model evaluation with an inappropriate performance indicator, and (3) batch effect and how these pitfalls could affect the generalizability of machine learning models. We implement random forest and deep convolutional neural network models using several medical imaging datasets, including head and neck CT, lung CT, chest X-Ray, and histopathological images, to quantify and illustrate the effect of these pitfalls. We develop these models with and without the pitfall and compare the performance of the resulting models in terms of accuracy, precision, recall, and F1 score. Our results showed that violation of the independence assumption could substantially affect model generalizability. More specifically, (I) applying oversampling before splitting data into train, validation and test sets; (II) performing data augmentation before splitting data; (III) distributing data points for a subject across training, validation, and test sets; and (IV) applying feature selection before splitting data led to superficial boosts in model performance. We also observed that inappropriate performance indicators could lead to erroneous conclusions. Also, batch effect could lead to developing models that lack generalizability. The aforementioned methodological pitfalls lead to machine learning models with over-optimistic performance. These errors, if made, cannot be captured using internal model evaluation, and the inaccurate predictions made by the model may lead to wrong conclusions and interpretations. Therefore, avoiding these pitfalls is a necessary condition for developing generalizable models.  ( 2 min )
    On Monte Carlo Tree Search for Weighted Vertex Coloring. (arXiv:2202.01665v1 [cs.LG])
    This work presents the first study of using the popular Monte Carlo Tree Search (MCTS) method combined with dedicated heuristics for solving the Weighted Vertex Coloring Problem. Starting with the basic MCTS algorithm, we gradually introduce a number of algorithmic variants where MCTS is extended by various simulation strategies including greedy and local search heuristics. We conduct experiments on well-known benchmark instances to assess the value of each studied combination. We also provide empirical evidence to shed light on the advantages and limits of each strategy.  ( 2 min )
    Generative Flow Networks for Discrete Probabilistic Modeling. (arXiv:2202.01361v1 [cs.LG])
    We present energy-based generative flow networks (EB-GFN), a novel probabilistic modeling algorithm for high-dimensional discrete data. Building upon the theory of generative flow networks (GFlowNets), we model the generation process by a stochastic data construction policy and thus amortize expensive MCMC exploration into a fixed number of actions sampled from a GFlowNet. We show how GFlowNets can approximately perform large-block Gibbs sampling to mix between modes. We propose a framework to jointly train a GFlowNet with an energy function, so that the GFlowNet learns to sample from the energy distribution, while the energy learns with an approximate MLE objective with negative samples from the GFlowNet. We demonstrate EB-GFN's effectiveness on various probabilistic modeling tasks.  ( 2 min )
    Removing Distortion Effects in Music Using Deep Neural Networks. (arXiv:2202.01664v1 [eess.AS])
    Audio effects are an essential element in the context of music production, and therefore, modeling analog audio effects has been extensively researched for decades using system-identification methods, circuit simulation, and recently, deep learning. However, only few works tackled the reconstruction of signals that were processed using an audio effect unit. Given the recent advances in music source separation and automatic mixing, the removal of audio effects could facilitate an automatic remixing system. This paper focuses on removing distortion and clipping applied to guitar tracks for music production while presenting a comparative investigation of different deep neural network (DNN) architectures on this task. We achieve exceptionally good results in distortion removal using DNNs for effects that superimpose the clean signal to the distorted signal, while the task is more challenging if the clean signal is not superimposed. Nevertheless, in the latter case, the neural models under evaluation surpass one state-of-the-art declipping system in terms of source-to-distortion ratio, leading to better quality and faster inference.  ( 2 min )
    Can machines solve general queueing systems?. (arXiv:2202.01729v1 [cs.LG])
    In this paper, we analyze how well a machine can solve a general problem in queueing theory. To answer this question, we use a deep learning model to predict the stationary queue-length distribution of an $M/G/1$ queue (Poisson arrivals, general service times, one server). To the best of our knowledge, this is the first time a machine learning model is applied to a general queueing theory problem. We chose $M/G/1$ queue for this paper because it lies "on the cusp" of the analytical frontier: on the one hand exact solution for this model is available, which is both computationally and mathematically complex. On the other hand, the problem (specifically the service time distribution) is general. This allows us to compare the accuracy and efficiency of the deep learning approach to the analytical solutions. The two key challenges in applying machine learning to this problem are (1) generating a diverse set of training examples that provide a good representation of a "generic" positive-valued distribution, and (2) representations of the continuous distribution of service times as an input. We show how we overcome these challenges. Our results show that our model is indeed able to predict the stationary behavior of the $M/G/1$ queue extremely accurately: the average value of our metric over the entire test set is $0.0009$. Moreover, our machine learning model is very efficient, computing very accurate stationary distributions in a fraction of a second (an approach based on simulation modeling would take much longer to converge). We also present a case-study that mimics a real-life setting and shows that our approach is more robust and provides more accurate solutions compared to the existing methods. This shows the promise of extending our approach beyond the analytically solvable systems (e.g., $G/G/1$ or $G/G/c$).  ( 2 min )
    FORML: Learning to Reweight Data for Fairness. (arXiv:2202.01719v1 [cs.LG])
    Deployed machine learning models are evaluated by multiple metrics beyond accuracy, such as fairness and robustness. However, such models are typically trained to minimize the average loss for a single metric, which is typically a proxy for accuracy. Training to optimize a single metric leaves these models prone to fairness violations, especially when the population of sub-groups in the training data are imbalanced. This work addresses the challenge of jointly optimizing fairness and predictive performance in the multi-class classification setting by introducing Fairness Optimized Reweighting via Meta-Learning (FORML), a training algorithm that balances fairness constraints and accuracy by jointly optimizing training sample weights and a neural network's parameters. The approach increases fairness by learning to weight each training datum's contribution to the loss according to its impact on reducing fairness violations, balancing the contributions from both over- and under-represented sub-groups. We empirically validate FORML on a range of benchmark and real-world classification datasets and show that our approach improves equality of opportunity fairness criteria over existing state-of-the-art reweighting methods by approximately 1% on image classification tasks and by approximately 5% on a face attribute prediction task. This improvement is achieved without pre-processing data or post-processing model outputs, without learning an additional weighting function, and while maintaining accuracy on the original predictive metric.  ( 2 min )
    Accounting for Unobserved Confounding in Domain Generalization. (arXiv:2007.10653v6 [stat.ML] UPDATED)
    This paper investigates the problem of learning robust, generalizable prediction models from a combination of multiple datasets and qualitative assumptions about the underlying data-generating model. Part of the challenge of learning robust models lies in the influence of unobserved confounders that void many of the invariances and principles of minimum error presently used for this problem. Our approach is to define a different invariance property of causal solutions in the presence of unobserved confounders which, through a relaxation of this invariance, can be connected with an explicit distributionally robust optimization problem over a set of affine combination of data distributions. Concretely, our objective takes the form of a standard loss, plus a regularization term that encourages partial equality of error derivatives with respect to model parameters. We demonstrate the empirical performance of our approach on healthcare data from different modalities, including image, speech and tabular data.  ( 2 min )
    How to build a cognitive map: insights from models of the hippocampal formation. (arXiv:2202.01682v1 [q-bio.NC])
    Learning and interpreting the structure of the environment is an innate feature of biological systems, and is integral to guiding flexible behaviours for evolutionary viability. The concept of a cognitive map has emerged as one of the leading metaphors for these capacities, and unravelling the learning and neural representation of such a map has become a central focus of neuroscience. While experimentalists are providing a detailed picture of the neural substrate of cognitive maps in hippocampus and beyond, theorists have been busy building models to bridge the divide between neurons, computation, and behaviour. These models can account for a variety of known representations and neural phenomena, but often provide a differing understanding of not only the underlying principles of cognitive maps, but also the respective roles of hippocampus and cortex. In this Perspective, we bring many of these models into a common language, distil their underlying principles of constructing cognitive maps, provide novel (re)interpretations for neural phenomena, suggest how the principles can be extended to account for prefrontal cortex representations and, finally, speculate on the role of cognitive maps in higher cognitive capacities.  ( 2 min )
    Neural Transformation Learning for Deep Anomaly Detection Beyond Images. (arXiv:2103.16440v4 [cs.LG] UPDATED)
    Data transformations (e.g. rotations, reflections, and cropping) play an important role in self-supervised learning. Typically, images are transformed into different views, and neural networks trained on tasks involving these views produce useful feature representations for downstream tasks, including anomaly detection. However, for anomaly detection beyond image data, it is often unclear which transformations to use. Here we present a simple end-to-end procedure for anomaly detection with learnable transformations. The key idea is to embed the transformed data into a semantic space such that the transformed data still resemble their untransformed form, while different transformations are easily distinguishable. Extensive experiments on time series demonstrate that our proposed method outperforms existing approaches in the one-vs.-rest setting and is competitive in the more challenging n-vs.-rest anomaly detection task. On tabular datasets from the medical and cyber-security domains, our method learns domain-specific transformations and detects anomalies more accurately than previous work.  ( 2 min )
    PageRank algorithm for Directed Hypergraph. (arXiv:1909.01132v2 [cs.SI] UPDATED)
    During the last two decades, we easilly see that the World Wide Web's link structure is modeled as the directed graph. In this paper, we will model the World Wide Web's link structure as the directed hypergraph. Moreover, we will develop the PageRank algorithm for this directed hypergraph. Due to the lack of the World Wide Web directed hypergraph datasets, we will apply the PageRank algorithm to the metabolic network which is the directed hypergraph itself. The experiments show that our novel PageRank algorithm is successfully applied to this metabolic network.  ( 2 min )
    Formal Mathematics Statement Curriculum Learning. (arXiv:2202.01344v1 [cs.LG])
    We explore the use of expert iteration in the context of language modeling applied to formal mathematics. We show that at same compute budget, expert iteration, by which we mean proof search interleaved with learning, dramatically outperforms proof search only. We also observe that when applied to a collection of formal statements of sufficiently varied difficulty, expert iteration is capable of finding and solving a curriculum of increasingly difficult problems, without the need for associated ground-truth proofs. Finally, by applying this expert iteration to a manually curated set of problem statements, we achieve state-of-the-art on the miniF2F benchmark, automatically solving multiple challenging problems drawn from high school olympiads.  ( 2 min )
    Approximating Full Conformal Prediction at Scale via Influence Functions. (arXiv:2202.01315v1 [cs.LG])
    Conformal prediction (CP) is a wrapper around traditional machine learning models, giving coverage guarantees under the sole assumption of exchangeability; in classification problems, for a chosen significance level $\varepsilon$, CP guarantees that the number of errors is at most $\varepsilon$, irrespective of whether the underlying model is misspecified. However, the prohibitive computational costs of full CP led researchers to design scalable alternatives, which alas do not attain the same guarantees or statistical power of full CP. In this paper, we use influence functions to efficiently approximate full CP. We prove that our method is a consistent approximation of full CP, and empirically show that the approximation error becomes smaller as the training set increases; e.g., for $10^{3}$ training points the two methods output p-values that are $<10^{-3}$ apart: a negligible error for any practical application. Our methods enable scaling full CP to large real-world datasets. We compare our full CP approximation ACP to mainstream CP alternatives, and observe that our method is computationally competitive whilst enjoying the statistical predictive power of full CP.  ( 2 min )
    Deep Layer-wise Networks Have Closed-Form Weights. (arXiv:2202.01210v1 [stat.ML])
    There is currently a debate within the neuroscience community over the likelihood of the brain performing backpropagation (BP). To better mimic the brain, training a network \textit{one layer at a time} with only a "single forward pass" has been proposed as an alternative to bypass BP; we refer to these networks as "layer-wise" networks. We continue the work on layer-wise networks by answering two outstanding questions. First, $\textit{do they have a closed-form solution?}$ Second, $\textit{how do we know when to stop adding more layers?}$ This work proves that the Kernel Mean Embedding is the closed-form weight that achieves the network global optimum while driving these networks to converge towards a highly desirable kernel for classification; we call it the $\textit{Neural Indicator Kernel}$.  ( 2 min )
    Machine Learning Solar Wind Driving Magnetospheric Convection in Tail Lobes. (arXiv:2202.01383v1 [astro-ph.SR])
    To quantitatively study the driving mechanisms of magnetospheric convection in the magnetotail lobes on a global scale, we utilize data from the ARTEMIS spacecraft in the deep tail and the Cluster spacecraft in the near tail. Previous work demonstrated that, in the lobes near the Moon, we can estimate the convection by utilizing ARTEMIS measurements of lunar ions velocity. In this paper, we analyze these datasets with machine learning models to determine what upstream factors drive the lobe convection in different magnetotail regions and thereby understand the mechanisms that control the dynamics of the tail lobes. Our results show that the correlations between the predicted and test convection velocities for the machine learning models (> 0.75) are much better than those of the multiple linear regression model (~ 0.23 - 0.43). The systematic analysis reveals that the IMF and magnetospheric activity play an important role in influencing plasma convection in the global magnetotail lobes.  ( 2 min )
    Multi-Resolution Factor Graph Based Stereo Correspondence Algorithm. (arXiv:2202.01309v1 [cs.CV])
    A dense depth-map of a scene at an arbitrary view orientation can be estimated from dense view correspondences among multiple lower-dimensional views of the scene. These low-dimensional view correspondences are dependent on the geometrical relationship among the views and the scene. Determining dense view correspondences is difficult in part due to presence of homogeneous regions in the scene and due to presence of occluded regions and illumination differences among the views. We present a new multi-resolution factor graph-based stereo matching algorithm (MR-FGS) that utilizes both intra- and inter-resolution dependencies among the views as well as among the disparity estimates. The proposed framework allows exchange of information among multiple resolutions of the correspondence problem and is useful for handling larger homogeneous regions in a scene. The MR-FGS algorithm was evaluated qualitatively and quantitatively using stereo pairs in the Middlebury stereo benchmark dataset based on commonly used performance measures. When compared to a recently developed factor graph model (FGS), the MR-FGS algorithm provided more accurate disparity estimates without requiring the commonly used post-processing procedure known as the left-right consistency check. The multi-resolution dependency constraint within the factor-graph model significantly improved contrast along depth boundaries in the MR-FGS generated disparity maps.  ( 2 min )
    Accelerated Quality-Diversity for Robotics through Massive Parallelism. (arXiv:2202.01258v1 [cs.NE])
    Quality-Diversity (QD) algorithms are a well-known approach to generate large collections of diverse and high-quality policies. However, QD algorithms are also known to be data-inefficient, requiring large amounts of computational resources and are slow when used in practice for robotics tasks. Policy evaluations are already commonly performed in parallel to speed up QD algorithms but have limited capabilities on a single machine as most physics simulators run on CPUs. With recent advances in simulators that run on accelerators, thousands of evaluations can performed in parallel on single GPU/TPU. In this paper, we present QDax, an implementation of MAP-Elites which leverages massive parallelism on accelerators to make QD algorithms more accessible. We first demonstrate the improvements on the number of evaluations per second that parallelism using accelerated simulators can offer. More importantly, we show that QD algorithms are ideal candidates and can scale with massive parallelism to be run at interactive timescales. The increase in parallelism does not significantly affect the performance of QD algorithms, while reducing experiment runtimes by two factors of magnitudes, turning days of computation into minutes. These results show that QD can now benefit from hardware acceleration, which contributed significantly to the bloom of deep learning.  ( 2 min )
    Imitation Learning by Estimating Expertise of Demonstrators. (arXiv:2202.01288v1 [cs.LG])
    Many existing imitation learning datasets are collected from multiple demonstrators, each with different expertise at different parts of the environment. Yet, standard imitation learning algorithms typically treat all demonstrators as homogeneous, regardless of their expertise, absorbing the weaknesses of any suboptimal demonstrators. In this work, we show that unsupervised learning over demonstrator expertise can lead to a consistent boost in the performance of imitation learning algorithms. We develop and optimize a joint model over a learned policy and expertise levels of the demonstrators. This enables our model to learn from the optimal behavior and filter out the suboptimal behavior of each demonstrator. Our model learns a single policy that can outperform even the best demonstrator, and can be used to estimate the expertise of any demonstrator at any state. We illustrate our findings on real-robotic continuous control tasks from Robomimic and discrete environments such as MiniGrid and chess, out-performing competing methods in $21$ out of $23$ settings, with an average of $7\%$ and up to $60\%$ improvement in terms of the final reward.  ( 2 min )
    Inverse Bayesian Optimization: Learning Human Acquisition Functions in an Exploration vs Exploitation Search Task. (arXiv:2104.09237v2 [cs.HC] UPDATED)
    This paper introduces a probabilistic framework to estimate parameters of an acquisition function given observed human behavior that can be modeled as a collection of sample paths from a Bayesian optimization procedure. The methodology involves defining a likelihood on observed human behavior from an optimization task, where the likelihood is parameterized by a Bayesian optimization subroutine governed by an unknown acquisition function. This structure enables us to make inference on a subject's acquisition function while allowing their behavior to deviate around the solution to the Bayesian optimization subroutine. To test our methods, we designed a sequential optimization task which forced subjects to balance exploration and exploitation in search of an invisible target location. Applying our proposed methods to the resulting data, we find that many subjects tend to exhibit exploration preferences beyond that of standard acquisition functions to capture. Guided by the model discrepancies, we augment the candidate acquisition functions to yield a superior fit to the human behavior in this task.  ( 2 min )
    ETSformer: Exponential Smoothing Transformers for Time-series Forecasting. (arXiv:2202.01381v1 [cs.LG])
    Transformers have been actively studied for time-series forecasting in recent years. While often showing promising results in various scenarios, traditional Transformers are not designed to fully exploit the characteristics of time-series data and thus suffer some fundamental limitations, e.g., they generally lack of decomposition capability and interpretability, and are neither effective nor efficient for long-term forecasting. In this paper, we propose ETSFormer, a novel time-series Transformer architecture, which exploits the principle of exponential smoothing in improving Transformers for time-series forecasting. In particular, inspired by the classical exponential smoothing methods in time-series forecasting, we propose the novel exponential smoothing attention (ESA) and frequency attention (FA) to replace the self-attention mechanism in vanilla Transformers, thus improving both accuracy and efficiency. Based on these, we redesign the Transformer architecture with modular decomposition blocks such that it can learn to decompose the time-series data into interpretable time-series components such as level, growth and seasonality. Extensive experiments on various time-series benchmarks validate the efficacy and advantages of the proposed method. The code and models of our implementations will be released.  ( 2 min )
    Toward Realistic Backdoor Injection Attacks on DNNs using Rowhammer. (arXiv:2110.07683v2 [cs.LG] UPDATED)
    State-of-the-art deep neural networks (DNNs) have been proven to be vulnerable to adversarial manipulation and backdoor attacks. Backdoored models deviate from expected behavior on inputs with predefined triggers while retaining performance on clean data. Recent works focus on software simulation of backdoor injection during the inference phase by modifying network weights, which we find often unrealistic in practice due to restrictions in hardware. In contrast, in this work for the first time we present an end-to-end backdoor injection attack realized on actual hardware on a classifier model using Rowhammer as the fault injection method. To this end, we first investigate the viability of backdoor injection attacks in real-life deployments of DNNs on hardware and address such practical issues in hardware implementation from a novel optimization perspective. We are motivated by the fact that the vulnerable memory locations are very rare, device-specific, and sparsely distributed. Consequently, we propose a novel network training algorithm based on constrained optimization to achieve a realistic backdoor injection attack in hardware. By modifying parameters uniformly across the convolutional and fully-connected layers as well as optimizing the trigger pattern together, we achieve the state-of-the-art attack performance with fewer bit flips. For instance, our method on a hardware-deployed ResNet-20 model trained on CIFAR-10 achieves over 91% test accuracy and 94% attack success rate by flipping only 10 out of 2.2 million bits.  ( 2 min )
    IGNNITION: Bridging the Gap Between Graph Neural Networks and Networking Systems. (arXiv:2109.06715v2 [cs.LG] UPDATED)
    Recent years have seen the vast potential of Graph Neural Networks (GNN) in many fields where data is structured as graphs (e.g., chemistry, recommender systems). In particular, GNNs are becoming increasingly popular in the field of networking, as graphs are intrinsically present at many levels (e.g., topology, routing). The main novelty of GNNs is their ability to generalize to other networks unseen during training, which is an essential feature for developing practical Machine Learning (ML) solutions for networking. However, implementing a functional GNN prototype is currently a cumbersome task that requires strong skills in neural network programming. This poses an important barrier to network engineers that often do not have the necessary ML expertise. In this article, we present IGNNITION, a novel open-source framework that enables fast prototyping of GNNs for networking systems. IGNNITION is based on an intuitive high-level abstraction that hides the complexity behind GNNs, while still offering great flexibility to build custom GNN architectures. To showcase the versatility and performance of this framework, we implement two state-of-the-art GNN models applied to different networking use cases. Our results show that the GNN models produced by IGNNITION are equivalent in terms of accuracy and performance to their native implementations in TensorFlow.  ( 2 min )
    GALAXY: Graph-based Active Learning at the Extreme. (arXiv:2202.01402v1 [cs.LG])
    Active learning is a label-efficient approach to train highly effective models while interactively selecting only small subsets of unlabelled data for labelling and training. In "open world" settings, the classes of interest can make up a small fraction of the overall dataset -- most of the data may be viewed as an out-of-distribution or irrelevant class. This leads to extreme class-imbalance, and our theory and methods focus on this core issue. We propose a new strategy for active learning called GALAXY (Graph-based Active Learning At the eXtrEme), which blends ideas from graph-based active learning and deep learning. GALAXY automatically and adaptively selects more class-balanced examples for labeling than most other methods for active learning. Our theory shows that GALAXY performs a refined form of uncertainty sampling that gathers a much more class-balanced dataset than vanilla uncertainty sampling. Experimentally, we demonstrate GALAXY's superiority over existing state-of-art deep active learning algorithms in unbalanced vision classification settings generated from popular datasets.  ( 2 min )
    Training Semantic Descriptors for Image-Based Localization. (arXiv:2202.01212v1 [cs.CV])
    Vision based solutions for the localization of vehicles have become popular recently. We employ an image retrieval based visual localization approach. The database images are kept with GPS coordinates and the location of the retrieved database image serves as an approximate position of the query image. We show that localization can be performed via descriptors solely extracted from semantically segmented images. It is reliable especially when the environment is subjected to severe illumination and seasonal changes. Our experiments reveal that the localization performance of a semantic descriptor can increase up to the level of state-of-the-art RGB image based methods.  ( 2 min )
    Byzantine-Robust Decentralized Learning via Self-Centered Clipping. (arXiv:2202.01545v1 [cs.LG])
    In this paper, we study the challenging task of Byzantine-robust decentralized training on arbitrary communication graphs. Unlike federated learning where workers communicate through a server, workers in the decentralized environment can only talk to their neighbors, making it harder to reach consensus. We identify a novel dissensus attack in which few malicious nodes can take advantage of information bottlenecks in the topology to poison the collaboration. To address these issues, we propose a Self-Centered Clipping (SCClip) algorithm for Byzantine-robust consensus and optimization, which is the first to provably converge to a $O(\delta_{\max}\zeta^2/\gamma^2)$ neighborhood of the stationary point for non-convex objectives under standard assumptions. Finally, we demonstrate the encouraging empirical performance of SCClip under a large number of attacks.  ( 2 min )
    Toric Geometry of Entropic Regularization. (arXiv:2202.01571v1 [math.OC])
    Entropic regularization is a method for large-scale linear programming. Geometrically, one traces intersections of the feasible polytope with scaled toric varieties, starting at the Birch point. We compare this to log-barrier methods, with reciprocal linear spaces, starting at the analytic center. We revisit entropic regularization for unbalanced optimal transport, and we develop the use of optimal conic couplings. We compute the degree of the associated toric variety, and we explore algorithms like iterative scaling.  ( 2 min )
    Fenrir: Physics-Enhanced Regression for Initial Value Problems. (arXiv:2202.01287v1 [cs.LG])
    We show how probabilistic numerics can be used to convert an initial value problem into a Gauss--Markov process parametrised by the dynamics of the initial value problem. Consequently, the often difficult problem of parameter estimation in ordinary differential equations is reduced to hyperparameter estimation in Gauss--Markov regression, which tends to be considerably easier. The method's relation and benefits in comparison to classical numerical integration and gradient matching approaches is elucidated. In particular, the method can, in contrast to gradient matching, handle partial observations, and has certain routes for escaping local optima not available to classical numerical integration. Experimental results demonstrate that the method is on par or moderately better than competing approaches.  ( 2 min )
    Learning Physics through Images: An Application to Wind-Driven Spatial Patterns. (arXiv:2202.01762v1 [cs.LG])
    For centuries, scientists have observed nature to understand the laws that govern the physical world. The traditional process of turning observations into physical understanding is slow. Imperfect models are constructed and tested to explain relationships in data. Powerful new algorithms are available that can enable computers to learn physics by observing images and videos. Inspired by this idea, instead of training machine learning models using physical quantities, we trained them using images, that is, pixel information. For this work, and as a proof of concept, the physics of interest are wind-driven spatial patterns. Examples of these phenomena include features in Aeolian dunes and the deposition of volcanic ash, wildfire smoke, and air pollution plumes. We assume that the spatial patterns were collected by an imaging device that records the magnitude of the logarithm of deposition as a red, green, blue (RGB) color image with channels containing values ranging from 0 to 255. In this paper, we explore deep convolutional neural network-based autoencoders to exploit relationships in wind-driven spatial patterns, which commonly occur in geosciences, and reduce their dimensionality. Reducing the data dimension size with an encoder allows us to train regression models linking geographic and meteorological scalar input quantities to the encoded space. Once this is achieved, full predictive spatial patterns are reconstructed using the decoder. We demonstrate this approach on images of spatial deposition from a pollution source, where the encoder compresses the dimensionality to 0.02% of the original size and the full predictive model performance on test data achieves an accuracy of 92%.  ( 2 min )
    Unpaired Image Super-Resolution with Optimal Transport Maps. (arXiv:2202.01116v1 [eess.IV] CROSS LISTED)
    Real-world image super-resolution (SR) tasks often do not have paired datasets limiting the application of supervised techniques. As a result, the tasks are usually approached by unpaired techniques based on Generative Adversarial Networks (GANs) which yield complex training losses with several regularization terms such as content and identity losses. We theoretically investigate the optimization problems which arise in such models and find two surprising observations. First, the learned SR map is always an optimal transport (OT) map. Second, we empirically show that the learned map is biased, i.e., it may not actually transform the distribution of low-resolution images to high-resolution images. Inspired by these findings, we propose an algorithm for unpaired SR which learns an unbiased OT map for the perceptual transport cost. Unlike existing GAN-based alternatives, our algorithm has a simple optimization objective reducing the neccesity to perform complex hyperparameter selection and use additional regularizations. At the same time, it provides nearly state-of-the-art performance on the large-scale unpaired AIM-19 dataset.  ( 2 min )
    Pre-Trained Language Models for Interactive Decision-Making. (arXiv:2202.01771v1 [cs.LG])
    Language model (LM) pre-training has proven useful for a wide variety of language processing tasks, but can such pre-training be leveraged for more general machine learning problems? We investigate the effectiveness of language modeling to scaffold learning and generalization in autonomous decision-making. We describe a framework for imitation learning in which goals and observations are represented as a sequence of embeddings, and translated into actions using a policy network initialized with a pre-trained transformer LM. We demonstrate that this framework enables effective combinatorial generalization across different environments, such as VirtualHome and BabyAI. In particular, for test tasks involving novel goals or novel scenes, initializing policies with language models improves task completion rates by 43.6% in VirtualHome. We hypothesize and investigate three possible factors underlying the effectiveness of LM-based policy initialization. We find that sequential representations (vs. fixed-dimensional feature vectors) and the LM objective (not just the transformer architecture) are both important for generalization. Surprisingly, however, the format of the policy inputs encoding (e.g. as a natural language string vs. an arbitrary sequential encoding) has little influence. Together, these results suggest that language modeling induces representations that are useful for modeling not just language, but also goals and plans; these representations can aid learning and generalization even outside of language processing.  ( 2 min )
    Training a Bidirectional GAN-based One-Class Classifier for Network Intrusion Detection. (arXiv:2202.01332v1 [cs.LG])
    The network intrusion detection task is challenging because of the imbalanced and unlabeled nature of the dataset it operates on. Existing generative adversarial networks (GANs), are primarily used for creating synthetic samples from reals. They also have been proved successful in anomaly detection tasks. In our proposed method, we construct the trained encoder-discriminator as a one-class classifier based on Bidirectional GAN (Bi-GAN) for detecting anomalous traffic from normal traffic other than calculating expensive and complex anomaly scores or thresholds. Our experimental result illustrates that our proposed method is highly effective to be used in network intrusion detection tasks and outperforms other similar generative methods on the NSL-KDD dataset.  ( 2 min )
    Rademacher Random Projections with Tensor Networks. (arXiv:2110.13970v3 [cs.LG] UPDATED)
    Random projection (RP) have recently emerged as popular techniques in the machine learning community for their ability in reducing the dimension of very high-dimensional tensors. Following the work in [30], we consider a tensorized random projection relying on Tensor Train (TT) decomposition where each element of the core tensors is drawn from a Rademacher distribution. Our theoretical results reveal that the Gaussian low-rank tensor represented in compressed form in TT format in [30] can be replaced by a TT tensor with core elements drawn from a Rademacher distribution with the same embedding size. Experiments on synthetic data demonstrate that tensorized Rademacher RP can outperform the tensorized Gaussian RP studied in [30]. In addition, we show both theoretically and experimentally, that the tensorized RP in the Matrix Product Operator (MPO) format is not a Johnson-Lindenstrauss transform (JLT) and therefore not a well-suited random projection map  ( 2 min )
    How to Leverage Unlabeled Data in Offline Reinforcement Learning. (arXiv:2202.01741v1 [cs.LG])
    Offline reinforcement learning (RL) can learn control policies from static datasets but, like standard RL methods, it requires reward annotations for every transition. In many cases, labeling large datasets with rewards may be costly, especially if those rewards must be provided by human labelers, while collecting diverse unlabeled data might be comparatively inexpensive. How can we best leverage such unlabeled data in offline RL? One natural solution is to learn a reward function from the labeled data and use it to label the unlabeled data. In this paper, we find that, perhaps surprisingly, a much simpler method that simply applies zero rewards to unlabeled data leads to effective data sharing both in theory and in practice, without learning any reward model at all. While this approach might seem strange (and incorrect) at first, we provide extensive theoretical and empirical analysis that illustrates how it trades off reward bias, sample complexity and distributional shift, often leading to good results. We characterize conditions under which this simple strategy is effective, and further show that extending it with a simple reweighting approach can further alleviate the bias introduced by using incorrect reward labels. Our empirical evaluation confirms these findings in simulated robotic locomotion, navigation, and manipulation settings.  ( 2 min )
    Neural graphical modelling in continuous-time: consistency guarantees and algorithms. (arXiv:2105.02522v3 [stat.ML] UPDATED)
    The discovery of structure from time series data is a key problem in fields of study working with complex systems. Most identifiability results and learning algorithms assume the underlying dynamics to be discrete in time. Comparatively few, in contrast, explicitly define dependencies in infinitesimal intervals of time, independently of the scale of observation and of the regularity of sampling. In this paper, we consider score-based structure learning for the study of dynamical systems. We prove that for vector fields parameterized in a large class of neural networks, least squares optimization with adaptive regularization schemes consistently recovers directed graphs of local independencies in systems of stochastic differential equations. Using this insight, we propose a score-based learning algorithm based on penalized Neural Ordinary Differential Equations (modelling the mean process) that we show to be applicable to the general setting of irregularly-sampled multivariate time series and to outperform the state of the art across a range of dynamical systems.  ( 2 min )
    Optimality and Stability in Non-Convex Smooth Games. (arXiv:2002.11875v3 [cs.LG] UPDATED)
    Convergence to a saddle point for convex-concave functions has been studied for decades, while recent years has seen a surge of interest in non-convex (zero-sum) smooth games, motivated by their recent wide applications. It remains an intriguing research challenge how local optimal points are defined and which algorithm can converge to such points. An interesting concept is known as the local minimax point, which strongly correlates with the widely-known gradient descent ascent algorithm. This paper aims to provide a comprehensive analysis of local minimax points, such as their relation with other solution concepts and their optimality conditions. We find that local saddle points can be regarded as a special type of local minimax points, called uniformly local minimax points, under mild continuity assumptions. In (non-convex) quadratic games, we show that local minimax points are (in some sense) equivalent to global minimax points. Finally, we study the stability of gradient algorithms near local minimax points. Although gradient algorithms can converge to local/global minimax points in the non-degenerate case, they would often fail in general cases. This implies the necessity of either novel algorithms or concepts beyond saddle points and minimax points in non-convex smooth games.  ( 2 min )
    On Manifold Hypothesis: Hypersurface Submanifold Embedding Using Osculating Hyperspheres. (arXiv:2202.01619v1 [cs.LG])
    Consider a set of $n$ data points in the Euclidean space $\mathbb{R}^d$. This set is called dataset in machine learning and data science. Manifold hypothesis states that the dataset lies on a low-dimensional submanifold with high probability. All dimensionality reduction and manifold learning methods have the assumption of manifold hypothesis. In this paper, we show that the dataset lies on an embedded hypersurface submanifold which is locally $(d-1)$-dimensional. Hence, we show that the manifold hypothesis holds at least for the embedding dimensionality $d-1$. Using an induction in a pyramid structure, we also extend the embedding dimensionality to lower embedding dimensionalities to show the validity of manifold hypothesis for embedding dimensionalities $\{1, 2, \dots, d-1\}$. For embedding the hypersurface, we first construct the $d$ nearest neighbors graph for data. For every point, we fit an osculating hypersphere $S^{d-1}$ using its neighbors where this hypersphere is osculating to a hypothetical hypersurface. Then, using surgery theory, we apply surgery on the osculating hyperspheres to obtain $n$ hyper-caps. We connect the hyper-caps to one another using partial hyper-cylinders. By connecting all parts, the embedded hypersurface is obtained as the disjoint union of these elements. We discuss the geometrical characteristics of the embedded hypersurface, such as having boundary, its topology, smoothness, boundedness, orientability, compactness, and injectivity. Some discussion are also provided for the linearity and structure of data. This paper is the intersection of several fields of science including machine learning, differential geometry, and algebraic topology.  ( 2 min )
    Drawing Robust Scratch Tickets: Subnetworks with Inborn Robustness Are Found within Randomly Initialized Networks. (arXiv:2110.14068v3 [cs.LG] UPDATED)
    Deep Neural Networks (DNNs) are known to be vulnerable to adversarial attacks, i.e., an imperceptible perturbation to the input can mislead DNNs trained on clean images into making erroneous predictions. To tackle this, adversarial training is currently the most effective defense method, by augmenting the training set with adversarial samples generated on the fly. Interestingly, we discover for the first time that there exist subnetworks with inborn robustness, matching or surpassing the robust accuracy of the adversarially trained networks with comparable model sizes, within randomly initialized networks without any model training, indicating that adversarial training on model weights is not indispensable towards adversarial robustness. We name such subnetworks Robust Scratch Tickets (RSTs), which are also by nature efficient. Distinct from the popular lottery ticket hypothesis, neither the original dense networks nor the identified RSTs need to be trained. To validate and understand this fascinating finding, we further conduct extensive experiments to study the existence and properties of RSTs under different models, datasets, sparsity patterns, and attacks, drawing insights regarding the relationship between DNNs' robustness and their initialization/overparameterization. Furthermore, we identify the poor adversarial transferability between RSTs of different sparsity ratios drawn from the same randomly initialized dense network, and propose a Random RST Switch (R2S) technique, which randomly switches between different RSTs, as a novel defense method built on top of RSTs. We believe our findings about RSTs have opened up a new perspective to study model robustness and extend the lottery ticket hypothesis.  ( 2 min )
    Contrastive Mixture of Posteriors for Counterfactual Inference, Data Integration and Fairness. (arXiv:2106.08161v2 [stat.ML] UPDATED)
    Learning meaningful representations of data that can address challenges such as batch effect correction and counterfactual inference is a central problem in many domains including computational biology. Adopting a Conditional VAE framework, we show that marginal independence between the representation and a condition variable plays a key role in both of these challenges. We propose the Contrastive Mixture of Posteriors (CoMP) method that uses a novel misalignment penalty defined in terms of mixtures of the variational posteriors to enforce this independence in latent space. We show that CoMP has attractive theoretical properties compared to previous approaches and we prove counterfactual identifiability of CoMP under additional assumptions. We demonstrate state of the art performance on a set of challenging tasks including aligning human tumour samples with cancer cell-lines, predicting transcriptome-level perturbation responses, and batch correction on single-cell RNA sequencing data. We also find parallels to fair representation learning and demonstrate that CoMP is competitive on a common task in the field.  ( 2 min )
    A Deep Learning Based Multitask Network for Respiration Rate Estimation -- A Practical Perspective. (arXiv:2112.09071v2 [eess.SP] UPDATED)
    The exponential rise in wearable sensors has garnered significant interest in assessing the physiological parameters during day-to-day activities. Respiration rate is one of the vital parameters used in the performance assessment of lifestyle activities. However, obtrusive setup for measurement, motion artifacts, and other noises complicate the process. This paper presents a multitasking architecture based on Deep Learning (DL) for estimating instantaneous and average respiration rate from ECG and accelerometer signals, such that it performs efficiently under daily living activities like cycling, walking, etc. The multitasking network consists of a combination of Encoder-Decoder and Encoder-IncResNet, to fetch the average respiration rate and the respiration signal. The respiration signal can be leveraged to obtain the breathing peaks and instantaneous breathing cycles. Mean absolute error(MAE), Root mean square error (RMSE), inference time, and parameter count analysis has been used to compare the network with the current state of art Machine Learning (ML) model and other DL models developed in previous studies. Other DL configurations based on a variety of inputs are also developed as a part of the work. The proposed model showed better overall accuracy and gave better results than individual modalities during different activities.  ( 2 min )
    On the Accuracy of Analog Neural Network Inference Accelerators. (arXiv:2109.01262v3 [cs.AR] UPDATED)
    Specialized accelerators have recently garnered attention as a method to reduce the power consumption of neural network inference. A promising category of accelerators utilizes nonvolatile memory arrays to both store weights and perform $\textit{in situ}$ analog computation inside the array. While prior work has explored the design space of analog accelerators to optimize performance and energy efficiency, there is seldom a rigorous evaluation of the accuracy of these accelerators. This work shows how architectural design decisions, particularly in mapping neural network parameters to analog memory cells, influence inference accuracy. When evaluated using ResNet50 on ImageNet, the resilience of the system to analog non-idealities - cell programming errors, analog-to-digital converter resolution, and array parasitic resistances - all improve when analog quantities in the hardware are made proportional to the weights in the network. Moreover, contrary to the assumptions of prior work, nearly equivalent resilience to cell imprecision can be achieved by fully storing weights as analog quantities, rather than spreading weight bits across multiple devices, often referred to as bit slicing. By exploiting proportionality, analog system designers have the freedom to match the precision of the hardware to the needs of the algorithm, rather than attempting to guarantee the same level of precision in the intermediate results as an equivalent digital accelerator. This ultimately results in an analog accelerator that is more accurate, more robust to analog errors, and more energy-efficient.  ( 2 min )
    OnlineSTL: Scaling Time Series Decomposition by 100x. (arXiv:2107.09110v3 [cs.LG] UPDATED)
    Decomposing a complex time series into trend, seasonality, and remainder components is an important primitive that facilitates time series anomaly detection, change point detection, and forecasting. Although numerous batch algorithms are known for time series decomposition, none operate well in an online scalable setting where high throughput and real-time response are paramount. In this paper, we propose OnlineSTL, a novel online algorithm for time series decomposition which is highly scalable and is deployed for real-time metrics monitoring on high-resolution, high-ingest rate data. Experiments on different synthetic and real world time series datasets demonstrate that OnlineSTL achieves orders of magnitude speedups (100x) while maintaining quality of decomposition.  ( 2 min )
    Concept Representation Learning with Contrastive Self-Supervised Learning. (arXiv:2112.05677v2 [cs.LG] UPDATED)
    Concept-oriented deep learning (CODL) is a general approach to meet the future challenges for deep learning: (1) learning with little or no external supervision, (2) coping with test examples that come from a different distribution than the training examples, and (3) integrating deep learning with symbolic AI. In CODL, as in human learning, concept representations are learned based on concept exemplars. Contrastive self-supervised learning (CSSL) provides a promising approach to do so, since it: (1) uses data-driven associations, to get away from semantic labels, (2) supports incremental and continual learning, to get away from (large) fixed datasets, and (3) accommodates emergent objectives, to get away from fixed objectives (tasks). We discuss major aspects of concept representation learning using CSSL. These include dual-level concept representations, CSSL for feature representations, exemplar similarity measures and self-supervised relational reasoning, incremental and continual CSSL, and contrastive self-supervised concept (class) incremental learning. The discussion leverages recent findings from cognitive neural science and CSSL.  ( 2 min )
    Asymptotic Freeness of Layerwise Jacobians Caused by Invariance of Multilayer Perceptron: The Haar Orthogonal Case. (arXiv:2103.13466v4 [stat.ML] UPDATED)
    Free Probability Theory (FPT) provides rich knowledge for handling mathematical difficulties caused by random matrices that appear in research related to deep neural networks (DNNs), such as the dynamical isometry, Fisher information matrix, and training dynamics. FPT suits these researches because the DNN's parameter-Jacobian and input-Jacobian are polynomials of layerwise Jacobians. However, the critical assumption of asymptotic freenss of the layerwise Jacobian has not been proven completely so far. The asymptotic freeness assumption plays a fundamental role when propagating spectral distributions through the layers. Haar distributed orthogonal matrices are essential for achieving dynamical isometry. In this work, we prove asymptotic freeness of layerwise Jacobians of multilayer perceptron (MLP) in this case. A key of the proof is an invariance of the MLP. Considering the orthogonal matrices that fix the hidden units in each layer, we replace each layer's parameter matrix with itself multiplied by the orthogonal matrix, and then the MLP does not change. Furthermore, if the original weights are Haar orthogonal, the Jacobian is also unchanged by this replacement. Lastly, we can replace each weight with a Haar orthogonal random matrix independent of the Jacobian of the activation function using this key fact.  ( 2 min )
    Text classification problems via BERT embedding method and graph convolutional neural network. (arXiv:2111.15379v2 [cs.CL] UPDATED)
    This paper presents the novel way combining the BERT embedding method and the graph convolutional neural network. This combination is employed to solve the text classification problem. Initially, we apply the BERT embedding method to the texts (in the BBC news dataset and the IMDB movie reviews dataset) in order to transform all the texts to numerical vector. Then, the graph convolutional neural network will be applied to these numerical vectors to classify these texts into their ap-propriate classes/labels. Experiments show that the performance of the graph convolutional neural network model is better than the perfor-mances of the combination of the BERT embedding method with clas-sical machine learning models.  ( 2 min )
    A Stochastic Alternating Balance $k$-Means Algorithm for Fair Clustering. (arXiv:2105.14172v2 [cs.LG] UPDATED)
    In the application of data clustering to human-centric decision-making systems, such as loan applications and advertisement recommendations, the clustering outcome might discriminate against people across different demographic groups, leading to unfairness. A natural conflict occurs between the cost of clustering (in terms of distance to cluster centers) and the balance representation of all demographic groups across the clusters, leading to a bi-objective optimization problem that is nonconvex and nonsmooth. To determine the complete trade-off between these two competing goals, we design a novel stochastic alternating balance fair $k$-means (SAfairKM) algorithm, which consists of alternating classical mini-batch $k$-means updates and group swap updates. The number of $k$-means updates and the number of swap updates essentially parameterize the weight put on optimizing each objective function. Our numerical experiments show that the proposed SAfairKM algorithm is robust and computationally efficient in constructing well-spread and high-quality Pareto fronts both on synthetic and real datasets.  ( 2 min )
    Measuring Disparate Outcomes of Content Recommendation Algorithms with Distributional Inequality Metrics. (arXiv:2202.01615v1 [cs.CY])
    The harmful impacts of algorithmic decision systems have recently come into focus, with many examples of systems such as machine learning (ML) models amplifying existing societal biases. Most metrics attempting to quantify disparities resulting from ML algorithms focus on differences between groups, dividing users based on demographic identities and comparing model performance or overall outcomes between these groups. However, in industry settings, such information is often not available, and inferring these characteristics carries its own risks and biases. Moreover, typical metrics that focus on a single classifier's output ignore the complex network of systems that produce outcomes in real-world settings. In this paper, we evaluate a set of metrics originating from economics, distributional inequality metrics, and their ability to measure disparities in content exposure in a production recommendation system, the Twitter algorithmic timeline. We define desirable criteria for metrics to be used in an operational setting, specifically by ML practitioners. We characterize different types of engagement with content on Twitter using these metrics, and use these results to evaluate the metrics with respect to the desired criteria. We show that we can use these metrics to identify content suggestion algorithms that contribute more strongly to skewed outcomes between users. Overall, we conclude that these metrics can be useful tools for understanding disparate outcomes in online social networks.  ( 2 min )
    Sample and Communication-Efficient Decentralized Actor-Critic Algorithms with Finite-Time Analysis. (arXiv:2109.03699v2 [cs.LG] UPDATED)
    Actor-critic (AC) algorithms have been widely adopted in decentralized multi-agent systems to learn the optimal joint control policy. However, existing decentralized AC algorithms either do not preserve the privacy of agents or are not sample and communication-efficient. In this work, we develop two decentralized AC and natural AC (NAC) algorithms that are private, and sample and communication-efficient. In both algorithms, agents share noisy information to preserve privacy and adopt mini-batch updates to improve sample and communication efficiency. Particularly for decentralized NAC, we develop a decentralized Markovian SGD algorithm with an adaptive mini-batch size to efficiently compute the natural policy gradient. Under Markovian sampling and linear function approximation, we prove the proposed decentralized AC and NAC algorithms achieve the state-of-the-art sample complexities $\mathcal{O}\big(\epsilon^{-2}\ln(\epsilon^{-1})\big)$ and $\mathcal{O}\big(\epsilon^{-3}\ln(\epsilon^{-1})\big)$, respectively, and the same small communication complexity $\mathcal{O}\big(\epsilon^{-1}\ln(\epsilon^{-1})\big)$. Numerical experiments demonstrate that the proposed algorithms achieve lower sample and communication complexities than the existing decentralized AC algorithm.  ( 2 min )
    Global Optimization Networks. (arXiv:2202.01277v1 [stat.ML])
    We consider the problem of estimating a good maximizer of a black-box function given noisy examples. To solve such problems, we propose to fit a new type of function which we call a global optimization network (GON), defined as any composition of an invertible function and a unimodal function, whose unique global maximizer can be inferred in $\mathcal{O}(D)$ time. In this paper, we show how to construct invertible and unimodal functions by using linear inequality constraints on lattice models. We also extend to \emph{conditional} GONs that find a global maximizer conditioned on specified inputs of other dimensions. Experiments show the GON maximizers are statistically significantly better predictions than those produced by convex fits, GPR, or DNNs, and are more reasonable predictions for real-world problems.
    Understanding Cross-Domain Few-Shot Learning: An Experimental Study. (arXiv:2202.01339v1 [cs.LG])
    Cross-domain few-shot learning has drawn increasing attention for handling large differences between the source and target domains--an important concern in real-world scenarios. To overcome these large differences, recent works have considered exploiting small-scale unlabeled data from the target domain during the pre-training stage. This data enables self-supervised pre-training on the target domain, in addition to supervised pre-training on the source domain. In this paper, we empirically investigate scenarios under which it is advantageous to use each pre-training scheme, based on domain similarity and few-shot difficulty: performance gain of self-supervised pre-training over supervised pre-training increases when domain similarity is smaller or few-shot difficulty is lower. We further design two pre-training schemes, mixed-supervised and two-stage learning, that improve performance. In this light, we present seven findings for CD-FSL which are supported by extensive experiments and analyses on three source and eight target benchmark datasets with varying levels of domain similarity and few-shot difficulty. Our code is available at https://anonymous.4open.science/r/understandingCDFSL.  ( 2 min )
    Unified theory of atom-centered representations and graph convolutional machine-learning schemes. (arXiv:2202.01566v1 [stat.ML])
    Data-driven schemes that associate molecular and crystal structures with their microscopic properties share the need for a concise, effective description of the arrangement of their atomic constituents. Many types of models rely on descriptions of atom-centered environments, that are associated with an atomic property or with an atomic contribution to an extensive macroscopic quantity. Frameworks in this class can be understood in terms of atom-centered density correlations (ACDC), that are used as a basis for a body-ordered, symmetry-adapted expansion of the targets. Several other schemes, that gather information on the relationship between neighboring atoms using graph-convolutional (or message-passing) ideas, cannot be directly mapped to correlations centered around a single atom. We generalize the ACDC framework to include multi-centered information, generating representations that provide a complete linear basis to regress symmetric functions of atomic coordinates, and form the basis to systematize our understanding of both atom-centered and graph-convolutional machine-learning schemes.  ( 2 min )
    Multiclass learning with margin: exponential rates with no bias-variance trade-off. (arXiv:2202.01773v1 [stat.ML])
    We study the behavior of error bounds for multiclass classification under suitable margin conditions. For a wide variety of methods we prove that the classification error under a hard-margin condition decreases exponentially fast without any bias-variance trade-off. Different convergence rates can be obtained in correspondence of different margin assumptions. With a self-contained and instructive analysis we are able to generalize known results from the binary to the multiclass setting.  ( 2 min )
    ExPoSe: Combining State-Based Exploration with Gradient-Based Online Search. (arXiv:2202.01461v1 [cs.AI])
    A tree-based online search algorithm iteratively simulates trajectories and updates Q-value information on a set of states represented by a tree structure. Alternatively, policy gradient based online search algorithms update the information obtained from simulated trajectories directly onto the parameters of the policy and has been found to be effective. While tree-based methods limit the updates from simulations to the states that exist in the tree and do not interpolate the information to nearby states, policy gradient search methods do not do explicit exploration. In this paper, we show that it is possible to combine and leverage the strengths of these two methods for improved search performance. We examine the key reasons behind the improvement and propose a simple yet effective online search method, named Exploratory Policy Gradient Search (ExPoSe), that updates both the parameters of the policy as well as search information on the states in the trajectory. We conduct experiments on complex planning problems, which include Sokoban and Hamiltonian cycle search in sparse graphs and show that combining exploration with policy gradient improves online search performance.  ( 2 min )
    Multirate Training of Neural Networks. (arXiv:2106.10771v2 [cs.LG] UPDATED)
    We propose multirate training of neural networks: partitioning neural network parameters into "fast" and "slow" parts which are trained on different time scales. By choosing appropriate partitionings we can obtain substantial computational speed-up for transfer learning tasks. We show for applications in vision and NLP that we can fine-tune deep neural networks in almost half the time, without reducing the generalization performance of the resulting models. We analyze the convergence properties of our multirate scheme and draw a comparison with vanilla SGD. We also discuss splitting choices for the neural network parameters which could enhance generalization performance when neural networks are trained from scratch. A multirate approach can be used to learn different features present in the data and as a form of regularization. Our paper unlocks the potential of using multirate techniques for neural network training and provides several starting points for future work in this area.  ( 2 min )
    And/or trade-off in artificial neurons: impact on adversarial robustness. (arXiv:2102.07389v2 [cs.LG] UPDATED)
    Since its discovery in 2013, the phenomenon of adversarial examples has attracted a growing amount of attention from the machine learning community. A deeper understanding of the problem could lead to a better comprehension of how information is processed and encoded in neural networks and, more in general, could help to solve the issue of interpretability in machine learning. Our idea to increase adversarial resilience starts with the observation that artificial neurons can be divided in two broad categories: AND-like neurons and OR-like neurons. Intuitively, the former are characterised by a relatively low number of combinations of input values which trigger neuron activation, while for the latter the opposite is true. Our hypothesis is that the presence in a network of a sufficiently high number of OR-like neurons could lead to classification "brittleness" and increase the network's susceptibility to adversarial attacks. After constructing an operational definition of a neuron AND-like behaviour, we proceed to introduce several measures to increase the proportion of AND-like neurons in the network: L1 norm weight normalisation; application of an input filter; comparison between the neuron output's distribution obtained when the network is fed with the actual data set and the distribution obtained when the network is fed with a randomised version of the former called "scrambled data set". Tests performed on the MNIST data set hint that the proposed measures could represent an interesting direction to explore.  ( 2 min )
    What can linear interpolation of neural network loss landscapes tell us?. (arXiv:2106.16004v2 [cs.LG] UPDATED)
    Studying neural network loss landscapes provides insights into the nature of the underlying optimization problems. Unfortunately, loss landscapes are notoriously difficult to visualize in a human-comprehensible fashion. One common way to address this problem is to plot linear slices of the landscape, for example from the initial state of the network to the final state after optimization. On the basis of this analysis, prior work has drawn broader conclusions about the difficulty of the optimization problem. In this paper, we put inferences of this kind to the test, systematically evaluating how linear interpolation and final performance vary when altering the data, choice of initialization, and other optimizer and architecture design choices. Further, we use linear interpolation to study the role played by individual layers and substructures of the network. We find that certain layers are more sensitive to the choice of initialization, but that the shape of the linear path is not indicative of the changes in test accuracy of the model. Our results cast doubt on the broader intuition that the presence or absence of barriers when interpolating necessarily relates to the success of optimization.  ( 2 min )
    Deep Hierarchy in Bandits. (arXiv:2202.01454v1 [cs.LG])
    Mean rewards of actions are often correlated. The form of these correlations may be complex and unknown a priori, such as the preferences of a user for recommended products and their categories. To maximize statistical efficiency, it is important to leverage these correlations when learning. We formulate a bandit variant of this problem where the correlations of mean action rewards are represented by a hierarchical Bayesian model with latent variables. Since the hierarchy can have multiple layers, we call it deep. We propose a hierarchical Thompson sampling algorithm (HierTS) for this problem, and show how to implement it efficiently for Gaussian hierarchies. The efficient implementation is possible due to a novel exact hierarchical representation of the posterior, which itself is of independent interest. We use this exact posterior to analyze the Bayes regret of HierTS in Gaussian bandits. Our analysis reflects the structure of the problem, that the regret decreases with the prior width, and also shows that hierarchies reduce the regret by non-constant factors in the number of actions. We confirm these theoretical findings empirically, in both synthetic and real-world experiments.  ( 2 min )
    CoST: Contrastive Learning of Disentangled Seasonal-Trend Representations for Time Series Forecasting. (arXiv:2202.01575v1 [cs.LG])
    Deep learning has been actively studied for time series forecasting, and the mainstream paradigm is based on the end-to-end training of neural network architectures, ranging from classical LSTM/RNNs to more recent TCNs and Transformers. Motivated by the recent success of representation learning in computer vision and natural language processing, we argue that a more promising paradigm for time series forecasting, is to first learn disentangled feature representations, followed by a simple regression fine-tuning step -- we justify such a paradigm from a causal perspective. Following this principle, we propose a new time series representation learning framework for time series forecasting named CoST, which applies contrastive learning methods to learn disentangled seasonal-trend representations. CoST comprises both time domain and frequency domain contrastive losses to learn discriminative trend and seasonal representations, respectively. Extensive experiments on real-world datasets show that CoST consistently outperforms the state-of-the-art methods by a considerable margin, achieving a 21.3\% improvement in MSE on multivariate benchmarks. It is also robust to various choices of backbone encoders, as well as downstream regressors.  ( 2 min )
    Probing transfer learning with a model of synthetic correlated datasets. (arXiv:2106.05418v2 [cs.LG] UPDATED)
    Transfer learning can significantly improve the sample efficiency of neural networks, by exploiting the relatedness between a data-scarce target task and a data-abundant source task. Despite years of successful applications, transfer learning practice often relies on ad-hoc solutions, while theoretical understanding of these procedures is still limited. In the present work, we re-think a solvable model of synthetic data as a framework for modeling correlation between data-sets. This setup allows for an analytic characterization of the generalization performance obtained when transferring the learned feature map from the source to the target task. Focusing on the problem of training two-layer networks in a binary classification setting, we show that our model can capture a range of salient features of transfer learning with real data. Moreover, by exploiting parametric control over the correlation between the two data-sets, we systematically investigate under which conditions the transfer of features is beneficial for generalization.  ( 2 min )
    Robust Estimation for Nonparametric Families via Generative Adversarial Networks. (arXiv:2202.01269v1 [cs.LG])
    We provide a general framework for designing Generative Adversarial Networks (GANs) to solve high dimensional robust statistics problems, which aim at estimating unknown parameter of the true distribution given adversarially corrupted samples. Prior work focus on the problem of robust mean and covariance estimation when the true distribution lies in the family of Gaussian distributions or elliptical distributions, and analyze depth or scoring rule based GAN losses for the problem. Our work extend these to robust mean estimation, second moment estimation, and robust linear regression when the true distribution only has bounded Orlicz norms, which includes the broad family of sub-Gaussian, sub-Exponential and bounded moment distributions. We also provide a different set of sufficient conditions for the GAN loss to work: we only require its induced distance function to be a cumulative density function of some light-tailed distribution, which is easily satisfied by neural networks with sigmoid activation. In terms of techniques, our proposed GAN losses can be viewed as a smoothed and generalized Kolmogorov-Smirnov distance, which overcomes the computational intractability of the original Kolmogorov-Smirnov distance used in the prior work.  ( 2 min )
    Non-Vacuous Generalisation Bounds for Shallow Neural Networks. (arXiv:2202.01627v1 [cs.LG])
    We focus on a specific class of shallow neural networks with a single hidden layer, namely those with $L_2$-normalised data and either a sigmoid-shaped Gaussian error function ("erf") activation or a Gaussian Error Linear Unit (GELU) activation. For these networks, we derive new generalisation bounds through the PAC-Bayesian theory; unlike most existing such bounds they apply to neural networks with deterministic rather than randomised parameters. Our bounds are empirically non-vacuous when the network is trained with vanilla stochastic gradient descent on MNIST and Fashion-MNIST.  ( 2 min )
    Autobahn: Automorphism-based Graph Neural Nets. (arXiv:2103.01710v3 [cs.LG] UPDATED)
    We introduce Automorphism-based graph neural networks (Autobahn), a new family of graph neural networks. In an Autobahn, we decompose the graph into a collection of subgraphs and apply local convolutions that are equivariant to each subgraph's automorphism group. Specific choices of local neighborhoods and subgraphs recover existing architectures such as message passing neural networks. Our formalism also encompasses novel architectures: as an example, we introduce a graph neural network that decomposes the graph into paths and cycles. The resulting convolutions reflect the natural way that parts of the graph can transform, preserving the intuitive meaning of convolution without sacrificing global permutation equivariance. We validate our approach by applying Autobahn to molecular graphs, where it achieves results competitive with state-of-the-art message passing algorithms.  ( 2 min )
    Exploring Multi-physics with Extremely Weak Supervision. (arXiv:2202.01770v1 [physics.geo-ph])
    Multi-physical inversion plays a critical role in geophysics. It has been widely used to infer various physical properties (such as velocity and conductivity), simultaneously. Among those inversion problems, some are explicitly governed by partial differential equations (PDEs), while others are not. Without explicit governing equations, conventional multi-physical inversion techniques will not be feasible and data-driven inversion require expensive full labels. To overcome this issue, we develop a new data-driven multi-physics inversion technique with extremely weak supervision. Our key finding is that the pseudo labels can be constructed by learning the local relationship among geophysical properties at very sparse locations. We explore a multi-physics inversion problem from two distinct measurements (seismic and EM data) to three geophysical properties (velocity, conductivity, and CO$_2$ saturation). Our results show that we are able to invert for properties without explicit governing equations. Moreover, the label data on three geophysical properties can be significantly reduced by 50 times (from 100 down to only 2 locations).  ( 2 min )
    TS2Vec: Towards Universal Representation of Time Series. (arXiv:2106.10466v4 [cs.LG] UPDATED)
    This paper presents TS2Vec, a universal framework for learning representations of time series in an arbitrary semantic level. Unlike existing methods, TS2Vec performs contrastive learning in a hierarchical way over augmented context views, which enables a robust contextual representation for each timestamp. Furthermore, to obtain the representation of an arbitrary sub-sequence in the time series, we can apply a simple aggregation over the representations of corresponding timestamps. We conduct extensive experiments on time series classification tasks to evaluate the quality of time series representations. As a result, TS2Vec achieves significant improvement over existing SOTAs of unsupervised time series representation on 125 UCR datasets and 29 UEA datasets. The learned timestamp-level representations also achieve superior results in time series forecasting and anomaly detection tasks. A linear regression trained on top of the learned representations outperforms previous SOTAs of time series forecasting. Furthermore, we present a simple way to apply the learned representations for unsupervised anomaly detection, which establishes SOTA results in the literature. The source code is publicly available at https://github.com/yuezhihan/ts2vec.  ( 2 min )
    Deep Learning Algorithm for Threat Detection in Hackers Forum (Deep Web). (arXiv:2202.01448v1 [cs.CR])
    In our current society, the inter-connectivity of devices provides easy access for netizens to utilize cyberspace technology for illegal activities. The deep web platform is a consummative ecosystem shielded by boundaries of trust, information sharing, trade-off, and review systems. Domain knowledge is shared among experts in hacker's forums which contain indicators of compromise that can be explored for cyberthreat intelligence. Developing tools that can be deployed for threat detection is integral in securing digital communication in cyberspace. In this paper, we addressed the use of TOR relay nodes for anonymizing communications in deep web forums. We propose a novel approach for detecting cyberthreats using a deep learning algorithm Long Short-Term Memory (LSTM). The developed model outperformed the experimental results of other researchers in this problem domain with an accuracy of 94\% and precision of 90\%. Our model can be easily deployed by organizations in securing digital communications and detection of vulnerability exposure before cyberattack.  ( 2 min )
    Fair Representation Clustering with Several Protected Classes. (arXiv:2202.01391v1 [cs.DS])
    We study the problem of fair $k$-median where each cluster is required to have a fair representation of individuals from different groups. In the fair representation $k$-median problem, we are given a set of points $X$ in a metric space. Each point $x\in X$ belongs to one of $\ell$ groups. Further, we are given fair representation parameters $\alpha_j$ and $\beta_j$ for each group $j\in [\ell]$. We say that a $k$-clustering $C_1, \cdots, C_k$ fairly represents all groups if the number of points from group $j$ in cluster $C_i$ is between $\alpha_j |C_i|$ and $\beta_j |C_i|$ for every $j\in[\ell]$ and $i\in [k]$. The goal is to find a set $\mathcal{C}$ of $k$ centers and an assignment $\phi: X\rightarrow \mathcal{C}$ such that the clustering defined by $(\mathcal{C}, \phi)$ fairly represents all groups and minimizes the $\ell_1$-objective $\sum_{x\in X} d(x, \phi(x))$. We present an $O(\log k)$-approximation algorithm that runs in time $n^{O(\ell)}$. Note that the known algorithms for the problem either (i) violate the fairness constraints by an additive term or (ii) run in time that is exponential in both $k$ and $\ell$. We also consider an important special case of the problem where $\alpha_j = \beta_j = \frac{f_j}{f}$ and $f_j, f \in \mathbb{N}$ for all $j\in [\ell]$. For this special case, we present an $O(\log k)$-approximation algorithm that runs in $(kf)^{O(\ell)}\log n + poly(n)$ time.  ( 2 min )
    Reverse Engineering the Neural Tangent Kernel. (arXiv:2106.03186v2 [cs.LG] UPDATED)
    The development of methods to guide the design of neural networks is an important open challenge for deep learning theory. As a paradigm for principled neural architecture design, we propose the translation of high-performing kernels, which are better-understood and amenable to first-principles design, into equivalent network architectures, which have superior efficiency, flexibility, and feature learning. To this end, we constructively prove that, with just an appropriate choice of activation function, any positive-semidefinite dot-product kernel can be realized as either the conjugate or neural tangent kernel of a fully-connected neural network with only one hidden layer. We verify our construction numerically and demonstrate its utility as a design tool for finite fully-connected networks in several experiments.  ( 2 min )
    Improved Regret for Differentially Private Exploration in Linear MDP. (arXiv:2202.01292v1 [cs.LG])
    We study privacy-preserving exploration in sequential decision-making for environments that rely on sensitive data such as medical records. In particular, we focus on solving the problem of reinforcement learning (RL) subject to the constraint of (joint) differential privacy in the linear MDP setting, where both dynamics and rewards are given by linear functions. Prior work on this problem due to Luyo et al. (2021) achieves a regret rate that has a dependence of $O(K^{3/5})$ on the number of episodes $K$. We provide a private algorithm with an improved regret rate with an optimal dependence of $O(\sqrt{K})$ on the number of episodes. The key recipe for our stronger regret guarantee is the adaptivity in the policy update schedule, in which an update only occurs when sufficient changes in the data are detected. As a result, our algorithm benefits from low switching cost and only performs $O(\log(K))$ updates, which greatly reduces the amount of privacy noise. Finally, in the most prevalent privacy regimes where the privacy parameter $\epsilon$ is a constant, our algorithm incurs negligible privacy cost -- in comparison with the existing non-private regret bounds, the additional regret due to privacy appears in lower-order terms.  ( 2 min )
    Beyond Images: Label Noise Transition Matrix Estimation for Tasks with Lower-Quality Features. (arXiv:2202.01273v1 [cs.LG])
    The label noise transition matrix, denoting the transition probabilities from clean labels to noisy labels, is crucial knowledge for designing statistically robust solutions. Existing estimators for noise transition matrices, e.g., using either anchor points or clusterability, focus on computer vision tasks that are relatively easier to obtain high-quality representations. However, for other tasks with lower-quality features, the uninformative variables may obscure the useful counterpart and make anchor-point or clusterability conditions hard to satisfy. We empirically observe the failures of these approaches on a number of commonly used datasets. In this paper, to handle this issue, we propose a generally practical information-theoretic approach to down-weight the less informative parts of the lower-quality features. The salient technical challenge is to compute the relevant information-theoretical metrics using only noisy labels instead of clean ones. We prove that the celebrated $f$-mutual information measure can often preserve the order when calculated using noisy labels. The necessity and effectiveness of the proposed method is also demonstrated by evaluating the estimation error on a varied set of tabular data and text classification tasks with lower-quality features. Code is available at github.com/UCSC-REAL/Est-T-MI.  ( 2 min )
    How many degrees of freedom do we need to train deep networks: a loss landscape perspective. (arXiv:2107.05802v2 [cs.LG] UPDATED)
    A variety of recent works, spanning pruning, lottery tickets, and training within random subspaces, have shown that deep neural networks can be trained using far fewer degrees of freedom than the total number of parameters. We analyze this phenomenon for random subspaces by first examining the success probability of hitting a training loss sub-level set when training within a random subspace of a given training dimensionality. We find a sharp phase transition in the success probability from $0$ to $1$ as the training dimension surpasses a threshold. This threshold training dimension increases as the desired final loss decreases, but decreases as the initial loss decreases. We then theoretically explain the origin of this phase transition, and its dependence on initialization and final desired loss, in terms of properties of the high-dimensional geometry of the loss landscape. In particular, we show via Gordon's escape theorem, that the training dimension plus the Gaussian width of the desired loss sub-level set, projected onto a unit sphere surrounding the initialization, must exceed the total number of parameters for the success probability to be large. In several architectures and datasets, we measure the threshold training dimension as a function of initialization and demonstrate that it is a small fraction of the total parameters, implying by our theory that successful training with so few dimensions is possible precisely because the Gaussian width of low loss sub-level sets is very large. Moreover, we compare this threshold training dimension to more sophisticated ways of reducing training degrees of freedom, including lottery tickets as well as a new, analogous method: lottery subspaces. Code is available at https://github.com/ganguli-lab/degrees-of-freedom.  ( 2 min )
    Resource Management and Security Scheme of ICPSs and IoT Based on VNE Algorithm. (arXiv:2202.01375v1 [cs.CR])
    The development of Intelligent Cyber-Physical Systems (ICPSs) in virtual network environment is facing severe challenges. On the one hand, the Internet of things (IoT) based on ICPSs construction needs a large amount of reasonable network resources support. On the other hand, ICPSs are facing severe network security problems. The integration of ICPSs and network virtualization (NV) can provide more efficient network resource support and security guarantees for IoT users. Based on the above two problems faced by ICPSs, we propose a virtual network embedded (VNE) algorithm with computing, storage resources and security constraints to ensure the rationality and security of resource allocation in ICPSs. In particular, we use reinforcement learning (RL) method as a means to improve algorithm performance. We extract the important attribute characteristics of underlying network as the training environment of RL agent. Agent can derive the optimal node embedding strategy through training, so as to meet the requirements of ICPSs for resource management and security. The embedding of virtual links is based on the breadth first search (BFS) strategy. Therefore, this is a comprehensive two-stage RL-VNE algorithm considering the constraints of computing, storage and security three-dimensional resources. Finally, we design a large number of simulation experiments from the perspective of typical indicators of VNE algorithms. The experimental results effectively illustrate the effectiveness of the algorithm in the application of ICPSs.  ( 2 min )
    Near-Optimal Learning of Extensive-Form Games with Imperfect Information. (arXiv:2202.01752v1 [cs.LG])
    This paper resolves the open question of designing near-optimal algorithms for learning imperfect-information extensive-form games from bandit feedback. We present the first line of algorithms that require only $\widetilde{\mathcal{O}}((XA+YB)/\varepsilon^2)$ episodes of play to find an $\varepsilon$-approximate Nash equilibrium in two-player zero-sum games, where $X,Y$ are the number of information sets and $A,B$ are the number of actions for the two players. This improves upon the best known sample complexity of $\widetilde{\mathcal{O}}((X^2A+Y^2B)/\varepsilon^2)$ by a factor of $\widetilde{\mathcal{O}}(\max\{X, Y\})$, and matches the information-theoretic lower bound up to logarithmic factors. We achieve this sample complexity by two new algorithms: Balanced Online Mirror Descent, and Balanced Counterfactual Regret Minimization. Both algorithms rely on novel approaches of integrating \emph{balanced exploration policies} into their classical counterparts. We also extend our results to learning Coarse Correlated Equilibria in multi-player general-sum games.  ( 2 min )
    SubOmiEmbed: Self-supervised Representation Learning of Multi-omics Data for Cancer Type Classification. (arXiv:2202.01672v1 [cs.LG])
    For personalized medicines, very crucial intrinsic information is present in high dimensional omics data which is difficult to capture due to the large number of molecular features and small number of available samples. Different types of omics data show various aspects of samples. Integration and analysis of multi-omics data give us a broad view of tumours, which can improve clinical decision making. Omics data, mainly DNA methylation and gene expression profiles are usually high dimensional data with a lot of molecular features. In recent years, variational autoencoders (VAE) have been extensively used in embedding image and text data into lower dimensional latent spaces. In our project, we extend the idea of using a VAE model for low dimensional latent space extraction with the self-supervised learning technique of feature subsetting. With VAEs, the key idea is to make the model learn meaningful representations from different types of omics data, which could then be used for downstream tasks such as cancer type classification. The main goals are to overcome the curse of dimensionality and integrate methylation and expression data to combine information about different aspects of same tissue samples, and hopefully extract biologically relevant features. Our extension involves training encoder and decoder to reconstruct the data from just a subset of it. By doing this, we force the model to encode most important information in the latent representation. We also added an identity to the subsets so that the model knows which subset is being fed into it during training and testing. We experimented with our approach and found that SubOmiEmbed produces comparable results to the baseline OmiEmbed with a much smaller network and by using just a subset of the data. This work can be improved to integrate mutation-based genomic data as well.  ( 2 min )
    The Disagreement Problem in Explainable Machine Learning: A Practitioner's Perspective. (arXiv:2202.01602v1 [cs.LG])
    As various post hoc explanation methods are increasingly being leveraged to explain complex models in high-stakes settings, it becomes critical to develop a deeper understanding of if and when the explanations output by these methods disagree with each other, and how such disagreements are resolved in practice. However, there is little to no research that provides answers to these critical questions. In this work, we introduce and study the disagreement problem in explainable machine learning. More specifically, we formalize the notion of disagreement between explanations, analyze how often such disagreements occur in practice, and how do practitioners resolve these disagreements. To this end, we first conduct interviews with data scientists to understand what constitutes disagreement between explanations generated by different methods for the same model prediction, and introduce a novel quantitative framework to formalize this understanding. We then leverage this framework to carry out a rigorous empirical analysis with four real-world datasets, six state-of-the-art post hoc explanation methods, and eight different predictive models, to measure the extent of disagreement between the explanations generated by various popular explanation methods. In addition, we carry out an online user study with data scientists to understand how they resolve the aforementioned disagreements. Our results indicate that state-of-the-art explanation methods often disagree in terms of the explanations they output. Our findings underscore the importance of developing principled evaluation metrics that enable practitioners to effectively compare explanations.  ( 2 min )
    Approximate Bisimulation Relations for Neural Networks and Application to Assured Neural Network Compression. (arXiv:2202.01214v1 [cs.LG])
    In this paper, we propose a concept of approximate bisimulation relation for feedforward neural networks. In the framework of approximate bisimulation relation, a novel neural network merging method is developed to compute the approximate bisimulation error between two neural networks based on reachability analysis of neural networks. The developed method is able to quantitatively measure the distance between the outputs of two neural networks with the same inputs. Then, we apply the approximate bisimulation relation results to perform neural networks model reduction and compute the compression precision, i.e., assured neural networks compression. At last, using the assured neural network compression, we accelerate the verification processes of ACAS Xu neural networks to illustrate the effectiveness and advantages of our proposed approximate bisimulation approach.  ( 2 min )
    Fast and explainable clustering based on sorting. (arXiv:2202.01456v1 [cs.LG])
    We introduce a fast and explainable clustering method called CLASSIX. It consists of two phases, namely a greedy aggregation phase of the sorted data into groups of nearby data points, followed by the merging of groups into clusters. The algorithm is controlled by two scalar parameters, namely a distance parameter for the aggregation and another parameter controlling the minimal cluster size. Extensive experiments are conducted to give a comprehensive evaluation of the clustering performance on synthetic and real-world datasets, with various cluster shapes and low to high feature dimensionality. Our experiments demonstrate that CLASSIX competes with state-of-the-art clustering algorithms. The algorithm has linear space complexity and achieves near linear time complexity on a wide range of problems. Its inherent simplicity allows for the generation of intuitive explanations of the computed clusters.  ( 2 min )
    Influence-Augmented Local Simulators: A Scalable Solution for Fast Deep RL in Large Networked Systems. (arXiv:2202.01534v1 [cs.LG])
    Learning effective policies for real-world problems is still an open challenge for the field of reinforcement learning (RL). The main limitation being the amount of data needed and the pace at which that data can be obtained. In this paper, we study how to build lightweight simulators of complicated systems that can run sufficiently fast for deep RL to be applicable. We focus on domains where agents interact with a reduced portion of a larger environment while still being affected by the global dynamics. Our method combines the use of local simulators with learned models that mimic the influence of the global system. The experiments reveal that incorporating this idea into the deep RL workflow can considerably accelerate the training process and presents several opportunities for the future.  ( 2 min )
    Can Transformers be Strong Treatment Effect Estimators?. (arXiv:2202.01336v1 [cs.LG])
    In this paper, we develop a general framework for based on Transformer architectures to address a variety of challenging treatment effect estimation (TEE) problems. Our methods are applicable both when covariates are tabular and when they consist of sequences (e.g., in text), and can handle discrete, continuous, structured, or dosage-associated treatments. While Transformers have already emerged as dominant methods for diverse domains, including natural language and computer vision, our experiments with Transformers as Treatment Effect Estimators (TransTEE) demonstrate that these inductive biases are also effective on the sorts of estimation problems and datasets that arise in research aimed at estimating causal effects. Moreover, we propose a propensity score network that is trained with TransTEE in an adversarial manner to promote independence between covariates and treatments to further address selection bias. Through extensive experiments, we show that TransTEE significantly outperforms competitive baselines with greater parameter efficiency over a wide range of benchmarks and settings.  ( 2 min )
    The use of Generative Adversarial Networks to characterise new physics in multi-lepton final states at the LHC. (arXiv:2105.14933v4 [hep-ph] UPDATED)
    Semi-supervision in Machine Learning can be used in searches for new physics where the signal plus background regions are not labelled. This strongly reduces model dependency in the search for signals Beyond the Standard Model. This approach displays the drawback in that over-fitting can give rise to fake signals. Tossing toy Monte Carlo (MC) events can be used to estimate the corresponding trials factor through a frequentist inference. However, MC events that are based on full detector simulations are resource intensive. Generative Adversarial Networks (GANs) can be used to mimic MC generators. GANs are powerful generative models, but often suffer from training instability. We henceforth show a review of GANs. We advocate the use of Wasserstein GAN (WGAN) with weight clipping and WGAN with gradient penalty (WGAN-GP) where the norm of gradient of the critic is penalized with respect to its input. Following the emergence of multi-lepton anomalies, we apply GANs for the generation of di-leptons final states in association with $b$-quarks at the LHC. A good agreement between the MC and the WGAN-GP generated events is found for the observables selected in the study.  ( 2 min )
    GPUTreeShap: Massively Parallel Exact Calculation of SHAP Scores for Tree Ensembles. (arXiv:2010.13972v3 [cs.LG] UPDATED)
    SHAP (SHapley Additive exPlanation) values provide a game theoretic interpretation of the predictions of machine learning models based on Shapley values. While exact calculation of SHAP values is computationally intractable in general, a recursive polynomial-time algorithm called TreeShap is available for decision tree models. However, despite its polynomial time complexity, TreeShap can become a significant bottleneck in practical machine learning pipelines when applied to large decision tree ensembles. Unfortunately, the complicated TreeShap algorithm is difficult to map to hardware accelerators such as GPUs. In this work, we present GPUTreeShap, a reformulated TreeShap algorithm suitable for massively parallel computation on graphics processing units. Our approach first preprocesses each decision tree to isolate variable sized sub-problems from the original recursive algorithm, then solves a bin packing problem, and finally maps sub-problems to single-instruction, multiple-thread (SIMT) tasks for parallel execution with specialised hardware instructions. With a single NVIDIA Tesla V100-32 GPU, we achieve speedups of up to 19x for SHAP values, and speedups of up to 340x for SHAP interaction values, over a state-of-the-art multi-core CPU implementation executed on two 20-core Xeon E5-2698 v4 2.2 GHz CPUs. We also experiment with multi-GPU computing using eight V100 GPUs, demonstrating throughput of 1.2M rows per second -- equivalent CPU-based performance is estimated to require 6850 CPU cores.  ( 2 min )
    NoisyMix: Boosting Robustness by Combining Data Augmentations, Stability Training, and Noise Injections. (arXiv:2202.01263v1 [cs.LG])
    For many real-world applications, obtaining stable and robust statistical performance is more important than simply achieving state-of-the-art predictive test accuracy, and thus robustness of neural networks is an increasingly important topic. Relatedly, data augmentation schemes have been shown to improve robustness with respect to input perturbations and domain shifts. Motivated by this, we introduce NoisyMix, a training scheme that combines data augmentations with stability training and noise injections to improve both model robustness and in-domain accuracy. This combination promotes models that are consistently more robust and that provide well-calibrated estimates of class membership probabilities. We demonstrate the benefits of NoisyMix on a range of benchmark datasets, including ImageNet-C, ImageNet-R, and ImageNet-P. Moreover, we provide theory to understand implicit regularization and robustness of NoisyMix.  ( 2 min )
    Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model. (arXiv:2202.01562v1 [stat.ML])
    In real-world recommender systems and search engines, optimizing ranking decisions to present a ranked list of relevant items is critical. Off-policy evaluation (OPE) for ranking policies is thus gaining a growing interest because it enables performance estimation of new ranking policies using only logged data. Although OPE in contextual bandits has been studied extensively, its naive application to the ranking setting faces a critical variance issue due to the huge item space. To tackle this problem, previous studies introduce some assumptions on user behavior to make the combinatorial item space tractable. However, an unrealistic assumption may, in turn, cause serious bias. Therefore, appropriately controlling the bias-variance tradeoff by imposing a reasonable assumption is the key for success in OPE of ranking policies. To achieve a well-balanced bias-variance tradeoff, we propose the Cascade Doubly Robust estimator building on the cascade assumption, which assumes that a user interacts with items sequentially from the top position in a ranking. We show that the proposed estimator is unbiased in more cases compared to existing estimators that make stronger assumptions. Furthermore, compared to a previous estimator based on the same cascade assumption, the proposed estimator reduces the variance by leveraging a control variate. Comprehensive experiments on both synthetic and real-world data demonstrate that our estimator leads to more accurate OPE than existing estimators in a variety of settings.  ( 2 min )
    FedSpace: An Efficient Federated Learning Framework at Satellites and Ground Stations. (arXiv:2202.01267v1 [cs.LG])
    Large-scale deployments of low Earth orbit (LEO) satellites collect massive amount of Earth imageries and sensor data, which can empower machine learning (ML) to address global challenges such as real-time disaster navigation and mitigation. However, it is often infeasible to download all the high-resolution images and train these ML models on the ground because of limited downlink bandwidth, sparse connectivity, and regularization constraints on the imagery resolution. To address these challenges, we leverage Federated Learning (FL), where ground stations and satellites collaboratively train a global ML model without sharing the captured images on the satellites. We show fundamental challenges in applying existing FL algorithms among satellites and ground stations, and we formulate an optimization problem which captures a unique trade-off between staleness and idleness. We propose a novel FL framework, named FedSpace, which dynamically schedules model aggregation based on the deterministic and time-varying connectivity according to satellite orbits. Extensive numerical evaluations based on real-world satellite images and satellite networks show that FedSpace reduces the training time by 1.7 days (38.6%) over the state-of-the-art FL algorithms.  ( 2 min )
    Causal Imitation Learning under Temporally Correlated Noise. (arXiv:2202.01312v1 [cs.LG])
    We develop algorithms for imitation learning from policy data that was corrupted by temporally correlated noise in expert actions. When noise affects multiple timesteps of recorded data, it can manifest as spurious correlations between states and actions that a learner might latch on to, leading to poor policy performance. To break up these spurious correlations, we apply modern variants of the instrumental variable regression (IVR) technique of econometrics, enabling us to recover the underlying policy without requiring access to an interactive expert. In particular, we present two techniques, one of a generative-modeling flavor (DoubIL) that can utilize access to a simulator, and one of a game-theoretic flavor (ResiduIL) that can be run entirely offline. We find both of our algorithms compare favorably to behavioral cloning on simulated control tasks.  ( 2 min )
    Challenging Common Assumptions in Convex Reinforcement Learning. (arXiv:2202.01511v1 [cs.LG])
    The classic Reinforcement Learning (RL) formulation concerns the maximization of a scalar reward function. More recently, convex RL has been introduced to extend the RL formulation to all the objectives that are convex functions of the state distribution induced by a policy. Notably, convex RL covers several relevant applications that do not fall into the scalar formulation, including imitation learning, risk-averse RL, and pure exploration. In classic RL, it is common to optimize an infinite trials objective, which accounts for the state distribution instead of the empirical state visitation frequencies, even though the actual number of trajectories is always finite in practice. This is theoretically sound since the infinite trials and finite trials objectives can be proved to coincide and thus lead to the same optimal policy. In this paper, we show that this hidden assumption does not hold in the convex RL setting. In particular, we show that erroneously optimizing the infinite trials objective in place of the actual finite trials one, as it is usually done, can lead to a significant approximation error. Since the finite trials setting is the default in both simulated and real-world RL, we believe shedding light on this issue will lead to better approaches and methodologies for convex RL, impacting relevant research areas such as imitation learning, risk-averse RL, and pure exploration among others.  ( 2 min )
    Noise-robust classification with hypergraph neural network. (arXiv:2102.01934v2 [stat.ML] UPDATED)
    This paper presents a novel version of the hypergraph neural network method. This method is utilized to solve the noisy label learning problem. First, we apply the PCA dimensional reduction technique to the feature matrices of the image datasets in order to reduce the "noise" and the redundant features in the feature matrices of the image datasets and to reduce the runtime constructing the hypergraph of the hypergraph neural network method. Then, the classic graph-based semi-supervised learning method, the classic hypergraph based semi-supervised learning method, the graph neural network, the hypergraph neural network, and our proposed hypergraph neural network are employed to solve the noisy label learning problem. The accuracies of these five methods are evaluated and compared. Experimental results show that the hypergraph neural network methods achieve the best performance when the noise level increases. Moreover, the hypergraph neural network methods are at least as good as the graph neural network.  ( 2 min )
    RoFL: Attestable Robustness for Secure Federated Learning. (arXiv:2107.03311v3 [cs.CR] UPDATED)
    Even though recent years have seen many attacks exposing severe vulnerabilities in federated learning (FL), a holistic understanding of what enables these attacks and how they can be mitigated effectively is still lacking. In this work we demystify the inner workings of existing targeted attacks. We provide new insights into why these attacks are possible and why a definitive solution to FL robustness is challenging. We show that the need for ML algorithms to memorize tail data has significant implications for FL integrity. This phenomenon has largely been studied in the context of privacy; our analysis sheds light on its implications for ML integrity. In addition, we show how constraints on client updates can effectively improve robustness. To incorporate these constraints into secure FL protocols, we design and develop RoFL, a new secure FL system that enables constraints to be expressed and enforced on high-dimensional encrypted model updates. In essence, RoFL augments existing secure FL aggregation protocols with zero-knowledge proofs. Due to the scale of FL, realizing these checks efficiently presents a paramount challenge. We introduce several optimizations at the ML layer that allow us to reduce the number of cryptographic checks needed while preserving the effectiveness of our defenses. We show that RoFL scales to the sizes of models used in real-world FL deployments.  ( 2 min )
    Equality Is Not Equity: Proportional Fairness in Federated Learning. (arXiv:2202.01666v1 [cs.LG])
    Ensuring fairness of machine learning (ML) algorithms is becoming an increasingly important mission for ML service providers. This is even more critical and challenging in the federated learning (FL) scenario, given a large number of diverse participating clients. Simply mandating equality across clients could lead to many undesirable consequences, potentially discouraging high-performing clients and resulting in sub-optimal overall performance. In order to achieve better equity rather than equality, in this work, we introduce and study proportional fairness (PF) in FL, which has a deep connection with game theory. By viewing FL from a cooperative game perspective, where the players (clients) collaboratively learn a good model, we formulate PF as Nash bargaining solutions. Based on this concept, we propose PropFair, a novel and easy-to-implement algorithm for effectively finding PF solutions, and we prove its convergence properties. We illustrate through experiments that PropFair consistently improves the worst-case and the overall performances simultaneously over state-of-the-art fair FL algorithms for a wide array of vision and language datasets, thus achieving better equity.  ( 2 min )
    Graph Coloring with Physics-Inspired Graph Neural Networks. (arXiv:2202.01606v1 [cs.LG])
    We show how graph neural networks can be used to solve the canonical graph coloring problem. We frame graph coloring as a multi-class node classification problem and utilize an unsupervised training strategy based on the statistical physics Potts model. Generalizations to other multi-class problems such as community detection, data clustering, and the minimum clique cover problem are straightforward. We provide numerical benchmark results and illustrate our approach with an end-to-end application for a real-world scheduling use case within a comprehensive encode-process-decode framework. Our optimization approach performs on par or outperforms existing solvers, with the ability to scale to problems with millions of variables.  ( 2 min )
    Variational Nearest Neighbor Gaussian Processes. (arXiv:2202.01694v1 [cs.LG])
    Variational approximations to Gaussian processes (GPs) typically use a small set of inducing points to form a low-rank approximation to the covariance matrix. In this work, we instead exploit a sparse approximation of the precision matrix. We propose variational nearest neighbor Gaussian process (VNNGP), which introduces a prior that only retains correlations within K nearest-neighboring observations, thereby inducing sparse precision structure. Using the variational framework, VNNGP's objective can be factorized over both observations and inducing points, enabling stochastic optimization with a time complexity of O($K^3$). Hence, we can arbitrarily scale the inducing point size, even to the point of putting inducing points at every observed location. We compare VNNGP to other scalable GPs through various experiments, and demonstrate that VNNGP (1) can dramatically outperform low-rank methods, and (2) is less prone to overfitting than other nearest neighbor methods.  ( 2 min )
    Certifying Out-of-Domain Generalization for Blackbox Functions. (arXiv:2202.01679v1 [cs.LG])
    Certifying the robustness of model performance under bounded data distribution shifts has recently attracted intensive interests under the umbrella of distributional robustness. However, existing techniques either make strong assumptions on the model class and loss functions that can be certified, such as smoothness expressed via Lipschitz continuity of gradients, or require to solve complex optimization problems. As a result, the wider application of these techniques is currently limited by its scalability and flexibility -- these techniques often do not scale to large-scale datasets with modern deep neural networks or cannot handle loss functions which may be non-smooth, such as the 0-1 loss. In this paper, we focus on the problem of certifying distributional robustness for black box models and bounded losses, without other assumptions. We propose a novel certification framework given bounded distance of mean and variance of two distributions. Our certification technique scales to ImageNet-scale datasets, complex models, and a diverse range of loss functions. We then focus on one specific application enabled by such scalability and flexibility, i.e., certifying out-of-domain generalization for large neural networks and loss functions such as accuracy and AUC. We experimentally validate our certification method on a number of datasets, ranging from ImageNet, where we provide the first non-vacuous certified out-of-domain generalization, to smaller classification tasks where we are able to compare with the state-of-the-art and show that our method performs considerably better.  ( 2 min )
    Selection in the Presence of Implicit Bias: The Advantage of Intersectional Constraints. (arXiv:2202.01661v1 [cs.CY])
    In selection processes such as hiring, promotion, and college admissions, implicit bias toward socially-salient attributes such as race, gender, or sexual orientation of candidates is known to produce persistent inequality and reduce aggregate utility for the decision maker. Interventions such as the Rooney Rule and its generalizations, which require the decision maker to select at least a specified number of individuals from each affected group, have been proposed to mitigate the adverse effects of implicit bias in selection. Recent works have established that such lower-bound constraints can be very effective in improving aggregate utility in the case when each individual belongs to at most one affected group. However, in several settings, individuals may belong to multiple affected groups and, consequently, face more extreme implicit bias due to this intersectionality. We consider independently drawn utilities and show that, in the intersectional case, the aforementioned non-intersectional constraints can only recover part of the total utility achievable in the absence of implicit bias. On the other hand, we show that if one includes appropriate lower-bound constraints on the intersections, almost all the utility achievable in the absence of implicit bias can be recovered. Thus, intersectional constraints can offer a significant advantage over a reductionist dimension-by-dimension non-intersectional approach to reducing inequality.  ( 2 min )
    MRI Reconstruction via Data Driven Markov Chain with Joint Uncertainty Estimation. (arXiv:2202.01479v1 [cs.LG])
    We introduce a framework that enables efficient sampling from learned probability distributions for MRI reconstruction. Different from conventional deep learning-based MRI reconstruction techniques, samples are drawn from the posterior distribution given the measured k-space using the Markov chain Monte Carlo (MCMC) method. In addition to the maximum a posteriori (MAP) estimate for the image, which can be obtained with conventional methods, the minimum mean square error (MMSE) estimate and uncertainty maps can also be computed. The data-driven Markov chains are constructed from the generative model learned from a given image database and are independent of the forward operator that is used to model the k-space measurement. This provides flexibility because the method can be applied to k-space acquired with different sampling schemes or receive coils using the same pre-trained models. Furthermore, we use a framework based on a reverse diffusion process to be able to utilize advanced generative models. The performance of the method is evaluated on an open dataset using 10-fold accelerated acquisition.  ( 2 min )
    Assessing External Validity Over Worst-case Subpopulations. (arXiv:2007.02411v3 [stat.ML] UPDATED)
    Study populations are typically sampled from limited points in space and time, and marginalized groups are underrepresented. To assess the external validity of randomized and observational studies, we propose and evaluate the worst-case treatment effect (WTE) across all subpopulations of a given size, which guarantees positive findings remain valid over subpopulations. We develop a semiparametrically efficient estimator for the WTE that analyzes the external validity of the augmented inverse propensity weighted estimator for the average treatment effect. Our cross-fitting procedure leverages flexible nonparametric and machine learning-based estimates of nuisance parameters and is a regular root-$n$ estimator even when nuisance estimates converge more slowly. On real examples where external validity is of core concern, our proposed framework guards against brittle findings that are invalidated by unanticipated population shifts.  ( 2 min )
    Towards Coherent and Consistent Use of Entities in Narrative Generation. (arXiv:2202.01709v1 [cs.CL])
    Large pre-trained language models (LMs) have demonstrated impressive capabilities in generating long, fluent text; however, there is little to no analysis on their ability to maintain entity coherence and consistency. In this work, we focus on the end task of narrative generation and systematically analyse the long-range entity coherence and consistency in generated stories. First, we propose a set of automatic metrics for measuring model performance in terms of entity usage. Given these metrics, we quantify the limitations of current LMs. Next, we propose augmenting a pre-trained LM with a dynamic entity memory in an end-to-end manner by using an auxiliary entity-related loss for guiding the reads and writes to the memory. We demonstrate that the dynamic entity memory increases entity coherence according to both automatic and human judgment and helps preserving entity-related information especially in settings with a limited context window. Finally, we also validate that our automatic metrics are correlated with human ratings and serve as a good indicator of the quality of generated stories.  ( 2 min )
    A Comparison of Online Hate on Reddit and 4chan: A Case Study of the 2020 US Election. (arXiv:2202.01302v1 [cs.CL])
    The rapid integration of the Internet into our daily lives has led to many benefits but also to a number of new, wide-spread threats such as online hate, trolling, bullying, and generally aggressive behaviours. While research has traditionally explored online hate, in particular, on one platform, the reality is that such hate is a phenomenon that often makes use of multiple online networks. In this article, we seek to advance the discussion into online hate by harnessing a comparative approach, where we make use of various Natural Language Processing (NLP) techniques to computationally analyse hateful content from Reddit and 4chan relating to the 2020 US Presidential Elections. Our findings show how content and posting activity can differ depending on the platform being used. Through this, we provide initial comparison into the platform-specific behaviours of online hate, and how different platforms can serve specific purposes. We further provide several avenues for future research utilising a cross-platform approach so as to gain a more comprehensive understanding of the global hate ecosystem.  ( 2 min )
    A benchmark of state-of-the-art sound event detection systems evaluated on synthetic soundscapes. (arXiv:2202.01487v1 [eess.AS])
    This paper proposes a benchmark of submissions to Detection and Classification Acoustic Scene and Events 2021 Challenge (DCASE) Task 4 representing a sampling of the state-of-the-art in Sound Event Detection task. The submissions are evaluated according to the two polyphonic sound detection score scenarios proposed for the DCASE 2021 Challenge Task 4, which allow to make an analysis on whether submissions are designed to perform fine-grained temporal segmentation, coarse-grained temporal segmentation, or have been designed to be polyvalent on the scenarios proposed. We study the solutions proposed by participants to analyze their robustness to varying level target to non-target signal-to-noise ratio and to temporal localization of target sound events. A last experiment is proposed in order to study the impact of non-target events on systems outputs. Results show that systems adapted to provide coarse segmentation outputs are more robust to different target to non-target signal-to-noise ratio and, with the help of specific data augmentation methods, they are more robust to time localization of the original event. Results of the last experiment display that systems tend to spuriously predict short events when non-target events are present. This is particularly true for systems that are tailored to have a fine segmentation.  ( 2 min )
    Comparative assessment of federated and centralized machine learning. (arXiv:2202.01529v1 [cs.LG])
    Federated Learning (FL) is a privacy preserving machine learning scheme, where training happens with data federated across devices and not leaving them to sustain user privacy. This is ensured by making the untrained or partially trained models to reach directly the individual devices and getting locally trained "on-device" using the device owned data, and the server aggregating all the partially trained model learnings to update a global model. Although almost all the model learning schemes in the federated learning setup use gradient descent, there are certain characteristic differences brought about by the non-IID nature of the data availability, that affects the training in comparison to the centralized schemes. In this paper, we discuss the various factors that affect the federated learning training, because of the non-IID distributed nature of the data, as well as the inherent differences in the federating learning approach as against the typical centralized gradient descent techniques. We empirically demonstrate the effect of number of samples per device and the distribution of output labels on federated learning. In addition to the privacy advantage we seek through federated learning, we also study if there is a cost advantage while using federated learning frameworks. We show that federated learning does have an advantage in cost when the model sizes to be trained are not reasonably large. All in all, we present the need for careful design of model for both performance and cost.  ( 2 min )
    Data Heterogeneity-Robust Federated Learning via Group Client Selection in Industrial IoT. (arXiv:2202.01512v1 [cs.LG])
    Nowadays, the industrial Internet of Things (IIoT) has played an integral role in Industry 4.0 and produced massive amounts of data for industrial intelligence. These data locate on decentralized devices in modern factories. To protect the confidentiality of industrial data, federated learning (FL) was introduced to collaboratively train shared machine learning models. However, the local data collected by different devices skew in class distribution and degrade industrial FL performance. This challenge has been widely studied at the mobile edge, but they ignored the rapidly changing streaming data and clustering nature of factory devices, and more seriously, they may threaten data security. In this paper, we propose FedGS, which is a hierarchical cloud-edge-end FL framework for 5G empowered industries, to improve industrial FL performance on non-i.i.d. data. Taking advantage of naturally clustered factory devices, FedGS uses a gradient-based binary permutation algorithm (GBP-CS) to select a subset of devices within each factory and build homogeneous super nodes participating in FL training. Then, we propose a compound-step synchronization protocol to coordinate the training process within and among these super nodes, which shows great robustness against data heterogeneity. The proposed methods are time-efficient and can adapt to dynamic environments, without exposing confidential industrial data in risky manipulation. We prove that FedGS has better convergence performance than FedAvg and give a relaxed condition under which FedGS is more communication-efficient. Extensive experiments show that FedGS improves accuracy by 3.5% and reduces training rounds by 59% on average, confirming its superior effectiveness and efficiency on non-i.i.d. data.  ( 2 min )
    Impact Analysis of Harassment Against Women Using Association Rule Mining Approaches: Bangladesh Prospective. (arXiv:2202.01308v1 [cs.LG])
    In recent years, it has been noticed that women are making progress in every sector of society. Their involvement in every field, such as education, job market, social work, etc., is increasing at a remarkable rate. For the last several years, the government has been trying its level best for the advancement of women in every sector by doing several research work and activities and funding several organizations to motivate women. Although women's involvement in several fields is increasing, the big concern is they are facing several barriers in their advancement, and it is not surprising that sexual harassment is one of them. In Bangladesh, harassment against women, especially students, is a common phenomenon, and it is increasing. In this paper, a survey-based and Apriori algorithm are used to analyze the several impacts of harassment among several age groups. Also, several factors such as frequent impacts of harassment, most vulnerable groups, women mostly facing harassment, the alleged person behind harassment, etc., are analyzed through association rule mining of Apriori algorithm and F.P. Growth algorithm. And then, a comparison of performance between both algorithms has been shown briefly. For this analysis, data have been carefully collected from all ages.  ( 2 min )
    PRUNIX: Non-Ideality Aware Convolutional Neural Network Pruning for Memristive Accelerators. (arXiv:2202.01758v1 [cs.AR])
    In this work, PRUNIX, a framework for training and pruning convolutional neural networks is proposed for deployment on memristor crossbar based accelerators. PRUNIX takes into account the numerous non-ideal effects of memristor crossbars including weight quantization, state-drift, aging and stuck-at-faults. PRUNIX utilises a novel Group Sawtooth Regularization intended to improve non-ideality tolerance as well as sparsity, and a novel Adaptive Pruning Algorithm (APA) intended to minimise accuracy loss by considering the sensitivity of different layers of a CNN to pruning. We compare our regularization and pruning methods with other standards on multiple CNN architectures, and observe an improvement of 13% test accuracy when quantization and other non-ideal effects are accounted for with an overall sparsity of 85%, which is similar to other methods  ( 2 min )
    Systems Biology: Identifiability analysis and parameter identification via systems-biology informed neural networks. (arXiv:2202.01723v1 [q-bio.QM])
    The dynamics of systems biological processes are usually modeled by a system of ordinary differential equations (ODEs) with many unknown parameters that need to be inferred from noisy and sparse measurements. Here, we introduce systems-biology informed neural networks for parameter estimation by incorporating the system of ODEs into the neural networks. To complete the workflow of system identification, we also describe structural and practical identifiability analysis to analyze the identifiability of parameters. We use the ultridian endocrine model for glucose-insulin interaction as the example to demonstrate all these methods and their implementation.  ( 2 min )
    Concept Bottleneck Model with Additional Unsupervised Concepts. (arXiv:2202.01459v1 [cs.CV])
    With the increasing demands for accountability, interpretability is becoming an essential capability for real-world AI applications. However, most methods utilize post-hoc approaches rather than training the interpretable model. In this article, we propose a novel interpretable model based on the concept bottleneck model (CBM). CBM uses concept labels to train an intermediate layer as the additional visible layer. However, because the number of concept labels restricts the dimension of this layer, it is difficult to obtain high accuracy with a small number of labels. To address this issue, we integrate supervised concepts with unsupervised ones trained with self-explaining neural networks (SENNs). By seamlessly training these two types of concepts while reducing the amount of computation, we can obtain both supervised and unsupervised concepts simultaneously, even for large-sized images. We refer to the proposed model as the concept bottleneck model with additional unsupervised concepts (CBM-AUC). We experimentally confirmed that the proposed model outperformed CBM and SENN. We also visualized the saliency map of each concept and confirmed that it was consistent with the semantic meanings.  ( 2 min )
    Review of automated time series forecasting pipelines. (arXiv:2202.01712v1 [cs.LG])
    Time series forecasting is fundamental for various use cases in different domains such as energy systems and economics. Creating a forecasting model for a specific use case requires an iterative and complex design process. The typical design process includes the five sections (1) data pre-processing, (2) feature engineering, (3) hyperparameter optimization, (4) forecasting method selection, and (5) forecast ensembling, which are commonly organized in a pipeline structure. One promising approach to handle the ever-growing demand for time series forecasts is automating this design process. The present paper, thus, analyzes the existing literature on automated time series forecasting pipelines to investigate how to automate the design process of forecasting models. Thereby, we consider both Automated Machine Learning (AutoML) and automated statistical forecasting methods in a single forecasting pipeline. For this purpose, we firstly present and compare the proposed automation methods for each pipeline section. Secondly, we analyze the automation methods regarding their interaction, combination, and coverage of the five pipeline sections. For both, we discuss the literature, identify problems, give recommendations, and suggest future research. This review reveals that the majority of papers only cover two or three of the five pipeline sections. We conclude that future research has to holistically consider the automation of the forecasting pipeline to enable the large-scale application of time series forecasting.  ( 2 min )
    Sequential Learning of the Topological Ordering for the Linear Non-Gaussian Acyclic Model with Parametric Noise. (arXiv:2202.01748v1 [stat.ME])
    Causal discovery, the learning of causality in a data mining scenario, has been of strong scientific and theoretical interest as a starting point to identify "what causes what?" Contingent on assumptions, it is sometimes possible to identify an exact causal Directed Acyclic Graph (DAG), as opposed to a Markov equivalence class of graphs that gives ambiguity of causal directions. The focus of this paper is on one such case: a linear structural equation model with non-Gaussian noise, a model known as the Linear Non-Gaussian Acyclic Model (LiNGAM). Given a specified parametric noise model, we develop a novel sequential approach to estimate the causal ordering of a DAG. At each step of the procedure, only simple likelihood ratio scores are calculated on regression residuals to decide the next node to append to the current partial ordering. Under mild assumptions, the population version of our procedure provably identifies a true ordering of the underlying causal DAG. We provide extensive numerical evidence to demonstrate that our sequential procedure is scalable to cases with possibly thousands of nodes and works well for high-dimensional data. We also conduct an application to a single-cell gene expression dataset to demonstrate our estimation procedure.  ( 2 min )
    Optimized Potential Initialization for Low-latency Spiking Neural Networks. (arXiv:2202.01440v1 [cs.NE])
    Spiking Neural Networks (SNNs) have been attached great importance due to the distinctive properties of low power consumption, biological plausibility, and adversarial robustness. The most effective way to train deep SNNs is through ANN-to-SNN conversion, which have yielded the best performance in deep network structure and large-scale datasets. However, there is a trade-off between accuracy and latency. In order to achieve high precision as original ANNs, a long simulation time is needed to match the firing rate of a spiking neuron with the activation value of an analog neuron, which impedes the practical application of SNN. In this paper, we aim to achieve high-performance converted SNNs with extremely low latency (fewer than 32 time-steps). We start by theoretically analyzing ANN-to-SNN conversion and show that scaling the thresholds does play a similar role as weight normalization. Instead of introducing constraints that facilitate ANN-to-SNN conversion at the cost of model capacity, we applied a more direct way by optimizing the initial membrane potential to reduce the conversion loss in each layer. Besides, we demonstrate that optimal initialization of membrane potentials can implement expected error-free ANN-to-SNN conversion. We evaluate our algorithm on the CIFAR-10, CIFAR-100 and ImageNet datasets and achieve state-of-the-art accuracy, using fewer time-steps. For example, we reach top-1 accuracy of 93.38\% on CIFAR-10 with 16 time-steps. Moreover, our method can be applied to other ANN-SNN conversion methodologies and remarkably promote performance when the time-steps is small.  ( 2 min )
    mSLAM: Massively multilingual joint pre-training for speech and text. (arXiv:2202.01374v1 [cs.CL])
    We present mSLAM, a multilingual Speech and LAnguage Model that learns cross-lingual cross-modal representations of speech and text by pre-training jointly on large amounts of unlabeled speech and text in multiple languages. mSLAM combines w2v-BERT pre-training on speech with SpanBERT pre-training on character-level text, along with Connectionist Temporal Classification (CTC) losses on paired speech and transcript data, to learn a single model capable of learning from and representing both speech and text signals in a shared representation space. We evaluate mSLAM on several downstream speech understanding tasks and find that joint pre-training with text improves quality on speech translation, speech intent classification and speech language-ID while being competitive on multilingual ASR, when compared against speech-only pre-training. Our speech translation model demonstrates zero-shot text translation without seeing any text translation data, providing evidence for cross-modal alignment of representations. mSLAM also benefits from multi-modal fine-tuning, further improving the quality of speech translation by directly leveraging text translation data during the fine-tuning process. Our empirical analysis highlights several opportunities and challenges arising from large-scale multimodal pre-training, suggesting directions for future research.  ( 2 min )
    Gradient estimators for normalising flows. (arXiv:2202.01314v1 [stat.ML])
    Recently a machine learning approach to Monte-Carlo simulations called Neural Markov Chain Monte-Carlo (NMCMC) is gaining traction. In its most popular form it uses the neural networks to construct normalizing flows which are then trained to approximate the desired target distribution. As this distribution is usually defined via a Hamiltonian or action, the standard learning algorithm requires estimation of the action gradient with respect to the fields. In this contribution we present another gradient estimator (and the corresponding [PyTorch implementation) that avoids this calculation, thus potentially speeding up training for models with more complicated actions. We also study the statistical properties of several gradient estimators and show that our formulation leads to better training results.  ( 2 min )
    Sampling Permutations for Shapley Value Estimation. (arXiv:2104.12199v2 [stat.ML] UPDATED)
    Game-theoretic attribution techniques based on Shapley values are used to interpret black-box machine learning models, but their exact calculation is generally NP-hard, requiring approximation methods for non-trivial models. As the computation of Shapley values can be expressed as a summation over a set of permutations, a common approach is to sample a subset of these permutations for approximation. Unfortunately, standard Monte Carlo sampling methods can exhibit slow convergence, and more sophisticated quasi-Monte Carlo methods have not yet been applied to the space of permutations. To address this, we investigate new approaches based on two classes of approximation methods and compare them empirically. First, we demonstrate quadrature techniques in a RKHS containing functions of permutations, using the Mallows kernel in combination with kernel herding and sequential Bayesian quadrature. The RKHS perspective also leads to quasi-Monte Carlo type error bounds, with a tractable discrepancy measure defined on permutations. Second, we exploit connections between the hypersphere $\mathbb{S}^{d-2}$ and permutations to create practical algorithms for generating permutation samples with good properties. Experiments show the above techniques provide significant improvements for Shapley value estimates over existing methods, converging to a smaller RMSE in the same number of model evaluations.  ( 2 min )
    A unified surrogate-based scheme for black-box and preference-based optimization. (arXiv:2202.01468v1 [math.OC])
    Black-box and preference-based optimization algorithms are global optimization procedures that aim to find the global solutions of an optimization problem using, respectively, the least amount of function evaluations or sample comparisons as possible. In the black-box case, the analytical expression of the objective function is unknown and it can only be evaluated through a (costly) computer simulation or an experiment. In the preference-based case, the objective function is still unknown but it corresponds to the subjective criterion of an individual. So, it is not possible to quantify such criterion in a reliable and consistent way. Therefore, preference-based optimization algorithms seek global solutions using only comparisons between couples of different samples, for which a human decision-maker indicates which of the two is preferred. Quite often, the black-box and preference-based frameworks are covered separately and are handled using different techniques. In this paper, we show that black-box and preference-based optimization problems are closely related and can be solved using the same family of approaches, namely surrogate-based methods. Moreover, we propose the generalized Metric Response Surface (gMRS) algorithm, an optimization scheme that is a generalization of the popular MSRS framework. Finally, we provide a convergence proof for the proposed optimization method.  ( 2 min )
    Efficient Memory Partitioning in Software Defined Hardware. (arXiv:2202.01261v1 [cs.AR])
    As programmers turn to software-defined hardware (SDH) to maintain a high level of productivity while programming hardware to run complex algorithms, heavy-lifting must be done by the compiler to automatically partition on-chip arrays. In this paper, we introduce an automatic memory partitioning system that can quickly compute more efficient partitioning schemes than prior systems. Our system employs a variety of resource-saving optimizations and an ML cost model to select the best partitioning scheme from an array of candidates. We compared our system against various state-of-the-art SDH compilers and FPGAs on a variety of benchmarks and found that our system generates solutions that, on average, consume 40.3% fewer logic resources, 78.3% fewer FFs, 54.9% fewer Block RAMs (BRAMs), and 100% fewer DSPs.  ( 2 min )
    Causal Inference Through the Structural Causal Marginal Problem. (arXiv:2202.01300v1 [cs.AI])
    We introduce an approach to counterfactual inference based on merging information from multiple datasets. We consider a causal reformulation of the statistical marginal problem: given a collection of marginal structural causal models (SCMs) over distinct but overlapping sets of variables, determine the set of joint SCMs that are counterfactually consistent with the marginal ones. We formalise this approach for categorical SCMs using the response function formulation and show that it reduces the space of allowed marginal and joint SCMs. Our work thus highlights a new mode of falsifiability through additional variables, in contrast to the statistical one via additional data.  ( 2 min )
    Directed hypergraph neural network. (arXiv:2008.03626v2 [stat.ML] UPDATED)
    To deal with irregular data structure, graph convolution neural networks have been developed by a lot of data scientists. However, data scientists just have concentrated primarily on developing deep neural network method for un-directed graph. In this paper, we will present the novel neural network method for directed hypergraph. In the other words, we will develop not only the novel directed hypergraph neural network method but also the novel directed hypergraph based semi-supervised learning method. These methods are employed to solve the node classification task. The two datasets that are used in the experiments are the cora and the citeseer datasets. Among the classic directed graph based semi-supervised learning method, the novel directed hypergraph based semi-supervised learning method, the novel directed hypergraph neural network method that are utilized to solve this node classification task, we recognize that the novel directed hypergraph neural network achieves the highest accuracies.  ( 2 min )
    Predicting Dynamic Stability of Power Grids using Graph Neural Networks. (arXiv:2108.08230v2 [physics.soc-ph] UPDATED)
    The prediction of dynamical stability of power grids becomes more important and challenging with increasing shares of renewable energy sources due to their decentralized structure, reduced inertia and volatility. We investigate the feasibility of applying graph neural networks (GNN) to predict dynamic stability of synchronisation in complex power grids using the single-node basin stability (SNBS) as a measure. To do so, we generate two synthetic datasets for grids with 20 and 100 nodes respectively and estimate SNBS using Monte-Carlo sampling. Those datasets are used to train and evaluate the performance of eight different GNN-models. All models use the full graph without simplifications as input and predict SNBS in a nodal-regression-setup. We show that SNBS can be predicted in general and the performance significantly changes using different GNN-models. Furthermore, we observe interesting transfer capabilities of our approach: GNN-models trained on smaller grids can directly be applied on larger grids without the need of retraining.  ( 2 min )
    Minimax rate of consistency for linear models with missing values. (arXiv:2202.01463v1 [stat.ML])
    Missing values arise in most real-world data sets due to the aggregation of multiple sources and intrinsically missing information (sensor failure, unanswered questions in surveys...). In fact, the very nature of missing values usually prevents us from running standard learning algorithms. In this paper, we focus on the extensively-studied linear models, but in presence of missing values, which turns out to be quite a challenging task. Indeed, the Bayes rule can be decomposed as a sum of predictors corresponding to each missing pattern. This eventually requires to solve a number of learning tasks, exponential in the number of input features, which makes predictions impossible for current real-world datasets. First, we propose a rigorous setting to analyze a least-square type estimator and establish a bound on the excess risk which increases exponentially in the dimension. Consequently, we leverage the missing data distribution to propose a new algorithm, andderive associated adaptive risk bounds that turn out to be minimax optimal. Numerical experiments highlight the benefits of our method compared to state-of-the-art algorithms used for predictions with missing values.  ( 2 min )
    Adaptive Discrete Communication Bottlenecks with Dynamic Vector Quantization. (arXiv:2202.01334v1 [cs.LG])
    Vector Quantization (VQ) is a method for discretizing latent representations and has become a major part of the deep learning toolkit. It has been theoretically and empirically shown that discretization of representations leads to improved generalization, including in reinforcement learning where discretization can be used to bottleneck multi-agent communication to promote agent specialization and robustness. The discretization tightness of most VQ-based methods is defined by the number of discrete codes in the representation vector and the codebook size, which are fixed as hyperparameters. In this work, we propose learning to dynamically select discretization tightness conditioned on inputs, based on the hypothesis that data naturally contains variations in complexity that call for different levels of representational coarseness. We show that dynamically varying tightness in communication bottlenecks can improve model performance on visual reasoning and reinforcement learning tasks.  ( 2 min )
    DASHA: Distributed Nonconvex Optimization with Communication Compression, Optimal Oracle Complexity, and No Client Synchronization. (arXiv:2202.01268v1 [cs.LG])
    We develop and analyze DASHA: a new family of methods for nonconvex distributed optimization problems. When the local functions at the nodes have a finite-sum or an expectation form, our new methods, DASHA-PAGE and DASHA-SYNC-MVR, improve the theoretical oracle and communication complexity of the previous state-of-the-art method MARINA by Gorbunov et al. (2020). In particular, to achieve an epsilon-stationary point, and considering the random sparsifier RandK as an example, our methods compute the optimal number of gradients $\mathcal{O}\left(\frac{\sqrt{m}}{\varepsilon\sqrt{n}}\right)$ and $\mathcal{O}\left(\frac{\sigma}{\varepsilon^{3/2}n}\right)$ in finite-sum and expectation form cases, respectively, while maintaining the SOTA communication complexity $\mathcal{O}\left(\frac{d}{\varepsilon \sqrt{n}}\right)$. Furthermore, unlike MARINA, the new methods DASHA, DASHA-PAGE and DASHA-MVR send compressed vectors only and never synchronize the nodes, which makes them more practical for federated learning. We extend our results to the case when the functions satisfy the Polyak-Lojasiewicz condition. Finally, our theory is corroborated in practice: we see a significant improvement in experiments with nonconvex classification and training of deep learning models.  ( 2 min )
    Parameters or Privacy: A Provable Tradeoff Between Overparameterization and Membership Inference. (arXiv:2202.01243v1 [stat.ML])
    A surprising phenomenon in modern machine learning is the ability of a highly overparameterized model to generalize well (small error on the test data) even when it is trained to memorize the training data (zero error on the training data). This has led to an arms race towards increasingly overparameterized models (c.f., deep learning). In this paper, we study an underexplored hidden cost of overparameterization: the fact that overparameterized models are more vulnerable to privacy attacks, in particular the membership inference attack that predicts the (potentially sensitive) examples used to train a model. We significantly extend the relatively few empirical results on this problem by theoretically proving for an overparameterized linear regression model with Gaussian data that the membership inference vulnerability increases with the number of parameters. Moreover, a range of empirical studies indicates that more complex, nonlinear models exhibit the same behavior. Finally, we study different methods for mitigating such attacks in the overparameterized regime, such as noise addition and regularization, and conclude that simply reducing the parameters of an overparameterized model is an effective strategy to protect it from membership inference without greatly decreasing its generalization error.  ( 2 min )
    Deep Learning for Epidemiologists: An Introduction to Neural Networks. (arXiv:2202.01319v1 [cs.LG])
    Deep learning methods are increasingly being applied to problems in medicine and healthcare. However, few epidemiologists have received formal training in these methods. To bridge this gap, this article introduces to the fundamentals of deep learning from an epidemiological perspective. Specifically, this article reviews core concepts in machine learning (overfitting, regularization, hyperparameters), explains several fundamental deep learning architectures (convolutional neural networks, recurrent neural networks), and summarizes training, evaluation, and deployment of models. We aim to enable the reader to engage with and critically evaluate medical applications of deep learning, facilitating a dialogue between computer scientists and epidemiologists that will improve the safety and efficacy of applications of this technology.  ( 2 min )
    JaQuAD: Japanese Question Answering Dataset for Machine Reading Comprehension. (arXiv:2202.01764v1 [cs.CL])
    Question Answering (QA) is a task in which a machine understands a given document and a question to find an answer. Despite impressive progress in the NLP area, QA is still a challenging problem, especially for non-English languages due to the lack of annotated datasets. In this paper, we present the Japanese Question Answering Dataset, JaQuAD, which is annotated by humans. JaQuAD consists of 39,696 extractive question-answer pairs on Japanese Wikipedia articles. We finetuned a baseline model which achieves 78.92% for F1 score and 63.38% for EM on test set. The dataset and our experiments are available at https://github.com/SkelterLabsInc/JaQuAD.  ( 2 min )
  • Open

    How many degrees of freedom do we need to train deep networks: a loss landscape perspective. (arXiv:2107.05802v2 [cs.LG] UPDATED)
    A variety of recent works, spanning pruning, lottery tickets, and training within random subspaces, have shown that deep neural networks can be trained using far fewer degrees of freedom than the total number of parameters. We analyze this phenomenon for random subspaces by first examining the success probability of hitting a training loss sub-level set when training within a random subspace of a given training dimensionality. We find a sharp phase transition in the success probability from $0$ to $1$ as the training dimension surpasses a threshold. This threshold training dimension increases as the desired final loss decreases, but decreases as the initial loss decreases. We then theoretically explain the origin of this phase transition, and its dependence on initialization and final desired loss, in terms of properties of the high-dimensional geometry of the loss landscape. In particular, we show via Gordon's escape theorem, that the training dimension plus the Gaussian width of the desired loss sub-level set, projected onto a unit sphere surrounding the initialization, must exceed the total number of parameters for the success probability to be large. In several architectures and datasets, we measure the threshold training dimension as a function of initialization and demonstrate that it is a small fraction of the total parameters, implying by our theory that successful training with so few dimensions is possible precisely because the Gaussian width of low loss sub-level sets is very large. Moreover, we compare this threshold training dimension to more sophisticated ways of reducing training degrees of freedom, including lottery tickets as well as a new, analogous method: lottery subspaces. Code is available at https://github.com/ganguli-lab/degrees-of-freedom.  ( 2 min )
    Gradient estimators for normalising flows. (arXiv:2202.01314v1 [stat.ML])
    Recently a machine learning approach to Monte-Carlo simulations called Neural Markov Chain Monte-Carlo (NMCMC) is gaining traction. In its most popular form it uses the neural networks to construct normalizing flows which are then trained to approximate the desired target distribution. As this distribution is usually defined via a Hamiltonian or action, the standard learning algorithm requires estimation of the action gradient with respect to the fields. In this contribution we present another gradient estimator (and the corresponding [PyTorch implementation) that avoids this calculation, thus potentially speeding up training for models with more complicated actions. We also study the statistical properties of several gradient estimators and show that our formulation leads to better training results.  ( 2 min )
    Contrastive Mixture of Posteriors for Counterfactual Inference, Data Integration and Fairness. (arXiv:2106.08161v2 [stat.ML] UPDATED)
    Learning meaningful representations of data that can address challenges such as batch effect correction and counterfactual inference is a central problem in many domains including computational biology. Adopting a Conditional VAE framework, we show that marginal independence between the representation and a condition variable plays a key role in both of these challenges. We propose the Contrastive Mixture of Posteriors (CoMP) method that uses a novel misalignment penalty defined in terms of mixtures of the variational posteriors to enforce this independence in latent space. We show that CoMP has attractive theoretical properties compared to previous approaches and we prove counterfactual identifiability of CoMP under additional assumptions. We demonstrate state of the art performance on a set of challenging tasks including aligning human tumour samples with cancer cell-lines, predicting transcriptome-level perturbation responses, and batch correction on single-cell RNA sequencing data. We also find parallels to fair representation learning and demonstrate that CoMP is competitive on a common task in the field.  ( 2 min )
    PageRank algorithm for Directed Hypergraph. (arXiv:1909.01132v2 [cs.SI] UPDATED)
    During the last two decades, we easilly see that the World Wide Web's link structure is modeled as the directed graph. In this paper, we will model the World Wide Web's link structure as the directed hypergraph. Moreover, we will develop the PageRank algorithm for this directed hypergraph. Due to the lack of the World Wide Web directed hypergraph datasets, we will apply the PageRank algorithm to the metabolic network which is the directed hypergraph itself. The experiments show that our novel PageRank algorithm is successfully applied to this metabolic network.  ( 2 min )
    A Stochastic Alternating Balance $k$-Means Algorithm for Fair Clustering. (arXiv:2105.14172v2 [cs.LG] UPDATED)
    In the application of data clustering to human-centric decision-making systems, such as loan applications and advertisement recommendations, the clustering outcome might discriminate against people across different demographic groups, leading to unfairness. A natural conflict occurs between the cost of clustering (in terms of distance to cluster centers) and the balance representation of all demographic groups across the clusters, leading to a bi-objective optimization problem that is nonconvex and nonsmooth. To determine the complete trade-off between these two competing goals, we design a novel stochastic alternating balance fair $k$-means (SAfairKM) algorithm, which consists of alternating classical mini-batch $k$-means updates and group swap updates. The number of $k$-means updates and the number of swap updates essentially parameterize the weight put on optimizing each objective function. Our numerical experiments show that the proposed SAfairKM algorithm is robust and computationally efficient in constructing well-spread and high-quality Pareto fronts both on synthetic and real datasets.  ( 2 min )
    Modeling Advection on Directed Graphs using Mat\'ern Gaussian Processes for Traffic Flow. (arXiv:2201.00001v2 [math.NA] UPDATED)
    The transport of traffic flow can be modeled by the advection equation. Finite difference and finite volumes methods have been used to numerically solve this hyperbolic equation on a mesh. Advection has also been modeled discretely on directed graphs using the graph advection operator [4, 18]. In this paper, we first show that we can reformulate this graph advection operator as a finite difference scheme. We then propose the Directed Graph Advection Mat\'ern Gaussian Process (DGAMGP) model that incorporates the dynamics of this graph advection operator into the kernel of a trainable Mat\'ern Gaussian Process to effectively model traffic flow and its uncertainty as an advective process on a directed graph.  ( 2 min )
    Unified theory of atom-centered representations and graph convolutional machine-learning schemes. (arXiv:2202.01566v1 [stat.ML])
    Data-driven schemes that associate molecular and crystal structures with their microscopic properties share the need for a concise, effective description of the arrangement of their atomic constituents. Many types of models rely on descriptions of atom-centered environments, that are associated with an atomic property or with an atomic contribution to an extensive macroscopic quantity. Frameworks in this class can be understood in terms of atom-centered density correlations (ACDC), that are used as a basis for a body-ordered, symmetry-adapted expansion of the targets. Several other schemes, that gather information on the relationship between neighboring atoms using graph-convolutional (or message-passing) ideas, cannot be directly mapped to correlations centered around a single atom. We generalize the ACDC framework to include multi-centered information, generating representations that provide a complete linear basis to regress symmetric functions of atomic coordinates, and form the basis to systematize our understanding of both atom-centered and graph-convolutional machine-learning schemes.  ( 2 min )
    Multirate Training of Neural Networks. (arXiv:2106.10771v2 [cs.LG] UPDATED)
    We propose multirate training of neural networks: partitioning neural network parameters into "fast" and "slow" parts which are trained on different time scales. By choosing appropriate partitionings we can obtain substantial computational speed-up for transfer learning tasks. We show for applications in vision and NLP that we can fine-tune deep neural networks in almost half the time, without reducing the generalization performance of the resulting models. We analyze the convergence properties of our multirate scheme and draw a comparison with vanilla SGD. We also discuss splitting choices for the neural network parameters which could enhance generalization performance when neural networks are trained from scratch. A multirate approach can be used to learn different features present in the data and as a form of regularization. Our paper unlocks the potential of using multirate techniques for neural network training and provides several starting points for future work in this area.  ( 2 min )
    Improving Screening Processes via Calibrated Subset Selection. (arXiv:2202.01147v1 [cs.LG] CROSS LISTED)
    Many selection processes such as finding patients qualifying for a medical trial or retrieval pipelines in search engines consist of multiple stages, where an initial screening stage focuses the resources on shortlisting the most promising candidates. In this paper, we investigate what guarantees a screening classifier can provide, independently of whether it is constructed manually or trained. We find that current solutions do not enjoy distribution-free theoretical guarantees -- we show that, in general, even for a perfectly calibrated classifier, there always exist specific pools of candidates for which its shortlist is suboptimal. Then, we develop a distribution-free screening algorithm -- called Calibrated Subset Selection (CSS) -- that, given any classifier and some amount of calibration data, finds near-optimal shortlists of candidates that contain a desired number of qualified candidates in expectation. Moreover, we show that a variant of our algorithm that calibrates a given classifier multiple times across specific groups can create shortlists with provable diversity guarantees. Experiments on US Census survey data validate our theoretical results and show that the shortlists provided by our algorithm are superior to those provided by several competitive baselines.  ( 2 min )
    Assessing External Validity Over Worst-case Subpopulations. (arXiv:2007.02411v3 [stat.ML] UPDATED)
    Study populations are typically sampled from limited points in space and time, and marginalized groups are underrepresented. To assess the external validity of randomized and observational studies, we propose and evaluate the worst-case treatment effect (WTE) across all subpopulations of a given size, which guarantees positive findings remain valid over subpopulations. We develop a semiparametrically efficient estimator for the WTE that analyzes the external validity of the augmented inverse propensity weighted estimator for the average treatment effect. Our cross-fitting procedure leverages flexible nonparametric and machine learning-based estimates of nuisance parameters and is a regular root-$n$ estimator even when nuisance estimates converge more slowly. On real examples where external validity is of core concern, our proposed framework guards against brittle findings that are invalidated by unanticipated population shifts.  ( 2 min )
    On Manifold Hypothesis: Hypersurface Submanifold Embedding Using Osculating Hyperspheres. (arXiv:2202.01619v1 [cs.LG])
    Consider a set of $n$ data points in the Euclidean space $\mathbb{R}^d$. This set is called dataset in machine learning and data science. Manifold hypothesis states that the dataset lies on a low-dimensional submanifold with high probability. All dimensionality reduction and manifold learning methods have the assumption of manifold hypothesis. In this paper, we show that the dataset lies on an embedded hypersurface submanifold which is locally $(d-1)$-dimensional. Hence, we show that the manifold hypothesis holds at least for the embedding dimensionality $d-1$. Using an induction in a pyramid structure, we also extend the embedding dimensionality to lower embedding dimensionalities to show the validity of manifold hypothesis for embedding dimensionalities $\{1, 2, \dots, d-1\}$. For embedding the hypersurface, we first construct the $d$ nearest neighbors graph for data. For every point, we fit an osculating hypersphere $S^{d-1}$ using its neighbors where this hypersphere is osculating to a hypothetical hypersurface. Then, using surgery theory, we apply surgery on the osculating hyperspheres to obtain $n$ hyper-caps. We connect the hyper-caps to one another using partial hyper-cylinders. By connecting all parts, the embedded hypersurface is obtained as the disjoint union of these elements. We discuss the geometrical characteristics of the embedded hypersurface, such as having boundary, its topology, smoothness, boundedness, orientability, compactness, and injectivity. Some discussion are also provided for the linearity and structure of data. This paper is the intersection of several fields of science including machine learning, differential geometry, and algebraic topology.  ( 2 min )
    What can linear interpolation of neural network loss landscapes tell us?. (arXiv:2106.16004v2 [cs.LG] UPDATED)
    Studying neural network loss landscapes provides insights into the nature of the underlying optimization problems. Unfortunately, loss landscapes are notoriously difficult to visualize in a human-comprehensible fashion. One common way to address this problem is to plot linear slices of the landscape, for example from the initial state of the network to the final state after optimization. On the basis of this analysis, prior work has drawn broader conclusions about the difficulty of the optimization problem. In this paper, we put inferences of this kind to the test, systematically evaluating how linear interpolation and final performance vary when altering the data, choice of initialization, and other optimizer and architecture design choices. Further, we use linear interpolation to study the role played by individual layers and substructures of the network. We find that certain layers are more sensitive to the choice of initialization, but that the shape of the linear path is not indicative of the changes in test accuracy of the model. Our results cast doubt on the broader intuition that the presence or absence of barriers when interpolating necessarily relates to the success of optimization.  ( 2 min )
    Non-Vacuous Generalisation Bounds for Shallow Neural Networks. (arXiv:2202.01627v1 [cs.LG])
    We focus on a specific class of shallow neural networks with a single hidden layer, namely those with $L_2$-normalised data and either a sigmoid-shaped Gaussian error function ("erf") activation or a Gaussian Error Linear Unit (GELU) activation. For these networks, we derive new generalisation bounds through the PAC-Bayesian theory; unlike most existing such bounds they apply to neural networks with deterministic rather than randomised parameters. Our bounds are empirically non-vacuous when the network is trained with vanilla stochastic gradient descent on MNIST and Fashion-MNIST.  ( 2 min )
    Selection in the Presence of Implicit Bias: The Advantage of Intersectional Constraints. (arXiv:2202.01661v1 [cs.CY])
    In selection processes such as hiring, promotion, and college admissions, implicit bias toward socially-salient attributes such as race, gender, or sexual orientation of candidates is known to produce persistent inequality and reduce aggregate utility for the decision maker. Interventions such as the Rooney Rule and its generalizations, which require the decision maker to select at least a specified number of individuals from each affected group, have been proposed to mitigate the adverse effects of implicit bias in selection. Recent works have established that such lower-bound constraints can be very effective in improving aggregate utility in the case when each individual belongs to at most one affected group. However, in several settings, individuals may belong to multiple affected groups and, consequently, face more extreme implicit bias due to this intersectionality. We consider independently drawn utilities and show that, in the intersectional case, the aforementioned non-intersectional constraints can only recover part of the total utility achievable in the absence of implicit bias. On the other hand, we show that if one includes appropriate lower-bound constraints on the intersections, almost all the utility achievable in the absence of implicit bias can be recovered. Thus, intersectional constraints can offer a significant advantage over a reductionist dimension-by-dimension non-intersectional approach to reducing inequality.  ( 2 min )
    NoisyMix: Boosting Robustness by Combining Data Augmentations, Stability Training, and Noise Injections. (arXiv:2202.01263v1 [cs.LG])
    For many real-world applications, obtaining stable and robust statistical performance is more important than simply achieving state-of-the-art predictive test accuracy, and thus robustness of neural networks is an increasingly important topic. Relatedly, data augmentation schemes have been shown to improve robustness with respect to input perturbations and domain shifts. Motivated by this, we introduce NoisyMix, a training scheme that combines data augmentations with stability training and noise injections to improve both model robustness and in-domain accuracy. This combination promotes models that are consistently more robust and that provide well-calibrated estimates of class membership probabilities. We demonstrate the benefits of NoisyMix on a range of benchmark datasets, including ImageNet-C, ImageNet-R, and ImageNet-P. Moreover, we provide theory to understand implicit regularization and robustness of NoisyMix.  ( 2 min )
    Text classification problems via BERT embedding method and graph convolutional neural network. (arXiv:2111.15379v2 [cs.CL] UPDATED)
    This paper presents the novel way combining the BERT embedding method and the graph convolutional neural network. This combination is employed to solve the text classification problem. Initially, we apply the BERT embedding method to the texts (in the BBC news dataset and the IMDB movie reviews dataset) in order to transform all the texts to numerical vector. Then, the graph convolutional neural network will be applied to these numerical vectors to classify these texts into their ap-propriate classes/labels. Experiments show that the performance of the graph convolutional neural network model is better than the perfor-mances of the combination of the BERT embedding method with clas-sical machine learning models.  ( 2 min )
    Efficient learning of hidden state LTI state space models of unknown order. (arXiv:2202.01625v1 [math.ST])
    The aim of this paper is to address two related estimation problems arising in the setup of hidden state linear time invariant (LTI) state space systems when the dimension of the hidden state is unknown. Namely, the estimation of any finite number of the system's Markov parameters and the estimation of a minimal realization for the system, both from the partial observation of a single trajectory. For both problems, we provide statistical guarantees in the form of various estimation error upper bounds, $\rank$ recovery conditions, and sample complexity estimates. Specifically, we first show that the low $\rank$ solution of the Hankel penalized least square estimator satisfies an estimation error in $S_p$-norms for $p \in [1,2]$ that captures the effect of the system order better than the existing operator norm upper bound for the simple least square. We then provide a stability analysis for an estimation procedure based on a variant of the Ho-Kalman algorithm that improves both the dependence on the dimension and the least singular value of the Hankel matrix of the Markov parameters. Finally, we propose an estimation algorithm for the minimal realization that uses both the Hankel penalized least square estimator and the Ho-Kalman based estimation procedure and guarantees with high probability that we recover the correct order of the system and satisfies a new fast rate in the $S_2$-norm with a polynomial reduction in the dependence on the dimension and other parameters of the problem.  ( 2 min )
    Noise-robust classification with hypergraph neural network. (arXiv:2102.01934v2 [stat.ML] UPDATED)
    This paper presents a novel version of the hypergraph neural network method. This method is utilized to solve the noisy label learning problem. First, we apply the PCA dimensional reduction technique to the feature matrices of the image datasets in order to reduce the "noise" and the redundant features in the feature matrices of the image datasets and to reduce the runtime constructing the hypergraph of the hypergraph neural network method. Then, the classic graph-based semi-supervised learning method, the classic hypergraph based semi-supervised learning method, the graph neural network, the hypergraph neural network, and our proposed hypergraph neural network are employed to solve the noisy label learning problem. The accuracies of these five methods are evaluated and compared. Experimental results show that the hypergraph neural network methods achieve the best performance when the noise level increases. Moreover, the hypergraph neural network methods are at least as good as the graph neural network.  ( 2 min )
    Hierarchical Bayesian Mixture Models for Time Series Using Context Trees as State Space Partitions. (arXiv:2106.03023v2 [stat.ME] UPDATED)
    A general Bayesian framework is introduced for mixture modelling and inference with real-valued time series. At the top level, the state space is partitioned via the choice of a discrete context tree, so that the resulting partition depends on the values of some of the most recent samples. At the bottom level, a different model is associated with each region of the partition. This defines a very rich and flexible class of mixture models, for which we provide algorithms that allow for efficient, exact Bayesian inference. In particular, we show that the maximum a posteriori probability (MAP) model (including the relevant MAP context tree partition) can be precisely identified, along with its exact posterior probability. The utility of this general framework is illustrated in detail when a different autoregressive (AR) model is used in each state-space region, resulting in a mixture-of-AR model class. The performance of the associated algorithmic tools is demonstrated in the problems of model selection and forecasting on both simulated and real-world data, where they are found to provide results as good or better than state-of-the-art methods.  ( 2 min )
    Scaling Structured Inference with Randomization. (arXiv:2112.03638v2 [cs.LG] UPDATED)
    Deep discrete structured models have seen considerable progress recently, but traditional inference using dynamic programming (DP) typically works with a small number of states (less than hundreds), which severely limits model capacity. At the same time, across machine learning, there is a recent trend of using randomized truncation techniques to accelerate computations involving large sums. Here, we propose a family of randomized dynamic programming (RDP) algorithms for scaling structured models to tens of thousands of latent states. Our method is widely applicable to classical DP-based inference (partition, marginal, reparameterization, entropy) and different graph structures (chains, trees, and more general hypergraphs). It is also compatible with automatic differentiation: it can be integrated with neural networks seamlessly and learned with gradient-based optimizers. Our core technique approximates the sum-product by restricting and reweighting DP on a small subset of nodes, which reduces computation by orders of magnitude. We further achieve low bias and variance via Rao-Blackwellization and importance sampling. Experiments over different graphs demonstrate the accuracy and efficiency of our approach. Furthermore, when using RDP for training a structured variational autoencoder with a scaled inference network, we achieve better test likelihood than baselines and successfully prevent posterior collapse  ( 2 min )
    Gaussian Process Sampling and Optimization with Approximate Upper and Lower Bounds. (arXiv:2110.12087v2 [cs.LG] UPDATED)
    Many functions have approximately-known upper and/or lower bounds, potentially aiding the modeling of such functions. In this paper, we introduce Gaussian process models for functions where such bounds are (approximately) known. More specifically, we propose the first use of such bounds to improve Gaussian process (GP) posterior sampling and Bayesian optimization (BO). That is, we transform a GP model satisfying the given bounds, and then sample and weight functions from its posterior. To further exploit these bounds in BO settings, we present bounded entropy search (BES) to select the point gaining the most information about the underlying function, estimated by the GP samples, while satisfying the output constraints. We characterize the sample variance bounds and show that the decision made by BES is explainable. Our proposed approach is conceptually straightforward and can be used as a plug in extension to existing methods for GP posterior sampling and Bayesian optimization.  ( 2 min )
    Double Machine Learning for Partially Linear Mixed-Effects Models with Repeated Measurements. (arXiv:2108.13657v2 [stat.ME] UPDATED)
    Traditionally, spline or kernel approaches in combination with parametric estimation are used to infer the linear coefficient (fixed effects) in a partially linear mixed-effects model for repeated measurements. Using machine learning algorithms allows us to incorporate complex interaction structures and high-dimensional variables. We employ double machine learning to cope with the nonparametric part of the partially linear mixed-effects model: the nonlinear variables are regressed out nonparametrically from both the linear variables and the response. This adjustment can be performed with any machine learning algorithm, for instance random forests, which allows to take complex interaction terms and nonsmooth structures into account. The adjusted variables satisfy a linear mixed-effects model, where the linear coefficient can be estimated with standard linear mixed-effects techniques. We prove that the estimated fixed effects coefficient converges at the parametric rate, is asymptotically Gaussian distributed, and semiparametrically efficient. Two simulation studies demonstrate that our method outperforms a penalized regression spline approach in terms of coverage. We also illustrate our proposed approach on a longitudinal dataset with HIV-infected individuals. Software code for our method is available in the R-package dmlalg.  ( 2 min )
    Log-Euclidean Signatures for Intrinsic Distances Between Unaligned Datasets. (arXiv:2202.01671v1 [stat.ML])
    The need for efficiently comparing and representing datasets with unknown alignment spans various fields, from model analysis and comparison in machine learning to trend discovery in collections of medical datasets. We use manifold learning to compare the intrinsic geometric structures of different datasets by comparing their diffusion operators, symmetric positive-definite (SPD) matrices that relate to approximations of the continuous Laplace-Beltrami operator from discrete samples. Existing methods typically compare such operators in a pointwise manner or assume known data alignment. Instead, we exploit the Riemannian geometry of SPD matrices to compare these operators and define a new theoretically-motivated distance based on a lower bound of the log-Euclidean metric. Our framework facilitates comparison of data manifolds expressed in datasets with different sizes, numbers of features, and measurement modalities. Our log-Euclidean signature (LES) distance recovers meaningful structural differences, outperforming competing methods in various application domains.  ( 2 min )
    Equality Is Not Equity: Proportional Fairness in Federated Learning. (arXiv:2202.01666v1 [cs.LG])
    Ensuring fairness of machine learning (ML) algorithms is becoming an increasingly important mission for ML service providers. This is even more critical and challenging in the federated learning (FL) scenario, given a large number of diverse participating clients. Simply mandating equality across clients could lead to many undesirable consequences, potentially discouraging high-performing clients and resulting in sub-optimal overall performance. In order to achieve better equity rather than equality, in this work, we introduce and study proportional fairness (PF) in FL, which has a deep connection with game theory. By viewing FL from a cooperative game perspective, where the players (clients) collaboratively learn a good model, we formulate PF as Nash bargaining solutions. Based on this concept, we propose PropFair, a novel and easy-to-implement algorithm for effectively finding PF solutions, and we prove its convergence properties. We illustrate through experiments that PropFair consistently improves the worst-case and the overall performances simultaneously over state-of-the-art fair FL algorithms for a wide array of vision and language datasets, thus achieving better equity.  ( 2 min )
    Fenrir: Physics-Enhanced Regression for Initial Value Problems. (arXiv:2202.01287v1 [cs.LG])
    We show how probabilistic numerics can be used to convert an initial value problem into a Gauss--Markov process parametrised by the dynamics of the initial value problem. Consequently, the often difficult problem of parameter estimation in ordinary differential equations is reduced to hyperparameter estimation in Gauss--Markov regression, which tends to be considerably easier. The method's relation and benefits in comparison to classical numerical integration and gradient matching approaches is elucidated. In particular, the method can, in contrast to gradient matching, handle partial observations, and has certain routes for escaping local optima not available to classical numerical integration. Experimental results demonstrate that the method is on par or moderately better than competing approaches.  ( 2 min )
    Generative Flow Networks for Discrete Probabilistic Modeling. (arXiv:2202.01361v1 [cs.LG])
    We present energy-based generative flow networks (EB-GFN), a novel probabilistic modeling algorithm for high-dimensional discrete data. Building upon the theory of generative flow networks (GFlowNets), we model the generation process by a stochastic data construction policy and thus amortize expensive MCMC exploration into a fixed number of actions sampled from a GFlowNet. We show how GFlowNets can approximately perform large-block Gibbs sampling to mix between modes. We propose a framework to jointly train a GFlowNet with an energy function, so that the GFlowNet learns to sample from the energy distribution, while the energy learns with an approximate MLE objective with negative samples from the GFlowNet. We demonstrate EB-GFN's effectiveness on various probabilistic modeling tasks.  ( 2 min )
    Global Optimization Networks. (arXiv:2202.01277v1 [stat.ML])
    We consider the problem of estimating a good maximizer of a black-box function given noisy examples. To solve such problems, we propose to fit a new type of function which we call a global optimization network (GON), defined as any composition of an invertible function and a unimodal function, whose unique global maximizer can be inferred in $\mathcal{O}(D)$ time. In this paper, we show how to construct invertible and unimodal functions by using linear inequality constraints on lattice models. We also extend to \emph{conditional} GONs that find a global maximizer conditioned on specified inputs of other dimensions. Experiments show the GON maximizers are statistically significantly better predictions than those produced by convex fits, GPR, or DNNs, and are more reasonable predictions for real-world problems.  ( 2 min )
    Sharpe Ratio Analysis in High Dimensions: Residual-Based Nodewise Regression in Factor Models. (arXiv:2002.01800v5 [q-fin.PM] UPDATED)
    We provide a new theory for nodewise regression when the residuals from a fitted factor model are used. We apply our results to the analysis of the consistency of Sharpe ratio estimators when there are many assets in a portfolio. We allow for an increasing number of assets as well as time observations of the portfolio. Since the nodewise regression is not feasible due to the unknown nature of idiosyncratic errors, we provide a feasible-residual-based nodewise regression to estimate the precision matrix of errors which is consistent even when number of assets, p, exceeds the time span of the portfolio, n. In another new development, we also show that the precision matrix of returns can be estimated consistently, even with an increasing number of factors and p>n. We show that: (1) with p>n, the Sharpe ratio estimators are consistent in global minimum-variance and mean-variance portfolios; and (2) with p>n, the maximum Sharpe ratio estimator is consistent when the portfolio weights sum to one; and (3) with p<<n, the maximum-out-of-sample Sharpe ratio estimator is consistent.  ( 2 min )
    Accounting for Unobserved Confounding in Domain Generalization. (arXiv:2007.10653v6 [stat.ML] UPDATED)
    This paper investigates the problem of learning robust, generalizable prediction models from a combination of multiple datasets and qualitative assumptions about the underlying data-generating model. Part of the challenge of learning robust models lies in the influence of unobserved confounders that void many of the invariances and principles of minimum error presently used for this problem. Our approach is to define a different invariance property of causal solutions in the presence of unobserved confounders which, through a relaxation of this invariance, can be connected with an explicit distributionally robust optimization problem over a set of affine combination of data distributions. Concretely, our objective takes the form of a standard loss, plus a regularization term that encourages partial equality of error derivatives with respect to model parameters. We demonstrate the empirical performance of our approach on healthcare data from different modalities, including image, speech and tabular data.  ( 2 min )
    The Terminating-Knockoff Filter: Fast High-Dimensional Variable Selection with False Discovery Rate Control. (arXiv:2110.06048v3 [stat.ME] UPDATED)
    We propose the Terminating-Knockoff (T-Knock) filter, a fast variable selection method for high-dimensional data. The T-Knock filter controls a user-defined target false discovery rate (FDR) while maximizing the number of selected variables. This is achieved by fusing the solutions of multiple early terminated random experiments. The experiments are conducted on a combination of the original predictors and multiple sets of randomly generated knockoff predictors. A finite sample proof based on martingale theory for the FDR control property is provided. Numerical simulations show that the FDR is controlled at the target level while allowing for a high power. We prove under mild conditions that the knockoffs can be sampled from any univariate probability distribution with existing finite expectation and variance. The computational complexity of the proposed method is derived and it is demonstrated via numerical simulations that the sequential computation time is multiple orders of magnitude lower than that of the strongest benchmark methods in sparse high-dimensional settings. The T-Knock filter outperforms state-of-the-art methods for FDR control on a simulated genome-wide association study (GWAS), while its computation time is more than two orders of magnitude lower than that of the strongest benchmark methods. An open source R package containing the implementation of the T-Knock filter is available at https://github.com/jasinmachkour/tknock.  ( 2 min )
    Asymptotic Freeness of Layerwise Jacobians Caused by Invariance of Multilayer Perceptron: The Haar Orthogonal Case. (arXiv:2103.13466v4 [stat.ML] UPDATED)
    Free Probability Theory (FPT) provides rich knowledge for handling mathematical difficulties caused by random matrices that appear in research related to deep neural networks (DNNs), such as the dynamical isometry, Fisher information matrix, and training dynamics. FPT suits these researches because the DNN's parameter-Jacobian and input-Jacobian are polynomials of layerwise Jacobians. However, the critical assumption of asymptotic freenss of the layerwise Jacobian has not been proven completely so far. The asymptotic freeness assumption plays a fundamental role when propagating spectral distributions through the layers. Haar distributed orthogonal matrices are essential for achieving dynamical isometry. In this work, we prove asymptotic freeness of layerwise Jacobians of multilayer perceptron (MLP) in this case. A key of the proof is an invariance of the MLP. Considering the orthogonal matrices that fix the hidden units in each layer, we replace each layer's parameter matrix with itself multiplied by the orthogonal matrix, and then the MLP does not change. Furthermore, if the original weights are Haar orthogonal, the Jacobian is also unchanged by this replacement. Lastly, we can replace each weight with a Haar orthogonal random matrix independent of the Jacobian of the activation function using this key fact.  ( 2 min )
    Near-Optimal Learning of Extensive-Form Games with Imperfect Information. (arXiv:2202.01752v1 [cs.LG])
    This paper resolves the open question of designing near-optimal algorithms for learning imperfect-information extensive-form games from bandit feedback. We present the first line of algorithms that require only $\widetilde{\mathcal{O}}((XA+YB)/\varepsilon^2)$ episodes of play to find an $\varepsilon$-approximate Nash equilibrium in two-player zero-sum games, where $X,Y$ are the number of information sets and $A,B$ are the number of actions for the two players. This improves upon the best known sample complexity of $\widetilde{\mathcal{O}}((X^2A+Y^2B)/\varepsilon^2)$ by a factor of $\widetilde{\mathcal{O}}(\max\{X, Y\})$, and matches the information-theoretic lower bound up to logarithmic factors. We achieve this sample complexity by two new algorithms: Balanced Online Mirror Descent, and Balanced Counterfactual Regret Minimization. Both algorithms rely on novel approaches of integrating \emph{balanced exploration policies} into their classical counterparts. We also extend our results to learning Coarse Correlated Equilibria in multi-player general-sum games.  ( 2 min )
    Minimax rate of consistency for linear models with missing values. (arXiv:2202.01463v1 [stat.ML])
    Missing values arise in most real-world data sets due to the aggregation of multiple sources and intrinsically missing information (sensor failure, unanswered questions in surveys...). In fact, the very nature of missing values usually prevents us from running standard learning algorithms. In this paper, we focus on the extensively-studied linear models, but in presence of missing values, which turns out to be quite a challenging task. Indeed, the Bayes rule can be decomposed as a sum of predictors corresponding to each missing pattern. This eventually requires to solve a number of learning tasks, exponential in the number of input features, which makes predictions impossible for current real-world datasets. First, we propose a rigorous setting to analyze a least-square type estimator and establish a bound on the excess risk which increases exponentially in the dimension. Consequently, we leverage the missing data distribution to propose a new algorithm, andderive associated adaptive risk bounds that turn out to be minimax optimal. Numerical experiments highlight the benefits of our method compared to state-of-the-art algorithms used for predictions with missing values.  ( 2 min )
    Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model. (arXiv:2202.01562v1 [stat.ML])
    In real-world recommender systems and search engines, optimizing ranking decisions to present a ranked list of relevant items is critical. Off-policy evaluation (OPE) for ranking policies is thus gaining a growing interest because it enables performance estimation of new ranking policies using only logged data. Although OPE in contextual bandits has been studied extensively, its naive application to the ranking setting faces a critical variance issue due to the huge item space. To tackle this problem, previous studies introduce some assumptions on user behavior to make the combinatorial item space tractable. However, an unrealistic assumption may, in turn, cause serious bias. Therefore, appropriately controlling the bias-variance tradeoff by imposing a reasonable assumption is the key for success in OPE of ranking policies. To achieve a well-balanced bias-variance tradeoff, we propose the Cascade Doubly Robust estimator building on the cascade assumption, which assumes that a user interacts with items sequentially from the top position in a ranking. We show that the proposed estimator is unbiased in more cases compared to existing estimators that make stronger assumptions. Furthermore, compared to a previous estimator based on the same cascade assumption, the proposed estimator reduces the variance by leveraging a control variate. Comprehensive experiments on both synthetic and real-world data demonstrate that our estimator leads to more accurate OPE than existing estimators in a variety of settings.  ( 2 min )
    Optimality and Stability in Non-Convex Smooth Games. (arXiv:2002.11875v3 [cs.LG] UPDATED)
    Convergence to a saddle point for convex-concave functions has been studied for decades, while recent years has seen a surge of interest in non-convex (zero-sum) smooth games, motivated by their recent wide applications. It remains an intriguing research challenge how local optimal points are defined and which algorithm can converge to such points. An interesting concept is known as the local minimax point, which strongly correlates with the widely-known gradient descent ascent algorithm. This paper aims to provide a comprehensive analysis of local minimax points, such as their relation with other solution concepts and their optimality conditions. We find that local saddle points can be regarded as a special type of local minimax points, called uniformly local minimax points, under mild continuity assumptions. In (non-convex) quadratic games, we show that local minimax points are (in some sense) equivalent to global minimax points. Finally, we study the stability of gradient algorithms near local minimax points. Although gradient algorithms can converge to local/global minimax points in the non-degenerate case, they would often fail in general cases. This implies the necessity of either novel algorithms or concepts beyond saddle points and minimax points in non-convex smooth games.  ( 2 min )
    FedSpace: An Efficient Federated Learning Framework at Satellites and Ground Stations. (arXiv:2202.01267v1 [cs.LG])
    Large-scale deployments of low Earth orbit (LEO) satellites collect massive amount of Earth imageries and sensor data, which can empower machine learning (ML) to address global challenges such as real-time disaster navigation and mitigation. However, it is often infeasible to download all the high-resolution images and train these ML models on the ground because of limited downlink bandwidth, sparse connectivity, and regularization constraints on the imagery resolution. To address these challenges, we leverage Federated Learning (FL), where ground stations and satellites collaboratively train a global ML model without sharing the captured images on the satellites. We show fundamental challenges in applying existing FL algorithms among satellites and ground stations, and we formulate an optimization problem which captures a unique trade-off between staleness and idleness. We propose a novel FL framework, named FedSpace, which dynamically schedules model aggregation based on the deterministic and time-varying connectivity according to satellite orbits. Extensive numerical evaluations based on real-world satellite images and satellite networks show that FedSpace reduces the training time by 1.7 days (38.6%) over the state-of-the-art FL algorithms.  ( 2 min )
    Neural graphical modelling in continuous-time: consistency guarantees and algorithms. (arXiv:2105.02522v3 [stat.ML] UPDATED)
    The discovery of structure from time series data is a key problem in fields of study working with complex systems. Most identifiability results and learning algorithms assume the underlying dynamics to be discrete in time. Comparatively few, in contrast, explicitly define dependencies in infinitesimal intervals of time, independently of the scale of observation and of the regularity of sampling. In this paper, we consider score-based structure learning for the study of dynamical systems. We prove that for vector fields parameterized in a large class of neural networks, least squares optimization with adaptive regularization schemes consistently recovers directed graphs of local independencies in systems of stochastic differential equations. Using this insight, we propose a score-based learning algorithm based on penalized Neural Ordinary Differential Equations (modelling the mean process) that we show to be applicable to the general setting of irregularly-sampled multivariate time series and to outperform the state of the art across a range of dynamical systems.  ( 2 min )
    Fast and explainable clustering based on sorting. (arXiv:2202.01456v1 [cs.LG])
    We introduce a fast and explainable clustering method called CLASSIX. It consists of two phases, namely a greedy aggregation phase of the sorted data into groups of nearby data points, followed by the merging of groups into clusters. The algorithm is controlled by two scalar parameters, namely a distance parameter for the aggregation and another parameter controlling the minimal cluster size. Extensive experiments are conducted to give a comprehensive evaluation of the clustering performance on synthetic and real-world datasets, with various cluster shapes and low to high feature dimensionality. Our experiments demonstrate that CLASSIX competes with state-of-the-art clustering algorithms. The algorithm has linear space complexity and achieves near linear time complexity on a wide range of problems. Its inherent simplicity allows for the generation of intuitive explanations of the computed clusters.  ( 2 min )
    Variational Nearest Neighbor Gaussian Processes. (arXiv:2202.01694v1 [cs.LG])
    Variational approximations to Gaussian processes (GPs) typically use a small set of inducing points to form a low-rank approximation to the covariance matrix. In this work, we instead exploit a sparse approximation of the precision matrix. We propose variational nearest neighbor Gaussian process (VNNGP), which introduces a prior that only retains correlations within K nearest-neighboring observations, thereby inducing sparse precision structure. Using the variational framework, VNNGP's objective can be factorized over both observations and inducing points, enabling stochastic optimization with a time complexity of O($K^3$). Hence, we can arbitrarily scale the inducing point size, even to the point of putting inducing points at every observed location. We compare VNNGP to other scalable GPs through various experiments, and demonstrate that VNNGP (1) can dramatically outperform low-rank methods, and (2) is less prone to overfitting than other nearest neighbor methods.  ( 2 min )
    Robust Estimation for Nonparametric Families via Generative Adversarial Networks. (arXiv:2202.01269v1 [cs.LG])
    We provide a general framework for designing Generative Adversarial Networks (GANs) to solve high dimensional robust statistics problems, which aim at estimating unknown parameter of the true distribution given adversarially corrupted samples. Prior work focus on the problem of robust mean and covariance estimation when the true distribution lies in the family of Gaussian distributions or elliptical distributions, and analyze depth or scoring rule based GAN losses for the problem. Our work extend these to robust mean estimation, second moment estimation, and robust linear regression when the true distribution only has bounded Orlicz norms, which includes the broad family of sub-Gaussian, sub-Exponential and bounded moment distributions. We also provide a different set of sufficient conditions for the GAN loss to work: we only require its induced distance function to be a cumulative density function of some light-tailed distribution, which is easily satisfied by neural networks with sigmoid activation. In terms of techniques, our proposed GAN losses can be viewed as a smoothed and generalized Kolmogorov-Smirnov distance, which overcomes the computational intractability of the original Kolmogorov-Smirnov distance used in the prior work.  ( 2 min )
    Sampling Permutations for Shapley Value Estimation. (arXiv:2104.12199v2 [stat.ML] UPDATED)
    Game-theoretic attribution techniques based on Shapley values are used to interpret black-box machine learning models, but their exact calculation is generally NP-hard, requiring approximation methods for non-trivial models. As the computation of Shapley values can be expressed as a summation over a set of permutations, a common approach is to sample a subset of these permutations for approximation. Unfortunately, standard Monte Carlo sampling methods can exhibit slow convergence, and more sophisticated quasi-Monte Carlo methods have not yet been applied to the space of permutations. To address this, we investigate new approaches based on two classes of approximation methods and compare them empirically. First, we demonstrate quadrature techniques in a RKHS containing functions of permutations, using the Mallows kernel in combination with kernel herding and sequential Bayesian quadrature. The RKHS perspective also leads to quasi-Monte Carlo type error bounds, with a tractable discrepancy measure defined on permutations. Second, we exploit connections between the hypersphere $\mathbb{S}^{d-2}$ and permutations to create practical algorithms for generating permutation samples with good properties. Experiments show the above techniques provide significant improvements for Shapley value estimates over existing methods, converging to a smaller RMSE in the same number of model evaluations.  ( 2 min )
    Rademacher Random Projections with Tensor Networks. (arXiv:2110.13970v3 [cs.LG] UPDATED)
    Random projection (RP) have recently emerged as popular techniques in the machine learning community for their ability in reducing the dimension of very high-dimensional tensors. Following the work in [30], we consider a tensorized random projection relying on Tensor Train (TT) decomposition where each element of the core tensors is drawn from a Rademacher distribution. Our theoretical results reveal that the Gaussian low-rank tensor represented in compressed form in TT format in [30] can be replaced by a TT tensor with core elements drawn from a Rademacher distribution with the same embedding size. Experiments on synthetic data demonstrate that tensorized Rademacher RP can outperform the tensorized Gaussian RP studied in [30]. In addition, we show both theoretically and experimentally, that the tensorized RP in the Matrix Product Operator (MPO) format is not a Johnson-Lindenstrauss transform (JLT) and therefore not a well-suited random projection map  ( 2 min )
    The RoyalFlush System of Speech Recognition for M2MeT Challenge. (arXiv:2202.01614v1 [cs.SD])
    This paper describes our RoyalFlush system for the track of multi-speaker automatic speech recognition (ASR) in the M2MeT challenge. We adopted the serialized output training (SOT) based multi-speakers ASR system with large-scale simulation data. Firstly, we investigated a set of front-end methods, including multi-channel weighted predicted error (WPE), beamforming, speech separation, speech enhancement and so on, to process training, validation and test sets. But we only selected WPE and beamforming as our frontend methods according to their experimental results. Secondly, we made great efforts in the data augmentation for multi-speaker ASR, mainly including adding noise and reverberation, overlapped speech simulation, multi-channel speech simulation, speed perturbation, front-end processing, and so on, which brought us a great performance improvement. Finally, in order to make full use of the performance complementary of different model architecture, we trained the standard conformer based joint CTC/Attention (Conformer) and U2++ ASR model with a bidirectional attention decoder, a modification of Conformer, to fuse their results. Comparing with the official baseline system, our system got a 12.22% absolute Character Error Rate (CER) reduction on the validation set and 12.11% on the test set.  ( 2 min )
    Partitioning Cloud-based Microservices (via Deep Learning). (arXiv:2109.14569v2 [cs.LG] UPDATED)
    Cloud-based software has many advantages. When services are divided into many independent components, they are easier to update. Also, during peak demand, it is easier to scale cloud services (just hire more CPUs). Hence, many organizations are partitioning their monolithic enterprise applications into cloud-based microservices. Recently there has been much work using machine learning to simplify this partitioning task. Despite much research, no single partitioning method can be recommended as generally useful. More specifically, those prior solutions are "brittle"; i.e. if they work well for one kind of goal in one dataset, then they can be sub-optimal if applied to many datasets and multiple goals. In order to find a generally useful partitioning method, we propose DEEPLY. This new algorithm extends the CO-GCN deep learning partition generator with (a) a novel loss function and (b) some hyper-parameter optimization. As shown by our experiments, DEEPLY generally outperforms prior work (including CO-GCN, and others) across multiple datasets and goals. To the best of our knowledge, this is the first report in SE of such stable hyper-parameter optimization. To aid reuse of this work, DEEPLY is available on-line at https://bit.ly/2WhfFlB.  ( 2 min )
    Sequential Learning of the Topological Ordering for the Linear Non-Gaussian Acyclic Model with Parametric Noise. (arXiv:2202.01748v1 [stat.ME])
    Causal discovery, the learning of causality in a data mining scenario, has been of strong scientific and theoretical interest as a starting point to identify "what causes what?" Contingent on assumptions, it is sometimes possible to identify an exact causal Directed Acyclic Graph (DAG), as opposed to a Markov equivalence class of graphs that gives ambiguity of causal directions. The focus of this paper is on one such case: a linear structural equation model with non-Gaussian noise, a model known as the Linear Non-Gaussian Acyclic Model (LiNGAM). Given a specified parametric noise model, we develop a novel sequential approach to estimate the causal ordering of a DAG. At each step of the procedure, only simple likelihood ratio scores are calculated on regression residuals to decide the next node to append to the current partial ordering. Under mild assumptions, the population version of our procedure provably identifies a true ordering of the underlying causal DAG. We provide extensive numerical evidence to demonstrate that our sequential procedure is scalable to cases with possibly thousands of nodes and works well for high-dimensional data. We also conduct an application to a single-cell gene expression dataset to demonstrate our estimation procedure.  ( 2 min )
    Deep Hierarchy in Bandits. (arXiv:2202.01454v1 [cs.LG])
    Mean rewards of actions are often correlated. The form of these correlations may be complex and unknown a priori, such as the preferences of a user for recommended products and their categories. To maximize statistical efficiency, it is important to leverage these correlations when learning. We formulate a bandit variant of this problem where the correlations of mean action rewards are represented by a hierarchical Bayesian model with latent variables. Since the hierarchy can have multiple layers, we call it deep. We propose a hierarchical Thompson sampling algorithm (HierTS) for this problem, and show how to implement it efficiently for Gaussian hierarchies. The efficient implementation is possible due to a novel exact hierarchical representation of the posterior, which itself is of independent interest. We use this exact posterior to analyze the Bayes regret of HierTS in Gaussian bandits. Our analysis reflects the structure of the problem, that the regret decreases with the prior width, and also shows that hierarchies reduce the regret by non-constant factors in the number of actions. We confirm these theoretical findings empirically, in both synthetic and real-world experiments.  ( 2 min )
    Deep Layer-wise Networks Have Closed-Form Weights. (arXiv:2202.01210v1 [stat.ML])
    There is currently a debate within the neuroscience community over the likelihood of the brain performing backpropagation (BP). To better mimic the brain, training a network \textit{one layer at a time} with only a "single forward pass" has been proposed as an alternative to bypass BP; we refer to these networks as "layer-wise" networks. We continue the work on layer-wise networks by answering two outstanding questions. First, $\textit{do they have a closed-form solution?}$ Second, $\textit{how do we know when to stop adding more layers?}$ This work proves that the Kernel Mean Embedding is the closed-form weight that achieves the network global optimum while driving these networks to converge towards a highly desirable kernel for classification; we call it the $\textit{Neural Indicator Kernel}$.  ( 2 min )
    Parameters or Privacy: A Provable Tradeoff Between Overparameterization and Membership Inference. (arXiv:2202.01243v1 [stat.ML])
    A surprising phenomenon in modern machine learning is the ability of a highly overparameterized model to generalize well (small error on the test data) even when it is trained to memorize the training data (zero error on the training data). This has led to an arms race towards increasingly overparameterized models (c.f., deep learning). In this paper, we study an underexplored hidden cost of overparameterization: the fact that overparameterized models are more vulnerable to privacy attacks, in particular the membership inference attack that predicts the (potentially sensitive) examples used to train a model. We significantly extend the relatively few empirical results on this problem by theoretically proving for an overparameterized linear regression model with Gaussian data that the membership inference vulnerability increases with the number of parameters. Moreover, a range of empirical studies indicates that more complex, nonlinear models exhibit the same behavior. Finally, we study different methods for mitigating such attacks in the overparameterized regime, such as noise addition and regularization, and conclude that simply reducing the parameters of an overparameterized model is an effective strategy to protect it from membership inference without greatly decreasing its generalization error.  ( 2 min )
    Directed hypergraph neural network. (arXiv:2008.03626v2 [stat.ML] UPDATED)
    To deal with irregular data structure, graph convolution neural networks have been developed by a lot of data scientists. However, data scientists just have concentrated primarily on developing deep neural network method for un-directed graph. In this paper, we will present the novel neural network method for directed hypergraph. In the other words, we will develop not only the novel directed hypergraph neural network method but also the novel directed hypergraph based semi-supervised learning method. These methods are employed to solve the node classification task. The two datasets that are used in the experiments are the cora and the citeseer datasets. Among the classic directed graph based semi-supervised learning method, the novel directed hypergraph based semi-supervised learning method, the novel directed hypergraph neural network method that are utilized to solve this node classification task, we recognize that the novel directed hypergraph neural network achieves the highest accuracies.  ( 2 min )
    Adaptive Clustering and Personalization in Multi-Agent Stochastic Linear Bandits. (arXiv:2106.08902v2 [stat.ML] UPDATED)
    We consider the problem of minimizing regret in an $N$ agent heterogeneous stochastic linear bandits framework, where the agents (users) are similar but not all identical. We model user heterogeneity using two popularly used ideas in practice; (i) A clustering framework where users are partitioned into groups with users in the same group being identical to each other, but different across groups, and (ii) a personalization framework where no two users are necessarily identical, but a user's parameters are close to that of the population average. In the clustered users' setup, we propose a novel algorithm, based on successive refinement of cluster identities and regret minimization. We show that, for any agent, the regret scales as $\mathcal{O}(\sqrt{T/N})$, if the agent is in a `well separated' cluster, or scales as $\mathcal{O}(T^{\frac{1}{2} + \varepsilon}/(N)^{\frac{1}{2} -\varepsilon})$ if its cluster is not well separated, where $\varepsilon$ is positive and arbitrarily close to $0$. Our algorithm is adaptive to the cluster separation, and is parameter free -- it does not need to know the number of clusters, separation and cluster size, yet the regret guarantee adapts to the inherent complexity. In the personalization framework, we introduce a natural algorithm where, the personal bandit instances are initialized with the estimates of the global average model. We show that, an agent $i$ whose parameter deviates from the population average by $\epsilon_i$, attains a regret scaling of $\widetilde{O}(\epsilon_i\sqrt{T})$. This demonstrates that if the user representations are close (small $\epsilon_i)$, the resulting regret is low, and vice-versa. The results are empirically validated and we observe superior performance of our adaptive algorithms over non-adaptive baselines.  ( 2 min )
    Byzantine-Robust Decentralized Learning via Self-Centered Clipping. (arXiv:2202.01545v1 [cs.LG])
    In this paper, we study the challenging task of Byzantine-robust decentralized training on arbitrary communication graphs. Unlike federated learning where workers communicate through a server, workers in the decentralized environment can only talk to their neighbors, making it harder to reach consensus. We identify a novel dissensus attack in which few malicious nodes can take advantage of information bottlenecks in the topology to poison the collaboration. To address these issues, we propose a Self-Centered Clipping (SCClip) algorithm for Byzantine-robust consensus and optimization, which is the first to provably converge to a $O(\delta_{\max}\zeta^2/\gamma^2)$ neighborhood of the stationary point for non-convex objectives under standard assumptions. Finally, we demonstrate the encouraging empirical performance of SCClip under a large number of attacks.  ( 2 min )
    Multiclass learning with margin: exponential rates with no bias-variance trade-off. (arXiv:2202.01773v1 [stat.ML])
    We study the behavior of error bounds for multiclass classification under suitable margin conditions. For a wide variety of methods we prove that the classification error under a hard-margin condition decreases exponentially fast without any bias-variance trade-off. Different convergence rates can be obtained in correspondence of different margin assumptions. With a self-contained and instructive analysis we are able to generalize known results from the binary to the multiclass setting.  ( 2 min )

  • Open

    [R] "Unified Scaling Laws for Routed Language Models", Clark et al 2022 Deepmind (detailed MoE scaling analysis; MoE advantage currently disappears at ~900b dense-parameters)
    Paper: https://arxiv.org/abs/2202.01169#deepmind Abstract: "The performance of a language model has been shown to be effectively modeled as a power-law in its parameter count. Here we study the scaling behaviors of Routing Networks: architectures that conditionally use only a subset of their parameters while processing an input. For these models, parameter count and computational requirement form two independent axes along which an increase leads to better performance. In this work we derive and justify scaling laws defined on these two variables which generalize those known for standard language models and describe the performance of a wide range of routing architectures trained via three different techniques. Afterwards we provide two applications of these laws: first deriving an Effective Parameter Count along which all models scale at the same rate, and then using the scaling coefficients to give a quantitative comparison of the three routing techniques considered. Our analysis derives from an extensive evaluation of Routing Networks across five orders of magnitude of size, including models with hundreds of experts and hundreds of billions of parameters. " https://preview.redd.it/23p0dih3awf81.jpg?width=821&format=pjpg&auto=webp&s=5c86dc18387d1ad1af02a58847dc4f18ada80f8f submitted by /u/Singularian2501 [link] [comments]  ( 1 min )
    [D] What does the perfect GPU host look like?
    I am starting up a hosting business and I want to focus on AI and HPC applications. Mostly just because I love the technology and I want to find ways to better serve the audience (you). ​ What are some things you would like a GPU hosting provider to offer that is currently hard to find? What is most annoying thing about billing when it comes to hosting providers (like aws, azure, lambda cloud). Is cost really the only metric you look for? Any cool ideas that would help you get more out of what you pay for? ​ I would love to hear what you guys think. Personally I think the whole industry is not trying hard enough when it comes to innovation. submitted by /u/RedmondHosting [link] [comments]  ( 1 min )
    [R] Datamodels: Predicting Predictions from Training Data
    Hi r/MachineLearning. We just released a paper on "Datamodeling" a framework for approximating the end-to-end training and evaluation pipeline as a direct (closed-form) function of the training data. We trained *hundreds of thousands* of models on random subsets of computer vision datasets using our library FFCV (http://ffcv.io, mentioned in a post last week!). We then used this data to fit *linear* models that can successfully predict model outputs from the presence of training points. These linear models turn out to enable a bunch of really cool applications like counterfactual prediction, brittleness tests, clustering, train-test leakage detection, and more---let us know what you think! Paper: https://arxiv.org/abs/2202.00622 Blog (Part 1/4): https://gradientscience.org/datamodels-1/ Data release: https://github.com/MadryLab/datamodels-data submitted by /u/andrew_ilyas [link] [comments]  ( 1 min )
    [D] Connor Leahy on EleutherAI, Replicating GPT-2/GPT-3, AI Risk and Alignment
    Hey, just want to share the new Gradient interview with Connor Leahy. It has a lot of fun stories about getting into the field of AI from as a sort of hacker and how EleutherAI got going and still works as a group passion project of a bunch of hackers, so it's a fun listen! Things discussed: Replicating GPT2–1.5B , GPT2, Counting Consciousness and the Curious Hacker The Hacker Learns to Trust The Pile GPT-Neo GPT-J Why Release a Large Language Model? What A Long, Strange Trip It's Been: EleutherAI One Year Retrospective GPT-NeoX Sections: (00:00:00) Intro (00:01:20) Start in AI (00:08:00) Being excited about GPT-2 (00:18:00) Discovering AI safety and alignment (00:21:10) Replicating GPT-2 (00:27:30) Deciding whether to relese GPT-2 weights (00:36:15) Life after GPT-2 (00:40:05) GPT-3 and Start of Eleuther AI (00:44:40) Early days of Eleuther AI (00:47:30) Creating the Pile, GPT-Neo, Hacker Culture (00:55:10) Growth of Eleuther AI, Cultivating Community (01:02:22) Why release a large language model (01:08:50) AI Risk and Alignment (01:21:30) Worrying (or not) about Superhuman AI (01:25:20) AI alignment and releasing powerful models (01:32:08) AI risk and research norms (01:37:10) Work on GPT-3 replication, GPT-NeoX (01:38:48) Joining Eleuther AI (01:43:28) Personal interests / hobbies (01:47:20) Outro submitted by /u/regalalgorithm [link] [comments]  ( 1 min )
    [D] GPT-NeoX-20B - Interview w/ EleutherAI co-founder Connor Leahy (Video)
    https://youtu.be/AJwnbSP_rq8 EleutherAI announces GPT-NeoX-20B, a 20 billion parameter open-source language model, inspired by GPT-3. Connor joins me to discuss the process of training, how the group got their hands on the necessary hardware, what the new model can do, and how anyone can try it out! ​ OUTLINE: 0:00 - Intro 1:00 - Start of interview 2:00 - How did you get all the hardware? 3:50 - What's the scale of this model? 6:00 - A look into the experimental results 11:15 - Why are there GPT-Neo, GPT-J, and GPT-NeoX? 14:15 - How difficult is training these big models? 17:00 - Try out the model on GooseAI 19:00 - Final thoughts ​ Read the announcement: https://blog.eleuther.ai/announcing-20b/ Try out the model: https://goose.ai/ Check out EleutherAI: https://www.eleuther.ai/ Read the code: https://github.com/EleutherAI/gpt-neox Hardware sponsor: https://www.coreweave.com/ submitted by /u/ykilcher [link] [comments]  ( 1 min )
    Holy $#!t: Are popular toxicity models simply profanity detectors? [D]
    One of the problems with real world machine learning is that engineers often treat models as pure black boxes to be optimized, ignoring the datasets behind them. I've often worked with ML engineers who can't give you any examples of false positives they want their models to fix! Perhaps this is okay when your datasets are high-quality and representative of the real world, but they're usually not. For example, many toxicity and hate speech datasets mistakenly flag texts like "this is fucking awesome!" as toxic, even though they're actually quite positive -- because NLP datasets are often labeled by non-fluent speakers who pattern match on profanity. (So is 99% accuracy or 99% precision actually a good thing? Not if your test sets are inaccurate as well!) Many of the new, massive scale language models use the Perspective API to measure their safety. But we've noticed a number of Perspective API mistakes on texts containing positive profanity, so I wrote a blog post in an attempt to explain the problem and quantify it. Note: I work for Surge AI / this is OC. submitted by /u/BB4evaTB12 [link] [comments]  ( 3 min )
    [D] How should machine learning practitioners prioritise their study time?
    Machine learning and data science are vast fields. Some subjects useful to know about are: Maths and statistics fundamentals Machine learning fundamentals Software engineering principles, proficiency in programming language of choice (e.g. python), Learning new frameworks/languages Keeping up with latest methods and reading research papers Project organisation and ML in practice. Let's say you are a professional data scientist / MLE but you're little rusty in some of these areas (which many of us probably are), and wanted to brush up on your knowledge. How would you go about prioritising your study? Should you: focus on one area at a time, or pick two or three areas to work on simultaneously? take a "bottom-up" approach by studying fundamentals first, and high-level implementation second, or an opposite "top-down" approach? What other considerations do you think are relevant? Would love to hear your thoughts. submitted by /u/Hgat [link] [comments]  ( 2 min )
    [R] [2010.00406] Momentum via Primal Averaging: Theoretical Insights and Learning Rate Schedules for Non-Convex Optimization
    submitted by /u/fasttosmile [link] [comments]
    [P] CNN with hyper-spectral images (HSI)
    I am working on an object detection project and the input images changed from RGB to Hyper-spectral (HSI). HSI have sizes of e.g. (224,244,220) . Is there a framework where I can use Faster R-CNN, Mask R-CNN or YOLOv4 for that task using HSI? submitted by /u/giakou4 [link] [comments]  ( 2 min )
    [D] Issue calculating consistency cost in mean teacher setup for token classification problem
    I am planning to apply mean-teacher for my problem of token classification. Since adding different noise for teacher and student is really important for the approach, i am confused about how to calculate consistency cost as length of active logits would differ. for e.g. if i use synonym noise then it can happen that it increases the length of the sentence (some tokens maybe replaces by synonym of len 2/3) when given to teacher model and same augmentation/Noise(different ratio) may generate augmented sentence of different length when given to student model. How should i calculate this loss, i am quite stuck, please help! Paper link for ref. : https://arxiv.org/abs/1703.01780 submitted by /u/cr7tub [link] [comments]  ( 1 min )
  • Open

    Steps to Bridge AI Talent Shortage
    Artificial Intelligence’s fame in the enterprises continues to develop, yet practices and developments remain stagnant as associations run into deterrents while deploying AI systems. O’Reilly’s 2021 AI Adoption in the Enterprise report, which reviewed in excess of 3,500 business pioneers, observed that an absence of gifted individuals and trouble hiring bested the rundown of difficulties… Read More »Steps to Bridge AI Talent Shortage The post Steps to Bridge AI Talent Shortage appeared first on Data Science Central.  ( 4 min )
    Javascript Tips For Conditional Expressions
    The Javascript language has, over the last few years, evolved dramatically, to the extent that even those who are regular Javascript developers are likely to not fully appreciate all that can be done with the web’s most ubiquitous language. In this particular article, I hope to concentrate on the realm of conditional expressions. Most Javascript… Read More »Javascript Tips For Conditional Expressions The post Javascript Tips For Conditional Expressions appeared first on Data Science Central.  ( 4 min )
    Natural Interfaces Fuel Expansion of Mixed Reality Market
    Mixed reality technology finds application in a wide range of industries such as e-commerce & retail, healthcare, entertainment, automotive & aerospace. Of them, the concept is being increasingly adopted in the healthcare sector globally. This factor is resulting in revenue-generation opportunities in the global mixed reality market in the upcoming years. The rising adoption of… Read More »Natural Interfaces Fuel Expansion of Mixed Reality Market The post Natural Interfaces Fuel Expansion of Mixed Reality Market appeared first on Data Science Central.  ( 2 min )
    Cyber Security as a Service: Vital Growth-Multiplying Trends of 2022
    Cyber attacks are tightening their grip across various government and non-government firms. In this grim situation, technology can only be the answer to overcome it. A layman cannot overcome these threats. A full-fledged service is needed to tackle such threats. This is where cyber security as a service (CSaaS) steps in. Cybersecurity has emerged not… Read More »Cyber Security as a Service: Vital Growth-Multiplying Trends of 2022 The post Cyber Security as a Service: Vital Growth-Multiplying Trends of 2022 appeared first on Data Science Central.  ( 3 min )
    How to Build a Product Development Roadmap?
    The product roadmap envisions the step-by-step developmental process of a product. The advantage of having a roadmap is that it defines everything expected from the Product. It communicates the requirements to all the stakeholders. With a roadmap in place, it becomes comprehensive to explain the strategy, particularly to different internal stakeholders. When do you require… Read More »How to Build a Product Development Roadmap? The post How to Build a Product Development Roadmap? appeared first on Data Science Central.  ( 5 min )
    How Augmented Reality Can Boost Social Media Videos
    Augmented Reality has done so much to improve the lives of people. In this day and age, AR is making a huge impact on how people are communicating on social communities online. But what is it and what is its role in social media videos? How will it affect brand marketing and user experience? What… Read More »How Augmented Reality Can Boost Social Media Videos The post How Augmented Reality Can Boost Social Media Videos appeared first on Data Science Central.  ( 4 min )
    Learn how Digital Transformation has Impacted Accounting
    In the last few years, the role of an accountant has emerged significantly, and crucial technologies have played a major impact on the accounting and finance professionals’ business during those years. Accounting is currently going through a period of major changes due to the immense popularity of cloud-based accounting that has impacted and altered even… Read More »Learn how Digital Transformation has Impacted Accounting The post Learn how Digital Transformation has Impacted Accounting appeared first on Data Science Central.  ( 3 min )
  • Open

    Connor Leahy on EleutherAI, Replicating GPT-2/GPT-3, AI Risk and Alignment
    submitted by /u/regalalgorithm [link] [comments]
    How AI Is Deciding Who Gets Hired
    submitted by /u/Beautiful-Credit-868 [link] [comments]
    Using Physics to Teach Computers to Speak! Unity3D + GPT Project Overview
    submitted by /u/itsallshit-eatup [link] [comments]
    [R] DeepMind’s AlphaCode Generates Code at a Level Competitive With Human Programmers
    A DeepMind research team presents AlphaCode, an automated code-generation system that can create novel solutions for programming problems that require deep reasoning and achieves a top 54.3% ranking in programming competitions. Here is a quick read: DeepMind’s AlphaCode Generates Code at a Level Competitive With Human Programmers. DeepMind has released its CodeContests dataset of competitive programming problems and solutions on GitHub. The paper Competition-Level Code Generation with AlphaCode is on Google Cloud Storage. submitted by /u/Yuqing7 [link] [comments]  ( 1 min )
    Virtual staff members powered by artificial intelligence have increasingly become a fixture at South Korea’s retail banks.. AI bankers would be deployed to its banking activity centers where customers can learn and experience AI-powered banking services.
    submitted by /u/dannylenwinn [link] [comments]  ( 1 min )
    Image to hand drawn
    Hi, I'm looking for more projects that will turn an image into a "hand drawn" image. These are the ones I've found so far. They are all based on the same dataset from APDrawingGAN. This is a scaled down image. The originals were generated at 1200px width (512 for APDrawGAN) Steve Mcqueen ​ Medy Lamarr APDrawingGAN is optimized for portraits. I used it with generic settings as I want the model to be able to work on any image. Sources: U2Net, ArtLine, Pix2PixHD, APDrawingGAN What other Projects are out there that already do similar things or which one should be able to be trained like that? submitted by /u/krummrey [link] [comments]  ( 1 min )
    Automatised search of hyperparameters for SARIMA
    Wrote hopefully interesting blog for automation of AR, I and MA parameter for SARIMA on C02 data. https://vitomirj.medium.com/auto-arima-hyperparameter-search-ab991a21c2bd submitted by /u/Toica_Rasta [link] [comments]
    AI, the brain, and cognitive plausibility
    submitted by /u/bendee983 [link] [comments]
    Real-time Food Quality Prediction. Detect spoiled products using the Tiny Machine Learning approach.
    Things used in this project Hardware components: Arduino Mega 2560 Software apps and online services: Neuton Tiny ML https://i.redd.it/eguem17vktf81.gif Real-time Food Quality Prediction. Story With each passing year, the issue of food waste becomes more acute for the environment. A recent Food Waste Index Report by the United Nations Environment Program (UNEP) showed that, on average, consumers waste almost a billion tons of food per year (or 17 percent of all food purchased): https://www.unep.org/resources/report/unep-food-waste-index-report-2021 The fact that people produce more food than they consume has significant negative consequences. For example, an estimated 8-10% of global greenhouse gas emissions come from unused food. On the contrary, reducing food waste will help to…  ( 4 min )
    Top 10 Sentiment Analysis APIs
    submitted by /u/tah_zem [link] [comments]
    What AI is there, which I can use to make extreme realistic high quality images?
    It can generate or edit already existing ones, I only want to get the copyright of the image. submitted by /u/xXLisa28Xx [link] [comments]
    Artificial Nightmares : The Emerald Dragons || DD PYTTI VQGAN AI Art Video [4K 60 FPS]
    submitted by /u/Thenamessd [link] [comments]
    How to scrape Google Local Results with Artificial Intelligence?
    submitted by /u/Kagermanov [link] [comments]  ( 1 min )
    Google DeepMind's AlphaCode is a Game Changer for AI Coding
    submitted by /u/Beautiful-Credit-868 [link] [comments]  ( 2 min )
  • Open

    Neural network experiment. Want also?
    submitted by /u/vlad_ma [link] [comments]
    I have to simulate a neural network with some data from a .csv file. Anyone know where I should start as a programming newbie?
    I've been told to use the scikit learn library submitted by /u/basedphrog [link] [comments]  ( 1 min )
  • Open

    A closer look at zero spacing
    A few days ago I wrote about Kneser’s theorem. This theorem tells whether the differential equation u ″(x) + h(x) u(x) = 0 will oscillate indefinitely, i.e. whether it will have an infinite number of zeros. Today’s post will look at another theorem that gives more specific information about the spacing of the zeros. Kneser’s theorem […] A closer look at zero spacing first appeared on John D. Cook.  ( 3 min )
  • Open

    DIAMBRA a New Ecosystem for Reinforcement Learning Research and Experimentation
    We are excited to announce that the New DIAMBRA Universe🤖🎮 is finally out! This upgrade is HUGE (all links in the comments below): ⚔️🛡️𝗗𝗜𝗔𝗠𝗕𝗥𝗔 𝗔𝗿𝗲𝗻𝗮 𝘃𝟭.𝟬 𝗿𝗲𝗹𝗲𝗮𝘀𝗲𝗱. It features an exclusive collection of high-quality #Reinforcement Learning environments, exposed through a Python API fully compliant with OpenAI Gym standard and super easy to use. Supports all major OSs (Linux, Windows and MacOS), and the GitHub repository provides a collection of examples ready for execution. 📙𝗗𝗜𝗔𝗠𝗕𝗥𝗔 𝗔𝗿𝗲𝗻𝗮 𝗗𝗼𝗰𝘂𝗺𝗲𝗻𝘁𝗮𝘁𝗶𝗼𝗻. A comprehensive guide covering from package step by step installation to environments details and usage. 🔥𝗔 𝗱𝗲𝗱𝗶𝗰𝗮𝘁𝗲𝗱 𝗱𝗼𝗺𝗮𝗶𝗻 𝗮𝗻𝗱 𝗮𝗻 𝗮𝗺𝗮𝘇𝗶𝗻𝗴 𝗻𝗲𝘄 𝘄𝗲𝗯𝘀𝗶𝘁𝗲, the foundations of our soon-to-be-released AI Tournaments Platform. Our new home (link in comments): simple, short, clear. AI is what we love. Here is where you'll find us. 🖥️Together with our Twitch Channel and Discord Server, we also created our 𝗱𝗲𝗱𝗶𝗰𝗮𝘁𝗲𝗱 𝗽𝗮𝗴𝗲𝘀 𝗼𝗻 𝗟𝗶𝗻𝗸𝗲𝗱𝗶𝗻 𝗮𝗻𝗱 𝗬𝗼𝘂𝗧𝘂𝗯𝗲. Follow us and subscribe to stay tuned with all the great updates we are going to share in the coming weeks! submitted by /u/DIAMBRA_AIArena [link] [comments]  ( 1 min )
    Why does MADDPG use action log prob for Q (Critic) instead of sampled action?
    In MADDPG paper it is stated Note also that, in the above equation, we input the action log probabilities of each agent directly into Q, rather than sampling. What is the benefit of feeding log prob instead of sampled action like in DDPG? Edit: Also I dont quite understand replay buffer in MADDPG. In paper they used one buffer to store the tuples (x, x′, a1, . . . , aN , r1, . . . , rN ) for simple MADDPG, but when adding Policy Ensembles they state: Since different sub-policies will be executed in different episodes, we maintain a replay buffer D(k)i for each sub-policy μ(k)i of agent i. Why does each policy require separate replay buffer instead of single D from which we can sample? ​ submitted by /u/TheGuy839 [link] [comments]  ( 1 min )
    [H] agent learns to stay in the same place and not go any where (in a racing env)
    hi there guys, i'm trying to train an agent to play the game SuperTuxKart using this python interface called pystk. i tried training the agent on raw RGB pixels, turns out it usually doesn't work, and just as expected that didn't seem to learn anything (i'm totally new to RL and this is my first project so don't judge what i'm doing here). so i trained a VAE for representation learning (using grayscale images). it seems to work pretty well. (i used beta-VAE here) i'm currently using LSTM to which i pass the previous 5 latent space representations produced by the VAE. then i pass the outputs of the LSTM network to 2 heads (actor and critic) and take actions based on that. i'm using the PPO algorithm to train the agent. the implementation is my own. (although i looked at it like 10 times and everything seems to be fine there) the action space is multi discrete, so i'm using a MultiCategorical class (the same as the one from stable_baselines3). but then the agent still doesn't learn to do anything. after 1500 steps of training the agent has learned to stay in the same place and rotate. so any further training is not affecting the agent (cause it's staying in the same place). also as a side note i give the agent -ve rewards for staying in the same place, so i don't know why it doesn't want to move. here is the code if anyone wants to take a look at it - https://github.com/jedi2610/tuxkart-ai i also read andyljones's rl debugging post but i can't figure out where i'm going wrong. if anyone can help me figure out where i'm going wrong it would be supercool. ty submitted by /u/jedi1026 [link] [comments]  ( 2 min )
    Can Rainbow learn stochastic policy?
    policy gradient methods are used when one wants to learn stochastic policy. I.e optimal action has certain probability let’s say 35.5%. Can Rainbow algorithm also do that? It’s strongly based on DQN and DQN picks action based on its maximal Q value. I’m pretty sure DQN can’t, but about Rainbow I’m not so sure. As I’m not familiar with the details. submitted by /u/basic_r_user [link] [comments]  ( 1 min )
    Can you learn a discrete action using a continuous action-space?
    You're controlling a simple 6-DOF simulated arm, and want to give it the ability to grab objects, but don't want to deal with simulating grip mechanics. If you threshold the 7th action to signify [open vs close] can the continuous action-space network (e.g. SAC), learn to use it's continuous 7th value as a discrete action? State space include a one-hot vector indicating [open vs close], joint rotations, and linear distance to goal. Any thoughts? submitted by /u/RoamBear [link] [comments]  ( 1 min )
    Promising options for super-computer scale RL parallelism with very fast environment.
    Hi guys, I'm trying to implement a super-computer-scale RL system that can train very high quality agents very quickly. I just tried combining ES as as scalable alternative to reinforcement learning with PPO (PPO training on workers before evaluation) and it failed. So I looked to more conventional parallelism methods, but they often seem to bank on the idea that environment evaluation is slow (e.g. parallelism is strictly for env evaluation). But my environment evaluates almost instantly. Given this situation are there actually promising multi-node parallelism options for me to use? P.S. I think the GA thing failed because individual agents diverged too much before weight averaging step (but I'm not sure). submitted by /u/Udon_noodles [link] [comments]  ( 1 min )
    Centralized-Learning Distributed-Execution for Multi Agent RL using SB3
    Hello there :D I am about to get into using Stable Baselines3 for my project, which deals with multi agent RL. I aim to use the concept of Centralized Learning and Distributed execution with PPO (centralized critic, distributed actors). Does SB3 support implementations for this? submitted by /u/AhmedNizam_ [link] [comments]  ( 1 min )
  • Open

    AI-generated Valentine's Cards
    I remember having to give out Valentine cards in elementary school, a mandatory expression of not-actual-valentine-feeling in which you could express your individuality and warm regard for your best friends and the classroom bully by giving them small store-bought cards with cartoon characters and generic messages on them. But what  ( 3 min )
    Bonus: more AI-generated valentine cards!
    AI Weirdness: the strange side of machine learning  ( 1 min )
  • Open

    Accelerating non-LTE synthesis and inversions with graph networks. (arXiv:2111.10552v2 [astro-ph.SR] UPDATED)
    The computational cost of fast non-LTE synthesis is one of the challenges that limits the development of 2D and 3D inversion codes. It also makes the interpretation of observations of lines formed in the chromosphere and transition region a slow and computationally costly process, which limits the inference of the physical properties on rather small fields of view. Having access to a fast way of computing the deviation from the LTE regime through the departure coefficients could largely alleviate this problem. We propose to build and train a graph network that quickly predicts the atomic level populations without solving the non-LTE problem. We find an optimal architecture for the graph network for predicting the departure coefficients of the levels of an atom from the physical conditions of a model atmosphere. A suitable dataset with a representative sample of potential model atmospheres is used for training. This dataset has been computed using existing non-LTE synthesis codes. The graph network has been integrated into existing synthesis and inversion codes for the particular case of \caii. We demonstrate orders of magnitude gain in computing speed. We analyze the generalization capabilities of the graph network and demonstrate that it produces good predicted departure coefficients for unseen models. We implement this approach in \hazel\ and show how the inversions nicely compare with those obtained with standard non-LTE inversion codes. Our approximate method opens up the possibility of extracting physical information from the chromosphere on large fields-of-view with time evolution.  ( 2 min )
    Self-Supervised Speaker Verification with Simple Siamese Network and Self-Supervised Regularization. (arXiv:2112.04459v2 [eess.AS] UPDATED)
    Training speaker-discriminative and robust speaker verification systems without speaker labels is still challenging and worthwhile to explore. In this study, we propose an effective self-supervised learning framework and a novel regularization strategy to facilitate self-supervised speaker representation learning. Different from contrastive learning-based self-supervised learning methods, the proposed self-supervised regularization (SSReg) focuses exclusively on the similarity between the latent representations of positive data pairs. We also explore the effectiveness of alternative online data augmentation strategies on both the time domain and frequency domain. With our strong online data augmentation strategy, the proposed SSReg shows the potential of self-supervised learning without using negative pairs and it can significantly improve the performance of self-supervised speaker representation learning with a simple Siamese network architecture. Comprehensive experiments on the VoxCeleb datasets demonstrate that our proposed self-supervised approach obtains a 23.4% relative improvement by adding the effective self-supervised regularization and outperforms other previous works.  ( 2 min )
    Normalise for Fairness: A Simple Normalisation Technique for Fairness in Regression Machine Learning Problems. (arXiv:2202.00993v1 [cs.LG])
    Algorithms and Machine Learning (ML) are increasingly affecting everyday life and several decision-making processes, where ML has an advantage due to scalability or superior performance. Fairness in such applications is crucial, where models should not discriminate their results based on race, gender, or other protected groups. This is especially crucial for models affecting very sensitive topics, like interview hiring or recidivism prediction. Fairness is not commonly studied for regression problems compared to binary classification problems; hence, we present a simple, yet effective method based on normalisation (FaiReg), which minimises the impact of unfairness in regression problems, especially due to labelling bias. We present a theoretical analysis of the method, in addition to an empirical comparison against two standard methods for fairness, namely data balancing and adversarial training. We also include a hybrid formulation (FaiRegH), merging the presented method with data balancing, in an attempt to face labelling and sample biases simultaneously. The experiments are conducted on the multimodal dataset First Impressions (FI) with various labels, namely personality prediction and interview screening score. The results show the superior performance of diminishing the effects of unfairness better than data balancing, also without deteriorating the performance of the original problem as much as adversarial training.
    Make Some Noise: Reliable and Efficient Single-Step Adversarial Training. (arXiv:2202.01181v1 [cs.LG])
    Recently, Wong et al. showed that adversarial training with single-step FGSM leads to a characteristic failure mode named catastrophic overfitting (CO), in which a model becomes suddenly vulnerable to multi-step attacks. They showed that adding a random perturbation prior to FGSM (RS-FGSM) seemed to be sufficient to prevent CO. However, Andriushchenko and Flammarion observed that RS-FGSM still leads to CO for larger perturbations, and proposed an expensive regularizer (GradAlign) to avoid CO. In this work, we methodically revisit the role of noise and clipping in single-step adversarial training. Contrary to previous intuitions, we find that using a stronger noise around the clean sample combined with not clipping is highly effective in avoiding CO for large perturbation radii. Based on these observations, we then propose Noise-FGSM (N-FGSM) that, while providing the benefits of single-step adversarial training, does not suffer from CO. Empirical analyses on a large suite of experiments show that N-FGSM is able to match or surpass the performance of previous single-step methods while achieving a 3$\times$ speed-up.
    Sequential- and Parallel- Constrained Max-value Entropy Search via Information Lower Bound. (arXiv:2102.09788v4 [cs.LG] UPDATED)
    Max-value entropy search (MES) is one of the state-of-the-art approaches in Bayesian optimization (BO). In this paper, we propose a novel variant of MES for constrained problems, called Constrained MES via Information lower BOund (CMES-IBO), that is based on a Monte Carlo (MC) estimator of a lower bound of a mutual information (MI). Unlike existing studies, our MI is defined so that uncertainty with respect to feasibility can be incorporated. We derive a lower bound of the MI that guarantees non-negativity, while a constrained counterpart of conventional MES can be negative. We further provide theoretical analysis that assures the low-variability of our estimator which has never been investigated for any existing information-theoretic BO. Moreover, using the conditional MI, we extend CMES-IBO to the parallel setting while maintaining the desirable properties. We demonstrate the effectiveness of CMES-IBO by several benchmark functions and real-world problems.  ( 2 min )
    Pop Quiz! Can a Large Language Model Help With Reverse Engineering?. (arXiv:2202.01142v1 [cs.SE])
    Large language models (such as OpenAI's Codex) have demonstrated impressive zero-shot multi-task capabilities in the software domain, including code explanation. In this work, we examine if this ability can be used to help with reverse engineering. Specifically, we investigate prompting Codex to identify the purpose, capabilities, and important variable names or values from code, even when the code is produced through decompilation. Alongside an examination of the model's responses in answering open-ended questions, we devise a true/false quiz framework to characterize the performance of the language model. We present an extensive quantitative analysis of the measured performance of the language model on a set of program purpose identification and information extraction tasks: of the 136,260 questions we posed, it answered 72,754 correctly. A key takeaway is that while promising, LLMs are not yet ready for zero-shot reverse engineering.
    Scientific multi-agent reinforcement learning for wall-models of turbulent flows. (arXiv:2106.11144v2 [physics.flu-dyn] UPDATED)
    The predictive capabilities of turbulent flow simulations, critical for aerodynamic design and weather prediction, hinge on the choice of turbulence models. The abundance of data from experiments and simulations and the advent of machine learning have provided a boost to turbulence modeling efforts. However, simulations of turbulent flows remain hindered by the inability of heuristics and supervised learning to model the near-wall dynamics. We address this challenge by introducing scientific multi-agent reinforcement learning (SciMARL) for the discovery of wall models for large-eddy simulations (LES). In SciMARL, discretization points act also as cooperating agents that learn to supply the LES closure model. The agents self-learn using limited data and generalize to extreme Reynolds numbers and previously unseen geometries. The present simulations reduce by several orders of magnitude the computational cost over fully-resolved simulations while reproducing key flow quantities. We believe that SciMARL creates unprecedented capabilities for the simulation of turbulent flows.  ( 2 min )
    Federated Reinforcement Learning for Collective Navigation of Robotic Swarms. (arXiv:2202.01141v1 [cs.RO])
    The recent advancement of Deep Reinforcement Learning (DRL) contributed to robotics by allowing automatic controller design. Automatic controller design is a crucial approach for designing swarm robotic systems, which require more complex controller than a single robot system to lead a desired collective behaviour. Although DRL-based controller design method showed its effectiveness, the reliance on the central training server is a critical problem in the real-world environments where the robot-server communication is unstable or limited. We propose a novel Federated Learning (FL) based DRL training strategy for use in swarm robotic applications. As FL reduces the number of robot-server communication by only sharing neural network model weights, not local data samples, the proposed strategy reduces the reliance on the central server during controller training with DRL. The experimental results from the collective learning scenario showed that the proposed FL-based strategy dramatically reduced the number of communication by minimum 1600 times and even increased the success rate of navigation with the trained controller by 2.8 times compared to the baseline strategies that share a central server. The results suggest that our proposed strategy can efficiently train swarm robotic systems in the real-world environments with the limited robot-server communication, e.g. agri-robotics, underwater and damaged nuclear facilities.  ( 2 min )
    Multilingual training for Software Engineering. (arXiv:2112.02043v3 [cs.SE] UPDATED)
    Well-trained machine-learning models, which leverage large amounts of open-source software data, have now become an interesting approach to automating many software engineering tasks. Several SE tasks have all been subject to this approach, with performance gradually improving over the past several years with better models and training methods. More, and more diverse, clean, labeled data is better for training; but constructing good-quality datasets is time-consuming and challenging. Ways of augmenting the volume and diversity of clean, labeled data generally have wide applicability. For some languages (e.g., Ruby) labeled data is less abundant; in others (e.g., JavaScript) the available data maybe more focused on some application domains, and thus less diverse. As a way around such data bottlenecks, we present evidence suggesting that human-written code in different languages (which performs the same function), is rather similar, and particularly preserving of identifier naming patterns; we further present evidence suggesting that identifiers are a very important element of training data for software engineering tasks. We leverage this rather fortuitous phenomenon to find evidence that available multilingual training data (across different languages) can be used to amplify performance. We study this for 3 different tasks: code summarization, code retrieval, and function naming. We note that this data-augmenting approach is broadly compatible with different tasks, languages, and machine-learning models.  ( 2 min )
    Autoregressive Diffusion Models. (arXiv:2110.02037v2 [cs.LG] UPDATED)
    We introduce Autoregressive Diffusion Models (ARDMs), a model class encompassing and generalizing order-agnostic autoregressive models (Uria et al., 2014) and absorbing discrete diffusion (Austin et al., 2021), which we show are special cases of ARDMs under mild assumptions. ARDMs are simple to implement and easy to train. Unlike standard ARMs, they do not require causal masking of model representations, and can be trained using an efficient objective similar to modern probabilistic diffusion models that scales favourably to highly-dimensional data. At test time, ARDMs support parallel generation which can be adapted to fit any given generation budget. We find that ARDMs require significantly fewer steps than discrete diffusion models to attain the same performance. Finally, we apply ARDMs to lossless compression, and show that they are uniquely suited to this task. Contrary to existing approaches based on bits-back coding, ARDMs obtain compelling results not only on complete datasets, but also on compressing single data points. Moreover, this can be done using a modest number of network calls for (de)compression due to the model's adaptable parallel generation.  ( 2 min )
    LocUNet: Fast Urban Positioning Using Radio Maps and Deep Learning. (arXiv:2202.00738v1 [cs.LG])
    This paper deals with the problem of localization in a cellular network in a dense urban scenario. Global Navigation Satellite Systems (GNSS) typically perform poorly in urban environments, where the likelihood of line-of-sight conditions is low, and thus alternative localization methods are required for good accuracy. We present LocUNet: A deep learning method for localization, based merely on Received Signal Strength (RSS) from Base Stations (BSs), which does not require any increase in computation complexity at the user devices with respect to the device standard operations, unlike methods that rely on time of arrival or angle of arrival information. In the proposed method, the user to be localized reports the RSS from BSs to a Central Processing Unit (CPU), which may be located in the cloud. Alternatively, the localization can be performed locally at the user. Using estimated pathloss radio maps of the BSs, LocUNet can localize users with state-of-the-art accuracy and enjoys high robustness to inaccuracies in the radio maps. The proposed method does not require pre-sampling of the environment; and is suitable for real-time applications, thanks to the RadioUNet, a neural network-based radio map estimator. We also introduce two datasets that allow numerical comparisons of RSS and Time of Arrival (ToA) methods in realistic urban environments.  ( 2 min )
    Giga-scale Kernel Matrix Vector Multiplication on GPU. (arXiv:2202.01085v1 [math.NA])
    Kernel matrix vector multiplication (KMVM) is a ubiquitous operation in machine learning and scientific computing, spanning from the kernel literature to signal processing. As kernel matrix vector multiplication tends to scale quadratically in both memory and time, applications are often limited by these computational scaling constraints. We propose a novel approximation procedure coined Faster-Fast and Free Memory Method ($\text{F}^3$M) to address these scaling issues for KMVM. Extensive experiments demonstrate that $\text{F}^3$M has empirical \emph{linear time and memory} complexity with a relative error of order $10^{-3}$ and can compute a full KMVM for a billion points \emph{in under one minute} on a high-end GPU, leading to a significant speed-up in comparison to existing CPU methods. We further demonstrate the utility of our procedure by applying it as a drop-in for the state-of-the-art GPU-based linear solver FALKON, \emph{improving speed 3-5 times} at the cost of $<$1\% drop in accuracy.  ( 2 min )
    Non-Stationary Dueling Bandits. (arXiv:2202.00935v1 [cs.LG])
    We study the non-stationary dueling bandits problem with $K$ arms, where the time horizon $T$ consists of $M$ stationary segments, each of which is associated with its own preference matrix. The learner repeatedly selects a pair of arms and observes a binary preference between them as feedback. To minimize the accumulated regret, the learner needs to pick the Condorcet winner of each stationary segment as often as possible, despite preference matrices and segment lengths being unknown. We propose the $\mathrm{Beat\, the\, Winner\, Reset}$ algorithm and prove a bound on its expected binary weak regret in the stationary case, which tightens the bound of current state-of-art algorithms. We also show a regret bound for the non-stationary case, without requiring knowledge of $M$ or $T$. We further propose and analyze two meta-algorithms, $\mathrm{DETECT}$ for weak regret and $\mathrm{Monitored\, Dueling\, Bandits}$ for strong regret, both based on a detection-window approach that can incorporate any dueling bandit algorithm as a black-box algorithm. Finally, we prove a worst-case lower bound for expected weak regret in the non-stationary case.  ( 2 min )
    Decision-Focused Learning in Restless Multi-Armed Bandits with Application to Maternal and Child Care Domain. (arXiv:2202.00916v1 [cs.LG])
    This paper studies restless multi-armed bandit (RMAB) problems with unknown arm transition dynamics but with known correlated arm features. The goal is to learn a model to predict transition dynamics given features, where the Whittle index policy solves the RMAB problems using predicted transitions. However, prior works often learn the model by maximizing the predictive accuracy instead of final RMAB solution quality, causing a mismatch between training and evaluation objectives. To address this shortcoming we propose a novel approach for decision-focused learning in RMAB that directly trains the predictive model to maximize the Whittle index solution quality. We present three key contributions: (i) we establish the differentiability of the Whittle index policy to support decision-focused learning; (ii) we significantly improve the scalability of previous decision-focused learning approaches in sequential problems; (iii) we apply our algorithm to the service call scheduling problem on a real-world maternal and child health domain. Our algorithm is the first for decision-focused learning in RMAB that scales to large-scale real-world problems. \end{abstract}  ( 2 min )
    Gradient-free optimization of chaotic acoustics with reservoir computing. (arXiv:2106.09780v2 [physics.flu-dyn] UPDATED)
    We develop a versatile optimization method, which finds the design parameters that minimize time-averaged acoustic cost functionals. The method is gradient-free, model-informed, and data-driven with reservoir computing based on echo state networks. First, we analyse the predictive capabilities of echo state networks both in the short- and long-time prediction of the dynamics. We find that both fully data-driven and model-informed architectures learn the chaotic acoustic dynamics, both time-accurately and statistically. Informing the training with a physical reduced-order model with one acoustic mode markedly improves the accuracy and robustness of the echo state networks, whilst keeping the computational cost low. Echo state networks offer accurate predictions of the long-time dynamics, which would be otherwise expensive by integrating the governing equations to evaluate the time-averaged quantity to optimize. Second, we couple echo state networks with a Bayesian technique to explore the design thermoacoustic parameter space. The computational method is minimally intrusive. Third, we find the set of flame parameters that minimize the time-averaged acoustic energy of chaotic oscillations, which are caused by the positive feedback with a heat source, such as a flame in gas turbines or rocket motors. These oscillations are known as thermoacoustic oscillations. The optimal set of flame parameters is found with the same accuracy as brute-force grid search, but with a convergence rate that is more than one order of magnitude faster. This work opens up new possibilities for non-intrusive ("hands-off") optimization of chaotic systems, in which the cost of generating data, for example from high-fidelity simulations and experiments, is high.  ( 2 min )
    Co-training Improves Prompt-based Learning for Large Language Models. (arXiv:2202.00828v1 [cs.CL])
    We demonstrate that co-training (Blum & Mitchell, 1998) can improve the performance of prompt-based learning by using unlabeled data. While prompting has emerged as a promising paradigm for few-shot and zero-shot learning, it is often brittle and requires much larger models compared to the standard supervised setup. We find that co-training makes it possible to improve the original prompt model and at the same time learn a smaller, downstream task-specific model. In the case where we only have partial access to a prompt model (e.g., output probabilities from GPT-3 (Brown et al., 2020)) we learn a calibration model over the prompt outputs. When we have full access to the prompt model's gradients but full finetuning remains prohibitively expensive (e.g., T0 (Sanh et al., 2021)), we learn a set of soft prompt continuous vectors to iteratively update the prompt model. We find that models trained in this manner can significantly improve performance on challenging datasets where there is currently a large gap between prompt-based learning and fully-supervised models.  ( 2 min )
    Matching Learned Causal Effects of Neural Networks with Domain Priors. (arXiv:2111.12490v2 [cs.LG] UPDATED)
    A trained neural network can be interpreted as a structural causal model (SCM) that provides the effect of changing input variables on the model's output. However, if training data contains both causal and correlational relationships, a model that optimizes prediction accuracy may not necessarily learn the true causal relationships between input and output variables. On the other hand, expert users often have prior knowledge of the causal relationship between certain input variables and output from domain knowledge. Therefore, we propose a regularization method that aligns the learned causal effects of a neural network with domain priors, including both direct and total causal effects. We show that this approach can generalize to different kinds of domain priors, including monotonicity of causal effect of an input variable on output or zero causal effect of a variable on output for purposes of fairness. Our experiments on twelve benchmark datasets show its utility in regularizing a neural network model to maintain desired causal effects, without compromising on accuracy. Importantly, we also show that a model thus trained is robust and gets improved accuracy on noisy inputs.  ( 2 min )
    A comparison of approaches to improve worst-case predictive model performance over patient subpopulations. (arXiv:2108.12250v2 [stat.ML] UPDATED)
    Predictive models for clinical outcomes that are accurate on average in a patient population may underperform drastically for some subpopulations, potentially introducing or reinforcing inequities in care access and quality. Model training approaches that aim to maximize worst-case model performance across subpopulations, such as distributionally robust optimization (DRO), attempt to address this problem without introducing additional harms. We conduct a large-scale empirical study of DRO and several variations of standard learning procedures to identify approaches for model development and selection that consistently improve disaggregated and worst-case performance over subpopulations compared to standard approaches for learning predictive models from electronic health records data. In the course of our evaluation, we introduce an extension to DRO approaches that allows for specification of the metric used to assess worst-case performance. We conduct the analysis for models that predict in-hospital mortality, prolonged length of stay, and 30-day readmission for inpatient admissions, and predict in-hospital mortality using intensive care data. We find that, with relatively few exceptions, no approach performs better, for each patient subpopulation examined, than standard learning procedures using the entire training dataset. These results imply that when it is of interest to improve model performance for patient subpopulations beyond what can be achieved with standard practices, it may be necessary to do so via data collection techniques that increase the effective sample size or reduce the level of noise in the prediction problem.
    Relational Graph Neural Network Design via Progressive Neural Architecture Search. (arXiv:2105.14490v3 [cs.LG] UPDATED)
    We propose a novel solution to addressing a long-standing dilemma in the representation learning of graph neural networks (GNNs): how to effectively capture and represent useful information embedded in long-distance nodes to improve the performance of nodes with low homophily without leading to performance degradation in nodes with high homophily. This dilemma limits the generalization capability of existing GNNs. Intuitively, interactions with distant nodes introduce more noise for a node than those with close neighbors. However, in most existing works, messages being passed among nodes are mingled together, which is inefficient from a communication perspective. Our solution is based on a novel, simple, yet effective aggregation scheme, resulting in a ladder-style GNN architecture, namely LADDER-GNN. Specifically, we separate messages from different hops, assign different dimensions for them, and then concatenate them to obtain node representations. Such disentangled representations facilitate improving the information-to-noise ratio of messages passed from different hops. To explore an effective hop-dimension relationship, we develop a conditionally progressive neural architecture search strategy. Based on the searching results, we further propose an efficient approximate hop-dimension relation function to facilitate the rapid configuration of the proposed LADDER-GNN. We verify the proposed LADDER-GNN on seven diverse semi-supervised node classification datasets. Experimental results show that our solution achieves better performance than most existing GNNs. We further analyze our aggregation scheme with two commonly used GNN architectures, and the results corroborate that our scheme outperforms existing schemes in classifying low homophily nodes by a large margin.  ( 2 min )
    Interpreting Distributional Reinforcement Learning: Regularization and Optimization Perspectives. (arXiv:2110.03155v2 [cs.LG] UPDATED)
    Distributional reinforcement learning~(RL) is a class of state-of-the-art algorithms that estimate the whole distribution of the total return rather than only its expectation. Despite the remarkable performance of distributional RL, a theoretical understanding of its advantages over expectation-based RL remains elusive. In this paper, we illuminate the superiority of distributional RL from both regularization and optimization perspectives. Firstly, by applying an expectation decomposition, the additional impact of distributional RL compared with expectation-based RL is interpreted as a \textit{risk-aware entropy regularization} in the \textit{neural Z-fitted iteration} framework. We also provide a rigorous comparison between the resulting entropy regularization and the vanilla one in maximum entropy RL. Through the lens of optimization, we shed light on the stability-promoting distributional loss with desirable smoothness properties in distributional RL. Moreover, the acceleration effect of distributional RL owing to the risk-aware entropy regularization is also provided. Finally, rigorous experiments reveal the different regularization effects as well as the mutual impact of vanilla entropy and risk-aware entropy regularization in distributional RL, focusing specifically on actor-critic algorithms. We also empirically verify that the distributional RL algorithm enjoys a more stable gradient behavior, contributing to its stable optimization and acceleration effect as opposed to classical RL. Our research paves a way towards better interpreting the superiority of distributional RL algorithms.  ( 2 min )
    Drug repurposing for COVID-19 using graph neural network and harmonizing multiple evidence. (arXiv:2009.10931v3 [q-bio.QM] UPDATED)
    Amid the pandemic of 2019 novel coronavirus disease (COVID-19) infected by SARS-CoV-2, a vast amount of drug research for prevention and treatment has been quickly conducted, but these efforts have been unsuccessful thus far. Our objective is to prioritize repurposable drugs using a drug repurposing pipeline that systematically integrates multiple SARS-CoV-2 and drug interactions, deep graph neural networks, and in-vitro/population-based validations. We first collected all the available drugs (n= 3,635) involved in COVID-19 patient treatment through CTDbase. We built a SARS-CoV-2 knowledge graph based on the interactions among virus baits, host genes, pathways, drugs, and phenotypes. A deep graph neural network approach was used to derive the candidate representation based on the biological interactions. We prioritized the candidate drugs using clinical trial history, and then validated them with their genetic profiles, in vitro experimental efficacy, and electronic health records. We highlight the top 22 drugs including Azithromycin, Atorvastatin, Aspirin, Acetaminophen, and Albuterol. We further pinpointed drug combinations that may synergistically target COVID-19. In summary, we demonstrated that the integration of extensive interactions, deep neural networks, and rigorous validation can facilitate the rapid identification of candidate drugs for COVID-19 treatment. This is a post-peer-review, pre-copyedit version of an article published in Scientific Reports The final authenticated version is available online at: https://www.nature.com/articles/s41598-021-02353-5  ( 2 min )
    Active Multi-Task Representation Learning. (arXiv:2202.00911v1 [cs.LG])
    To leverage the power of big data from source tasks and overcome the scarcity of the target task samples, representation learning based on multi-task pretraining has become a standard approach in many applications. However, up until now, choosing which source tasks to include in the multi-task learning has been more art than science. In this paper, we give the first formal study on resource task sampling by leveraging the techniques from active learning. We propose an algorithm that iteratively estimates the relevance of each source task to the target task and samples from each source task based on the estimated relevance. Theoretically, we show that for the linear representation class, to achieve the same error rate, our algorithm can save up to a \textit{number of source tasks} factor in the source task sample complexity, compared with the naive uniform sampling from all source tasks. We also provide experiments on real-world computer vision datasets to illustrate the effectiveness of our proposed method on both linear and convolutional neural network representation classes. We believe our paper serves as an important initial step to bring techniques from active learning to representation learning.  ( 2 min )
    Spectral Flow on the Manifold of SPD Matrices for Multimodal Data Processing. (arXiv:2009.08062v2 [cs.LG] UPDATED)
    In this paper, we consider data acquired by multimodal sensors capturing complementary aspects and features of a measured phenomenon. We focus on a scenario in which the measurements share mutual sources of variability but might also be contaminated by other measurement-specific sources such as interferences or noise. Our approach combines manifold learning, which is a class of nonlinear data-driven dimension reduction methods, with the well-known Riemannian geometry of symmetric and positive-definite (SPD) matrices. Manifold learning typically includes the spectral analysis of a kernel built from the measurements. Here, we take a different approach, utilizing the Riemannian geometry of the kernels. In particular, we study the way the spectrum of the kernels changes along geodesic paths on the manifold of SPD matrices. We show that this change enables us, in a purely unsupervised manner, to derive a compact, yet informative, description of the relations between the measurements, in terms of their underlying components. Based on this result, we present new algorithms for extracting the common latent components and for identifying common and measurement-specific components.  ( 2 min )
    EEGminer: Discovering Interpretable Features of Brain Activity with Learnable Filters. (arXiv:2110.10009v2 [cs.LG] UPDATED)
    Patterns of brain activity are associated with different brain processes and can be used to identify different brain states and make behavioral predictions. However, the relevant features are not readily apparent and accessible. To mine informative latent representations from multichannel recordings of ongoing EEG activity, we propose a novel differentiable decoding pipeline consisting of learnable filters and a pre-determined feature extraction module. Specifically, we introduce filters parameterized by generalized Gaussian functions that offer a smooth derivative for stable end-to-end model training and allow for learning interpretable features. For the feature module, we use signal magnitude and functional connectivity estimates. We demonstrate the utility of our model towards emotion recognition from EEG signals on the SEED dataset, as well as on a new EEG dataset of unprecedented size (i.e., 761 subjects), where we identify consistent trends of music perception and related individual differences. The discovered features align with previous neuroscience studies and offer new insights, such as marked differences in the functional connectivity profile between left and right temporal areas during music listening. This agrees with the respective specialisation of the temporal lobes regarding music perception proposed in the literature.  ( 2 min )
    Error Correction in ASR using Sequence-to-Sequence Models. (arXiv:2202.01157v1 [cs.CL])
    Post-editing in Automatic Speech Recognition (ASR) entails automatically correcting common and systematic errors produced by the ASR system. The outputs of an ASR system are largely prone to phonetic and spelling errors. In this paper, we propose to use a powerful pre-trained sequence-to-sequence model, BART, further adaptively trained to serve as a denoising model, to correct errors of such types. The adaptive training is performed on an augmented dataset obtained by synthetically inducing errors as well as by incorporating actual errors from an existing ASR system. We also propose a simple approach to rescore the outputs using word level alignments. Experimental results on accented speech data demonstrate that our strategy effectively rectifies a significant number of ASR errors and produces improved WER results when compared against a competitive baseline.  ( 2 min )
    Can I use this publicly available dataset to build commercial AI software? -- A Case Study on Publicly Available Image Datasets. (arXiv:2111.02374v3 [cs.LG] UPDATED)
    Publicly available datasets are one of the key drivers for commercial AI software. The use of publicly available datasets is governed by dataset licenses. These dataset licenses outline the rights one is entitled to on a given dataset and the obligations that one must fulfil to enjoy such rights without any license compliance violations. Unlike standardized Open Source Software (OSS) licenses, existing dataset licenses are defined in an ad-hoc manner and do not clearly outline the rights and obligations associated with their usage. Further, a public dataset may be hosted in multiple locations and created from multiple data sources each of which may have different licenses. Hence, existing approaches on checking OSS license compliance cannot be used. In this paper, we propose a new approach to assessing the potential license compliance violations if a given publicly available dataset were to be used for building commercial AI software. We conduct a case study with our approach on 6 commonly used publicly available image datasets. Our results show that there exists potential risks of license violations associated with 5 of these 6 studied datasets if they were used for commercial purposes.  ( 2 min )
    MD-GAN with multi-particle input: the machine learning of long-time molecular behavior from short-time MD data. (arXiv:2202.00995v1 [physics.chem-ph])
    MD-GAN is a machine learning-based method that can evolve part of the system at any time step, accelerating the generation of molecular dynamics data. For the accurate prediction of MD-GAN, sufficient information on the dynamics of a part of the system should be included with the training data. Therefore, the selection of the part of the system is important for efficient learning. In a previous study, only one particle (or vector) of each molecule was extracted as part of the system. Therefore, we investigated the effectiveness of adding information from other particles to the learning process. In the experiment of the polyethylene system, when the dynamics of three particles of each molecule were used, the diffusion was successfully predicted using one-third of the time length of the training data, compared to the single-particle input. Surprisingly, the unobserved transition of diffusion in the training data was also predicted using this method.  ( 2 min )
    DCSAU-Net: A Deeper and More Compact Split-Attention U-Net for Medical Image Segmentation. (arXiv:2202.00972v1 [eess.IV])
    Image segmentation is a key step for medical image analysis. Approaches based on deep neural networks have been introduced and performed more reliable results than traditional image processing methods. However, many models focus on one medical image application and still show limited abilities to work with complex images. In this paper, we propose a novel deeper and more compact split-attention u-shape network (DCSAU-Net) that extracts useful features using multi-scale combined split-attention and deeper depthwise convolution. We evaluate the proposed model on CVC-ClinicDB, 2018 Data Science Bowl, ISIC-2018 and SegPC-2021 datasets. As a result, DCSAU-Net displays better performance than other state-of-the-art (SOTA) methods in terms of the mean Intersection over Union (mIoU) and F1-socre. More significantly, the proposed model demonstrate better segmentation performance on challenging images.  ( 2 min )
    Quantitative analysis of phase transitions in two-dimensional XY models using persistent homology. (arXiv:2109.10960v2 [cond-mat.stat-mech] UPDATED)
    We use persistent homology and persistence images as an observable of three different variants of the two-dimensional XY model in order to identify and study their phase transitions. We examine models with the classical XY action, a topological lattice action, and an action with an additional nematic term. In particular, we introduce a new way of computing the persistent homology of lattice spin model configurations and, by considering the fluctuations in the output of logistic regression and k-nearest neighbours models trained on persistence images, we develop a methodology to extract estimates of the critical temperature and the critical exponent of the correlation length. We put particular emphasis on finite-size scaling behaviour and producing estimates with quantifiable error. For each model we successfully identify its phase transition(s) and are able to get an accurate determination of the critical temperatures and critical exponents of the correlation length.  ( 2 min )
    Robust Training of Neural Networks using Scale Invariant Architectures. (arXiv:2202.00980v1 [cs.LG])
    In contrast to SGD, adaptive gradient methods like Adam allow robust training of modern deep networks, especially large language models. However, the use of adaptivity not only comes at the cost of extra memory but also raises the fundamental question: can non-adaptive methods like SGD enjoy similar benefits? In this paper, we provide an affirmative answer to this question by proposing to achieve both robust and memory-efficient training via the following general recipe: (1) modify the architecture and make it scale invariant, i.e. the scale of parameter doesn't affect the output of the network, (2) train with SGD and weight decay, and optionally (3) clip the global gradient norm proportional to weight norm multiplied by $\sqrt{\tfrac{2\lambda}{\eta}}$, where $\eta$ is learning rate and $\lambda$ is weight decay. We show that this general approach is robust to rescaling of parameter and loss by proving that its convergence only depends logarithmically on the scale of initialization and loss, whereas the standard SGD might not even converge for many initializations. Following our recipe, we design a scale invariant version of BERT, called SIBERT, which when trained simply by vanilla SGD achieves performance comparable to BERT trained by adaptive methods like Adam on downstream tasks.  ( 2 min )
    Asynchronous Decentralized Learning over Unreliable Wireless Networks. (arXiv:2202.00955v1 [cs.IT])
    Decentralized learning enables edge users to collaboratively train models by exchanging information via device-to-device communication, yet prior works have been limited to wireless networks with fixed topologies and reliable workers. In this work, we propose an asynchronous decentralized stochastic gradient descent (DSGD) algorithm, which is robust to the inherent computation and communication failures occurring at the wireless network edge. We theoretically analyze its performance and establish a non-asymptotic convergence guarantee. Experimental results corroborate our analysis, demonstrating the benefits of asynchronicity and outdated gradient information reuse in decentralized learning over unreliable wireless networks.  ( 2 min )
    Active Audio-Visual Separation of Dynamic Sound Sources. (arXiv:2202.00850v1 [cs.CV])
    We explore active audio-visual separation for dynamic sound sources, where an embodied agent moves intelligently in a 3D environment to continuously isolate the time-varying audio stream being emitted by an object of interest. The agent hears a mixed stream of multiple time-varying audio sources (e.g., multiple people conversing and a band playing music at a noisy party). Given a limited time budget, it needs to extract the target sound using egocentric audio-visual observations. We propose a reinforcement learning agent equipped with a novel transformer memory that learns motion policies to control its camera and microphone to recover the dynamic target audio, improving its own estimates for past timesteps via self-attention. Using highly realistic acoustic SoundSpaces simulations in real-world scanned Matterport3D environments, we show that our model is able to learn efficient behavior to carry out continuous separation of a time-varying audio target. Project: https://vision.cs.utexas.edu/projects/active-av-dynamic-separation/.  ( 2 min )
    Invariant Ancestry Search. (arXiv:2202.00913v1 [stat.ME])
    Recently, methods have been proposed that exploit the invariance of prediction models with respect to changing environments to infer subsets of the causal parents of a response variable. If the environments influence only few of the underlying mechanisms, the subset identified by invariant causal prediction, for example, may be small, or even empty. We introduce the concept of minimal invariance and propose invariant ancestry search (IAS). In its population version, IAS outputs a set which contains only ancestors of the response and is a superset of the output of ICP. When applied to data, corresponding guarantees hold asymptotically if the underlying test for invariance has asymptotic level and power. We develop scalable algorithms and perform experiments on simulated and real data.  ( 2 min )
    3PC: Three Point Compressors for Communication-Efficient Distributed Training and a Better Theory for Lazy Aggregation. (arXiv:2202.00998v1 [cs.LG])
    We propose and study a new class of gradient communication mechanisms for communication-efficient training -- three point compressors (3PC) -- as well as efficient distributed nonconvex optimization algorithms that can take advantage of them. Unlike most established approaches, which rely on a static compressor choice (e.g., Top-$K$), our class allows the compressors to {\em evolve} throughout the training process, with the aim of improving the theoretical communication complexity and practical efficiency of the underlying methods. We show that our general approach can recover the recently proposed state-of-the-art error feedback mechanism EF21 (Richt\'arik et al., 2021) and its theoretical properties as a special case, but also leads to a number of new efficient methods. Notably, our approach allows us to improve upon the state of the art in the algorithmic and theoretical foundations of the {\em lazy aggregation} literature (Chen et al., 2018). As a by-product that may be of independent interest, we provide a new and fundamental link between the lazy aggregation and error feedback literature. A special feature of our work is that we do not require the compressors to be unbiased.
    N-HiTS: Neural Hierarchical Interpolation for Time Series Forecasting. (arXiv:2201.12886v2 [cs.LG] UPDATED)
    Recent progress in neural forecasting accelerated improvements in the performance of large-scale forecasting systems. Yet, long-horizon forecasting remains a very difficult task. Two common challenges afflicting long-horizon forecasting are the volatility of the predictions and their computational complexity. In this paper, we introduce N-HiTS, a model which addresses both challenges by incorporating novel hierarchical interpolation and multi-rate data sampling techniques. These techniques enable the proposed method to assemble its predictions sequentially, selectively emphasizing components with different frequencies and scales, while decomposing the input signal and synthesizing the forecast. We conduct an extensive empirical evaluation demonstrating the advantages of N-HiTS over the state-of-the-art long-horizon forecasting methods. On an array of multivariate forecasting tasks, the proposed method provides an average accuracy improvement of 25% over the latest Transformer architectures while reducing the computation time by an order of magnitude. Our code is available at https://github.com/cchallu/n-hits.
    A Longitudinal Dataset of Twitter ISIS Users. (arXiv:2202.00878v1 [cs.SI])
    We present a large longitudinal dataset of tweets from two sets of users that are suspected to be affiliated with ISIS. These sets of users are identified based on a prior study and a campaign aimed at shutting down ISIS Twitter accounts. These users have engaged with known ISIS accounts at least once during 2014-2015 and are still active as of 2021. Some of them have directly supported the ISIS users and their tweets by retweeting them, and some of the users that have quoted tweets of ISIS, have uncertain connections to ISIS seed accounts. This study and the dataset represent a unique approach to analyzing ISIS data. Although much research exists on ISIS online activities, few studies have focused on individual accounts. Our approach to validating accounts as well as developing a framework for differentiating accounts' functionality (e.g., propaganda versus operational planning) offers a foundation for future research. We perform some descriptive statistics and preliminary analyses on our collected data to provide deeper insight and highlight the significance and practicality of such analyses. We further discuss several cross-disciplinary potential use cases and research directions.
    TONet: Tone-Octave Network for Singing Melody Extraction from Polyphonic Music. (arXiv:2202.00951v1 [eess.AS])
    Singing melody extraction is an important problem in the field of music information retrieval. Existing methods typically rely on frequency-domain representations to estimate the sung frequencies. However, this design does not lead to human-level performance in the perception of melody information for both tone (pitch-class) and octave. In this paper, we propose TONet, a plug-and-play model that improves both tone and octave perceptions by leveraging a novel input representation and a novel network architecture. First, we present an improved input representation, the Tone-CFP, that explicitly groups harmonics via a rearrangement of frequency-bins. Second, we introduce an encoder-decoder architecture that is designed to obtain a salience feature map, a tone feature map, and an octave feature map. Third, we propose a tone-octave fusion mechanism to improve the final salience feature map. Experiments are done to verify the capability of TONet with various baseline backbone models. Our results show that tone-octave fusion with Tone-CFP can significantly improve the singing voice extraction performance across various datasets -- with substantial gains in octave and tone accuracy.
    Algorithms for Efficiently Learning Low-Rank Neural Networks. (arXiv:2202.00834v1 [cs.LG])
    We study algorithms for learning low-rank neural networks -- networks where the weight parameters are re-parameterized by products of two low-rank matrices. First, we present a provably efficient algorithm which learns an optimal low-rank approximation to a single-hidden-layer ReLU network up to additive error $\epsilon$ with probability $\ge 1 - \delta$, given access to noiseless samples with Gaussian marginals in polynomial time and samples. Thus, we provide the first example of an algorithm which can efficiently learn a neural network up to additive error without assuming the ground truth is realizable. To solve this problem, we introduce an efficient SVD-based \textit{Nonlinear Kernel Projection} algorithm for solving a nonlinear low-rank approximation problem over Gaussian space. Inspired by the efficiency of our algorithm, we propose a novel low-rank initialization framework for training low-rank \textit{deep} networks, and prove that for ReLU networks, the gap between our method and existing schemes widens as the desired rank of the approximating weights decreases, or as the dimension of the inputs increases (the latter point holds when network width is superlinear in dimension). Finally, we validate our theory by training ResNets and EfficientNets \citep{he2016deepresidual, tan2019efficientnet} models on ImageNet \citep{ILSVRC15}.
    Compiler-Driven Simulation of Reconfigurable Hardware Accelerators. (arXiv:2202.00739v1 [cs.PL])
    As customized accelerator design has become increasingly popular to keep up with the demand for high performance computing, it poses challenges for modern simulator design to adapt to such a large variety of accelerators. Existing simulators tend to two extremes: low-level and general approaches, such as RTL simulation, that can model any hardware but require substantial effort and long execution times; and higher-level application-specific models that can be much faster and easier to use but require one-off engineering effort. This work proposes a compiler-driven simulation workflow that can model configurable hardware accelerator. The key idea is to separate structure representation from simulation by developing an intermediate language that can flexibly represent a wide variety of hardware constructs. We design the Event Queue (EQueue) dialect of MLIR, a dialect that can model arbitrary hardware accelerators with explicit data movement and distributed event-based control; we also implement a generic simulation engine to model EQueue programs with hybrid MLIR dialects representing different abstraction levels. We demonstrate two case studies of EQueue-implemented accelerators: the systolic array of convolution and SIMD processors in a modern FPGA. In the former we show EQueue simulation is as accurate as a state-of-the-art simulator, while offering higher extensibility and lower iteration cost via compiler passes. In the latter we demonstrate our simulation flow can guide designer efficiently improve their design using visualizable simulation outputs.
    Certifiable Robustness to Adversarial State Uncertainty in Deep Reinforcement Learning. (arXiv:2004.06496v6 [cs.LG] UPDATED)
    Deep Neural Network-based systems are now the state-of-the-art in many robotics tasks, but their application in safety-critical domains remains dangerous without formal guarantees on network robustness. Small perturbations to sensor inputs (from noise or adversarial examples) are often enough to change network-based decisions, which was recently shown to cause an autonomous vehicle to swerve into another lane. In light of these dangers, numerous algorithms have been developed as defensive mechanisms from these adversarial inputs, some of which provide formal robustness guarantees or certificates. This work leverages research on certified adversarial robustness to develop an online certifiably robust for deep reinforcement learning algorithms. The proposed defense computes guaranteed lower bounds on state-action values during execution to identify and choose a robust action under a worst-case deviation in input space due to possible adversaries or noise. Moreover, the resulting policy comes with a certificate of solution quality, even though the true state and optimal action are unknown to the certifier due to the perturbations. The approach is demonstrated on a Deep Q-Network policy and is shown to increase robustness to noise and adversaries in pedestrian collision avoidance scenarios and a classic control task. This work extends one of our prior works with new performance guarantees, extensions to other RL algorithms, expanded results aggregated across more scenarios, an extension into scenarios with adversarial behavior, comparisons with a more computationally expensive method, and visualizations that provide intuition about the robustness algorithm.
    AdaAnn: Adaptive Annealing Scheduler for Probability Density Approximation. (arXiv:2202.00792v1 [stat.CO])
    Approximating probability distributions can be a challenging task, particularly when they are supported over regions of high geometrical complexity or exhibit multiple modes. Annealing can be used to facilitate this task which is often combined with constant a priori selected increments in inverse temperature. However, using constant increments limit the computational efficiency due to the inability to adapt to situations where smooth changes in the annealed density could be handled equally well with larger increments. We introduce AdaAnn, an adaptive annealing scheduler that automatically adjusts the temperature increments based on the expected change in the Kullback-Leibler divergence between two distributions with a sufficiently close annealing temperature. AdaAnn is easy to implement and can be integrated into existing sampling approaches such as normalizing flows for variational inference and Markov chain Monte Carlo. We demonstrate the computational efficiency of the AdaAnn scheduler for variational inference with normalizing flows on a number of examples, including density approximation and parameter estimation for dynamical systems.
    Investigating Transfer Learning in Graph Neural Networks. (arXiv:2202.00740v1 [cs.LG])
    Graph neural networks (GNNs) build on the success of deep learning models by extending them for use in graph spaces. Transfer learning has proven extremely successful for traditional deep learning problems: resulting in faster training and improved performance. Despite the increasing interest in GNNs and their use cases, there is little research on their transferability. This research demonstrates that transfer learning is effective with GNNs, and describes how source tasks and the choice of GNN impact the ability to learn generalisable knowledge. We perform experiments using real-world and synthetic data within the contexts of node classification and graph classification. To this end, we also provide a general methodology for transfer learning experimentation and present a novel algorithm for generating synthetic graph classification tasks. We compare the performance of GCN, GraphSAGE and GIN across both the synthetic and real-world datasets. Our results demonstrate empirically that GNNs with inductive operations yield statistically significantly improved transfer. Further we show that similarity in community structure between source and target tasks support statistically significant improvements in transfer over and above the use of only the node attributes.
    Fairness of Machine Learning Algorithms in Demography. (arXiv:2202.01013v1 [cs.LG])
    The paper is devoted to the study of the model fairness and process fairness of the Russian demographic dataset by making predictions of divorce of the 1st marriage, religiosity, 1st employment and completion of education. Our goal was to make classifiers more equitable by reducing their reliance on sensitive features while increasing or at least maintaining their accuracy. We took inspiration from "dropout" techniques in neural-based approaches and suggested a model that uses "feature drop-out" to address process fairness. To evaluate a classifier's fairness and decide the sensitive features to eliminate, we used "LIME Explanations". This results in a pool of classifiers due to feature dropout whose ensemble has been shown to be less reliant on sensitive features and to have improved or no effect on accuracy. Our empirical study was performed on four families of classifiers (Logistic Regression, Random Forest, Bagging, and Adaboost) and carried out on real-life dataset (Russian demographic data derived from Generations and Gender Survey), and it showed that all of the models became less dependent on sensitive features (such as gender, breakup of the 1st partnership, 1st partnership, etc.) and showed improvements or no impact on accuracy
    Accelerating DNN Training with Structured Data Gradient Pruning. (arXiv:2202.00774v1 [cs.LG])
    Weight pruning is a technique to make Deep Neural Network (DNN) inference more computationally efficient by reducing the number of model parameters over the course of training. However, most weight pruning techniques generally does not speed up DNN training and can even require more iterations to reach model convergence. In this work, we propose a novel Structured Data Gradient Pruning (SDGP) method that can speed up training without impacting model convergence. This approach enforces a specific sparsity structure, where only N out of every M elements in a matrix can be nonzero, making it amenable to hardware acceleration. Modern accelerators such as the Nvidia A100 GPU support this type of structured sparsity for 2 nonzeros per 4 elements in a reduction. Assuming hardware support for 2:4 sparsity, our approach can achieve a 15-25\% reduction in total training time without significant impact to performance. Source code and pre-trained models are available at \url{https://github.com/BradMcDanel/sdgp}.
    Mars Terrain Segmentation with Less Labels. (arXiv:2202.00791v1 [cs.CV])
    Planetary rover systems need to perform terrain segmentation to identify drivable areas as well as identify specific types of soil for sample collection. The latest Martian terrain segmentation methods rely on supervised learning which is very data hungry and difficult to train where only a small number of labeled samples are available. Moreover, the semantic classes are defined differently for different applications (e.g., rover traversal vs. geological) and as a result the network has to be trained from scratch each time, which is an inefficient use of resources. This research proposes a semi-supervised learning framework for Mars terrain segmentation where a deep segmentation network trained in an unsupervised manner on unlabeled images is transferred to the task of terrain segmentation trained on few labeled images. The network incorporates a backbone module which is trained using a contrastive loss function and an output atrous convolution module which is trained using a pixel-wise cross-entropy loss function. Evaluation results using the metric of segmentation accuracy show that the proposed method with contrastive pretraining outperforms plain supervised learning by 2%-10%. Moreover, the proposed model is able to achieve a segmentation accuracy of 91.1% using only 161 training images (1% of the original dataset) compared to 81.9% with plain supervised learning.
    Identifying Suitable Tasks for Inductive Transfer Through the Analysis of Feature Attributions. (arXiv:2202.01096v1 [cs.LG])
    Transfer learning approaches have shown to significantly improve performance on downstream tasks. However, it is common for prior works to only report where transfer learning was beneficial, ignoring the significant trial-and-error required to find effective settings for transfer. Indeed, not all task combinations lead to performance benefits, and brute-force searching rapidly becomes computationally infeasible. Hence the question arises, can we predict whether transfer between two tasks will be beneficial without actually performing the experiment? In this paper, we leverage explainability techniques to effectively predict whether task pairs will be complementary, through comparison of neural network activation between single-task models. In this way, we can avoid grid-searches over all task and hyperparameter combinations, dramatically reducing the time needed to find effective task pairs. Our results show that, through this approach, it is possible to reduce training time by up to 83.5% at a cost of only 0.034 reduction in positive-class F1 on the TREC-IS 2020-A dataset.
    What Happens after SGD Reaches Zero Loss? --A Mathematical Framework. (arXiv:2110.06914v2 [cs.LG] UPDATED)
    Understanding the implicit bias of Stochastic Gradient Descent (SGD) is one of the key challenges in deep learning, especially for overparametrized models, where the local minimizers of the loss function $L$ can form a manifold. Intuitively, with a sufficiently small learning rate $\eta$, SGD tracks Gradient Descent (GD) until it gets close to such manifold, where the gradient noise prevents further convergence. In such a regime, Blanc et al. (2020) proved that SGD with label noise locally decreases a regularizer-like term, the sharpness of loss, $\mathrm{tr}[\nabla^2 L]$. The current paper gives a general framework for such analysis by adapting ideas from Katzenberger (1991). It allows in principle a complete characterization for the regularization effect of SGD around such manifold -- i.e., the "implicit bias" -- using a stochastic differential equation (SDE) describing the limiting dynamics of the parameters, which is determined jointly by the loss function and the noise covariance. This yields some new results: (1) a global analysis of the implicit bias valid for $\eta^{-2}$ steps, in contrast to the local analysis of Blanc et al. (2020) that is only valid for $\eta^{-1.6}$ steps and (2) allowing arbitrary noise covariance. As an application, we show with arbitrary large initialization, label noise SGD can always escape the kernel regime and only requires $O(\kappa\ln d)$ samples for learning an $\kappa$-sparse overparametrized linear model in $\mathbb{R}^d$ (Woodworth et al., 2020), while GD initialized in the kernel regime requires $\Omega(d)$ samples. This upper bound is minimax optimal and improves the previous $\tilde{O}(\kappa^2)$ upper bound (HaoChen et al., 2020).
    ColloSSL: Collaborative Self-Supervised Learning for Human Activity Recognition. (arXiv:2202.00758v1 [cs.LG])
    A major bottleneck in training robust Human-Activity Recognition models (HAR) is the need for large-scale labeled sensor datasets. Because labeling large amounts of sensor data is an expensive task, unsupervised and semi-supervised learning techniques have emerged that can learn good features from the data without requiring any labels. In this paper, we extend this line of research and present a novel technique called Collaborative Self-Supervised Learning (ColloSSL) which leverages unlabeled data collected from multiple devices worn by a user to learn high-quality features of the data. A key insight that underpins the design of ColloSSL is that unlabeled sensor datasets simultaneously captured by multiple devices can be viewed as natural transformations of each other, and leveraged to generate a supervisory signal for representation learning. We present three technical innovations to extend conventional self-supervised learning algorithms to a multi-device setting: a Device Selection approach which selects positive and negative devices to enable contrastive learning, a Contrastive Sampling algorithm which samples positive and negative examples in a multi-device setting, and a loss function called Multi-view Contrastive Loss which extends standard contrastive loss to a multi-device setting. Our experimental results on three multi-device datasets show that ColloSSL outperforms both fully-supervised and semi-supervised learning techniques in majority of the experiment settings, resulting in an absolute increase of upto 7.9% in F_1 score compared to the best performing baselines. We also show that ColloSSL outperforms the fully-supervised methods in a low-data regime, by just using one-tenth of the available labeled data in the best case.
    Automatic event detection in football using tracking data. (arXiv:2202.00804v1 [cs.LG])
    One of the main shortcomings of event data in football, which has been extensively used for analytics in the recent years, is that it still requires manual collection, thus limiting its availability to a reduced number of tournaments. In this work, we propose a computational framework to automatically extract football events using tracking data, namely the coordinates of all players and the ball. Our approach consists of two models: (1) the possession model evaluates which player was in possession of the ball at each time, as well as the distinct player configurations in the time intervals where the ball is not in play; (2) the event detection model relies on the changes in ball possession to determine in-game events, namely passes, shots, crosses, saves, receptions and interceptions, as well as set pieces. First, analyze the accuracy of tracking data for determining ball possession, as well as the accuracy of the time annotations for the manually collected events. Then, we benchmark the auto-detected events with a dataset of manually annotated events to show that in most categories the proposed method achieves $+90\%$ detection rate. Lastly, we demonstrate how the contextual information offered by tracking data can be leveraged to increase the granularity of auto-detected events, and exhibit how the proposed framework may be used to conduct a myriad of data analyses in football.
    Hierarchical Shrinkage: improving the accuracy and interpretability of tree-based methods. (arXiv:2202.00858v1 [cs.LG])
    Tree-based models such as decision trees and random forests (RF) are a cornerstone of modern machine-learning practice. To mitigate overfitting, trees are typically regularized by a variety of techniques that modify their structure (e.g. pruning). We introduce Hierarchical Shrinkage (HS), a post-hoc algorithm that does not modify the tree structure, and instead regularizes the tree by shrinking the prediction over each node towards the sample means of its ancestors. The amount of shrinkage is controlled by a single regularization parameter and the number of data points in each ancestor. Since HS is a post-hoc method, it is extremely fast, compatible with any tree growing algorithm, and can be used synergistically with other regularization techniques. Extensive experiments over a wide variety of real-world datasets show that HS substantially increases the predictive performance of decision trees, even when used in conjunction with other regularization techniques. Moreover, we find that applying HS to each tree in an RF often improves accuracy, as well as its interpretability by simplifying and stabilizing its decision boundaries and SHAP values. We further explain the success of HS in improving prediction performance by showing its equivalence to ridge regression on a (supervised) basis constructed of decision stumps associated with the internal nodes of a tree. All code and models are released in a full-fledged package available on Github (github.com/csinva/imodels)
    A Patch-based Image Denoising Method Using Eigenvectors of the Geodesics' Gramian Matrix. (arXiv:2010.07769v2 [eess.IV] UPDATED)
    With the proliferation of sophisticated cameras in modern society, the demand for accurate and visually pleasing images is increasing. However, the quality of an image captured by a camera may be degraded by noise. Thus, some processing of images is required to filter out the noise without losing vital image features. Even though the current literature offers a variety of denoising methods, the fidelity and efficacy of their denoising are sometimes uncertain. Thus, here we propose a novel and computationally efficient image denoising method that is capable of producing accurate images. To preserve image smoothness, this method inputs patches partitioned from the image rather than pixels. Then, it performs denoising on the manifold underlying the patch-space rather than that in the image domain to better preserve the features across the whole image. We validate the performance of this method against benchmark image processing methods.
    Specifying and Interpreting Reinforcement Learning Policies through Simulatable Machine Learning. (arXiv:2101.07140v3 [cs.LG] UPDATED)
    Human-AI collaborative policy synthesis is a procedure in which (1) a human initializes an autonomous agent's behavior, (2) Reinforcement Learning improves the human specified behavior, and (3) the agent can explain the final optimized policy to the user. This paradigm leverages human expertise and facilitates a greater insight into the learned behaviors of an agent. Existing approaches to enabling collaborative policy specification involve black box methods which are unintelligible and are not catered towards non-expert end-users. In this paper, we develop a novel collaborative framework to enable humans to initialize and interpret an autonomous agent's behavior, rooted in principles of human-centered design. Through our framework, we enable humans to specify an initial behavior model in the form of unstructured, natural language, which we then convert to lexical decision trees. Next, we are able to leverage these human-specified policies, to warm-start reinforcement learning and further allow the agent to optimize the policies through reinforcement learning. Finally, to close the loop on human-specification, we produce explanations of the final learned policy, in multiple modalities, to provide the user a final depiction about the learned policy of the agent. We validate our approach by showing that our model can produce >80% accuracy, and that human-initialized policies are able to successfully warm-start RL. We then conduct a novel human-subjects study quantifying the relative subjective and objective benefits of varying XAI modalities(e.g., Tree, Language, and Program) for explaining learned policies to end-users, in terms of usability and interpretability and identify the circumstances that influence these measures. Our findings emphasize the need for personalized explainable systems that can facilitate user-centric policy explanations for a variety of end-users.
    Distributional Reinforcement Learning via Sinkhorn Iterations. (arXiv:2202.00769v1 [cs.LG])
    Distributional reinforcement learning~(RL) is a class of state-of-the-art algorithms that estimate the whole distribution of the total return rather than only its expectation. The representation manner of each return distribution and the choice of distribution divergence are pivotal for the empirical success of distributional RL. In this paper, we propose a new class of \textit{Sinkhorn distributional RL} algorithm that learns a finite set of statistics, i.e., deterministic samples, from each return distribution and then leverages Sinkhorn iterations to evaluate the Sinkhorn distance between the current and target Bellmen distributions. Remarkably, as Sinkhorn divergence interpolates between the Wasserstein distance and Maximum Mean Discrepancy~(MMD). This allows our proposed Sinkhorn distributional RL algorithms to find a sweet spot leveraging the geometry of optimal transport-based distance, and the unbiased gradient estimates of MMD. Finally, experiments on a suite of Atari games reveal the competitive performance of Sinkhorn distributional RL algorithm as opposed to existing state-of-the-art algorithms.
    Personalized Federated Learning via Convex Clustering. (arXiv:2202.00718v1 [cs.LG])
    We propose a parametric family of algorithms for personalized federated learning with locally convex user costs. The proposed framework is based on a generalization of convex clustering in which the differences between different users' models are penalized via a sum-of-norms penalty, weighted by a penalty parameter $\lambda$. The proposed approach enables "automatic" model clustering, without prior knowledge of the hidden cluster structure, nor the number of clusters. Analytical bounds on the weight parameter, that lead to simultaneous personalization, generalization and automatic model clustering are provided. The solution to the formulated problem enables personalization, by providing different models across different clusters, and generalization, by providing models different than the per-user models computed in isolation. We then provide an efficient algorithm based on the Parallel Direction Method of Multipliers (PDMM) to solve the proposed formulation in a federated server-users setting. Numerical experiments corroborate our findings. As an interesting byproduct, our results provide several generalizations to convex clustering.
    Multi-Task Learning as a Bargaining Game. (arXiv:2202.01017v1 [cs.LG])
    In Multi-task learning (MTL), a joint model is trained to simultaneously make predictions for several tasks. Joint training reduces computation costs and improves data efficiency; however, since the gradients of these different tasks may conflict, training a joint model for MTL often yields lower performance than its corresponding single-task counterparts. A common method for alleviating this issue is to combine per-task gradients into a joint update direction using a particular heuristic. In this paper, we propose viewing the gradients combination step as a bargaining game, where tasks negotiate to reach an agreement on a joint direction of parameter update. Under certain assumptions, the bargaining problem has a unique solution, known as the Nash Bargaining Solution, which we propose to use as a principled approach to multi-task learning. We describe a new MTL optimization procedure, Nash-MTL, and derive theoretical guarantees for its convergence. Empirically, we show that Nash-MTL achieves state-of-the-art results on multiple MTL benchmarks in various domains.
    Achieving Fairness at No Utility Cost via Data Reweighing. (arXiv:2202.00787v1 [cs.LG])
    With the fast development of algorithmic governance, fairness has become a compulsory property for machine learning models to suppress unintentional discrimination. In this paper, we focus on the pre-processing aspect for achieving fairness, and propose a data reweighing approach that only adjusts the weight for samples in the training phase. Different from most previous reweighing methods which assign a uniform weight for each (sub)group, we granularly model the influence from each training sample with regard to fairness and predictive utility, and compute individual weights based on the influence with constraints of both fairness and utility. Experimental results reveal that previous methods achieve fairness at a non-negligible cost of utility, while as a significant advantage, our approach can empirically release the tradeoff and obtain cost-free fairness. We demonstrate the cost-free fairness through vanilla classifiers and standard training processes on different fairness notions, compared to baseline methods on multiple tabular datasets.
    SPAGHETTI: Editing Implicit Shapes Through Part Aware Generation. (arXiv:2201.13168v1 [cs.GR] CROSS LISTED)
    Neural implicit fields are quickly emerging as an attractive representation for learning based techniques. However, adopting them for 3D shape modeling and editing is challenging. We introduce a method for $\mathbf{E}$diting $\mathbf{I}$mplicit $\mathbf{S}$hapes $\mathbf{T}$hrough $\mathbf{P}$art $\mathbf{A}$ware $\mathbf{G}$enera$\mathbf{T}$ion, permuted in short as SPAGHETTI. Our architecture allows for manipulation of implicit shapes by means of transforming, interpolating and combining shape segments together, without requiring explicit part supervision. SPAGHETTI disentangles shape part representation into extrinsic and intrinsic geometric information. This characteristic enables a generative framework with part-level control. The modeling capabilities of SPAGHETTI are demonstrated using an interactive graphical interface, where users can directly edit neural implicit shapes.
    Probabilistically Robust Learning: Balancing Average- and Worst-case Performance. (arXiv:2202.01136v1 [cs.LG])
    Many of the successes of machine learning are based on minimizing an averaged loss function. However, it is well-known that this paradigm suffers from robustness issues that hinder its applicability in safety-critical domains. These issues are often addressed by training against worst-case perturbations of data, a technique known as adversarial training. Although empirically effective, adversarial training can be overly conservative, leading to unfavorable trade-offs between nominal performance and robustness. To this end, in this paper we propose a framework called probabilistic robustness that bridges the gap between the accurate, yet brittle average case and the robust, yet conservative worst case by enforcing robustness to most rather than to all perturbations. From a theoretical point of view, this framework overcomes the trade-offs between the performance and the sample-complexity of worst-case and average-case learning. From a practical point of view, we propose a novel algorithm based on risk-aware optimization that effectively balances average- and worst-case performance at a considerably lower computational cost relative to adversarial training. Our results on MNIST, CIFAR-10, and SVHN illustrate the advantages of this framework on the spectrum from average- to worst-case robustness.
    A Note on Rough Set Algebra and Core Regular Double Stone Algebras. (arXiv:2101.02313v3 [math.RA] UPDATED)
    Rough Set Theory (RST), first introduced by Pawlak in 1982, is an approach for dealing with information systems where knowledge is uncertain or incomplete.\cite{Pawlak} It is of fundamental importance in many subfields of artificial intelligence and cognitive science.\cite{RSTppf} Given a universe $U$ with an equivalence relation $\theta$, the pair $\langle U,\theta\rangle$ is referred to as an information system and we denote its collection of rough sets $R_\theta$. In our main Theorem we show $R_\theta$ with $|\theta_u| > 1\ \forall\ u \in U$ to be isomorphic to core regular double Stone algebras, CRDSA, that are complete and atomic, and that the crisp, or definable, sets form a complete atomistic Boolean algebra. These guarantees of infimum/supremeum for arbitrary subsets and formulations in terms of fundamental elements are likely useful if dealing with equivalence relations with an infinite number of partitions, such as projective Hilbert spaces. We further derive that every CRDSA is isomorphic to a subalgebra of a principal rough set algebra, $R_\theta$, for some approximation space $\langle U,\theta \rangle$. In our main Corollary we show explicitly how to embed $R_\theta$ into the CRDSA and first demonstrate by extending the culminating finite example of \cite{RCRDSA}. As our capstone, we consider the projective Hilbert space of complex numbers, $\mathbb{C}$ and show, among other things, the power set of the set of pure states is a complete, atomistic Boolean algebra. In closing, we suggest other Quantum relevant applications that may be useful, such as Hilbert spaces of operators
    Reinforcement learning of optimal active particle navigation. (arXiv:2202.00812v1 [cond-mat.soft])
    The development of self-propelled particles at the micro- and the nanoscale has sparked a huge potential for future applications in active matter physics, microsurgery, and targeted drug delivery. However, while the latter applications provoke the quest on how to optimally navigate towards a target, such as e.g. a cancer cell, there is still no simple way known to determine the optimal route in sufficiently complex environments. Here we develop a machine learning-based approach that allows us, for the first time, to determine the asymptotically optimal path of a self-propelled agent which can freely steer in complex environments. Our method hinges on policy gradient-based deep reinforcement learning techniques and, crucially, does not require any reward shaping or heuristics. The presented method provides a powerful alternative to current analytical methods to calculate optimal trajectories and opens a route towards a universal path planner for future intelligent active particles.
    Lagrangian Manifold Monte Carlo on Monge Patches. (arXiv:2202.00755v1 [stat.ME])
    The efficiency of Markov Chain Monte Carlo (MCMC) depends on how the underlying geometry of the problem is taken into account. For distributions with strongly varying curvature, Riemannian metrics help in efficient exploration of the target distribution. Unfortunately, they have significant computational overhead due to e.g. repeated inversion of the metric tensor, and current geometric MCMC methods using the Fisher information matrix to induce the manifold are in practice slow. We propose a new alternative Riemannian metric for MCMC, by embedding the target distribution into a higher-dimensional Euclidean space as a Monge patch and using the induced metric determined by direct geometric reasoning. Our metric only requires first-order gradient information and has fast inverse and determinants, and allows reducing the computational complexity of individual iterations from cubic to quadratic in the problem dimensionality. We demonstrate how Lagrangian Monte Carlo in this metric efficiently explores the target distributions.
    ADG-Pose: Automated Dataset Generation for Real-World Human Pose Estimation. (arXiv:2202.00753v1 [cs.CV])
    Recent advancements in computer vision have seen a rise in the prominence of applications using neural networks to understand human poses. However, while accuracy has been steadily increasing on State-of-the-Art datasets, these datasets often do not address the challenges seen in real-world applications. These challenges are dealing with people distant from the camera, people in crowds, and heavily occluded people. As a result, many real-world applications have trained on data that does not reflect the data present in deployment, leading to significant underperformance. This article presents ADG-Pose, a method for automatically generating datasets for real-world human pose estimation. These datasets can be customized to determine person distances, crowdedness, and occlusion distributions. Models trained with our method are able to perform in the presence of these challenges where those trained on other datasets fail. Using ADG-Pose, end-to-end accuracy for real-world skeleton-based action recognition sees a 20% increase on scenes with moderate distance and occlusion levels, and a 4X increase on distant scenes where other models failed to perform better than random.
    RescoreBERT: Discriminative Speech Recognition Rescoring with BERT. (arXiv:2202.01094v1 [eess.AS])
    Second-pass rescoring is an important component in automatic speech recognition (ASR) systems that is used to improve the outputs from a first-pass decoder by implementing a lattice rescoring or $n$-best re-ranking. While pretraining with a masked language model (MLM) objective has received great success in various natural language understanding (NLU) tasks, it has not gained traction as a rescoring model for ASR. Specifically, training a bidirectional model like BERT on a discriminative objective such as minimum WER (MWER) has not been explored. Here we where show how to train a BERT-based rescoring model with MWER loss, to incorporate the improvements of a discriminative loss into fine-tuning of deep bidirectional pretrained models for ASR. We propose a fusion strategy that incorporates the MLM into the discriminative training process to effectively distill the knowledge from a pretrained model. We further propose an alternative discriminative loss. We name this approach RescoreBERT, and evaluate it on the LibriSpeech corpus, and it reduces WER by 6.6%/3.4% relative on clean/other test sets over a BERT baseline without discriminative objective. We also evaluate our method on an internal dataset from a conversational agent and find that it reduces both latency and WER (by 3-8% relative) over an LSTM rescoring model.
    From Predictions to Decisions: The Importance of Joint Predictive Distributions. (arXiv:2107.09224v2 [cs.LG] UPDATED)
    A fundamental challenge for any intelligent system is prediction: given some inputs, can you predict corresponding outcomes? Most work on supervised learning has focused on producing accurate marginal predictions for each input. However, we show that for a broad class of decision problems, accurate joint predictions are required to deliver good performance. In particular, we establish several results pertaining to combinatorial decision problems, sequential predictions, and multi-armed bandits to elucidate the essential role of joint predictive distributions. Our treatment of multi-armed bandits introduces an approximate Thompson sampling algorithm and analytic techniques that lead to a new kind of regret bound.
    FROB: Few-shot ROBust Model for Classification and Out-of-Distribution Detection. (arXiv:2111.15487v2 [cs.LG] UPDATED)
    Nowadays, classification and Out-of-Distribution (OoD) detection in the few-shot setting remain challenging aims due to rarity and the limited samples in the few-shot setting, and because of adversarial attacks. Accomplishing these aims is important for critical systems in safety, security, and defence. In parallel, OoD detection is challenging since deep neural network classifiers set high confidence to OoD samples away from the training data. To address such limitations, we propose the Few-shot ROBust (FROB) model for classification and few-shot OoD detection. We devise FROB for improved robustness and reliable confidence prediction for few-shot OoD detection. We generate the support boundary of the normal class distribution and combine it with few-shot Outlier Exposure (OE). We propose a self-supervised learning few-shot confidence boundary methodology based on generative and discriminative models. The contribution of FROB is the combination of the generated boundary in a self-supervised learning manner and the imposition of low confidence at this learned boundary. FROB implicitly generates strong adversarial samples on the boundary and forces samples from OoD, including our boundary, to be less confident by the classifier. FROB achieves generalization to unseen OoD with applicability to unknown, in the wild, test sets that do not correlate to the training datasets. To improve robustness, FROB redesigns OE to work even for zero-shots. By including our boundary, FROB reduces the threshold linked to the model's few-shot robustness; it maintains the OoD performance approximately independent of the number of few-shots. The few-shot robustness analysis evaluation of FROB on different sets and on One-Class Classification (OCC) data shows that FROB achieves competitive performance and outperforms benchmarks in terms of robustness to the outlier few-shot sample population and variability.
    A deep residual learning implementation of Metamorphosis. (arXiv:2202.00676v1 [eess.IV])
    In medical imaging, most of the image registration methods implicitly assume a one-to-one correspondence between the source and target images (i.e., diffeomorphism). However, this is not necessarily the case when dealing with pathological medical images (e.g., presence of a tumor, lesion, etc.). To cope with this issue, the Metamorphosis model has been proposed. It modifies both the shape and the appearance of an image to deal with the geometrical and topological differences. However, the high computational time and load have hampered its applications so far. Here, we propose a deep residual learning implementation of Metamorphosis that drastically reduces the computational time at inference. Furthermore, we also show that the proposed framework can easily integrate prior knowledge of the localization of topological changes (e.g., segmentation masks) that can act as spatial regularization to correctly disentangle appearance and shape changes. We test our method on the BraTS 2021 dataset, showing that it outperforms current state-of-the-art methods in the alignment of images with brain tumors.
    Methodology for forecasting and optimization in IEEE-CIS 3rd Technical Challenge. (arXiv:2202.00894v1 [eess.SY])
    This report provides a description of the methodology I used in the IEEE-CIS 3rd Technical Challenge. For the forecast, I used a quantile regression forest approach using the solar variables provided by the Bureau of Meterology of Australia (BOM) and many of the weather variables from the European Centre for Medium-Range Weather Forecasting (ECMWF). Groups of buildings and all of the solar instances were trained together as they were observed to be closely correlated over time. Other variables used included Fourier values based on hour of day and day of year, and binary variables for combinations of days of the week. The start dates for the time series were carefully tuned based on phase 1 and cleaning and thresholding was used to reduce the observed error rate for each time series. For the optimization, a four-step approach was used using the forecast developed. First, a mixed-integer program (MIP) was solved for the recurring and recurring plus once-off activities, then each of these was extended using a mixed-integer quadratic program (MIQP). The general strategy was chosen from one of two ("array" from the "array" and "tuples" approaches) while the specific step improvement strategy was chosen from one of five ("no forced discharge").
    Exit Time Analysis for Approximations of Gradient Descent Trajectories Around Saddle Points. (arXiv:2006.01106v2 [math.OC] UPDATED)
    This paper considers the problem of understanding the exit time for trajectories of gradient-related first-order methods from saddle neighborhoods under some initial boundary conditions. Given the `flat' geometry around saddle points, first-order methods can struggle to escape these regions in a fast manner due to the small magnitudes of gradients encountered. In particular, while it is known that gradient-related first-order methods escape strict-saddle neighborhoods, existing analytic techniques do not explicitly leverage the local geometry around saddle points in order to control behavior of gradient trajectories. It is in this context that this paper puts forth a rigorous geometric analysis of the gradient-descent method around strict-saddle neighborhoods using matrix perturbation theory. In doing so, it provides a key result that can be used to generate an approximate gradient trajectory for any given initial conditions. In addition, the analysis leads to a linear exit-time solution for gradient-descent method under certain necessary initial conditions, which explicitly bring out the dependence on problem dimension, conditioning of the saddle neighborhood, and more, for a class of strict-saddle functions.
    Improving Uncertainty Calibration of Deep Neural Networks via Truth Discovery and Geometric Optimization. (arXiv:2106.14662v2 [cs.LG] UPDATED)
    Deep Neural Networks (DNNs), despite their tremendous success in recent years, could still cast doubts on their predictions due to the intrinsic uncertainty associated with their learning process. Ensemble techniques and post-hoc calibrations are two types of approaches that have individually shown promise in improving the uncertainty calibration of DNNs. However, the synergistic effect of the two types of methods has not been well explored. In this paper, we propose a truth discovery framework to integrate ensemble-based and post-hoc calibration methods. Using the geometric variance of the ensemble candidates as a good indicator for sample uncertainty, we design an accuracy-preserving truth estimator with provably no accuracy drop. Furthermore, we show that post-hoc calibration can also be enhanced by truth discovery-regularized optimization. On large-scale datasets including CIFAR and ImageNet, our method shows consistent improvement against state-of-the-art calibration approaches on both histogram-based and kernel density-based evaluation metrics. Our codes are available at https://github.com/horsepurve/truly-uncertain.
    A Graph Based Neural Network Approach to Immune Profiling of Multiplexed Tissue Samples. (arXiv:2202.00813v1 [cs.LG])
    Multiplexed immunofluorescence provides an unprecedented opportunity for studying specific cell-to-cell and cell microenvironment interactions. We employ graph neural networks to combine features obtained from tissue morphology with measurements of protein expression to profile the tumour microenvironment associated with different tumour stages. Our framework presents a new approach to analysing and processing these complex multi-dimensional datasets that overcomes some of the key challenges in analysing these data and opens up the opportunity to abstract biologically meaningful interactions.
    Signal processing on simplicial complexes. (arXiv:2106.07471v2 [eess.SP] UPDATED)
    Higher-order networks have so far been considered primarily in the context of studying the structure of complex systems, i.e., the higher-order or multi-way relations connecting the constituent entities. More recently, a number of studies have considered dynamical processes that explicitly account for such higher-order dependencies, e.g., in the context of epidemic spreading processes or opinion formation. In this chapter, we focus on a closely related, but distinct third perspective: how can we use higher-order relationships to process signals and data supported on higher-order network structures. In particular, we survey how ideas from signal processing of data supported on regular domains, such as time series or images, can be extended to graphs and simplicial complexes. We discuss Fourier analysis, signal denoising, signal interpolation, and nonlinear processing through neural networks based on simplicial complexes. Key to our developments is the Hodge Laplacian matrix, a multi-relational operator that leverages the special structure of simplicial complexes and generalizes desirable properties of the Laplacian matrix in graph signal processing.
    Some Reflections on Drawing Causal Inference using Textual Data: Parallels Between Human Subjects and Organized Texts. (arXiv:2202.00848v1 [cs.CL])
    We examine the role of textual data as study units when conducting causal inference by drawing parallels between human subjects and organized texts. %in human population research. We elaborate on key causal concepts and principles, and expose some ambiguity and sometimes fallacies. To facilitate better framing a causal query, we discuss two strategies: (i) shifting from immutable traits to perceptions of them, and (ii) shifting from some abstract concept/property to its constituent parts, i.e., adopting a constructivist perspective of an abstract concept. We hope this article would raise the awareness of the importance of articulating and clarifying fundamental concepts before delving into developing methodologies when drawing causal inference using textual data.
    Geometry-Consistent Neural Shape Representation with Implicit Displacement Fields. (arXiv:2106.05187v3 [cs.CV] UPDATED)
    We present implicit displacement fields, a novel representation for detailed 3D geometry. Inspired by a classic surface deformation technique, displacement mapping, our method represents a complex surface as a smooth base surface plus a displacement along the base's normal directions, resulting in a frequency-based shape decomposition, where the high frequency signal is constrained geometrically by the low frequency signal. Importantly, this disentanglement is unsupervised thanks to a tailored architectural design that has an innate frequency hierarchy by construction. We explore implicit displacement field surface reconstruction and detail transfer and demonstrate superior representational power, training stability and generalizability.
    Optimizing Sequential Experimental Design with Deep Reinforcement Learning. (arXiv:2202.00821v1 [cs.LG])
    Bayesian approaches developed to solve the optimal design of sequential experiments are mathematically elegant but computationally challenging. Recently, techniques using amortization have been proposed to make these Bayesian approaches practical, by training a parameterized policy that proposes designs efficiently at deployment time. However, these methods may not sufficiently explore the design space, require access to a differentiable probabilistic model and can only optimize over continuous design spaces. Here, we address these limitations by showing that the problem of optimizing policies can be reduced to solving a Markov decision process (MDP). We solve the equivalent MDP with modern deep reinforcement learning techniques. Our experiments show that our approach is also computationally efficient at deployment time and exhibits state-of-the-art performance on both continuous and discrete design spaces, even when the probabilistic model is a black box.
    Barycentric-alignment and invertibility for domain generalization. (arXiv:2109.01902v4 [cs.LG] UPDATED)
    We revisit Domain Generalization (DG) problem, where the hypotheses are composed of a common representation mapping followed by a labeling function. Popular DG methods optimize a well-known upper bound to the risk in the unseen domain. However, the bound contains a term that is not optimized due to its dual dependence on the representation mapping and the unknown optimal labeling function for the unseen domain. We derive a new upper bound free of terms having such dual dependence by imposing mild assumptions on the loss function and an invertibility requirement on the representation map when restricted to the low-dimensional data manifold. The derivation leverages old and recent transport inequalities that link optimal transport metrics with information-theoretic measures. Our bound motivates a new algorithm for DG comprising Wasserstein-2 barycenter cost for feature alignment and mutual information or autoencoders for enforcing approximate invertibility. Experiments on several datasets demonstrate superior performance compared to the state-of-the-art DG algorithms.
    NodePiece: Compositional and Parameter-Efficient Representations of Large Knowledge Graphs. (arXiv:2106.12144v2 [cs.CL] UPDATED)
    Conventional representation learning algorithms for knowledge graphs (KG) map each entity to a unique embedding vector. Such a shallow lookup results in a linear growth of memory consumption for storing the embedding matrix and incurs high computational costs when working with real-world KGs. Drawing parallels with subword tokenization commonly used in NLP, we explore the landscape of more parameter-efficient node embedding strategies with possibly sublinear memory requirements. To this end, we propose NodePiece, an anchor-based approach to learn a fixed-size entity vocabulary. In NodePiece, a vocabulary of subword/sub-entity units is constructed from anchor nodes in a graph with known relation types. Given such a fixed-size vocabulary, it is possible to bootstrap an encoding and embedding for any entity, including those unseen during training. Experiments show that NodePiece performs competitively in node classification, link prediction, and relation prediction tasks while retaining less than 10% of explicit nodes in a graph as anchors and often having 10x fewer parameters. To this end, we show that a NodePiece-enabled model outperforms existing shallow models on a large OGB WikiKG 2 graph having 70x fewer parameters.
    Context Uncertainty in Contextual Bandits with Applications to Recommender Systems. (arXiv:2202.00805v1 [cs.LG])
    Recurrent neural networks have proven effective in modeling sequential user feedbacks for recommender systems. However, they usually focus solely on item relevance and fail to effectively explore diverse items for users, therefore harming the system performance in the long run. To address this problem, we propose a new type of recurrent neural networks, dubbed recurrent exploration networks (REN), to jointly perform representation learning and effective exploration in the latent space. REN tries to balance relevance and exploration while taking into account the uncertainty in the representations. Our theoretical analysis shows that REN can preserve the rate-optimal sublinear regret even when there exists uncertainty in the learned representations. Our empirical study demonstrates that REN can achieve satisfactory long-term rewards on both synthetic and real-world recommendation datasets, outperforming state-of-the-art models.
    Learning to Guide and to Be Guided in the Architect-Builder Problem. (arXiv:2112.07342v3 [cs.LG] UPDATED)
    We are interested in interactive agents that learn to coordinate, namely, a $builder$ -- which performs actions but ignores the goal of the task, i.e. has no access to rewards -- and an $architect$ which guides the builder towards the goal of the task. We define and explore a formal setting where artificial agents are equipped with mechanisms that allow them to simultaneously learn a task while at the same time evolving a shared communication protocol. Ideally, such learning should only rely on high-level communication priors and be able to handle a large variety of tasks and meanings while deriving communication protocols that can be reused across tasks. We present the Architect-Builder Problem (ABP): an asymmetrical setting in which an architect must learn to guide a builder towards constructing a specific structure. The architect knows the target structure but cannot act in the environment and can only send arbitrary messages to the builder. The builder on the other hand can act in the environment, but receives no rewards nor has any knowledge about the task, and must learn to solve it relying only on the messages sent by the architect. Crucially, the meaning of messages is initially not defined nor shared between the agents but must be negotiated throughout learning. Under these constraints, we propose Architect-Builder Iterated Guiding (ABIG), a solution to ABP where the architect leverages a learned model of the builder to guide it while the builder uses self-imitation learning to reinforce its guided behavior. We analyze the key learning mechanisms of ABIG and test it in 2D tasks involving grasping cubes, placing them at a given location, or building various shapes. ABIG results in a low-level, high-frequency, guiding communication protocol that not only enables an architect-builder pair to solve the task at hand, but that can also generalize to unseen tasks.
    Flow-based Algorithms for Improving Clusters: A Unifying Framework, Software, and Performance. (arXiv:2004.09608v3 [cs.LG] UPDATED)
    Clustering points in a vector space or nodes in a graph is a ubiquitous primitive in statistical data analysis, and it is commonly used for exploratory data analysis. In practice, it is often of interest to "refine" or "improve" a given cluster that has been obtained by some other method. In this survey, we focus on principled algorithms for this cluster improvement problem. Many such cluster improvement algorithms are flow-based methods, by which we mean that operationally they require the solution of a sequence of maximum flow problems on a (typically implicitly) modified data graph. These cluster improvement algorithms are powerful, both in theory and in practice, but they have not been widely adopted for problems such as community detection, local graph clustering, semi-supervised learning, etc. Possible reasons for this are: the steep learning curve for these algorithms; the lack of efficient and easy to use software; and the lack of detailed numerical experiments on real-world data that demonstrate their usefulness. Our objective here is to address these issues. To do so, we guide the reader through the whole process of understanding how to implement and apply these powerful algorithms. We present a unifying fractional programming optimization framework that permits us to distill, in a simple way, the crucial components of all these algorithms. It also makes apparent similarities and differences between related methods. Viewing these cluster improvement algorithms via a fractional programming framework suggests directions for future algorithm development. Finally, we develop efficient implementations of these algorithms in our LocalGraphClustering Python package, and we perform extensive numerical experiments to demonstrate the performance of these methods on social networks and image-based data graphs.
    Single-Layer Vision Transformers for More Accurate Early Exits with Less Overhead. (arXiv:2105.09121v2 [cs.LG] UPDATED)
    Deploying deep learning models in time-critical applications with limited computational resources, for instance in edge computing systems and IoT networks, is a challenging task that often relies on dynamic inference methods such as early exiting. In this paper, we introduce a novel architecture for early exiting based on the vision transformer architecture, as well as a fine-tuning strategy that significantly increase the accuracy of early exit branches compared to conventional approaches while introducing less overhead. Through extensive experiments on image and audio classification as well as audiovisual crowd counting, we show that our method works for both classification and regression problems, and in both single- and multi-modal settings. Additionally, we introduce a novel method for integrating audio and visual modalities within early exits in audiovisual data analysis, that can lead to a more fine-grained dynamic inference.  ( 2 min )
    Beyond Target Networks: Improving Deep $Q$-learning with Functional Regularization. (arXiv:2106.02613v3 [stat.ML] UPDATED)
    Much of the recent successes in Deep Reinforcement Learning have been based on minimizing the squared Bellman error. However, training is often unstable due to fast-changing target Q-values, and target networks are employed to regularize the Q-value estimation and stabilize training by using an additional set of lagging parameters. Despite their advantages, target networks are potentially an inflexible way to regularize Q-values which may ultimately slow down training. In this work, we address this issue by augmenting the squared Bellman error with a functional regularizer. Unlike target networks, the regularization we propose here is explicit and enables us to use up-to-date parameters as well as control the regularization. This leads to a faster yet more stable training method. We analyze the convergence of our method theoretically and empirically validate our predictions on simple environments as well as on a suite of Atari environments. We demonstrate empirical improvements over target network based methods in terms of both sample efficiency and performance. In summary, our approach provides a fast and stable alternative to replace the standard squared Bellman error  ( 2 min )
    Automotive Parts Assessment: Applying Real-time Instance-Segmentation Models to Identify Vehicle Parts. (arXiv:2202.00884v1 [cs.CV])
    The problem of automated car damage assessment presents a major challenge in the auto repair and damage assessment industry. The domain has several application areas ranging from car assessment companies such as car rentals and body shops to accidental damage assessment for car insurance companies. In vehicle assessment, the damage can take any form including scratches, minor and major dents to missing parts. More often, the assessment area has a significant level of noise such as dirt, grease, oil or rush that makes an accurate identification challenging. Moreover, the identification of a particular part is the first step in the repair industry to have an accurate labour and part assessment where the presence of different car models, shapes and sizes makes the task even more challenging for a machine-learning model to perform well. To address these challenges, this research explores and applies various instance segmentation methodologies to evaluate the best performing models. The scope of this work focusses on two genres of real-time instance segmentation models due to their industrial significance, namely SipMask and Yolact. These methodologies are evaluated against a previously reported car parts dataset (DSMLR) and an internally curated dataset extracted from local car repair workshops. The Yolact-based part localization and segmentation method performed well when compared to other real-time instance mechanisms with a mAP of 66.5. For the workshop repair dataset, SipMask++ reported better accuracies for object detection with a mAP of 57.0 with outcomes for AP_IoU=.50and AP_IoU=.75 reporting 72.0 and 67.0 respectively while Yolact was found to be a better performer for AP_s with 44.0 and 2.6 for object detection and segmentation categories respectively.  ( 2 min )
    Disaster Tweets Classification using BERT-Based Language Model. (arXiv:2202.00795v1 [cs.CL])
    Social networking services have became an important communication channel in time of emergency. The aim of this study is to create a machine learning language model that is able to investigate if a person or area was in danger or not. The ubiquitousness of smartphones enables people to announce an emergency they are observing in real-time. Because of this, more agencies are interested in programmatically monitoring Twitter (i.e. disaster relief organizations and news agencies). Design a language model that is able to understand and acknowledge when a disaster is happening based on the social network posts will become more and more necessary over time.
    Compositional Coding Capsule Network with K-Means Routing for Text Classification. (arXiv:1810.09177v4 [cs.LG] UPDATED)
    Text classification is a challenging problem which aims to identify the category of texts. In the process of training, word embeddings occupy a large part of parameters. Under the limitation of limited computing resources, it indirectly limits the ability of subsequent network designs. In order to reduce the number of parameters, the compositional coding mechanism has been proposed recently. Based on this, this paper further explores compositional coding and proposes a compositional weighted coding method. And we apply capsule network to model the relationship between word embeddings, a new routing algorithm, which is based on k-means clustering theory, is proposed to fully mine the relationship between word embeddings. Combined with our compositional weighted coding method and the routing algorithm, we design a neural network for text classification. Experiments conducted on eight challenging text classification datasets show that the proposed method achieves competitive accuracy compared to the state-of-the-art approach with significantly fewer parameters.  ( 2 min )
    The Role of Linear Layers in Nonlinear Interpolating Networks. (arXiv:2202.00856v1 [cs.LG])
    This paper explores the implicit bias of overparameterized neural networks of depth greater than two layers. Our framework considers a family of networks of varying depth that all have the same capacity but different implicitly defined representation costs. The representation cost of a function induced by a neural network architecture is the minimum sum of squared weights needed for the network to represent the function; it reflects the function space bias associated with the architecture. Our results show that adding linear layers to a ReLU network yields a representation cost that reflects a complex interplay between the alignment and sparsity of ReLU units. Specifically, using a neural network to fit training data with minimum representation cost yields an interpolating function that is constant in directions perpendicular to a low-dimensional subspace on which a parsimonious interpolant exists.  ( 2 min )
    HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection. (arXiv:2202.00874v1 [cs.SD])
    Audio classification is an important task of mapping audio samples into their corresponding labels. Recently, the transformer model with self-attention mechanisms has been adopted in this field. However, existing audio transformers require large GPU memories and long training time, meanwhile relying on pretrained vision models to achieve high performance, which limits the model's scalability in audio tasks. To combat these problems, we introduce HTS-AT: an audio transformer with a hierarchical structure to reduce the model size and training time. It is further combined with a token-semantic module to map final outputs into class featuremaps, thus enabling the model for the audio event detection (i.e. localization in time). We evaluate HTS-AT on three datasets of audio classification where it achieves new state-of-the-art (SOTA) results on AudioSet and ESC-50, and equals the SOTA on Speech Command V2. It also achieves better performance in event localization than the previous CNN-based models. Moreover, HTS-AT requires only 35% model parameters and 15% training time of the previous audio transformer. These results demonstrate the high performance and high efficiency of HTS-AT.  ( 2 min )
    On Linear Separability under Linear Compression with Applications to Hard Support Vector Machine. (arXiv:2202.01118v1 [cs.LG])
    This paper investigates the theoretical problem of maintaining linear separability of the data-generating distribution under linear compression. While it has been long known that linear separability may be maintained by linear transformations that approximately preserve the inner products between the domain points, the limit to which the inner products are preserved in order to maintain linear separability was unknown. In this paper, we show that linear separability is maintained as long as the distortion of the inner products is smaller than the squared margin of the original data-generating distribution. The proof is mainly based on the geometry of hard support vector machines (SVM) extended from the finite set of training examples to the (possibly) infinite domain of the data-generating distribution. As applications, we derive bounds on the (i) compression length of random sub-Gaussian matrices; and (ii) generalization error for compressive learning with hard-SVM.  ( 2 min )
    Structure-preserving GANs. (arXiv:2202.01129v1 [cs.LG])
    Generative adversarial networks (GANs), a class of distribution-learning methods based on a two-player game between a generator and a discriminator, can generally be formulated as a minmax problem based on the variational representation of a divergence between the unknown and the generated distributions. We introduce structure-preserving GANs as a data-efficient framework for learning distributions with additional structure such as group symmetry, by developing new variational representations for divergences. Our theory shows that we can reduce the discriminator space to its projection on the invariant discriminator space, using the conditional expectation with respect to the $\sigma$-algebra associated to the underlying structure. In addition, we prove that the discriminator space reduction must be accompanied by a careful design of structured generators, as flawed designs may easily lead to a catastrophic "mode collapse" of the learned distribution. We contextualize our framework by building symmetry-preserving GANs for distributions with intrinsic group symmetry, and demonstrate that both players, namely the equivariant generator and invariant discriminator, play important but distinct roles in the learning process. Empirical experiments and ablation studies across a broad range of data sets, including real-world medical imaging, validate our theory, and show our proposed methods achieve significantly improved sample fidelity and diversity -- almost an order of magnitude measured in Fr\'echet Inception Distance -- especially in the small data regime.  ( 2 min )
    Posterior temperature optimized Bayesian models for inverse problems in medical imaging. (arXiv:2202.00986v1 [eess.IV])
    We present Posterior Temperature Optimized Bayesian Inverse Models (POTOBIM), an unsupervised Bayesian approach to inverse problems in medical imaging using mean-field variational inference with a fully tempered posterior. Bayesian methods exhibit useful properties for approaching inverse tasks, such as tomographic reconstruction or image denoising. A suitable prior distribution introduces regularization, which is needed to solve the ill-posed problem and reduces overfitting the data. In practice, however, this often results in a suboptimal posterior temperature, and the full potential of the Bayesian approach is not being exploited. In POTOBIM, we optimize both the parameters of the prior distribution and the posterior temperature with respect to reconstruction accuracy using Bayesian optimization with Gaussian process regression. Our method is extensively evaluated on four different inverse tasks on a variety of modalities with images from public data sets and we demonstrate that an optimized posterior temperature outperforms both non-Bayesian and Bayesian approaches without temperature optimization. The use of an optimized prior distribution and posterior temperature leads to improved accuracy and uncertainty estimation and we show that it is sufficient to find these hyperparameters per task domain. Well-tempered posteriors yield calibrated uncertainty, which increases the reliability in the predictions. Our source code is publicly available at github.com/Cardio-AI/mfvi-dip-mia.
    AlphaDesign: A graph protein design method and benchmark on AlphaFoldDB. (arXiv:2202.01079v1 [q-bio.QM])
    While DeepMind has tentatively solved protein folding, its inverse problem -- protein design which predicts protein sequences from their 3D structures -- still faces significant challenges. Particularly, the lack of large-scale standardized benchmark and poor accuray hinder the research progress. In order to standardize comparisons and draw more research interest, we use AlphaFold DB, one of the world's largest protein structure databases, to establish a new graph-based benchmark -- AlphaDesign. Based on AlphaDesign, we propose a new method called ADesign to improve accuracy by introducing protein angles as new features, using a simplified graph transformer encoder (SGT), and proposing a confidence-aware protein decoder (CPD). Meanwhile, SGT and CPD also improve model efficiency by simplifying the training and testing procedures. Experiments show that ADesign significantly outperforms previous graph models, e.g., the average accuracy is improved by 8\%, and the inference speed is 40+ times faster than before.  ( 2 min )
    A Semi-Supervised Deep Clustering Pipeline for Mining Intentions From Texts. (arXiv:2202.00802v1 [cs.CL])
    Mining the latent intentions from large volumes of natural language inputs is a key step to help data analysts design and refine Intelligent Virtual Assistants (IVAs) for customer service. To aid data analysts in this task we present Verint Intent Manager (VIM), an analysis platform that combines unsupervised and semi-supervised approaches to help analysts quickly surface and organize relevant user intentions from conversational texts. For the initial exploration of data we make use of a novel unsupervised and semi-supervised pipeline that integrates the fine-tuning of high performing language models, a distributed k-NN graph building method and community detection techniques for mining the intentions and topics from texts. The fine-tuning step is necessary because pre-trained language models cannot encode texts to efficiently surface particular clustering structures when the target texts are from an unseen domain or the clustering task is not topic detection. For flexibility we deploy two clustering approaches: where the number of clusters must be specified and where the number of clusters is detected automatically with comparable clustering quality but at the expense of additional computation time. We describe the application and deployment and demonstrate its performance using BERT on three text mining tasks. Our experiments show that BERT begins to produce better task-aware representations using a labeled subset as small as 0.5% of the task data. The clustering quality exceeds the state-of-the-art results when BERT is fine-tuned with labeled subsets of only 2.5% of the task data. As deployed in the VIM application, this flexible clustering pipeline produces high quality results, improving the performance of data analysts and reducing the time it takes to surface intentions from customer service data, thereby reducing the time it takes to build and deploy IVAs in new domains.
    SHAFF: Fast and consistent SHApley eFfect estimates via random Forests. (arXiv:2105.11724v3 [stat.ML] UPDATED)
    Interpretability of learning algorithms is crucial for applications involving critical decisions, and variable importance is one of the main interpretation tools. Shapley effects are now widely used to interpret both tree ensembles and neural networks, as they can efficiently handle dependence and interactions in the data, as opposed to most other variable importance measures. However, estimating Shapley effects is a challenging task, because of the computational complexity and the conditional expectation estimates. Accordingly, existing Shapley algorithms have flaws: a costly running time, or a bias when input variables are dependent. Therefore, we introduce SHAFF, SHApley eFfects via random Forests, a fast and accurate Shapley effect estimate, even when input variables are dependent. We show SHAFF efficiency through both a theoretical analysis of its consistency, and the practical performance improvements over competitors with extensive experiments. An implementation of SHAFF in C++ and R is available online.  ( 2 min )
    Federated Learning Challenges and Opportunities: An Outlook. (arXiv:2202.00807v1 [cs.LG])
    Federated learning (FL) has been developed as a promising framework to leverage the resources of edge devices, enhance customers' privacy, comply with regulations, and reduce development costs. Although many methods and applications have been developed for FL, several critical challenges for practical FL systems remain unaddressed. This paper provides an outlook on FL development, categorized into five emerging directions of FL, namely algorithm foundation, personalization, hardware and security constraints, lifelong learning, and nonstandard data. Our unique perspectives are backed by practical observations from large-scale federated systems for edge devices.  ( 2 min )
    Reachability Analysis of Neural Feedback Loops. (arXiv:2108.04140v2 [eess.SY] UPDATED)
    Neural Networks (NNs) can provide major empirical performance improvements for closed-loop systems, but they also introduce challenges in formally analyzing those systems' safety properties. In particular, this work focuses on estimating the forward reachable set of \textit{neural feedback loops} (closed-loop systems with NN controllers). Recent work provides bounds on these reachable sets, but the computationally tractable approaches yield overly conservative bounds (thus cannot be used to verify useful properties), and the methods that yield tighter bounds are too intensive for online computation. This work bridges the gap by formulating a convex optimization problem for the reachability analysis of closed-loop systems with NN controllers. While the solutions are less tight than previous (semidefinite program-based) methods, they are substantially faster to compute, and some of those computational time savings can be used to refine the bounds through new input set partitioning techniques, which is shown to dramatically reduce the tightness gap. The new framework is developed for systems with uncertainty (e.g., measurement and process noise) and nonlinearities (e.g., polynomial dynamics), and thus is shown to be applicable to real-world systems. To inform the design of an initial state set when only the target state set is known/specified, a novel algorithm for backward reachability analysis is also provided, which computes the set of states that are guaranteed to lead to the target set. The numerical experiments show that our approach (based on linear relaxations and partitioning) gives a $5\times$ reduction in conservatism in $150\times$ less computation time compared to the state-of-the-art. Furthermore, experiments on quadrotor, 270-state, and polynomial systems demonstrate the method's ability to handle uncertainty sources, high dimensionality, and nonlinear dynamics, respectively.  ( 2 min )
    A Novel Hessian-Free Bilevel Optimizer via Evolution Strategies. (arXiv:2110.07004v2 [cs.LG] UPDATED)
    Bilevel optimization has arisen as a powerful tool for solving many modern machine learning problems. However, due to the nested structure of bilevel optimization, even gradient-based methods require second-order derivative approximations via Jacobian- or/and Hessian-vector computations, which can be very costly in practice. In this work, we propose a novel Hessian-free bilevel algorithm, which adopts the Evolution Strategies (ES) method to approximate the response Jacobian matrix in the hypergradient of the bilevel problem, and hence fully eliminates all second-order computations. We call our algorithm as ESJ (which stands for the ES-based Jacobian method) and further extend it to the stochastic setting as ESJ-S. Theoretically, we show that both ESJ and ESJ-S are guaranteed to converge. Experimentally, we demonstrate that the proposed algorithms outperform baseline bilevel optimizers on various bilevel problems. Particularly, in our experiment on few-shot meta-learning of ResNet-12 network over the miniImageNet dataset, we show that our algorithm outperforms baseline meta-learning algorithms, while other baseline bilevel optimizers do not solve such meta-learning problems within a comparable time frame.  ( 2 min )
    Can Stochastic Gradient Langevin Dynamics Provide Differential Privacy for Deep Learning?. (arXiv:2110.05057v3 [cs.LG] UPDATED)
    Bayesian learning via Stochastic Gradient Langevin Dynamics (SGLD) has been suggested for differentially private learning. While previous research provides differential privacy bounds for SGLD at the initial steps of the algorithm or when close to convergence, the question of what differential privacy guarantees can be made in between remains unanswered. This interim region is of great importance, especially for Bayesian neural networks, as it is hard to guarantee convergence to the posterior. This paper shows that using SGLD might result in unbounded privacy loss for this interim region, even when sampling from the posterior is as differentially private as desired.  ( 2 min )
    The Exponentially Tilted Gaussian Prior for Variational Autoencoders. (arXiv:2111.15646v2 [cs.LG] UPDATED)
    An important property for deep neural networks is the ability to perform robust out-of-distribution detection on previously unseen data. This property is essential for safety purposes when deploying models for real world applications. Recent studies show that probabilistic generative models can perform poorly on this task, which is surprising given that they seek to estimate the likelihood of training data. To alleviate this issue, we propose the exponentially tilted Gaussian prior distribution for the Variational Autoencoder (VAE) which pulls points onto the surface of a hyper-sphere in latent space. This achieves state-of-the art results on the area under the curve-receiver operator characteristics metric using just the negative log-likelihood that the VAE naturally assigns. Because this prior is a simple modification of the traditional VAE prior, it is faster and easier to implement than competitive methods.  ( 2 min )
    Using Ballistocardiography for Sleep Stage Classification. (arXiv:2202.01038v1 [cs.HC])
    A practical way of detecting sleep stages has become more necessary as we begin to learn about the vast effects that sleep has on people's lives. The current methods of sleep stage detection are expensive, invasive to a person's sleep, and not practical in a modern home setting. While the method of detecting sleep stages via the monitoring of brain activity, muscle activity, and eye movement, through electroencephalogram in a lab setting, provide the gold standard for detection, this paper aims to investigate a new method that will allow a person to gain similar insight and results with no obtrusion to their normal sleeping habits. Ballistocardiography (BCG) is a non-invasive sensing technology that collects information by measuring the ballistic forces generated by the heart. Using features extracted from BCG such as time of usage, heart rate, respiration rate, relative stroke volume, and heart rate variability, we propose to implement a sleep stage detection algorithm and compare it against sleep stages extracted from a Fitbit Sense Smart Watch. The accessibility, ease of use, and relatively-low cost of the BCG offers many applications and advantages for using this device. By standardizing this device, people will be able to benefit from the BCG in analyzing their own sleep patterns and draw conclusions on their sleep efficiency. This work demonstrates the feasibility of using BCG for an accurate and non-invasive sleep monitoring method that can be set up in the comfort of a one's personal sleep environment.  ( 2 min )
    Improving Screening Processes via Calibrated Subset Selection. (arXiv:2202.01147v1 [cs.LG])
    Many selection processes such as finding patients qualifying for a medical trial or retrieval pipelines in search engines consist of multiple stages, where an initial screening stage focuses the resources on shortlisting the most promising candidates. In this paper, we investigate what guarantees a screening classifier can provide, independently of whether it is constructed manually or trained. We find that current solutions do not enjoy distribution-free theoretical guarantees -- we show that, in general, even for a perfectly calibrated classifier, there always exist specific pools of candidates for which its shortlist is suboptimal. Then, we develop a distribution-free screening algorithm -- called Calibrated Subset Selection (CSS) -- that, given any classifier and some amount of calibration data, finds near-optimal shortlists of candidates that contain a desired number of qualified candidates in expectation. Moreover, we show that a variant of our algorithm that calibrates a given classifier multiple times across specific groups can create shortlists with provable diversity guarantees. Experiments on US Census survey data validate our theoretical results and show that the shortlists provided by our algorithm are superior to those provided by several competitive baselines.  ( 2 min )
    VC-PCR: A Prediction Method based on Supervised Variable Selection and Clustering. (arXiv:2202.00975v1 [stat.ML])
    Sparse linear prediction methods suffer from decreased prediction accuracy when the predictor variables have cluster structure (e.g. there are highly correlated groups of variables). To improve prediction accuracy, various methods have been proposed to identify variable clusters from the data and integrate cluster information into a sparse modeling process. But none of these methods achieve satisfactory performance for prediction, variable selection and variable clustering simultaneously. This paper presents Variable Cluster Principal Component Regression (VC-PCR), a prediction method that supervises variable selection and variable clustering in order to solve this problem. Experiments with real and simulated data demonstrate that, compared to competitor methods, VC-PCR achieves better prediction, variable selection and clustering performance when cluster structure is present.  ( 2 min )
    FriendlyCore: Practical Differentially Private Aggregation. (arXiv:2110.10132v2 [cs.LG] UPDATED)
    Differentially private algorithms for common metric aggregation tasks, such as clustering or averaging, often have limited practicality due to their complexity or a large number of data points that is required for accurate results. We propose a simple and practical tool, $\mathsf{FriendlyCore}$, that takes a set of points ${\cal D}$ from an unrestricted (pseudo) metric space as input. When ${\cal D}$ has effective diameter $r$, $\mathsf{FriendlyCore}$ returns a ``stable'' subset ${\cal C} \subseteq {\cal D}$ that includes all points, except possibly few outliers, and is {\em certified} to have diameter $r$. $\mathsf{FriendlyCore}$ can be used to preprocess the input before privately aggregating it, potentially simplifying the aggregation or boosting its accuracy. Surprisingly, $\mathsf{FriendlyCore}$ is light-weight with no dependence on the dimension. We empirically demonstrate its advantages in boosting the accuracy of mean estimation and clustering tasks such as $k$-means and $k$-GMM, outperforming tailored methods.  ( 2 min )
    Chaining Value Functions for Off-Policy Learning. (arXiv:2201.06468v2 [cs.LG] UPDATED)
    To accumulate knowledge and improve its policy of behaviour, a reinforcement learning agent can learn `off-policy' about policies that differ from the policy used to generate its experience. This is important to learn counterfactuals, or because the experience was generated out of its own control. However, off-policy learning is non-trivial, and standard reinforcement-learning algorithms can be unstable and divergent. In this paper we discuss a novel family of off-policy prediction algorithms which are convergent by construction. The idea is to first learn on-policy about the data-generating behaviour, and then bootstrap an off-policy value estimate on this on-policy estimate, thereby constructing a value estimate that is partially off-policy. This process can be repeated to build a chain of value functions, each time bootstrapping a new estimate on the previous estimate in the chain. Each step in the chain is stable and hence the complete algorithm is guaranteed to be stable. Under mild conditions this comes arbitrarily close to the off-policy TD solution when we increase the length of the chain. Hence it can compute the solution even in cases where off-policy TD diverges. We prove that the proposed scheme is convergent and corresponds to an iterative decomposition of the inverse key matrix. Furthermore it can be interpreted as estimating a novel objective -- that we call a `k-step expedition' -- of following the target policy for finitely many steps before continuing indefinitely with the behaviour policy. Empirically we evaluate the idea on challenging MDPs such as Baird's counter example and observe favourable results.  ( 2 min )
    Deep reinforced learning enables solving rich discrete-choice life cycle models to analyze social security reforms. (arXiv:2010.13471v3 [econ.GN] UPDATED)
    Discrete-choice life cycle models of labor supply can be used to estimate how social security reforms influence employment rate. In a life cycle model, optimal employment choices during the life course of an individual must be solved. Mostly, life cycle models have been solved with dynamic programming, which is not feasible when the state space is large, as often is the case in a realistic life cycle model. Solving a complex life cycle model requires the use of approximate methods, such as reinforced learning algorithms. We compare how well a deep reinforced learning algorithm ACKTR and dynamic programming solve a relatively simple life cycle model. To analyze results, we use a selection of statistics and also compare the resulting optimal employment choices at various states. The statistics demonstrate that ACKTR yields almost as good results as dynamic programming. Qualitatively, dynamic programming yields more spiked aggregate employment profiles than ACKTR. The results obtained with ACKTR provide a good, yet not perfect, approximation to the results of dynamic programming. In addition to the baseline case, we analyze two social security reforms: (1) an increase of retirement age, and (2) universal basic income. Our results suggest that reinforced learning algorithms can be of significant value in developing social security reforms.  ( 2 min )
    Lipschitz-constrained Unsupervised Skill Discovery. (arXiv:2202.00914v1 [cs.LG])
    We study the problem of unsupervised skill discovery, whose goal is to learn a set of diverse and useful skills with no external reward. There have been a number of skill discovery methods based on maximizing the mutual information (MI) between skills and states. However, we point out that their MI objectives usually prefer static skills to dynamic ones, which may hinder the application for downstream tasks. To address this issue, we propose Lipschitz-constrained Skill Discovery (LSD), which encourages the agent to discover more diverse, dynamic, and far-reaching skills. Another benefit of LSD is that its learned representation function can be utilized for solving goal-following downstream tasks even in a zero-shot manner - i.e., without further training or complex planning. Through experiments on various MuJoCo robotic locomotion and manipulation environments, we demonstrate that LSD outperforms previous approaches in terms of skill diversity, state space coverage, and performance on seven downstream tasks including the challenging task of following multiple goals on Humanoid. Our code and videos are available at https://shpark.me/projects/lsd/.  ( 2 min )
    Boosting Video Captioning with Dynamic Loss Network. (arXiv:2107.11707v3 [cs.CV] UPDATED)
    Video captioning is one of the challenging problems at the intersection of vision and language, having many real-life applications in video retrieval, video surveillance, assisting visually challenged people, Human-machine interface, and many more. Recent deep learning based methods have shown promising results but are still on the lower side than other vision tasks (such as image classification, object detection). A significant drawback with existing video captioning methods is that they are optimized over cross-entropy loss function, which is uncorrelated to the de facto evaluation metrics (BLEU, METEOR, CIDER, ROUGE). In other words, cross-entropy is not a proper surrogate of the true loss function for video captioning. To mitigate this, methods like REINFORCE, Actor-Critic, and Minimum Risk Training (MRT) have been applied but have limitations and are not very effective. This paper proposes an alternate solution by introducing a dynamic loss network (DLN), providing an additional feedback signal that reflects the evaluation metrics directly. Our solution proves to be more efficient than other solutions and can be easily adapted to similar tasks. Our results on Microsoft Research Video Description Corpus (MSVD) and MSR-Video to Text (MSRVTT) datasets outperform previous methods.  ( 2 min )
    Auto-Transfer: Learning to Route Transferrable Representations. (arXiv:2202.01011v1 [cs.LG])
    Knowledge transfer between heterogeneous source and target networks and tasks has received a lot of attention in recent times as large amounts of quality labelled data can be difficult to obtain in many applications. Existing approaches typically constrain the target deep neural network (DNN) feature representations to be close to the source DNNs feature representations, which can be limiting. We, in this paper, propose a novel adversarial multi-armed bandit approach which automatically learns to route source representations to appropriate target representations following which they are combined in meaningful ways to produce accurate target models. We see upwards of 5% accuracy improvements compared with the state-of-the-art knowledge transfer methods on four benchmark (target) image datasets CUB200, Stanford Dogs, MIT67, and Stanford40 where the source dataset is ImageNet. We qualitatively analyze the goodness of our transfer scheme by showing individual examples of the important features our target network focuses on in different layers compared with the (closest) competitors. We also observe that our improvement over other methods is higher for smaller target datasets making it an effective tool for small data applications that may benefit from transfer learning.  ( 2 min )
    Survival Regression with Proper Scoring Rules and Monotonic Neural Networks. (arXiv:2103.14755v2 [stat.ML] UPDATED)
    We consider frequently used scoring rules for right-censored survival regression models such as time-dependent concordance, survival-CRPS, integrated Brier score and integrated binomial log-likelihood, and prove that neither of them is a proper scoring rule. This means that the true survival distribution may be scored worse than incorrect distributions, leading to inaccurate estimation. We prove that, in contrast to these scores, the right-censored log-likelihood is a proper scoring rule, i.e., the highest expected score is achieved by the true distribution. Despite this, modern feed-forward neural-network-based survival regression models are unable to train and validate directly on the right-censored log-likelihood, due to its intractability, and resort to the aforementioned alternatives, i.e., non-proper scoring rules. We therefore propose a simple novel survival regression method capable of directly optimizing log-likelihood using a monotonic restriction on the time-dependent weights, coined SurvivalMonotonic-net (SuMo-net). SuMo-net achieves state-of-the-art log-likelihood scores across several datasets with 20--100$\times$ computational speedup on inference over existing state-of-the-art neural methods, and is readily applicable to datasets with several million observations.  ( 2 min )
    A Systematic Comparison of Architectures for Document-Level Sentiment Classification. (arXiv:2002.08131v2 [cs.CL] UPDATED)
    Documents are composed of smaller pieces - paragraphs, sentences, and tokens - that have complex relationships between one another. Sentiment classification models that take into account the structure inherent in these documents have a theoretical advantage over those that do not. At the same time, transfer learning models based on language model pretraining have shown promise for document classification. However, these two paradigms have not been systematically compared and it is not clear under which circumstances one approach is better than the other. In this work we empirically compare hierarchical models and transfer learning for document-level sentiment classification. We show that non-trivial hierarchical models outperform previous baselines and transfer learning on document-level sentiment classification in five languages.  ( 2 min )
    Graph-wise Common Latent Factor Extraction for Unsupervised Graph Representation Learning. (arXiv:2112.08830v2 [cs.LG] UPDATED)
    Unsupervised graph-level representation learning plays a crucial role in a variety of tasks such as molecular property prediction and community analysis, especially when data annotation is expensive. Currently, most of the best-performing graph embedding methods are based on Infomax principle. The performance of these methods highly depends on the selection of negative samples and hurt the performance, if the samples were not carefully selected. Inter-graph similarity-based methods also suffer if the selected set of graphs for similarity matching is low in quality. To address this, we focus only on utilizing the current input graph for embedding learning. We are motivated by an observation from real-world graph generation processes where the graphs are formed based on one or more global factors which are common to all elements of the graph (e.g., topic of a discussion thread, solubility level of a molecule). We hypothesize extracting these common factors could be highly beneficial. Hence, this work proposes a new principle for unsupervised graph representation learning: Graph-wise Common latent Factor EXtraction (GCFX). We further propose a deep model for GCFX, deepGCFX, based on the idea of reversing the above-mentioned graph generation process which could explicitly extract common latent factors from an input graph and achieve improved results on downstream tasks to the current state-of-the-art. Through extensive experiments and analysis, we demonstrate that, while extracting common latent factors is beneficial for graph-level tasks to alleviate distractions caused by local variations of individual nodes or local neighbourhoods, it also benefits node-level tasks by enabling long-range node dependencies, especially for disassortative graphs.
    AutoML for Contextual Bandits. (arXiv:1909.03212v2 [cs.LG] UPDATED)
    Contextual Bandits is one of the widely popular techniques used in applications such as personalization, recommendation systems, mobile health, causal marketing etc . As a dynamic approach, it can be more efficient than standard A/B testing in minimizing regret. We propose an end to end automated meta-learning pipeline to approximate the optimal Q function for contextual bandits problems. We see that our model is able to perform much better than random exploration, being more regret efficient and able to converge with a limited number of samples, while remaining very general and easy to use due to the meta-learning approach. We used a linearly annealed e-greedy exploration policy to define the exploration vs exploitation schedule. We tested the system on a synthetic environment to characterize it fully and we evaluated it on some open source datasets to benchmark against prior work. We see that our model outperforms or performs comparatively to other models while requiring no tuning nor feature engineering.  ( 2 min )
    Using Deep Learning to Bootstrap Abstractions for Hierarchical Robot Planning. (arXiv:2202.00907v1 [cs.RO])
    This paper addresses the problem of learning abstractions that boost robot planning performance while providing strong guarantees of reliability. Although state-of-the-art hierarchical robot planning algorithms allow robots to efficiently compute long-horizon motion plans for achieving user desired tasks, these methods typically rely upon environment-dependent state and action abstractions that need to be hand-designed by experts. We present a new approach for bootstrapping the entire hierarchical planning process. It shows how abstract states and actions for new environments can be computed automatically using the critical regions predicted by a deep neural-network with an auto-generated robot specific architecture. It uses the learned abstractions in a novel multi-source bi-directional hierarchical robot planning algorithm that is sound and probabilistically complete. An extensive empirical evaluation on twenty different settings using holonomic and non-holonomic robots shows that (a) the learned abstractions provide the information necessary for efficient multi-source hierarchical planning; and that (b) this approach of learning abstraction and planning outperforms state-of-the-art baselines by nearly a factor of ten in terms of planning time on test environments not seen during training.  ( 2 min )
    MPVNN: Mutated Pathway Visible Neural Network Architecture for Interpretable Prediction of Cancer-specific Survival Risk. (arXiv:2202.00882v1 [q-bio.QM])
    Survival risk prediction using gene expression data is important in making treatment decisions in cancer. Standard neural network (NN) survival analysis models are black boxes with lack of interpretability. More interpretable visible neural network (VNN) architectures are designed using biological pathway knowledge. But they do not model how pathway structures can change for particular cancer types. We propose a novel Mutated Pathway VNN or MPVNN architecture, designed using prior signaling pathway knowledge and gene mutation data-based edge randomization simulating signal flow disruption. As a case study, we use the PI3K-Akt pathway and demonstrate overall improved cancer-specific survival risk prediction results of MPVNN over standard non-NN and other similar sized NN survival analysis methods. We show that trained MPVNN architecture interpretation, which points to smaller sets of genes connected by signal flow within the PI3K-Akt pathway that are important in risk prediction for particular cancer types, is reliable.  ( 2 min )
    Interpretability for Multimodal Emotion Recognition using Concept Activation Vectors. (arXiv:2202.01072v1 [cs.LG])
    Multimodal Emotion Recognition refers to the classification of input video sequences into emotion labels based on multiple input modalities (usually video, audio and text). In recent years, Deep Neural networks have shown remarkable performance in recognizing human emotions, and are on par with human-level performance on this task. Despite the recent advancements in this field, emotion recognition systems are yet to be accepted for real world setups due to the obscure nature of their reasoning and decision-making process. Most of the research in this field deals with novel architectures to improve the performance for this task, with a few attempts at providing explanations for these models' decisions. In this paper, we address the issue of interpretability for neural networks in the context of emotion recognition using Concept Activation Vectors (CAVs). To analyse the model's latent space, we define human-understandable concepts specific to Emotion AI and map them to the widely-used IEMOCAP multimodal database. We then evaluate the influence of our proposed concepts at multiple layers of the Bi-directional Contextual LSTM (BC-LSTM) network to show that the reasoning process of neural networks for emotion recognition can be represented using human-understandable concepts. Finally, we perform hypothesis testing on our proposed concepts to show that they are significant for interpretability of this task.  ( 2 min )
    L3Cube-MahaCorpus and MahaBERT: Marathi Monolingual Corpus, Marathi BERT Language Models, and Resources. (arXiv:2202.01159v1 [cs.CL])
    We present L3Cube-MahaCorpus a Marathi monolingual data set scraped from different internet sources. We expand the existing Marathi monolingual corpus with 24.8M sentences and 289M tokens. We further present, MahaBERT, MahaAlBERT, and MahaRoBerta all BERT-based masked language models, and MahaFT, the fast text word embeddings both trained on full Marathi corpus with 752M tokens. We show the effectiveness of these resources on downstream classification and NER tasks. Marathi is a popular language in India but still lacks these resources. This work is a step forward in building open resources for the Marathi language. The data and models are available at https://github.com/l3cube-pune/MarathiNLP .  ( 2 min )
    Modularity-Aware Graph Autoencoders for Joint Community Detection and Link Prediction. (arXiv:2202.00961v1 [cs.LG])
    Graph autoencoders (GAE) and variational graph autoencoders (VGAE) emerged as powerful methods for link prediction. Their performances are less impressive on community detection problems where, according to recent and concurring experimental evaluations, they are often outperformed by simpler alternatives such as the Louvain method. It is currently still unclear to which extent one can improve community detection with GAE and VGAE, especially in the absence of node features. It is moreover uncertain whether one could do so while simultaneously preserving good performances on link prediction. In this paper, we show that jointly addressing these two tasks with high accuracy is possible. For this purpose, we introduce and theoretically study a community-preserving message passing scheme, doping our GAE and VGAE encoders by considering both the initial graph structure and modularity-based prior communities when computing embedding spaces. We also propose novel training and optimization strategies, including the introduction of a modularity-inspired regularizer complementing the existing reconstruction losses for joint link prediction and community detection. We demonstrate the empirical effectiveness of our approach, referred to as Modularity-Aware GAE and VGAE, through in-depth experimental validation on various real-world graphs.  ( 2 min )
    Bayesian Optimization for Distributionally Robust Chance-constrained Problem. (arXiv:2201.13112v2 [stat.ML] UPDATED)
    In black-box function optimization, we need to consider not only controllable design variables but also uncontrollable stochastic environment variables. In such cases, it is necessary to solve the optimization problem by taking into account the uncertainty of the environmental variables. Chance-constrained (CC) problem, the problem of maximizing the expected value under a certain level of constraint satisfaction probability, is one of the practically important problems in the presence of environmental variables. In this study, we consider distributionally robust CC (DRCC) problem and propose a novel DRCC Bayesian optimization method for the case where the distribution of the environmental variables cannot be precisely specified. We show that the proposed method can find an arbitrary accurate solution with high probability in a finite number of trials, and confirm the usefulness of the proposed method through numerical experiments.  ( 2 min )
    Is a Classification Procedure Good Enough? A Goodness-of-Fit Assessment Tool for Classification Learning. (arXiv:1911.03063v3 [stat.ML] UPDATED)
    In recent years, many non-traditional classification methods, such as Random Forest, Boosting, and neural network, have been widely used in applications. Their performance is typically measured in terms of classification accuracy. While the classification error rate and the like are important, they do not address a fundamental question: Is the classification method underfitted? To our best knowledge, there is no existing method that can assess the goodness-of-fit of a general classification procedure. Indeed, the lack of a parametric assumption makes it challenging to construct proper tests. To overcome this difficulty, we propose a methodology called BAGofT that splits the data into a training set and a validation set. First, the classification procedure to assess is applied to the training set, which is also used to adaptively find a data grouping that reveals the most severe regions of underfitting. Then, based on this grouping, we calculate a test statistic by comparing the estimated success probabilities and the actual observed responses from the validation set. The data splitting guarantees that the size of the test is controlled under the null hypothesis, and the power of the test goes to one as the sample size increases under the alternative hypothesis. For testing parametric classification models, the BAGofT has a broader scope than the existing methods since it is not restricted to specific parametric models (e.g., logistic regression). Extensive simulation studies show the utility of the BAGofT when assessing general classification procedures and its strengths over some existing methods when testing parametric classification models.  ( 2 min )
    Smoothed Embeddings for Certified Few-Shot Learning. (arXiv:2202.01186v1 [cs.LG])
    Randomized smoothing is considered to be the state-of-the-art provable defense against adversarial perturbations. However, it heavily exploits the fact that classifiers map input objects to class probabilities and do not focus on the ones that learn a metric space in which classification is performed by computing distances to embeddings of classes prototypes. In this work, we extend randomized smoothing to few-shot learning models that map inputs to normalized embeddings. We provide analysis of Lipschitz continuity of such models and derive robustness certificate against $\ell_2$-bounded perturbations that may be useful in few-shot learning scenarios. Our theoretical results are confirmed by experiments on different datasets.  ( 2 min )
    Data-Driven Offline Optimization For Architecting Hardware Accelerators. (arXiv:2110.11346v2 [cs.AR] UPDATED)
    Industry has gradually moved towards application-specific hardware accelerators in order to attain higher efficiency. While such a paradigm shift is already starting to show promising results, designers need to spend considerable manual effort and perform a large number of time-consuming simulations to find accelerators that can accelerate multiple target applications while obeying design constraints. Moreover, such a "simulation-driven" approach must be re-run from scratch every time the set of target applications or design constraints change. An alternative paradigm is to use a "data-driven", offline approach that utilizes logged simulation data, to architect hardware accelerators, without needing any form of simulations. Such an approach not only alleviates the need to run time-consuming simulation, but also enables data reuse and applies even when set of target applications changes. In this paper, we develop such a data-driven offline optimization method for designing hardware accelerators, dubbed PRIME, that enjoys all of these properties. Our approach learns a conservative, robust estimate of the desired cost function, utilizes infeasible points, and optimizes the design against this estimate without any additional simulator queries during optimization. PRIME architects accelerators -- tailored towards both single and multiple applications -- improving performance upon state-of-the-art simulation-driven methods by about 1.54x and 1.20x, while considerably reducing the required total simulation time by 93% and 99%, respectively. In addition, PRIME also architects effective accelerators for unseen applications in a zero-shot setting, outperforming simulation-based methods by 1.26x.  ( 2 min )
    Risk Adversarial Learning System for Connected and Autonomous Vehicle Charging. (arXiv:2108.01466v2 [cs.AI] UPDATED)
    In this paper, the design of a rational decision support system (RDSS) for a connected and autonomous vehicle charging infrastructure (CAV-CI) is studied. In the considered CAV-CI, the distribution system operator (DSO) deploys electric vehicle supply equipment (EVSE) to provide an EV charging facility for human-driven connected vehicles (CVs) and autonomous vehicles (AVs). The charging request by the human-driven EV becomes irrational when it demands more energy and charging period than its actual need. Therefore, the scheduling policy of each EVSE must be adaptively accumulated the irrational charging request to satisfy the charging demand of both CVs and AVs. To tackle this, we formulate an RDSS problem for the DSO, where the objective is to maximize the charging capacity utilization by satisfying the laxity risk of the DSO. Thus, we devise a rational reward maximization problem to adapt the irrational behavior by CVs in a data-informed manner. We propose a novel risk adversarial multi-agent learning system (RAMALS) for CAV-CI to solve the formulated RDSS problem. In RAMALS, the DSO acts as a centralized risk adversarial agent (RAA) for informing the laxity risk to each EVSE. Subsequently, each EVSE plays the role of a self-learner agent to adaptively schedule its own EV sessions by coping advice from RAA. Experiment results show that the proposed RAMALS affords around 46.6% improvement in charging rate, about 28.6% improvement in the EVSE's active charging time and at least 33.3% more energy utilization, as compared to a currently deployed ACN EVSE system, and other baselines.  ( 2 min )
    Maintaining fairness across distribution shift: do we have viable solutions for real-world applications?. (arXiv:2202.01034v1 [cs.LG])
    Fairness and robustness are often considered as orthogonal dimensions when evaluating machine learning models. However, recent work has revealed interactions between fairness and robustness, showing that fairness properties are not necessarily maintained under distribution shift. In healthcare settings, this can result in e.g. a model that performs fairly according to a selected metric in "hospital A" showing unfairness when deployed in "hospital B". While a nascent field has emerged to develop provable fair and robust models, it typically relies on strong assumptions about the shift, limiting its impact for real-world applications. In this work, we explore the settings in which recently proposed mitigation strategies are applicable by referring to a causal framing. Using examples of predictive models in dermatology and electronic health records, we show that real-world applications are complex and often invalidate the assumptions of such methods. Our work hence highlights technical, practical, and engineering gaps that prevent the development of robustly fair machine learning models for real-world applications. Finally, we discuss potential remedies at each step of the machine learning pipeline.  ( 2 min )
    EFloat: EFloat: Entropy-coded Floating Point Format for Compressing Vector Embedding Models. (arXiv:2102.02705v2 [cs.LG] UPDATED)
    In a large class of deep learning models, including vector embedding models such as word and database embeddings, we observe that floating point exponent values cluster around a few unique values, permitting entropy based data compression. Entropy coding compresses fixed-length values with variable-length codes, encoding most probable values with fewer bits. We propose the EFloat compressed floating point number format that uses a variable field boundary between the exponent and significand fields. EFloat uses entropy coding on exponent values and signs to minimize the average width of the exponent and sign fields, while preserving the original FP32 exponent range unchanged. Saved bits become part of the significand field increasing the EFloat numeric precision by 4.3 bits on average compared to other reduced-precision floating point formats. EFloat makes 8-bit and even smaller floats practical without sacrificing the exponent range of a 32-bit floating point representation. We currently use the EFloat format for saving memory capacity and bandwidth consumption of large vector embedding models such as those used for database embeddings. Using the RMS error as metric, we demonstrate that EFloat provides higher accuracy than other floating point formats with equal bit budget. The EF12 format with 12-bit budget has less end-to-end application error than the 16-bit BFloat16. EF16 with 16-bit budget has an RMS-error 17 to 35 times less than BF16 RMS-error for a diverse set of embedding models. When making similarity and dissimilarity queries, using the NDCG ranking metric, EFloat matches the result quality of prior floating point representations with larger bit budgets.  ( 2 min )
    A training-free recursive multiresolution framework for diffeomorphic deformable image registration. (arXiv:2202.00675v1 [eess.IV])
    Diffeomorphic deformable image registration is one of the crucial tasks in medical image analysis, which aims to find a unique transformation while preserving the topology and invertibility of the transformation. Deep convolutional neural networks (CNNs) have yielded well-suited approaches for image registration by learning the transformation priors from a large dataset. The improvement in the performance of these methods is related to their ability to learn information from several sample medical images that are difficult to obtain and bias the framework to the specific domain of data. In this paper, we propose a novel diffeomorphic training-free approach; this is built upon the principle of an ordinary differential equation. Our formulation yields an Euler integration type recursive scheme to estimate the changes of spatial transformations between the fixed and the moving image pyramids at different resolutions. The proposed architecture is simple in design. The moving image is warped successively at each resolution and finally aligned to the fixed image; this procedure is recursive in a way that at each resolution, a fully convolutional network (FCN) models a progressive change of deformation for the current warped image. The entire system is end-to-end and optimized for each pair of images from scratch. In comparison to learning-based methods, the proposed method neither requires a dedicated training set nor suffers from any training bias. We evaluate our method on three cardiac image datasets. The evaluation results demonstrate that the proposed method achieves state-of-the-art registration accuracy while maintaining desirable diffeomorphic properties.  ( 2 min )
    NeuRegenerate: A Framework for Visualizing Neurodegeneration. (arXiv:2202.01115v1 [cs.LG])
    Recent advances in high-resolution microscopy have allowed scientists to better understand the underlying brain connectivity. However, due to the limitation that biological specimens can only be imaged at a single timepoint, studying changes to neural projections is limited to general observations using population analysis. In this paper, we introduce NeuRegenerate, a novel end-to-end framework for the prediction and visualization of changes in neural fiber morphology within a subject, for specified age-timepoints.To predict projections, we present neuReGANerator, a deep-learning network based on cycle-consistent generative adversarial network (cycleGAN) that translates features of neuronal structures in a region, across age-timepoints, for large brain microscopy volumes. We improve the reconstruction quality of neuronal structures by implementing a density multiplier and a new loss function, called the hallucination loss.Moreover, to alleviate artifacts that occur due to tiling of large input volumes, we introduce a spatial-consistency module in the training pipeline of neuReGANerator. We show that neuReGANerator has a reconstruction accuracy of 94% in predicting neuronal structures. Finally, to visualize the predicted change in projections, NeuRegenerate offers two modes: (1) neuroCompare to simultaneously visualize the difference in the structures of the neuronal projections, across the age timepoints, and (2) neuroMorph, a vesselness-based morphing technique to interactively visualize the transformation of the structures from one age-timepoint to the other. Our framework is designed specifically for volumes acquired using wide-field microscopy. We demonstrate our framework by visualizing the structural changes in neuronal fibers within the cholinergic system of the mouse brain between a young and old specimen.  ( 2 min )
    VOS:Learning What You Don't Know by Virtual Outlier Synthesis. (arXiv:2202.01197v1 [cs.LG])
    Out-of-distribution (OOD) detection has received much attention lately due to its importance in the safe deployment of neural networks. One of the key challenges is that models lack supervision signals from unknown data, and as a result, can produce overconfident predictions on OOD data. Previous approaches rely on real outlier datasets for model regularization, which can be costly and sometimes infeasible to obtain in practice. In this paper, we present VOS, a novel framework for OOD detection by adaptively synthesizing virtual outliers that can meaningfully regularize the model's decision boundary during training. Specifically, VOS samples virtual outliers from the low-likelihood region of the class-conditional distribution estimated in the feature space. Alongside, we introduce a novel unknown-aware training objective, which contrastively shapes the uncertainty space between the ID data and synthesized outlier data. VOS achieves state-of-the-art performance on both object detection and image classification models, reducing the FPR95 by up to 7.87% compared to the previous best method. Code is available at https://github.com/deeplearning-wisc/vos.  ( 2 min )
    Classifying Textual Data with Pre-trained Vision Models through Transfer Learning and Data Transformations. (arXiv:2106.12479v3 [cs.CL] UPDATED)
    Knowledge is acquired by humans through experience, and no boundary is set between the kinds of knowledge or skill levels we can achieve on different tasks at the same time. When it comes to Neural Networks, that is not the case. The breakthroughs in the field are extremely task and domain-specific. Vision and language are dealt with in separate manners, using separate methods and different datasets. Current text classification methods, mostly rely on obtaining contextual embeddings for input text samples, then training a classifier on the embedded dataset. Transfer learning in Language-related tasks in general, is heavily used in obtaining the contextual text embeddings for the input samples. In this work, we propose to use the knowledge acquired by benchmark Vision Models which are trained on ImageNet to help a much smaller architecture learn to classify text. A data transformation technique is used to create a new image dataset, where each image represents a sentence embedding from the last six layers of BERT, projected on a 2D plane using a t-SNE based method. We trained five models containing early layers sliced from vision models which are pretrained on ImageNet, on the created image dataset for the IMDB dataset embedded with the last six layers of BERT. Despite the challenges posed by the very different datasets, experimental results achieved by this approach which links large pretrained models on both language and vision, are very promising, without employing compute resources. Specifically, Sentiment Analysis is achieved by five different models on the same image dataset obtained after BERT embeddings are transformed into gray scale images. Index Terms: BERT, Convolutional Neural Networks, Domain Adaptation, image classification, Natural Language Processing, t-SNE, text classification, Transfer Learning  ( 3 min )
    Towards Understanding the Data Dependency of Mixup-style Training. (arXiv:2110.07647v2 [cs.LG] UPDATED)
    In the Mixup training paradigm, a model is trained using convex combinations of data points and their associated labels. Despite seeing very few true data points during training, models trained using Mixup seem to still minimize the original empirical risk and exhibit better generalization and robustness on various tasks when compared to standard training. In this paper, we investigate how these benefits of Mixup training rely on properties of the data in the context of classification. For minimizing the original empirical risk, we compute a closed form for the Mixup-optimal classification, which allows us to construct a simple dataset on which minimizing the Mixup loss can provably lead to learning a classifier that does not minimize the empirical loss on the data. On the other hand, we also give sufficient conditions for Mixup training to also minimize the original empirical risk. For generalization, we characterize the margin of a Mixup classifier, and use this to understand why the decision boundary of a Mixup classifier can adapt better to the full structure of the training data when compared to standard training. In contrast, we also show that, for a large class of linear models and linearly separable datasets, Mixup training leads to learning the same classifier as standard training.  ( 2 min )
    Analogies and Feature Attributions for Model Agnostic Explanation of Similarity Learners. (arXiv:2202.01153v1 [cs.LG])
    Post-hoc explanations for black box models have been studied extensively in classification and regression settings. However, explanations for models that output similarity between two inputs have received comparatively lesser attention. In this paper, we provide model agnostic local explanations for similarity learners applicable to tabular and text data. We first propose a method that provides feature attributions to explain the similarity between a pair of inputs as determined by a black box similarity learner. We then propose analogies as a new form of explanation in machine learning. Here the goal is to identify diverse analogous pairs of examples that share the same level of similarity as the input pair and provide insight into (latent) factors underlying the model's prediction. The selection of analogies can optionally leverage feature attributions, thus connecting the two forms of explanation while still maintaining complementarity. We prove that our analogy objective function is submodular, making the search for good-quality analogies efficient. We apply the proposed approaches to explain similarities between sentences as predicted by a state-of-the-art sentence encoder, and between patients in a healthcare utilization application. Efficacy is measured through quantitative evaluations, a careful user study, and examples of explanations.  ( 2 min )
    On Regularizing Coordinate-MLPs. (arXiv:2202.00790v1 [cs.LG])
    We show that typical implicit regularization assumptions for deep neural networks (for regression) do not hold for coordinate-MLPs, a family of MLPs that are now ubiquitous in computer vision for representing high-frequency signals. Lack of such implicit bias disrupts smooth interpolations between training samples, and hampers generalizing across signal regions with different spectra. We investigate this behavior through a Fourier lens and uncover that as the bandwidth of a coordinate-MLP is enhanced, lower frequencies tend to get suppressed unless a suitable prior is provided explicitly. Based on these insights, we propose a simple regularization technique that can mitigate the above problem, which can be incorporated into existing networks without any architectural modifications.  ( 2 min )
    Visualizing Automatic Speech Recognition -- Means for a Better Understanding?. (arXiv:2202.00673v1 [cs.LG])
    Automatic speech recognition (ASR) is improving ever more at mimicking human speech processing. The functioning of ASR, however, remains to a large extent obfuscated by the complex structure of the deep neural networks (DNNs) they are based on. In this paper, we show how so-called attribution methods, that we import from image recognition and suitably adapt to handle audio data, can help to clarify the working of ASR. Taking DeepSpeech, an end-to-end model for ASR, as a case study, we show how these techniques help to visualize which features of the input are the most influential in determining the output. We focus on three visualization techniques: Layer-wise Relevance Propagation (LRP), Saliency Maps, and Shapley Additive Explanations (SHAP). We compare these methods and discuss potential further applications, such as in the detection of adversarial examples.  ( 2 min )
    Gromov-Wasserstein Discrepancy with Local Differential Privacy for Distributed Structural Graphs. (arXiv:2202.00808v1 [cs.LG])
    Learning the similarity between structured data, especially the graphs, is one of the essential problems. Besides the approach like graph kernels, Gromov-Wasserstein (GW) distance recently draws big attention due to its flexibility to capture both topological and feature characteristics, as well as handling the permutation invariance. However, structured data are widely distributed for different data mining and machine learning applications. With privacy concerns, accessing the decentralized data is limited to either individual clients or different silos. To tackle these issues, we propose a privacy-preserving framework to analyze the GW discrepancy of node embedding learned locally from graph neural networks in a federated flavor, and then explicitly place local differential privacy (LDP) based on Multi-bit Encoder to protect sensitive information. Our experiments show that, with strong privacy protections guaranteed by the $\varepsilon$-LDP algorithm, the proposed framework not only preserves privacy in graph learning but also presents a noised structural metric under GW distance, resulting in comparable and even better performance in classification and clustering tasks. Moreover, we reason the rationale behind the LDP-based GW distance analytically and empirically.  ( 2 min )
    The missing link: Developing a safety case for perception components in automated driving. (arXiv:2108.13294v3 [cs.LG] UPDATED)
    Safety assurance is a central concern for the development and societal acceptance of automated driving (AD) systems. Perception is a key aspect of AD that relies heavily on Machine Learning (ML). Despite the known challenges with the safety assurance of ML-based components, proposals have recently emerged for unit-level safety cases addressing these components. Unfortunately, AD safety cases express safety requirements at the system level and these efforts are missing the critical linking argument needed to integrate safety requirements at the system level with component performance requirements at the unit level. In this paper, we propose the Integration Safety Case for Perception (ISCaP), a generic template for such a linking safety argument specifically tailored for perception components. The template takes a deductive and formal approach to define strong traceability between levels. We demonstrate the applicability of ISCaP with a detailed case study and discuss its use as a tool to support incremental development of perception components.  ( 2 min )
    Detecting Privacy Requirements from User Stories with NLP Transfer Learning Models. (arXiv:2202.01035v1 [cs.SE])
    To provide privacy-aware software systems, it is crucial to consider privacy from the very beginning of the development. However, developers do not have the expertise and the knowledge required to embed the legal and social requirements for data protection into software systems. Objective: We present an approach to decrease privacy risks during agile software development by automatically detecting privacy-related information in the context of user story requirements, a prominent notation in agile Requirement Engineering (RE). Methods: The proposed approach combines Natural Language Processing (NLP) and linguistic resources with deep learning algorithms to identify privacy aspects into User Stories. NLP technologies are used to extract information regarding the semantic and syntactic structure of the text. This information is then processed by a pre-trained convolutional neural network, which paved the way for the implementation of a Transfer Learning technique. We evaluate the proposed approach by performing an empirical study with a dataset of 1680 user stories. Results: The experimental results show that deep learning algorithms allow to obtain better predictions than those achieved with conventional (shallow) machine learning methods. Moreover, the application of Transfer Learning allows to considerably improve the accuracy of the predictions, ca. 10%. Conclusions: Our study contributes to encourage software engineering researchers in considering the opportunities to automate privacy detection in the early phase of design, by also exploiting transfer learning models.  ( 2 min )
    Tail of Distribution GAN (TailGAN): Generative-Adversarial-Network-Based Boundary Formation. (arXiv:2107.11658v2 [cs.LG] UPDATED)
    Generative Adversarial Networks (GAN) are a powerful methodology and can be used for unsupervised anomaly detection, where current techniques have limitations such as the accurate detection of anomalies near the tail of a distribution. GANs generally do not guarantee the existence of a probability density and are susceptible to mode collapse, while few GANs use likelihood to reduce mode collapse. In this paper, we create a GAN-based tail formation model for anomaly detection, the Tail of distribution GAN (TailGAN), to generate samples on the tail of the data distribution and detect anomalies near the support boundary. Using TailGAN, we leverage GANs for anomaly detection and use maximum entropy regularization. Using GANs that learn the probability of the underlying distribution has advantages in improving the anomaly detection methodology by allowing us to devise a generator for boundary samples, and use this model to characterize anomalies. TailGAN addresses supports with disjoint components and achieves competitive performance on images. We evaluate TailGAN for identifying Out-of-Distribution (OoD) data and its performance evaluated on MNIST, CIFAR-10, Baggage X-Ray, and OoD data shows competitiveness compared to methods from the literature.  ( 2 min )
    Towards a Unified View of Parameter-Efficient Transfer Learning. (arXiv:2110.04366v3 [cs.CL] UPDATED)
    Fine-tuning large pre-trained language models on downstream tasks has become the de-facto learning paradigm in NLP. However, conventional approaches fine-tune all the parameters of the pre-trained model, which becomes prohibitive as the model size and the number of tasks grow. Recent work has proposed a variety of parameter-efficient transfer learning methods that only fine-tune a small number of (extra) parameters to attain strong performance. While effective, the critical ingredients for success and the connections among the various methods are poorly understood. In this paper, we break down the design of state-of-the-art parameter-efficient transfer learning methods and present a unified framework that establishes connections between them. Specifically, we re-frame them as modifications to specific hidden states in pre-trained models, and define a set of design dimensions along which different methods vary, such as the function to compute the modification and the position to apply the modification. Through comprehensive empirical studies across machine translation, text summarization, language understanding, and text classification benchmarks, we utilize the unified view to identify important design choices in previous methods. Furthermore, our unified framework enables the transfer of design elements across different approaches, and as a result we are able to instantiate new parameter-efficient fine-tuning methods that tune less parameters than previous methods while being more effective, achieving comparable results to fine-tuning all parameters on all four tasks.  ( 2 min )
    An Efficient DP-SGD Mechanism for Large Scale NLP Models. (arXiv:2107.14586v2 [cs.CL] UPDATED)
    Recent advances in deep learning have drastically improved performance on many Natural Language Understanding (NLU) tasks. However, the data used to train NLU models may contain private information such as addresses or phone numbers, particularly when drawn from human subjects. It is desirable that underlying models do not expose private information contained in the training data. Differentially Private Stochastic Gradient Descent (DP-SGD) has been proposed as a mechanism to build privacy-preserving models. However, DP-SGD can be prohibitively slow to train. In this work, we propose a more efficient DP-SGD for training using a GPU infrastructure and apply it to fine-tuning models based on LSTM and transformer architectures. We report faster training times, alongside accuracy, theoretical privacy guarantees and success of Membership inference attacks for our models and observe that fine-tuning with proposed variant of DP-SGD can yield competitive models without significant degradation in training time and improvement in privacy protection. We also make observations such as looser theoretical $\epsilon, \delta$ can translate into significant practical privacy gains.  ( 2 min )
    Bridging between soft and hard thresholding by scaling. (arXiv:2104.09703v2 [stat.ML] UPDATED)
    In this article, we developed and analyzed a thresholding method in which soft thresholding estimators are independently expanded by empirical scaling values. The scaling values have a common hyper-parameter that is an order of expansion of an ideal scaling value that achieves hard thresholding. We simply call this estimator a scaled soft thresholding estimator. The scaled soft thresholding is a general method that includes the soft thresholding and non-negative garrote as special cases and gives an another derivation of adaptive LASSO. We then derived the degree of freedom of the scaled soft thresholding by means of the Stein's unbiased risk estimate and found that it is decomposed into the degree of freedom of soft thresholding and the reminder connecting to hard thresholding. In this meaning, the scaled soft thresholding gives a natural bridge between soft and hard thresholding methods. Since the degree of freedom represents the degree of over-fitting, this result implies that there are two sources of over-fitting in the scaled soft thresholding. The first source originated from soft thresholding is determined by the number of un-removed coefficients and is a natural measure of the degree of over-fitting. We analyzed the second source in a particular case of the scaled soft thresholding by referring a known result for hard thresholding. We then found that, in a sparse, large sample and non-parametric setting, the second source is largely determined by coefficient estimates whose true values are zeros and has an influence on over-fitting when threshold levels are around noise levels in those coefficient estimates. In a simple numerical example, these theoretical implications has well explained the behavior of the degree of freedom. Moreover, based on the results here and some known facts, we explained the behaviors of risks of soft, hard and scaled soft thresholding methods.  ( 2 min )
    Being a Bit Frequentist Improves Bayesian Neural Networks. (arXiv:2106.10065v3 [cs.LG] UPDATED)
    Despite their compelling theoretical properties, Bayesian neural networks (BNNs) tend to perform worse than frequentist methods in classification-based uncertainty quantification (UQ) tasks such as out-of-distribution (OOD) detection. In this paper, based on empirical findings in prior works, we hypothesize that this issue is because even recent Bayesian methods have never considered OOD data in their training processes, even though this "OOD training" technique is an integral part of state-of-the-art frequentist UQ methods. To validate this, we treat OOD data as a first-class citizen in BNN training by exploring four different ways of incorporating OOD data into Bayesian inference. We show in extensive experiments that OOD-trained BNNs are competitive to recent frequentist baselines. This work thus provides strong baselines for future work in Bayesian UQ.  ( 2 min )
    Surrogate Modeling for Physical Systems with Preserved Properties and Adjustable Tradeoffs. (arXiv:2202.01139v1 [cs.AI])
    Determining the proper level of details to develop and solve physical models is usually difficult when one encounters new engineering problems. Such difficulty comes from how to balance the time (simulation cost) and accuracy for the physical model simulation afterwards. We propose a framework for automatic development of a family of surrogate models of physical systems that provide flexible cost-accuracy tradeoffs to assist making such determinations. We present both a model-based and a data-driven strategy to generate surrogate models. The former starts from a high-fidelity model generated from first principles and applies a bottom-up model order reduction (MOR) that preserves stability and convergence while providing a priori error bounds, although the resulting reduced-order model may lose its interpretability. The latter generates interpretable surrogate models by fitting artificial constitutive relations to a presupposed topological structure using experimental or simulation data. For the latter, we use Tonti diagrams to systematically produce differential equations from the assumed topological structure using algebraic topological semantics that are common to various lumped-parameter models (LPM). The parameter for the constitutive relations are estimated using standard system identification algorithms. Our framework is compatible with various spatial discretization schemes for distributed parameter models (DPM), and can supports solving engineering problems in different domains of physics.  ( 2 min )
    Communication Efficient Federated Learning for Generalized Linear Bandits. (arXiv:2202.01087v1 [cs.LG])
    Contextual bandit algorithms have been recently studied under the federated learning setting to satisfy the demand of keeping data decentralized and pushing the learning of bandit models to the client side. But limited by the required communication efficiency, existing solutions are restricted to linear models to exploit their closed-form solutions for parameter estimation. Such a restricted model choice greatly hampers these algorithms' practical utility. In this paper, we take the first step to addressing this challenge by studying generalized linear bandit models under a federated learning setting. We propose a communication-efficient solution framework that employs online regression for local update and offline regression for global update. We rigorously proved that, though the setting is more general and challenging, our algorithm can attain sub-linear rate in both regret and communication cost, which is also validated by our extensive empirical evaluations.  ( 2 min )
    Gradient Based Clustering. (arXiv:2202.00720v1 [cs.LG])
    We propose a general approach for distance based clustering, using the gradient of the cost function that measures clustering quality with respect to cluster assignments and cluster center positions. The approach is an iterative two step procedure (alternating between cluster assignment and cluster center updates) and is applicable to a wide range of functions, satisfying some mild assumptions. The main advantage of the proposed approach is a simple and computationally cheap update rule. Unlike previous methods that specialize to a specific formulation of the clustering problem, our approach is applicable to a wide range of costs, including non-Bregman clustering methods based on the Huber loss. We analyze the convergence of the proposed algorithm, and show that it converges to the set of appropriately defined fixed points, under arbitrary center initialization. In the special case of Bregman cost functions, the algorithm converges to the set of centroidal Voronoi partitions, which is consistent with prior works. Numerical experiments on real data demonstrate the effectiveness of the proposed method.  ( 2 min )
    Dictionary learning for clustering on hyperspectral images. (arXiv:2202.00990v1 [eess.IV])
    Dictionary learning and sparse coding have been widely studied as mechanisms for unsupervised feature learning. Unsupervised learning could bring enormous benefit to the processing of hyperspectral images and to other remote sensing data analysis because labelled data are often scarce in this field. We propose a method for clustering the pixels of hyperspectral images using sparse coefficients computed from a representative dictionary as features. We show empirically that the proposed method works more effectively than clustering on the original pixels. We also demonstrate that our approach, in certain circumstances, outperforms the clustering results of features extracted using principal component analysis and non-negative matrix factorisation. Furthermore, our method is suitable for applications in repetitively clustering an ever-growing amount of high-dimensional data, which is the case when working with hyperspectral satellite imagery.  ( 2 min )
    On the Benefits of Selectivity in Pseudo-Labeling for Unsupervised Multi-Source-Free Domain Adaptation. (arXiv:2202.00796v1 [cs.LG])
    Due to privacy, storage, and other constraints, there is a growing need for unsupervised domain adaptation techniques in machine learning that do not require access to the data used to train a collection of source models. Existing methods for such multi-source-free domain adaptation typically train a target model using supervised techniques in conjunction with pseudo-labels for the target data, which are produced by the available source models. However, we show that assigning pseudo-labels to only a subset of the target data leads to improved performance. In particular, we develop an information-theoretic bound on the generalization error of the resulting target model that demonstrates an inherent bias-variance trade-off controlled by the subset choice. Guided by this analysis, we develop a method that partitions the target data into pseudo-labeled and unlabeled subsets to balance the trade-off. In addition to exploiting the pseudo-labeled subset, our algorithm further leverages the information in the unlabeled subset via a traditional unsupervised domain adaptation feature alignment procedure. Experiments on multiple benchmark datasets demonstrate the superior performance of the proposed method.  ( 2 min )
    Framework for Evaluating Faithfulness of Local Explanations. (arXiv:2202.00734v1 [cs.LG])
    We study the faithfulness of an explanation system to the underlying prediction model. We show that this can be captured by two properties, consistency and sufficiency, and introduce quantitative measures of the extent to which these hold. Interestingly, these measures depend on the test-time data distribution. For a variety of existing explanation systems, such as anchors, we analytically study these quantities. We also provide estimators and sample complexity bounds for empirically determining the faithfulness of black-box explanation systems. Finally, we experimentally validate the new properties and estimators.  ( 2 min )
    An Eye for an Eye: Defending against Gradient-based Attacks with Gradients. (arXiv:2202.01117v1 [cs.CV])
    Deep learning models have been shown to be vulnerable to adversarial attacks. In particular, gradient-based attacks have demonstrated high success rates recently. The gradient measures how each image pixel affects the model output, which contains critical information for generating malicious perturbations. In this paper, we show that the gradients can also be exploited as a powerful weapon to defend against adversarial attacks. By using both gradient maps and adversarial images as inputs, we propose a Two-stream Restoration Network (TRN) to restore the adversarial images. To optimally restore the perturbed images with two streams of inputs, a Gradient Map Estimation Mechanism is proposed to estimate the gradients of adversarial images, and a Fusion Block is designed in TRN to explore and fuse the information in two streams. Once trained, our TRN can defend against a wide range of attack methods without significantly degrading the performance of benign inputs. Also, our method is generalizable, scalable, and hard to bypass. Experimental results on CIFAR10, SVHN, and Fashion MNIST demonstrate that our method outperforms state-of-the-art defense methods.  ( 2 min )
    Boundary of Distribution Support Generator (BDSG): Sample Generation on the Boundary. (arXiv:2107.09950v2 [cs.LG] UPDATED)
    Generative models, such as Generative Adversarial Networks (GANs), have been used for unsupervised anomaly detection. While performance keeps improving, several limitations exist particularly attributed to difficulties at capturing multimodal supports and to the ability to approximate the underlying distribution closer to the tails, i.e. the boundary of the distribution's support. This paper proposes an approach that attempts to alleviate such shortcomings. We propose an invertible-residual-network-based model, the Boundary of Distribution Support Generator (BDSG). GANs generally do not guarantee the existence of a probability distribution and here, we use the recently developed Invertible Residual Network (IResNet) and Residual Flow (ResFlow), for density estimation. These models have not yet been used for anomaly detection. We leverage IResNet and ResFlow for Out-of-Distribution (OoD) sample detection and for sample generation on the boundary using a compound loss function that forces the samples to lie on the boundary. The BDSG addresses non-convex support, disjoint components, and multimodal distributions. Results on synthetic data and data from multimodal distributions, such as MNIST and CIFAR-10, demonstrate competitive performance compared to methods from the literature.  ( 2 min )
    Improving Sample Efficiency of Value Based Models Using Attention and Vision Transformers. (arXiv:2202.00710v1 [cs.AI])
    Much of recent Deep Reinforcement Learning success is owed to the neural architecture's potential to learn and use effective internal representations of the world. While many current algorithms access a simulator to train with a large amount of data, in realistic settings, including while playing games that may be played against people, collecting experience can be quite costly. In this paper, we introduce a deep reinforcement learning architecture whose purpose is to increase sample efficiency without sacrificing performance. We design this architecture by incorporating advances achieved in recent years in the field of Natural Language Processing and Computer Vision. Specifically, we propose a visually attentive model that uses transformers to learn a self-attention mechanism on the feature maps of the state representation, while simultaneously optimizing return. We demonstrate empirically that this architecture improves sample complexity for several Atari environments, while also achieving better performance in some of the games.  ( 2 min )
    Efficient Algorithms for Learning to Control Bandits with Unobserved Contexts. (arXiv:2202.00867v1 [stat.ML])
    Contextual bandits are widely-used in the study of learning-based control policies for finite action spaces. While the problem is well-studied for bandits with perfectly observed context vectors, little is known about the case of imperfectly observed contexts. For this setting, existing approaches are inapplicable and new conceptual and technical frameworks are required. We present an implementable posterior sampling algorithm for bandits with imperfect context observations and study its performance for learning optimal decisions. The provided numerical results relate the performance of the algorithm to different quantities of interest including the number of arms, dimensions, observation matrices, posterior rescaling factors, and signal-to-noise ratios. In general, the proposed algorithm exposes efficiency in learning from the noisy imperfect observations and taking actions accordingly. Enlightening understandings the analyses provide as well as interesting future directions it points to, are discussed as well.  ( 2 min )
    Adaptive Experimentation with Delayed Binary Feedback. (arXiv:2202.00846v1 [cs.IR])
    Conducting experiments with objectives that take significant delays to materialize (e.g. conversions, add-to-cart events, etc.) is challenging. Although the classical "split sample testing" is still valid for the delayed feedback, the experiment will take longer to complete, which also means spending more resources on worse-performing strategies due to their fixed allocation schedules. Alternatively, adaptive approaches such as "multi-armed bandits" are able to effectively reduce the cost of experimentation. But these methods generally cannot handle delayed objectives directly out of the box. This paper presents an adaptive experimentation solution tailored for delayed binary feedback objectives by estimating the real underlying objectives before they materialize and dynamically allocating variants based on the estimates. Experiments show that the proposed method is more efficient for delayed feedback compared to various other approaches and is robust in different settings. In addition, we describe an experimentation product powered by this algorithm. This product is currently deployed in the online experimentation platform of JD.com, a large e-commerce company and a publisher of digital ads.  ( 2 min )
    Targeted Cross-Validation. (arXiv:2109.06949v2 [stat.ML] UPDATED)
    In many applications, we have access to the complete dataset but are only interested in the prediction of a particular region of predictor variables. A standard approach is to find the globally best modeling method from a set of candidate methods. However, it is perhaps rare in reality that one candidate method is uniformly better than the others. A natural approach for this scenario is to apply a weighted $L_2$ loss in performance assessment to reflect the region-specific interest. We propose a targeted cross-validation (TCV) to select models or procedures based on a general weighted $L_2$ loss. We show that the TCV is consistent in selecting the best performing candidate under the weighted $L_2$ loss. Experimental studies are used to demonstrate the use of TCV and its potential advantage over the global CV or the approach of using only local data for modeling a local region. Previous investigations on CV have relied on the condition that when the sample size is large enough, the ranking of two candidates stays the same. However, in many applications with the setup of changing data-generating processes or highly adaptive modeling methods, the relative performance of the methods is not static as the sample size varies. Even with a fixed data-generating process, it is possible that the ranking of two methods switches infinitely many times. In this work, we broaden the concept of the selection consistency by allowing the best candidate to switch as the sample size varies, and then establish the consistency of the TCV. This flexible framework can be applied to high-dimensional and complex machine learning scenarios where the relative performances of modeling procedures are dynamic.  ( 2 min )
    EdiTTS: Score-based Editing for Controllable Text-to-Speech. (arXiv:2110.02584v2 [cs.SD] UPDATED)
    We present EdiTTS, an off-the-shelf speech editing methodology based on score-based generative modeling for text-to-speech synthesis. EdiTTS allows for targeted, granular editing of audio, both in terms of content and pitch, without the need for any additional training, task-specific optimization, or architectural modifications to the score-based model backbone. Specifically, we apply coarse yet deliberate perturbations in the Gaussian prior space to induce desired behavior from the diffusion model, while applying masks and softening kernels to ensure that iterative edits are applied only to the target region. Through listening tests and speech-to-text back transcription, we show that EdiTTS outperforms existing baselines and produces robust samples that satisfy user-imposed requirements.  ( 2 min )
    A Neural Tangent Kernel Perspective of GANs. (arXiv:2106.05566v3 [cs.LG] UPDATED)
    We propose a novel theoretical framework of analysis for Generative Adversarial Networks (GANs). We start by pointing out a fundamental flaw in previous theoretical analyses that leads to ill-defined gradients for the discriminator. We overcome this issue which impedes a principled study of GAN training, solving it within our framework by taking into account the discriminator's architecture. To this end, we leverage the theory of infinite-width neural networks for the discriminator via its Neural Tangent Kernel. We characterize the trained discriminator for a wide range of losses and establish general differentiability properties of the network. Moreover, we derive new insights about the generated distribution's flow during training, advancing our understanding of GAN dynamics. We empirically corroborate these results via a publicly released analysis toolkit based on our framework, unveiling intuitions that are consistent with current GAN practice.  ( 2 min )
    Convergence of Finite Memory Q-Learning for POMDPs and Near Optimality of Learned Policies under Filter Stability. (arXiv:2103.12158v3 [cs.LG] UPDATED)
    In this paper, for POMDPs, we provide the convergence of a Q learning algorithm for control policies using a finite history of past observations and control actions, and, consequentially, we establish near optimality of such limit Q functions under explicit filter stability conditions. We present explicit error bounds relating the approximation error to the length of the finite history window. We establish the convergence of such Q-learning iterations under mild ergodicity assumptions on the state process during the exploration phase. We further show that the limit fixed point equation gives an optimal solution for an approximate belief-MDP. We then provide bounds on the performance of the policy obtained using the limit Q values compared to the performance of the optimal policy for the POMDP, where we also present explicit conditions using recent results on filter stability in controlled POMDPs. While there exist many experimental results, (i) the rigorous asymptotic convergence (to an approximate MDP value function) for such finite-memory Q-learning algorithms, and (ii) the near optimality with an explicit rate of convergence (in the memory size) are results that are new to the literature, to our knowledge.  ( 2 min )
    Approximative Algorithms for Multi-Marginal Optimal Transport and Free-Support Wasserstein Barycenters. (arXiv:2202.00954v1 [math.NA])
    Computationally solving multi-marginal optimal transport (MOT) with squared Euclidean costs for $N$ discrete probability measures has recently attracted considerable attention, in part because of the correspondence of its solutions with Wasserstein-$2$ barycenters, which have many applications in data science. In general, this problem is NP-hard, calling for practical approximative algorithms. While entropic regularization has been successfully applied to approximate Wasserstein barycenters, this loses the sparsity of the optimal solution, making it difficult to solve the MOT problem directly in practice because of the curse of dimensionality. Thus, for obtaining barycenters, one usually resorts to fixed-support restrictions to a grid, which is, however, prohibitive in higher ambient dimensions $d$. In this paper, after analyzing the relationship between MOT and barycenters, we present two algorithms to approximate the solution of MOT directly, requiring mainly just $N-1$ standard two-marginal OT computations. Thus, they are fast, memory-efficient and easy to implement and can be used with any sparse OT solver as a black box. Moreover, they produce sparse solutions and show promising numerical results. We analyze these algorithms theoretically, proving upper and lower bounds for the relative approximation error.  ( 2 min )
    GLISp-r: A preference-based optimization algorithm with convergence guarantees. (arXiv:2202.01125v1 [math.OC])
    Preference-based optimization algorithms are iterative procedures that seek the optimal value for a decision variable based only on comparisons between couples of different samples. At each iteration, a human decision-maker is asked to express a preference between two samples, highlighting which one, if any, is better than the other. The optimization procedure must use the observed preferences to find the value of the decision variable that is most preferred by the human decision-maker, while also minimizing the number of comparisons. In this work, we propose GLISp-r, an extension of a recent preference-based optimization procedure called GLISp. The latter uses a Radial Basis Function surrogate to describe the tastes of the individual. Iteratively, GLISp proposes new samples to compare with the current best candidate by trading off exploitation of the surrogate model and exploration of the decision space. In GLISp-r, we propose a different criterion to use when looking for a new candidate sample that is inspired by MSRS, a popular procedure in the black-box optimization framework (which is closely related to the preference-based one). Compared to GLISp, GLISp-r is less likely to get stuck on local optimizers of the preference-based optimization problem. We motivate this claim theoretically, with a proof of convergence, and empirically, by comparing the performances of GLISp and GLISp-r on different benchmark optimization problems.  ( 2 min )
    Do Differentiable Simulators Give Better Policy Gradients?. (arXiv:2202.00817v1 [cs.LG])
    Differentiable simulators promise faster computation time for reinforcement learning by replacing zeroth-order gradient estimates of a stochastic objective with an estimate based on first-order gradients. However, it is yet unclear what factors decide the performance of the two estimators on complex landscapes that involve long-horizon planning and control on physical systems, despite the crucial relevance of this question for the utility of differentiable simulators. We show that characteristics of certain physical systems, such as stiffness or discontinuities, may compromise the efficacy of the first-order estimator, and analyze this phenomenon through the lens of bias and variance. We additionally propose an $\alpha$-order gradient estimator, with $\alpha \in [0,1]$, which correctly utilizes exact gradients to combine the efficiency of first-order estimates with the robustness of zero-order methods. We demonstrate the pitfalls of traditional estimators and the advantages of the $\alpha$-order estimator on some numerical examples.  ( 2 min )
    OMASGAN: Out-of-Distribution Minimum Anomaly Score GAN for Sample Generation on the Boundary. (arXiv:2110.15273v2 [cs.LG] UPDATED)
    Generative models trained in an unsupervised manner may set high likelihood and low reconstruction loss to Out-of-Distribution (OoD) samples. This increases Type II errors and leads to missed anomalies, overall decreasing Anomaly Detection (AD) performance. In addition, AD models underperform due to the rarity of anomalies. To address these limitations, we propose the OoD Minimum Anomaly Score GAN (OMASGAN). OMASGAN generates, in a negative data augmentation manner, anomalous samples on the estimated distribution boundary. These samples are then used to refine an AD model, leading to more accurate estimation of the underlying data distribution including multimodal supports with disconnected modes. OMASGAN performs retraining by including the abnormal minimum-anomaly-score OoD samples generated on the distribution boundary in a self-supervised learning manner. For inference, for AD, we devise a discriminator which is trained with negative and positive samples either generated (negative or positive) or real (only positive). OMASGAN addresses the rarity of anomalies by generating strong and adversarial OoD samples on the distribution boundary using only normal class data, effectively addressing mode collapse. A key characteristic of our model is that it uses any f-divergence distribution metric in its variational representation, not requiring invertibility. OMASGAN does not use feature engineering and makes no assumptions about the data distribution. The evaluation of OMASGAN on image data using the leave-one-out methodology shows that it achieves an improvement of at least 0.24 and 0.07 points in AUROC on average on the MNIST and CIFAR-10 datasets, respectively, over other benchmark and state-of-the-art models for AD.  ( 2 min )
    Heterogeneous manifolds for curvature-aware graph embedding. (arXiv:2202.01185v1 [stat.ML])
    Graph embeddings, wherein the nodes of the graph are represented by points in a continuous space, are used in a broad range of Graph ML applications. The quality of such embeddings crucially depends on whether the geometry of the space matches that of the graph. Euclidean spaces are often a poor choice for many types of real-world graphs, where hierarchical structure and a power-law degree distribution are linked to negative curvature. In this regard, it has recently been shown that hyperbolic spaces and more general manifolds, such as products of constant-curvature spaces and matrix manifolds, are advantageous to approximately match nodes pairwise distances. However, all these classes of manifolds are homogeneous, implying that the curvature distribution is the same at each point, making them unsuited to match the local curvature (and related structural properties) of the graph. In this paper, we study graph embeddings in a broader class of heterogeneous rotationally-symmetric manifolds. By adding a single extra radial dimension to any given existing homogeneous model, we can both account for heterogeneous curvature distributions on graphs and pairwise distances. We evaluate our approach on reconstruction tasks on synthetic and real datasets and show its potential in better preservation of high-order structures and heterogeneous random graphs generation.  ( 2 min )
    An Empirical Study of Modular Bias Mitigators and Ensembles. (arXiv:2202.00751v1 [cs.LG])
    There are several bias mitigators that can reduce algorithmic bias in machine learning models but, unfortunately, the effect of mitigators on fairness is often not stable when measured across different data splits. A popular approach to train more stable models is ensemble learning. Ensembles, such as bagging, boosting, voting, or stacking, have been successful at making predictive performance more stable. One might therefore ask whether we can combine the advantages of bias mitigators and ensembles? To explore this question, we first need bias mitigators and ensembles to work together. We built an open-source library enabling the modular composition of 10 mitigators, 4 ensembles, and their corresponding hyperparameters. Based on this library, we empirically explored the space of combinations on 13 datasets, including datasets commonly used in fairness literature plus datasets newly curated by our library. Furthermore, we distilled the results into a guidance diagram for practitioners. We hope this paper will contribute towards improving stability in bias mitigation.  ( 2 min )
    Creating Multimodal Interactive Agents with Imitation and Self-Supervised Learning. (arXiv:2112.03763v2 [cs.LG] UPDATED)
    A common vision from science fiction is that robots will one day inhabit our physical spaces, sense the world as we do, assist our physical labours, and communicate with us through natural language. Here we study how to design artificial agents that can interact naturally with humans using the simplification of a virtual environment. We show that imitation learning of human-human interactions in a simulated world, in conjunction with self-supervised learning, is sufficient to produce a multimodal interactive agent, which we call MIA, that successfully interacts with non-adversarial humans 75% of the time. We further identify architectural and algorithmic techniques that improve performance, such as hierarchical action selection. Altogether, our results demonstrate that imitation of multi-modal, real-time human behaviour may provide a straightforward and surprisingly effective means of imbuing agents with a rich behavioural prior from which agents might then be fine-tuned for specific purposes, thus laying a foundation for training capable agents for interactive robots or digital assistants. A video of MIA's behaviour may be found at https://youtu.be/ZFgRhviF7mY  ( 2 min )
    Benchmarking Perturbation-based Saliency Maps for Explaining Atari Agents. (arXiv:2101.07312v3 [cs.LG] UPDATED)
    One of the most prominent methods for explaining the behavior of Deep Reinforcement Learning (DRL) agents is the generation of saliency maps that show how much each pixel attributed to the agents' decision. However, there is no work that computationally evaluates and compares the fidelity of different saliency map approaches specifically for DRL agents. It is particularly challenging to computationally evaluate saliency maps for DRL agents since their decisions are part of an overarching policy. For instance, the output neurons of value-based DRL algorithms encode both the value of the current state as well as the value of doing each action in this state. This ambiguity should be considered when evaluating saliency maps for such agents. In this paper, we compare five popular perturbation-based approaches to create saliency maps for DRL agents trained on four different Atari 2600 games. The approaches are compared using two computational metrics: dependence on the learned parameters of the agent (sanity checks) and fidelity to the agent's reasoning (input degradation). During the sanity checks, we encounter issues with one approach and propose a solution to fix these issues. For fidelity, we identify two main factors that influence which saliency approach should be chosen in which situation.  ( 2 min )
    KSD Aggregated Goodness-of-fit Test. (arXiv:2202.00824v1 [stat.ML])
    We investigate properties of goodness-of-fit tests based on the Kernel Stein Discrepancy (KSD). We introduce a strategy to construct a test, called KSDAgg, which aggregates multiple tests with different kernels. KSDAgg avoids splitting the data to perform kernel selection (which leads to a loss in test power), and rather maximises the test power over a collection of kernels. We provide theoretical guarantees on the power of KSDAgg: we show it achieves the smallest uniform separation rate of the collection, up to a logarithmic term. KSDAgg can be computed exactly in practice as it relies either on a parametric bootstrap or on a wild bootstrap to estimate the quantiles and the level corrections. In particular, for the crucial choice of bandwidth of a fixed kernel, it avoids resorting to arbitrary heuristics (such as median or standard deviation) or to data splitting. We find on both synthetic and real-world data that KSDAgg outperforms other state-of-the-art adaptive KSD-based goodness-of-fit testing procedures.  ( 2 min )
    Understanding Knowledge Integration in Language Models with Graph Convolutions. (arXiv:2202.00964v1 [cs.CL])
    Pretrained language models (LMs) do not capture factual knowledge very well. This has led to the development of a number of knowledge integration (KI) methods which aim to incorporate external knowledge into pretrained LMs. Even though KI methods show some performance gains over vanilla LMs, the inner-workings of these methods are not well-understood. For instance, it is unclear how and what kind of knowledge is effectively integrated into these models and if such integration may lead to catastrophic forgetting of already learned knowledge. This paper revisits the KI process in these models with an information-theoretic view and shows that KI can be interpreted using a graph convolution operation. We propose a probe model called \textit{Graph Convolution Simulator} (GCS) for interpreting knowledge-enhanced LMs and exposing what kind of knowledge is integrated into these models. We conduct experiments to verify that our GCS can indeed be used to correctly interpret the KI process, and we use it to analyze two well-known knowledge-enhanced LMs: ERNIE and K-Adapter, and find that only a small amount of factual knowledge is integrated in them. We stratify knowledge in terms of various relation types and find that ERNIE and K-Adapter integrate different kinds of knowledge to different extent. Our analysis also shows that simply increasing the size of the KI corpus may not lead to better KI; fundamental advances may be needed.  ( 2 min )
    Mold into a Graph: Efficient Bayesian Optimization over Mixed-Spaces. (arXiv:2202.00893v1 [cs.LG])
    Real-world optimization problems are generally not just black-box problems, but also involve mixed types of inputs in which discrete and continuous variables coexist. Such mixed-space optimization possesses the primary challenge of modeling complex interactions between the inputs. In this work, we propose a novel yet simple approach that entails exploiting the graph data structure to model the underlying relationship between variables, i.e., variables as nodes and interactions defined by edges. Then, a variational graph autoencoder is used to naturally take the interactions into account. We first provide empirical evidence of the existence of such graph structures and then suggest a joint framework of graph structure learning and latent space optimization to adaptively search for optimal graph connectivity. Experimental results demonstrate that our method shows remarkable performance, exceeding the existing approaches with significant computational efficiency for a number of synthetic and real-world tasks.  ( 2 min )
    Element selection for functional materials discovery by integrated machine learning of atomic contributions to properties. (arXiv:2202.01051v1 [cond-mat.mtrl-sci])
    At the high level, the fundamental differences between materials originate from the unique nature of the constituent chemical elements. Before specific differences emerge according to the precise ratios of elements (composition) in a given crystal structure (phase), the material can be represented by its phase field defined simply as the set of the constituent chemical elements. Classification of the materials at the level of their phase fields can accelerate materials discovery by selecting the elemental combinations that are likely to produce desirable functional properties in synthetically accessible materials. Here, we demonstrate that classification of the materials phase field with respect to the maximum expected value of a target functional property can be combined with the ranking of the materials synthetic accessibility. This end-to-end machine learning approach (PhaseSelect) first derives the atomic characteristics from the compositional environments in all computationally and experimentally explored materials and then employs these characteristics to classify the phase field by their merit. PhaseSelect can quantify the materials potential at the level of the periodic table, which we demonstrate with significant accuracy for three avenues of materials applications: high-temperature superconducting, high-temperature magnetic and targetted energy band gap materials.  ( 2 min )
    nTreeClus: a Tree-based Sequence Encoder for Clustering Categorical Series. (arXiv:2102.10252v2 [cs.LG] UPDATED)
    The overwhelming presence of categorical/sequential data in diverse domains emphasizes the importance of sequence mining. The challenging nature of sequences proves the need for continuing research to find a more accurate and faster approach providing a better understanding of their (dis)similarities. This paper proposes a new Model-based approach for clustering sequence data, namely nTreeClus. The proposed method deploys Tree-based Learners, k-mers, and autoregressive models for categorical time series, culminating with a novel numerical representation of the categorical sequences. Adopting this new representation, we cluster sequences, considering the inherent patterns in categorical time series. Accordingly, the model showed robustness to its parameter. Under different simulated scenarios, nTreeClus improved the baseline methods for various internal and external cluster validation metrics for up to 10.7% and 2.7%, respectively. The empirical evaluation using synthetic and real datasets, protein sequences, and categorical time series showed that nTreeClus is competitive or superior to most state-of-the-art algorithms.  ( 2 min )
    Modeling Irregular Time Series with Continuous Recurrent Units. (arXiv:2111.11344v2 [cs.LG] UPDATED)
    Recurrent neural networks (RNNs) are a popular choice for modeling sequential data. Modern RNN architectures assume constant time intervals between observations. However, in many datasets (e.g. medical records) observation times are irregular and can carry important information. To address this challenge, we propose continuous recurrent units (CRUs) -- a neural architecture that can naturally handle irregular intervals between observations. The CRU assumes a hidden state, which evolves according to a linear stochastic differential equation and is integrated into an encoder-decoder framework. The recursive computations of the CRU can be derived using the continuous-discrete Kalman filter and are in closed form. The resulting recurrent architecture has temporal continuity between hidden states and a gating mechanism that can optimally integrate noisy observations. We derive an efficient parameterization scheme for the CRU that leads to a fast implementation (f-CRU). We empirically study the CRU on a number of challenging datasets and find that it can interpolate irregular time-series better than methods based on neural ordinary differential equations.  ( 2 min )
    A Better Loss for Visual-Textual Grounding. (arXiv:2108.05308v2 [cs.CV] UPDATED)
    Given a textual phrase and an image, the visual grounding problem is the task of locating the content of the image referenced by the sentence. It is a challenging task that has several real-world applications in human-computer interaction, image-text reference resolution, and video-text reference resolution. In the last years, several works have addressed this problem by proposing more and more large and complex models that try to capture visual-textual dependencies better than before. These models are typically constituted by two main components that focus on how to learn useful multi-modal features for grounding and how to improve the predicted bounding box of the visual mention, respectively. Finding the right learning balance between these two sub-tasks is not easy, and the current models are not necessarily optimal with respect to this issue. In this work, we propose a loss function based on bounding boxes classes probabilities that: (i) improves the bounding boxes selection; (ii) improves the bounding boxes coordinates prediction. Our model, although using a simple multi-modal feature fusion component, is able to achieve a higher accuracy than state-of-the-art models on two widely adopted datasets, reaching a better learning balance between the two sub-tasks mentioned above.  ( 2 min )
    Mitigating cold start problems in drug-target affinity prediction with interaction knowledge transferring. (arXiv:2202.01195v1 [q-bio.BM])
    Motivation: Predicting the drug-target interaction is crucial for drug discovery as well as drug repurposing. Machine learning is commonly used in drug-target affinity (DTA) problem. However, machine learning model faces the cold-start problem where the model performance drops when predicting the interaction of a novel drug or target. Previous works try to solve the cold start problem by learning the drug or target representation using unsupervised learning. While the drug or target representation can be learned in an unsupervised manner, it still lacks the interaction information, which is critical in drug-target interaction. Results: To incorporate the interaction information into the drug and protein interaction, we proposed using transfer learning from chemical-chemical interaction (CCI) and protein-protein interaction (PPI) task to drug-target interaction task. The representation learned by CCI and PPI tasks can be transferred smoothly to the DTA task due to the similar nature of the tasks. The result on the drug-target affinity datasets shows that our proposed method has advantages compared to other pretraining methods in the DTA task.  ( 2 min )
    Tight Convergence Rate Bounds for Optimization Under Power Law Spectral Conditions. (arXiv:2202.00992v1 [math.OC])
    Performance of optimization on quadratic problems sensitively depends on the low-lying part of the spectrum. For large (effectively infinite-dimensional) problems, this part of the spectrum can often be naturally represented or approximated by power law distributions. In this paper we perform a systematic study of a range of classical single-step and multi-step first order optimization algorithms, with adaptive and non-adaptive, constant and non-constant learning rates: vanilla Gradient Descent, Steepest Descent, Heavy Ball, and Conjugate Gradients. For each of these, we prove that a power law spectral assumption entails a power law for convergence rate of the algorithm, with the convergence rate exponent given by a specific multiple of the spectral exponent. We establish both upper and lower bounds, showing that the results are tight. Finally, we demonstrate applications of these results to kernel learning and training of neural networks in the NTK regime.  ( 2 min )
    Classification of Skin Cancer Images using Convolutional Neural Networks. (arXiv:2202.00678v1 [eess.IV])
    Skin cancer is the most common human malignancy(American Cancer Society) which is primarily diagnosed visually, starting with an initial clinical screening and followed potentially by dermoscopic(related to skin) analysis, a biopsy and histopathological examination. Skin cancer occurs when errors (mutations) occur in the DNA of skin cells. The mutations cause the cells to grow out of control and form a mass of cancer cells. The aim of this study was to try to classify images of skin lesions with the help of convolutional neural networks. The deep neural networks show humongous potential for image classification while taking into account the large variability exhibited by the environment. Here we trained images based on the pixel values and classified them on the basis of disease labels. The dataset was acquired from an Open Source Kaggle Repository(Kaggle Dataset)which itself was acquired from ISIC(International Skin Imaging Collaboration) Archive. The training was performed on multiple models accompanied with Transfer Learning. The highest model accuracy achieved was over 86.65%. The dataset used is publicly available to ensure credibility and reproducibility of the aforementioned result.  ( 2 min )
    Transfer in Reinforcement Learning via Regret Bounds for Learning Agents. (arXiv:2202.01182v1 [cs.LG])
    We present an approach for the quantification of the usefulness of transfer in reinforcement learning via regret bounds for a multi-agent setting. Considering a number of $\aleph$ agents operating in the same Markov decision process, however possibly with different reward functions, we consider the regret each agent suffers with respect to an optimal policy maximizing her average reward. We show that when the agents share their observations the total regret of all agents is smaller by a factor of $\sqrt{\aleph}$ compared to the case when each agent has to rely on the information collected by herself. This result demonstrates how considering the regret in multi-agent settings can provide theoretical bounds on the benefit of sharing observations in transfer learning.  ( 2 min )
    Unified Scaling Laws for Routed Language Models. (arXiv:2202.01169v1 [cs.CL])
    The performance of a language model has been shown to be effectively modeled as a power-law in its parameter count. Here we study the scaling behaviors of Routing Networks: architectures that conditionally use only a subset of their parameters while processing an input. For these models, parameter count and computational requirement form two independent axes along which an increase leads to better performance. In this work we derive and justify scaling laws defined on these two variables which generalize those known for standard language models and describe the performance of a wide range of routing architectures trained via three different techniques. Afterwards we provide two applications of these laws: first deriving an Effective Parameter Count along which all models scale at the same rate, and then using the scaling coefficients to give a quantitative comparison of the three routing techniques considered. Our analysis derives from an extensive evaluation of Routing Networks across five orders of magnitude of size, including models with hundreds of experts and hundreds of billions of parameters.  ( 2 min )
    Physical Design using Differentiable Learned Simulators. (arXiv:2202.00728v1 [cs.LG])
    Designing physical artifacts that serve a purpose - such as tools and other functional structures - is central to engineering as well as everyday human behavior. Though automating design has tremendous promise, general-purpose methods do not yet exist. Here we explore a simple, fast, and robust approach to inverse design which combines learned forward simulators based on graph neural networks with gradient-based design optimization. Our approach solves high-dimensional problems with complex physical dynamics, including designing surfaces and tools to manipulate fluid flows and optimizing the shape of an airfoil to minimize drag. This framework produces high-quality designs by propagating gradients through trajectories of hundreds of steps, even when using models that were pre-trained for single-step predictions on data substantially different from the design tasks. In our fluid manipulation tasks, the resulting designs outperformed those found by sampling-based optimization techniques. In airfoil design, they matched the quality of those obtained with a specialized solver. Our results suggest that despite some remaining challenges, machine learning-based simulators are maturing to the point where they can support general-purpose design optimization across a variety of domains.  ( 2 min )
  • Open

    FROB: Few-shot ROBust Model for Classification and Out-of-Distribution Detection. (arXiv:2111.15487v2 [cs.LG] UPDATED)
    Nowadays, classification and Out-of-Distribution (OoD) detection in the few-shot setting remain challenging aims due to rarity and the limited samples in the few-shot setting, and because of adversarial attacks. Accomplishing these aims is important for critical systems in safety, security, and defence. In parallel, OoD detection is challenging since deep neural network classifiers set high confidence to OoD samples away from the training data. To address such limitations, we propose the Few-shot ROBust (FROB) model for classification and few-shot OoD detection. We devise FROB for improved robustness and reliable confidence prediction for few-shot OoD detection. We generate the support boundary of the normal class distribution and combine it with few-shot Outlier Exposure (OE). We propose a self-supervised learning few-shot confidence boundary methodology based on generative and discriminative models. The contribution of FROB is the combination of the generated boundary in a self-supervised learning manner and the imposition of low confidence at this learned boundary. FROB implicitly generates strong adversarial samples on the boundary and forces samples from OoD, including our boundary, to be less confident by the classifier. FROB achieves generalization to unseen OoD with applicability to unknown, in the wild, test sets that do not correlate to the training datasets. To improve robustness, FROB redesigns OE to work even for zero-shots. By including our boundary, FROB reduces the threshold linked to the model's few-shot robustness; it maintains the OoD performance approximately independent of the number of few-shots. The few-shot robustness analysis evaluation of FROB on different sets and on One-Class Classification (OCC) data shows that FROB achieves competitive performance and outperforms benchmarks in terms of robustness to the outlier few-shot sample population and variability.  ( 2 min )
    What Happens after SGD Reaches Zero Loss? --A Mathematical Framework. (arXiv:2110.06914v2 [cs.LG] UPDATED)
    Understanding the implicit bias of Stochastic Gradient Descent (SGD) is one of the key challenges in deep learning, especially for overparametrized models, where the local minimizers of the loss function $L$ can form a manifold. Intuitively, with a sufficiently small learning rate $\eta$, SGD tracks Gradient Descent (GD) until it gets close to such manifold, where the gradient noise prevents further convergence. In such a regime, Blanc et al. (2020) proved that SGD with label noise locally decreases a regularizer-like term, the sharpness of loss, $\mathrm{tr}[\nabla^2 L]$. The current paper gives a general framework for such analysis by adapting ideas from Katzenberger (1991). It allows in principle a complete characterization for the regularization effect of SGD around such manifold -- i.e., the "implicit bias" -- using a stochastic differential equation (SDE) describing the limiting dynamics of the parameters, which is determined jointly by the loss function and the noise covariance. This yields some new results: (1) a global analysis of the implicit bias valid for $\eta^{-2}$ steps, in contrast to the local analysis of Blanc et al. (2020) that is only valid for $\eta^{-1.6}$ steps and (2) allowing arbitrary noise covariance. As an application, we show with arbitrary large initialization, label noise SGD can always escape the kernel regime and only requires $O(\kappa\ln d)$ samples for learning an $\kappa$-sparse overparametrized linear model in $\mathbb{R}^d$ (Woodworth et al., 2020), while GD initialized in the kernel regime requires $\Omega(d)$ samples. This upper bound is minimax optimal and improves the previous $\tilde{O}(\kappa^2)$ upper bound (HaoChen et al., 2020).  ( 2 min )
    MD-GAN with multi-particle input: the machine learning of long-time molecular behavior from short-time MD data. (arXiv:2202.00995v1 [physics.chem-ph])
    MD-GAN is a machine learning-based method that can evolve part of the system at any time step, accelerating the generation of molecular dynamics data. For the accurate prediction of MD-GAN, sufficient information on the dynamics of a part of the system should be included with the training data. Therefore, the selection of the part of the system is important for efficient learning. In a previous study, only one particle (or vector) of each molecule was extracted as part of the system. Therefore, we investigated the effectiveness of adding information from other particles to the learning process. In the experiment of the polyethylene system, when the dynamics of three particles of each molecule were used, the diffusion was successfully predicted using one-third of the time length of the training data, compared to the single-particle input. Surprisingly, the unobserved transition of diffusion in the training data was also predicted using this method.  ( 2 min )
    Optimizing Sequential Experimental Design with Deep Reinforcement Learning. (arXiv:2202.00821v1 [cs.LG])
    Bayesian approaches developed to solve the optimal design of sequential experiments are mathematically elegant but computationally challenging. Recently, techniques using amortization have been proposed to make these Bayesian approaches practical, by training a parameterized policy that proposes designs efficiently at deployment time. However, these methods may not sufficiently explore the design space, require access to a differentiable probabilistic model and can only optimize over continuous design spaces. Here, we address these limitations by showing that the problem of optimizing policies can be reduced to solving a Markov decision process (MDP). We solve the equivalent MDP with modern deep reinforcement learning techniques. Our experiments show that our approach is also computationally efficient at deployment time and exhibits state-of-the-art performance on both continuous and discrete design spaces, even when the probabilistic model is a black box.
    OMASGAN: Out-of-Distribution Minimum Anomaly Score GAN for Sample Generation on the Boundary. (arXiv:2110.15273v2 [cs.LG] UPDATED)
    Generative models trained in an unsupervised manner may set high likelihood and low reconstruction loss to Out-of-Distribution (OoD) samples. This increases Type II errors and leads to missed anomalies, overall decreasing Anomaly Detection (AD) performance. In addition, AD models underperform due to the rarity of anomalies. To address these limitations, we propose the OoD Minimum Anomaly Score GAN (OMASGAN). OMASGAN generates, in a negative data augmentation manner, anomalous samples on the estimated distribution boundary. These samples are then used to refine an AD model, leading to more accurate estimation of the underlying data distribution including multimodal supports with disconnected modes. OMASGAN performs retraining by including the abnormal minimum-anomaly-score OoD samples generated on the distribution boundary in a self-supervised learning manner. For inference, for AD, we devise a discriminator which is trained with negative and positive samples either generated (negative or positive) or real (only positive). OMASGAN addresses the rarity of anomalies by generating strong and adversarial OoD samples on the distribution boundary using only normal class data, effectively addressing mode collapse. A key characteristic of our model is that it uses any f-divergence distribution metric in its variational representation, not requiring invertibility. OMASGAN does not use feature engineering and makes no assumptions about the data distribution. The evaluation of OMASGAN on image data using the leave-one-out methodology shows that it achieves an improvement of at least 0.24 and 0.07 points in AUROC on average on the MNIST and CIFAR-10 datasets, respectively, over other benchmark and state-of-the-art models for AD.
    Is a Classification Procedure Good Enough? A Goodness-of-Fit Assessment Tool for Classification Learning. (arXiv:1911.03063v3 [stat.ML] UPDATED)
    In recent years, many non-traditional classification methods, such as Random Forest, Boosting, and neural network, have been widely used in applications. Their performance is typically measured in terms of classification accuracy. While the classification error rate and the like are important, they do not address a fundamental question: Is the classification method underfitted? To our best knowledge, there is no existing method that can assess the goodness-of-fit of a general classification procedure. Indeed, the lack of a parametric assumption makes it challenging to construct proper tests. To overcome this difficulty, we propose a methodology called BAGofT that splits the data into a training set and a validation set. First, the classification procedure to assess is applied to the training set, which is also used to adaptively find a data grouping that reveals the most severe regions of underfitting. Then, based on this grouping, we calculate a test statistic by comparing the estimated success probabilities and the actual observed responses from the validation set. The data splitting guarantees that the size of the test is controlled under the null hypothesis, and the power of the test goes to one as the sample size increases under the alternative hypothesis. For testing parametric classification models, the BAGofT has a broader scope than the existing methods since it is not restricted to specific parametric models (e.g., logistic regression). Extensive simulation studies show the utility of the BAGofT when assessing general classification procedures and its strengths over some existing methods when testing parametric classification models.
    Being a Bit Frequentist Improves Bayesian Neural Networks. (arXiv:2106.10065v3 [cs.LG] UPDATED)
    Despite their compelling theoretical properties, Bayesian neural networks (BNNs) tend to perform worse than frequentist methods in classification-based uncertainty quantification (UQ) tasks such as out-of-distribution (OOD) detection. In this paper, based on empirical findings in prior works, we hypothesize that this issue is because even recent Bayesian methods have never considered OOD data in their training processes, even though this "OOD training" technique is an integral part of state-of-the-art frequentist UQ methods. To validate this, we treat OOD data as a first-class citizen in BNN training by exploring four different ways of incorporating OOD data into Bayesian inference. We show in extensive experiments that OOD-trained BNNs are competitive to recent frequentist baselines. This work thus provides strong baselines for future work in Bayesian UQ.
    Communication Efficient Federated Learning for Generalized Linear Bandits. (arXiv:2202.01087v1 [cs.LG])
    Contextual bandit algorithms have been recently studied under the federated learning setting to satisfy the demand of keeping data decentralized and pushing the learning of bandit models to the client side. But limited by the required communication efficiency, existing solutions are restricted to linear models to exploit their closed-form solutions for parameter estimation. Such a restricted model choice greatly hampers these algorithms' practical utility. In this paper, we take the first step to addressing this challenge by studying generalized linear bandit models under a federated learning setting. We propose a communication-efficient solution framework that employs online regression for local update and offline regression for global update. We rigorously proved that, though the setting is more general and challenging, our algorithm can attain sub-linear rate in both regret and communication cost, which is also validated by our extensive empirical evaluations.
    Barycentric-alignment and invertibility for domain generalization. (arXiv:2109.01902v4 [cs.LG] UPDATED)
    We revisit Domain Generalization (DG) problem, where the hypotheses are composed of a common representation mapping followed by a labeling function. Popular DG methods optimize a well-known upper bound to the risk in the unseen domain. However, the bound contains a term that is not optimized due to its dual dependence on the representation mapping and the unknown optimal labeling function for the unseen domain. We derive a new upper bound free of terms having such dual dependence by imposing mild assumptions on the loss function and an invertibility requirement on the representation map when restricted to the low-dimensional data manifold. The derivation leverages old and recent transport inequalities that link optimal transport metrics with information-theoretic measures. Our bound motivates a new algorithm for DG comprising Wasserstein-2 barycenter cost for feature alignment and mutual information or autoencoders for enforcing approximate invertibility. Experiments on several datasets demonstrate superior performance compared to the state-of-the-art DG algorithms.
    SHAFF: Fast and consistent SHApley eFfect estimates via random Forests. (arXiv:2105.11724v3 [stat.ML] UPDATED)
    Interpretability of learning algorithms is crucial for applications involving critical decisions, and variable importance is one of the main interpretation tools. Shapley effects are now widely used to interpret both tree ensembles and neural networks, as they can efficiently handle dependence and interactions in the data, as opposed to most other variable importance measures. However, estimating Shapley effects is a challenging task, because of the computational complexity and the conditional expectation estimates. Accordingly, existing Shapley algorithms have flaws: a costly running time, or a bias when input variables are dependent. Therefore, we introduce SHAFF, SHApley eFfects via random Forests, a fast and accurate Shapley effect estimate, even when input variables are dependent. We show SHAFF efficiency through both a theoretical analysis of its consistency, and the practical performance improvements over competitors with extensive experiments. An implementation of SHAFF in C++ and R is available online.
    A Reputation Game Simulation: Emergent Social Phenomena from Information Theory. (arXiv:2106.05414v2 [physics.soc-ph] UPDATED)
    Reputation is a central element of social communications, be it with human or artificial intelligence (AI), and as such can be the primary target of malicious communication strategies. There is already a vast amount of literature on trust networks addressing this issue and proposing ways to simulate these networks dynamics using Bayesian principles and involving Theory of Mind models. The main issue for these simulations is usually the amount of information that can be stored and is usually solved by discretising variables and using hard thresholds. Here we propose a novel approach to the way information is updated that accounts for knowledge uncertainty and is closer to reality. In our game, agents use information compression techniques to capture their complex environment and store it in their finite memories. The loss of information that results from this leads to emergent phenomena, such as echo chambers, self-deception, deception symbiosis, and freezing of group opinions. Various malicious strategies of agents are studied for their impact on group sociology, like sycophancy, egocentricity, pathological lying, and aggressiveness. Even though our modeling could be made more complex, our set-up can already provide insights into social interactions and can be used to investigate the effects of various communication strategies and find ways to counteract malicious ones. Eventually this work should help to safeguard the design of non-abusive AI systems.
    From Predictions to Decisions: The Importance of Joint Predictive Distributions. (arXiv:2107.09224v2 [cs.LG] UPDATED)
    A fundamental challenge for any intelligent system is prediction: given some inputs, can you predict corresponding outcomes? Most work on supervised learning has focused on producing accurate marginal predictions for each input. However, we show that for a broad class of decision problems, accurate joint predictions are required to deliver good performance. In particular, we establish several results pertaining to combinatorial decision problems, sequential predictions, and multi-armed bandits to elucidate the essential role of joint predictive distributions. Our treatment of multi-armed bandits introduces an approximate Thompson sampling algorithm and analytic techniques that lead to a new kind of regret bound.
    Distributional Reinforcement Learning via Sinkhorn Iterations. (arXiv:2202.00769v1 [cs.LG])
    Distributional reinforcement learning~(RL) is a class of state-of-the-art algorithms that estimate the whole distribution of the total return rather than only its expectation. The representation manner of each return distribution and the choice of distribution divergence are pivotal for the empirical success of distributional RL. In this paper, we propose a new class of \textit{Sinkhorn distributional RL} algorithm that learns a finite set of statistics, i.e., deterministic samples, from each return distribution and then leverages Sinkhorn iterations to evaluate the Sinkhorn distance between the current and target Bellmen distributions. Remarkably, as Sinkhorn divergence interpolates between the Wasserstein distance and Maximum Mean Discrepancy~(MMD). This allows our proposed Sinkhorn distributional RL algorithms to find a sweet spot leveraging the geometry of optimal transport-based distance, and the unbiased gradient estimates of MMD. Finally, experiments on a suite of Atari games reveal the competitive performance of Sinkhorn distributional RL algorithm as opposed to existing state-of-the-art algorithms.
    Global Riemannian Acceleration in Hyperbolic and Spherical Spaces. (arXiv:2012.03618v4 [math.OC] UPDATED)
    We further research on the accelerated optimization phenomenon on Riemannian manifolds by introducing accelerated global first-order methods for the optimization of $L$-smooth and geodesically convex (g-convex) or $\mu$-strongly g-convex functions defined on the hyperbolic space or a subset of the sphere. For a manifold other than the Euclidean space, these are the first methods to \emph{globally} achieve the same rates as accelerated gradient descent in the Euclidean space with respect to $L$ and $\varepsilon$ (and $\mu$ if it applies), up to log factors. Previous results with these accelerated rates only worked, given strong g-convexity, in a generally small neighborhood (initial distance $R$ to a minimizer being $R = O((\mu/L)^{3/4})$). Our rates have a polynomial factor on $1/\cos(R)$ (spherical case) or $\cosh(R)$ (hyperbolic case). Thus, we completely match the Euclidean case for a constant initial distance, and for larger $R$ we incur greater constants due to the geometry. As a proxy for our solution, we solve a constrained non-convex Euclidean problem, under a condition between convexity and \textit{quasar-convexity}, of independent interest. Additionally, for any Riemannian manifold of bounded sectional curvature, we provide reductions from optimization methods for smooth and g-convex functions to methods for smooth and strongly g-convex functions and vice versa.
    AdaAnn: Adaptive Annealing Scheduler for Probability Density Approximation. (arXiv:2202.00792v1 [stat.CO])
    Approximating probability distributions can be a challenging task, particularly when they are supported over regions of high geometrical complexity or exhibit multiple modes. Annealing can be used to facilitate this task which is often combined with constant a priori selected increments in inverse temperature. However, using constant increments limit the computational efficiency due to the inability to adapt to situations where smooth changes in the annealed density could be handled equally well with larger increments. We introduce AdaAnn, an adaptive annealing scheduler that automatically adjusts the temperature increments based on the expected change in the Kullback-Leibler divergence between two distributions with a sufficiently close annealing temperature. AdaAnn is easy to implement and can be integrated into existing sampling approaches such as normalizing flows for variational inference and Markov chain Monte Carlo. We demonstrate the computational efficiency of the AdaAnn scheduler for variational inference with normalizing flows on a number of examples, including density approximation and parameter estimation for dynamical systems.
    Autoregressive Diffusion Models. (arXiv:2110.02037v2 [cs.LG] UPDATED)
    We introduce Autoregressive Diffusion Models (ARDMs), a model class encompassing and generalizing order-agnostic autoregressive models (Uria et al., 2014) and absorbing discrete diffusion (Austin et al., 2021), which we show are special cases of ARDMs under mild assumptions. ARDMs are simple to implement and easy to train. Unlike standard ARMs, they do not require causal masking of model representations, and can be trained using an efficient objective similar to modern probabilistic diffusion models that scales favourably to highly-dimensional data. At test time, ARDMs support parallel generation which can be adapted to fit any given generation budget. We find that ARDMs require significantly fewer steps than discrete diffusion models to attain the same performance. Finally, we apply ARDMs to lossless compression, and show that they are uniquely suited to this task. Contrary to existing approaches based on bits-back coding, ARDMs obtain compelling results not only on complete datasets, but also on compressing single data points. Moreover, this can be done using a modest number of network calls for (de)compression due to the model's adaptable parallel generation.
    Modularity-Aware Graph Autoencoders for Joint Community Detection and Link Prediction. (arXiv:2202.00961v1 [cs.LG])
    Graph autoencoders (GAE) and variational graph autoencoders (VGAE) emerged as powerful methods for link prediction. Their performances are less impressive on community detection problems where, according to recent and concurring experimental evaluations, they are often outperformed by simpler alternatives such as the Louvain method. It is currently still unclear to which extent one can improve community detection with GAE and VGAE, especially in the absence of node features. It is moreover uncertain whether one could do so while simultaneously preserving good performances on link prediction. In this paper, we show that jointly addressing these two tasks with high accuracy is possible. For this purpose, we introduce and theoretically study a community-preserving message passing scheme, doping our GAE and VGAE encoders by considering both the initial graph structure and modularity-based prior communities when computing embedding spaces. We also propose novel training and optimization strategies, including the introduction of a modularity-inspired regularizer complementing the existing reconstruction losses for joint link prediction and community detection. We demonstrate the empirical effectiveness of our approach, referred to as Modularity-Aware GAE and VGAE, through in-depth experimental validation on various real-world graphs.
    Targeted Cross-Validation. (arXiv:2109.06949v2 [stat.ML] UPDATED)
    In many applications, we have access to the complete dataset but are only interested in the prediction of a particular region of predictor variables. A standard approach is to find the globally best modeling method from a set of candidate methods. However, it is perhaps rare in reality that one candidate method is uniformly better than the others. A natural approach for this scenario is to apply a weighted $L_2$ loss in performance assessment to reflect the region-specific interest. We propose a targeted cross-validation (TCV) to select models or procedures based on a general weighted $L_2$ loss. We show that the TCV is consistent in selecting the best performing candidate under the weighted $L_2$ loss. Experimental studies are used to demonstrate the use of TCV and its potential advantage over the global CV or the approach of using only local data for modeling a local region. Previous investigations on CV have relied on the condition that when the sample size is large enough, the ranking of two candidates stays the same. However, in many applications with the setup of changing data-generating processes or highly adaptive modeling methods, the relative performance of the methods is not static as the sample size varies. Even with a fixed data-generating process, it is possible that the ranking of two methods switches infinitely many times. In this work, we broaden the concept of the selection consistency by allowing the best candidate to switch as the sample size varies, and then establish the consistency of the TCV. This flexible framework can be applied to high-dimensional and complex machine learning scenarios where the relative performances of modeling procedures are dynamic.
    Graph-wise Common Latent Factor Extraction for Unsupervised Graph Representation Learning. (arXiv:2112.08830v2 [cs.LG] UPDATED)
    Unsupervised graph-level representation learning plays a crucial role in a variety of tasks such as molecular property prediction and community analysis, especially when data annotation is expensive. Currently, most of the best-performing graph embedding methods are based on Infomax principle. The performance of these methods highly depends on the selection of negative samples and hurt the performance, if the samples were not carefully selected. Inter-graph similarity-based methods also suffer if the selected set of graphs for similarity matching is low in quality. To address this, we focus only on utilizing the current input graph for embedding learning. We are motivated by an observation from real-world graph generation processes where the graphs are formed based on one or more global factors which are common to all elements of the graph (e.g., topic of a discussion thread, solubility level of a molecule). We hypothesize extracting these common factors could be highly beneficial. Hence, this work proposes a new principle for unsupervised graph representation learning: Graph-wise Common latent Factor EXtraction (GCFX). We further propose a deep model for GCFX, deepGCFX, based on the idea of reversing the above-mentioned graph generation process which could explicitly extract common latent factors from an input graph and achieve improved results on downstream tasks to the current state-of-the-art. Through extensive experiments and analysis, we demonstrate that, while extracting common latent factors is beneficial for graph-level tasks to alleviate distractions caused by local variations of individual nodes or local neighbourhoods, it also benefits node-level tasks by enabling long-range node dependencies, especially for disassortative graphs.  ( 2 min )
    Beyond Target Networks: Improving Deep $Q$-learning with Functional Regularization. (arXiv:2106.02613v3 [stat.ML] UPDATED)
    Much of the recent successes in Deep Reinforcement Learning have been based on minimizing the squared Bellman error. However, training is often unstable due to fast-changing target Q-values, and target networks are employed to regularize the Q-value estimation and stabilize training by using an additional set of lagging parameters. Despite their advantages, target networks are potentially an inflexible way to regularize Q-values which may ultimately slow down training. In this work, we address this issue by augmenting the squared Bellman error with a functional regularizer. Unlike target networks, the regularization we propose here is explicit and enables us to use up-to-date parameters as well as control the regularization. This leads to a faster yet more stable training method. We analyze the convergence of our method theoretically and empirically validate our predictions on simple environments as well as on a suite of Atari environments. We demonstrate empirical improvements over target network based methods in terms of both sample efficiency and performance. In summary, our approach provides a fast and stable alternative to replace the standard squared Bellman error  ( 2 min )
    KSD Aggregated Goodness-of-fit Test. (arXiv:2202.00824v1 [stat.ML])
    We investigate properties of goodness-of-fit tests based on the Kernel Stein Discrepancy (KSD). We introduce a strategy to construct a test, called KSDAgg, which aggregates multiple tests with different kernels. KSDAgg avoids splitting the data to perform kernel selection (which leads to a loss in test power), and rather maximises the test power over a collection of kernels. We provide theoretical guarantees on the power of KSDAgg: we show it achieves the smallest uniform separation rate of the collection, up to a logarithmic term. KSDAgg can be computed exactly in practice as it relies either on a parametric bootstrap or on a wild bootstrap to estimate the quantiles and the level corrections. In particular, for the crucial choice of bandwidth of a fixed kernel, it avoids resorting to arbitrary heuristics (such as median or standard deviation) or to data splitting. We find on both synthetic and real-world data that KSDAgg outperforms other state-of-the-art adaptive KSD-based goodness-of-fit testing procedures.  ( 2 min )
    nTreeClus: a Tree-based Sequence Encoder for Clustering Categorical Series. (arXiv:2102.10252v2 [cs.LG] UPDATED)
    The overwhelming presence of categorical/sequential data in diverse domains emphasizes the importance of sequence mining. The challenging nature of sequences proves the need for continuing research to find a more accurate and faster approach providing a better understanding of their (dis)similarities. This paper proposes a new Model-based approach for clustering sequence data, namely nTreeClus. The proposed method deploys Tree-based Learners, k-mers, and autoregressive models for categorical time series, culminating with a novel numerical representation of the categorical sequences. Adopting this new representation, we cluster sequences, considering the inherent patterns in categorical time series. Accordingly, the model showed robustness to its parameter. Under different simulated scenarios, nTreeClus improved the baseline methods for various internal and external cluster validation metrics for up to 10.7% and 2.7%, respectively. The empirical evaluation using synthetic and real datasets, protein sequences, and categorical time series showed that nTreeClus is competitive or superior to most state-of-the-art algorithms.  ( 2 min )
    VC-PCR: A Prediction Method based on Supervised Variable Selection and Clustering. (arXiv:2202.00975v1 [stat.ML])
    Sparse linear prediction methods suffer from decreased prediction accuracy when the predictor variables have cluster structure (e.g. there are highly correlated groups of variables). To improve prediction accuracy, various methods have been proposed to identify variable clusters from the data and integrate cluster information into a sparse modeling process. But none of these methods achieve satisfactory performance for prediction, variable selection and variable clustering simultaneously. This paper presents Variable Cluster Principal Component Regression (VC-PCR), a prediction method that supervises variable selection and variable clustering in order to solve this problem. Experiments with real and simulated data demonstrate that, compared to competitor methods, VC-PCR achieves better prediction, variable selection and clustering performance when cluster structure is present.  ( 2 min )
    Maintaining fairness across distribution shift: do we have viable solutions for real-world applications?. (arXiv:2202.01034v1 [cs.LG])
    Fairness and robustness are often considered as orthogonal dimensions when evaluating machine learning models. However, recent work has revealed interactions between fairness and robustness, showing that fairness properties are not necessarily maintained under distribution shift. In healthcare settings, this can result in e.g. a model that performs fairly according to a selected metric in "hospital A" showing unfairness when deployed in "hospital B". While a nascent field has emerged to develop provable fair and robust models, it typically relies on strong assumptions about the shift, limiting its impact for real-world applications. In this work, we explore the settings in which recently proposed mitigation strategies are applicable by referring to a causal framing. Using examples of predictive models in dermatology and electronic health records, we show that real-world applications are complex and often invalidate the assumptions of such methods. Our work hence highlights technical, practical, and engineering gaps that prevent the development of robustly fair machine learning models for real-world applications. Finally, we discuss potential remedies at each step of the machine learning pipeline.  ( 2 min )
    Modeling Irregular Time Series with Continuous Recurrent Units. (arXiv:2111.11344v2 [cs.LG] UPDATED)
    Recurrent neural networks (RNNs) are a popular choice for modeling sequential data. Modern RNN architectures assume constant time intervals between observations. However, in many datasets (e.g. medical records) observation times are irregular and can carry important information. To address this challenge, we propose continuous recurrent units (CRUs) -- a neural architecture that can naturally handle irregular intervals between observations. The CRU assumes a hidden state, which evolves according to a linear stochastic differential equation and is integrated into an encoder-decoder framework. The recursive computations of the CRU can be derived using the continuous-discrete Kalman filter and are in closed form. The resulting recurrent architecture has temporal continuity between hidden states and a gating mechanism that can optimally integrate noisy observations. We derive an efficient parameterization scheme for the CRU that leads to a fast implementation (f-CRU). We empirically study the CRU on a number of challenging datasets and find that it can interpolate irregular time-series better than methods based on neural ordinary differential equations.  ( 2 min )
    Boundary of Distribution Support Generator (BDSG): Sample Generation on the Boundary. (arXiv:2107.09950v2 [cs.LG] UPDATED)
    Generative models, such as Generative Adversarial Networks (GANs), have been used for unsupervised anomaly detection. While performance keeps improving, several limitations exist particularly attributed to difficulties at capturing multimodal supports and to the ability to approximate the underlying distribution closer to the tails, i.e. the boundary of the distribution's support. This paper proposes an approach that attempts to alleviate such shortcomings. We propose an invertible-residual-network-based model, the Boundary of Distribution Support Generator (BDSG). GANs generally do not guarantee the existence of a probability distribution and here, we use the recently developed Invertible Residual Network (IResNet) and Residual Flow (ResFlow), for density estimation. These models have not yet been used for anomaly detection. We leverage IResNet and ResFlow for Out-of-Distribution (OoD) sample detection and for sample generation on the boundary using a compound loss function that forces the samples to lie on the boundary. The BDSG addresses non-convex support, disjoint components, and multimodal distributions. Results on synthetic data and data from multimodal distributions, such as MNIST and CIFAR-10, demonstrate competitive performance compared to methods from the literature.  ( 2 min )
    Certifiable Robustness to Adversarial State Uncertainty in Deep Reinforcement Learning. (arXiv:2004.06496v6 [cs.LG] UPDATED)
    Deep Neural Network-based systems are now the state-of-the-art in many robotics tasks, but their application in safety-critical domains remains dangerous without formal guarantees on network robustness. Small perturbations to sensor inputs (from noise or adversarial examples) are often enough to change network-based decisions, which was recently shown to cause an autonomous vehicle to swerve into another lane. In light of these dangers, numerous algorithms have been developed as defensive mechanisms from these adversarial inputs, some of which provide formal robustness guarantees or certificates. This work leverages research on certified adversarial robustness to develop an online certifiably robust for deep reinforcement learning algorithms. The proposed defense computes guaranteed lower bounds on state-action values during execution to identify and choose a robust action under a worst-case deviation in input space due to possible adversaries or noise. Moreover, the resulting policy comes with a certificate of solution quality, even though the true state and optimal action are unknown to the certifier due to the perturbations. The approach is demonstrated on a Deep Q-Network policy and is shown to increase robustness to noise and adversaries in pedestrian collision avoidance scenarios and a classic control task. This work extends one of our prior works with new performance guarantees, extensions to other RL algorithms, expanded results aggregated across more scenarios, an extension into scenarios with adversarial behavior, comparisons with a more computationally expensive method, and visualizations that provide intuition about the robustness algorithm.  ( 2 min )
    Hierarchical Shrinkage: improving the accuracy and interpretability of tree-based methods. (arXiv:2202.00858v1 [cs.LG])
    Tree-based models such as decision trees and random forests (RF) are a cornerstone of modern machine-learning practice. To mitigate overfitting, trees are typically regularized by a variety of techniques that modify their structure (e.g. pruning). We introduce Hierarchical Shrinkage (HS), a post-hoc algorithm that does not modify the tree structure, and instead regularizes the tree by shrinking the prediction over each node towards the sample means of its ancestors. The amount of shrinkage is controlled by a single regularization parameter and the number of data points in each ancestor. Since HS is a post-hoc method, it is extremely fast, compatible with any tree growing algorithm, and can be used synergistically with other regularization techniques. Extensive experiments over a wide variety of real-world datasets show that HS substantially increases the predictive performance of decision trees, even when used in conjunction with other regularization techniques. Moreover, we find that applying HS to each tree in an RF often improves accuracy, as well as its interpretability by simplifying and stabilizing its decision boundaries and SHAP values. We further explain the success of HS in improving prediction performance by showing its equivalence to ridge regression on a (supervised) basis constructed of decision stumps associated with the internal nodes of a tree. All code and models are released in a full-fledged package available on Github (github.com/csinva/imodels)  ( 2 min )
    Survival Regression with Proper Scoring Rules and Monotonic Neural Networks. (arXiv:2103.14755v2 [stat.ML] UPDATED)
    We consider frequently used scoring rules for right-censored survival regression models such as time-dependent concordance, survival-CRPS, integrated Brier score and integrated binomial log-likelihood, and prove that neither of them is a proper scoring rule. This means that the true survival distribution may be scored worse than incorrect distributions, leading to inaccurate estimation. We prove that, in contrast to these scores, the right-censored log-likelihood is a proper scoring rule, i.e., the highest expected score is achieved by the true distribution. Despite this, modern feed-forward neural-network-based survival regression models are unable to train and validate directly on the right-censored log-likelihood, due to its intractability, and resort to the aforementioned alternatives, i.e., non-proper scoring rules. We therefore propose a simple novel survival regression method capable of directly optimizing log-likelihood using a monotonic restriction on the time-dependent weights, coined SurvivalMonotonic-net (SuMo-net). SuMo-net achieves state-of-the-art log-likelihood scores across several datasets with 20--100$\times$ computational speedup on inference over existing state-of-the-art neural methods, and is readily applicable to datasets with several million observations.  ( 2 min )
    Tail of Distribution GAN (TailGAN): Generative-Adversarial-Network-Based Boundary Formation. (arXiv:2107.11658v2 [cs.LG] UPDATED)
    Generative Adversarial Networks (GAN) are a powerful methodology and can be used for unsupervised anomaly detection, where current techniques have limitations such as the accurate detection of anomalies near the tail of a distribution. GANs generally do not guarantee the existence of a probability density and are susceptible to mode collapse, while few GANs use likelihood to reduce mode collapse. In this paper, we create a GAN-based tail formation model for anomaly detection, the Tail of distribution GAN (TailGAN), to generate samples on the tail of the data distribution and detect anomalies near the support boundary. Using TailGAN, we leverage GANs for anomaly detection and use maximum entropy regularization. Using GANs that learn the probability of the underlying distribution has advantages in improving the anomaly detection methodology by allowing us to devise a generator for boundary samples, and use this model to characterize anomalies. TailGAN addresses supports with disjoint components and achieves competitive performance on images. We evaluate TailGAN for identifying Out-of-Distribution (OoD) data and its performance evaluated on MNIST, CIFAR-10, Baggage X-Ray, and OoD data shows competitiveness compared to methods from the literature.  ( 2 min )
    The Exponentially Tilted Gaussian Prior for Variational Autoencoders. (arXiv:2111.15646v2 [cs.LG] UPDATED)
    An important property for deep neural networks is the ability to perform robust out-of-distribution detection on previously unseen data. This property is essential for safety purposes when deploying models for real world applications. Recent studies show that probabilistic generative models can perform poorly on this task, which is surprising given that they seek to estimate the likelihood of training data. To alleviate this issue, we propose the exponentially tilted Gaussian prior distribution for the Variational Autoencoder (VAE) which pulls points onto the surface of a hyper-sphere in latent space. This achieves state-of-the art results on the area under the curve-receiver operator characteristics metric using just the negative log-likelihood that the VAE naturally assigns. Because this prior is a simple modification of the traditional VAE prior, it is faster and easier to implement than competitive methods.  ( 2 min )
    Flow-based Algorithms for Improving Clusters: A Unifying Framework, Software, and Performance. (arXiv:2004.09608v3 [cs.LG] UPDATED)
    Clustering points in a vector space or nodes in a graph is a ubiquitous primitive in statistical data analysis, and it is commonly used for exploratory data analysis. In practice, it is often of interest to "refine" or "improve" a given cluster that has been obtained by some other method. In this survey, we focus on principled algorithms for this cluster improvement problem. Many such cluster improvement algorithms are flow-based methods, by which we mean that operationally they require the solution of a sequence of maximum flow problems on a (typically implicitly) modified data graph. These cluster improvement algorithms are powerful, both in theory and in practice, but they have not been widely adopted for problems such as community detection, local graph clustering, semi-supervised learning, etc. Possible reasons for this are: the steep learning curve for these algorithms; the lack of efficient and easy to use software; and the lack of detailed numerical experiments on real-world data that demonstrate their usefulness. Our objective here is to address these issues. To do so, we guide the reader through the whole process of understanding how to implement and apply these powerful algorithms. We present a unifying fractional programming optimization framework that permits us to distill, in a simple way, the crucial components of all these algorithms. It also makes apparent similarities and differences between related methods. Viewing these cluster improvement algorithms via a fractional programming framework suggests directions for future algorithm development. Finally, we develop efficient implementations of these algorithms in our LocalGraphClustering Python package, and we perform extensive numerical experiments to demonstrate the performance of these methods on social networks and image-based data graphs.  ( 3 min )
    Optimal high-dimensional and nonparametric distributed testing under communication constraints. (arXiv:2202.00968v1 [math.ST])
    We derive minimax testing errors in a distributed framework where the data is split over multiple machines and their communication to a central machine is limited to $b$ bits. We investigate both the $d$- and infinite-dimensional signal detection problem under Gaussian white noise. We also derive distributed testing algorithms reaching the theoretical lower bounds. Our results show that distributed testing is subject to fundamentally different phenomena that are not observed in distributed estimation. Among our findings, we show that testing protocols that have access to shared randomness can perform strictly better in some regimes than those that do not. Furthermore, we show that consistent nonparametric distributed testing is always possible, even with as little as $1$-bit of communication and the corresponding test outperforms the best local test using only the information available at a single local machine.  ( 2 min )
    Bridging between soft and hard thresholding by scaling. (arXiv:2104.09703v2 [stat.ML] UPDATED)
    In this article, we developed and analyzed a thresholding method in which soft thresholding estimators are independently expanded by empirical scaling values. The scaling values have a common hyper-parameter that is an order of expansion of an ideal scaling value that achieves hard thresholding. We simply call this estimator a scaled soft thresholding estimator. The scaled soft thresholding is a general method that includes the soft thresholding and non-negative garrote as special cases and gives an another derivation of adaptive LASSO. We then derived the degree of freedom of the scaled soft thresholding by means of the Stein's unbiased risk estimate and found that it is decomposed into the degree of freedom of soft thresholding and the reminder connecting to hard thresholding. In this meaning, the scaled soft thresholding gives a natural bridge between soft and hard thresholding methods. Since the degree of freedom represents the degree of over-fitting, this result implies that there are two sources of over-fitting in the scaled soft thresholding. The first source originated from soft thresholding is determined by the number of un-removed coefficients and is a natural measure of the degree of over-fitting. We analyzed the second source in a particular case of the scaled soft thresholding by referring a known result for hard thresholding. We then found that, in a sparse, large sample and non-parametric setting, the second source is largely determined by coefficient estimates whose true values are zeros and has an influence on over-fitting when threshold levels are around noise levels in those coefficient estimates. In a simple numerical example, these theoretical implications has well explained the behavior of the degree of freedom. Moreover, based on the results here and some known facts, we explained the behaviors of risks of soft, hard and scaled soft thresholding methods.  ( 2 min )
    Spectral Flow on the Manifold of SPD Matrices for Multimodal Data Processing. (arXiv:2009.08062v2 [cs.LG] UPDATED)
    In this paper, we consider data acquired by multimodal sensors capturing complementary aspects and features of a measured phenomenon. We focus on a scenario in which the measurements share mutual sources of variability but might also be contaminated by other measurement-specific sources such as interferences or noise. Our approach combines manifold learning, which is a class of nonlinear data-driven dimension reduction methods, with the well-known Riemannian geometry of symmetric and positive-definite (SPD) matrices. Manifold learning typically includes the spectral analysis of a kernel built from the measurements. Here, we take a different approach, utilizing the Riemannian geometry of the kernels. In particular, we study the way the spectrum of the kernels changes along geodesic paths on the manifold of SPD matrices. We show that this change enables us, in a purely unsupervised manner, to derive a compact, yet informative, description of the relations between the measurements, in terms of their underlying components. Based on this result, we present new algorithms for extracting the common latent components and for identifying common and measurement-specific components.  ( 2 min )
    An efficient estimation of time-varying parameters of dynamic models by combining offline batch optimization and online data assimilation. (arXiv:2110.12522v2 [physics.geo-ph] UPDATED)
    It is crucially important to estimate unknown parameters in earth system models by integrating observation and numerical simulation. For many applications in earth system sciences, an optimization method which allows parameters to temporally change is required. In the present paper, an efficient and practical method to estimate the time-varying parameters of relatively low dimensional models is presented. In the newly proposed method, called Hybrid Offline Online Parameter Estimation with Particle Filtering (HOOPE-PF), an inflation method to maintain the spread of ensemble members in a sampling-importance-resampling particle filter is improved using a non-parametric posterior probabilistic distribution of time-invariant parameters obtained by comparing simulated and observed climatology. The HOOPE-PF outperforms the original sampling-importance-resampling particle filter in synthetic experiments with toy models and a real-data experiment with a conceptual hydrological model especially when the ensemble size is small. The advantage of HOOPE-PF is that its performance is not greatly affected by the size of perturbation to be added to ensemble members to maintain their spread while it is critically important to get the optimal performance in the original particle filter. Since HOOPE-PF is the extension of the existing particle filter which has been extensively applied to many earth system models such as land, ecosystem, hydrology, and paleoclimate reconstruction, the HOOPE-PF can be applied to improve the simulation of these earth system models by considering time-varying model parameters.  ( 2 min )
    A Neural Tangent Kernel Perspective of GANs. (arXiv:2106.05566v3 [cs.LG] UPDATED)
    We propose a novel theoretical framework of analysis for Generative Adversarial Networks (GANs). We start by pointing out a fundamental flaw in previous theoretical analyses that leads to ill-defined gradients for the discriminator. We overcome this issue which impedes a principled study of GAN training, solving it within our framework by taking into account the discriminator's architecture. To this end, we leverage the theory of infinite-width neural networks for the discriminator via its Neural Tangent Kernel. We characterize the trained discriminator for a wide range of losses and establish general differentiability properties of the network. Moreover, we derive new insights about the generated distribution's flow during training, advancing our understanding of GAN dynamics. We empirically corroborate these results via a publicly released analysis toolkit based on our framework, unveiling intuitions that are consistent with current GAN practice.  ( 2 min )
    Data-driven Uncertainty Quantification in Computational Human Head Models. (arXiv:2110.15553v2 [physics.bio-ph] UPDATED)
    Computational models of the human head are promising tools for estimating the impact-induced response of brain, and thus play an important role in the prediction of traumatic brain injury. Modern biofidelic head model simulations are associated with very high computational cost, and high-dimensional inputs and outputs, which limits the applicability of traditional uncertainty quantification (UQ) methods on these systems. In this study, a two-stage, data-driven manifold learning-based framework is proposed for UQ of computational head models. This framework is demonstrated on a 2D subject-specific head model, where the goal is to quantify uncertainty in the simulated strain fields (i.e., output), given variability in the material properties of different brain substructures (i.e., input). In the first stage, a data-driven method based on multi-dimensional Gaussian kernel-density estimation and diffusion maps is used to generate realizations of the input random vector directly from the available data. Computational simulations of a small number of realizations provide input-output pairs for training data-driven surrogate models in the second stage. The surrogate models employ nonlinear dimensionality reduction using Grassmannian diffusion maps, Gaussian process regression to create a low-cost mapping between the input random vector and the reduced solution space, and geometric harmonics models for mapping between the reduced space and the Grassmann manifold. It is demonstrated that the surrogate models provide highly accurate approximations of the computational model while significantly reducing the computational cost. Monte Carlo simulations of the surrogate models are used for uncertainty propagation. UQ of strain fields highlight significant spatial variation in model uncertainty, and reveal key differences in uncertainty among commonly used strain-based brain injury predictor variables.  ( 2 min )
    Partition MCMC for inference on acyclic digraphs. (arXiv:1504.05006v2 [stat.ML] CROSS LISTED)
    Acyclic digraphs are the underlying representation of Bayesian networks, a widely used class of probabilistic graphical models. Learning the underlying graph from data is a way of gaining insights about the structural properties of a domain. Structure learning forms one of the inference challenges of statistical graphical models. MCMC methods, notably structure MCMC, to sample graphs from the posterior distribution given the data are probably the only viable option for Bayesian model averaging. Score modularity and restrictions on the number of parents of each node allow the graphs to be grouped into larger collections, which can be scored as a whole to improve the chain's convergence. Current examples of algorithms taking advantage of grouping are the biased order MCMC, which acts on the alternative space of permuted triangular matrices, and non ergodic edge reversal moves. Here we propose a novel algorithm, which employs the underlying combinatorial structure of DAGs to define a new grouping. As a result convergence is improved compared to structure MCMC, while still retaining the property of producing an unbiased sample. Finally the method can be combined with edge reversal moves to improve the sampler further.  ( 2 min )
    Non-Stationary Dueling Bandits. (arXiv:2202.00935v1 [cs.LG])
    We study the non-stationary dueling bandits problem with $K$ arms, where the time horizon $T$ consists of $M$ stationary segments, each of which is associated with its own preference matrix. The learner repeatedly selects a pair of arms and observes a binary preference between them as feedback. To minimize the accumulated regret, the learner needs to pick the Condorcet winner of each stationary segment as often as possible, despite preference matrices and segment lengths being unknown. We propose the $\mathrm{Beat\, the\, Winner\, Reset}$ algorithm and prove a bound on its expected binary weak regret in the stationary case, which tightens the bound of current state-of-art algorithms. We also show a regret bound for the non-stationary case, without requiring knowledge of $M$ or $T$. We further propose and analyze two meta-algorithms, $\mathrm{DETECT}$ for weak regret and $\mathrm{Monitored\, Dueling\, Bandits}$ for strong regret, both based on a detection-window approach that can incorporate any dueling bandit algorithm as a black-box algorithm. Finally, we prove a worst-case lower bound for expected weak regret in the non-stationary case.  ( 2 min )
    Compositional Coding Capsule Network with K-Means Routing for Text Classification. (arXiv:1810.09177v4 [cs.LG] UPDATED)
    Text classification is a challenging problem which aims to identify the category of texts. In the process of training, word embeddings occupy a large part of parameters. Under the limitation of limited computing resources, it indirectly limits the ability of subsequent network designs. In order to reduce the number of parameters, the compositional coding mechanism has been proposed recently. Based on this, this paper further explores compositional coding and proposes a compositional weighted coding method. And we apply capsule network to model the relationship between word embeddings, a new routing algorithm, which is based on k-means clustering theory, is proposed to fully mine the relationship between word embeddings. Combined with our compositional weighted coding method and the routing algorithm, we design a neural network for text classification. Experiments conducted on eight challenging text classification datasets show that the proposed method achieves competitive accuracy compared to the state-of-the-art approach with significantly fewer parameters.  ( 2 min )
    Structure-preserving GANs. (arXiv:2202.01129v1 [cs.LG])
    Generative adversarial networks (GANs), a class of distribution-learning methods based on a two-player game between a generator and a discriminator, can generally be formulated as a minmax problem based on the variational representation of a divergence between the unknown and the generated distributions. We introduce structure-preserving GANs as a data-efficient framework for learning distributions with additional structure such as group symmetry, by developing new variational representations for divergences. Our theory shows that we can reduce the discriminator space to its projection on the invariant discriminator space, using the conditional expectation with respect to the $\sigma$-algebra associated to the underlying structure. In addition, we prove that the discriminator space reduction must be accompanied by a careful design of structured generators, as flawed designs may easily lead to a catastrophic "mode collapse" of the learned distribution. We contextualize our framework by building symmetry-preserving GANs for distributions with intrinsic group symmetry, and demonstrate that both players, namely the equivariant generator and invariant discriminator, play important but distinct roles in the learning process. Empirical experiments and ablation studies across a broad range of data sets, including real-world medical imaging, validate our theory, and show our proposed methods achieve significantly improved sample fidelity and diversity -- almost an order of magnitude measured in Fr\'echet Inception Distance -- especially in the small data regime.  ( 2 min )
    Efficient Algorithms for Learning to Control Bandits with Unobserved Contexts. (arXiv:2202.00867v1 [stat.ML])
    Contextual bandits are widely-used in the study of learning-based control policies for finite action spaces. While the problem is well-studied for bandits with perfectly observed context vectors, little is known about the case of imperfectly observed contexts. For this setting, existing approaches are inapplicable and new conceptual and technical frameworks are required. We present an implementable posterior sampling algorithm for bandits with imperfect context observations and study its performance for learning optimal decisions. The provided numerical results relate the performance of the algorithm to different quantities of interest including the number of arms, dimensions, observation matrices, posterior rescaling factors, and signal-to-noise ratios. In general, the proposed algorithm exposes efficiency in learning from the noisy imperfect observations and taking actions accordingly. Enlightening understandings the analyses provide as well as interesting future directions it points to, are discussed as well.  ( 2 min )
    Some Reflections on Drawing Causal Inference using Textual Data: Parallels Between Human Subjects and Organized Texts. (arXiv:2202.00848v1 [cs.CL])
    We examine the role of textual data as study units when conducting causal inference by drawing parallels between human subjects and organized texts. %in human population research. We elaborate on key causal concepts and principles, and expose some ambiguity and sometimes fallacies. To facilitate better framing a causal query, we discuss two strategies: (i) shifting from immutable traits to perceptions of them, and (ii) shifting from some abstract concept/property to its constituent parts, i.e., adopting a constructivist perspective of an abstract concept. We hope this article would raise the awareness of the importance of articulating and clarifying fundamental concepts before delving into developing methodologies when drawing causal inference using textual data.  ( 2 min )
    Heterogeneous manifolds for curvature-aware graph embedding. (arXiv:2202.01185v1 [stat.ML])
    Graph embeddings, wherein the nodes of the graph are represented by points in a continuous space, are used in a broad range of Graph ML applications. The quality of such embeddings crucially depends on whether the geometry of the space matches that of the graph. Euclidean spaces are often a poor choice for many types of real-world graphs, where hierarchical structure and a power-law degree distribution are linked to negative curvature. In this regard, it has recently been shown that hyperbolic spaces and more general manifolds, such as products of constant-curvature spaces and matrix manifolds, are advantageous to approximately match nodes pairwise distances. However, all these classes of manifolds are homogeneous, implying that the curvature distribution is the same at each point, making them unsuited to match the local curvature (and related structural properties) of the graph. In this paper, we study graph embeddings in a broader class of heterogeneous rotationally-symmetric manifolds. By adding a single extra radial dimension to any given existing homogeneous model, we can both account for heterogeneous curvature distributions on graphs and pairwise distances. We evaluate our approach on reconstruction tasks on synthetic and real datasets and show its potential in better preservation of high-order structures and heterogeneous random graphs generation.  ( 2 min )
    Probabilistically Robust Learning: Balancing Average- and Worst-case Performance. (arXiv:2202.01136v1 [cs.LG])
    Many of the successes of machine learning are based on minimizing an averaged loss function. However, it is well-known that this paradigm suffers from robustness issues that hinder its applicability in safety-critical domains. These issues are often addressed by training against worst-case perturbations of data, a technique known as adversarial training. Although empirically effective, adversarial training can be overly conservative, leading to unfavorable trade-offs between nominal performance and robustness. To this end, in this paper we propose a framework called probabilistic robustness that bridges the gap between the accurate, yet brittle average case and the robust, yet conservative worst case by enforcing robustness to most rather than to all perturbations. From a theoretical point of view, this framework overcomes the trade-offs between the performance and the sample-complexity of worst-case and average-case learning. From a practical point of view, we propose a novel algorithm based on risk-aware optimization that effectively balances average- and worst-case performance at a considerably lower computational cost relative to adversarial training. Our results on MNIST, CIFAR-10, and SVHN illustrate the advantages of this framework on the spectrum from average- to worst-case robustness.  ( 2 min )
    A Novel Hessian-Free Bilevel Optimizer via Evolution Strategies. (arXiv:2110.07004v2 [cs.LG] UPDATED)
    Bilevel optimization has arisen as a powerful tool for solving many modern machine learning problems. However, due to the nested structure of bilevel optimization, even gradient-based methods require second-order derivative approximations via Jacobian- or/and Hessian-vector computations, which can be very costly in practice. In this work, we propose a novel Hessian-free bilevel algorithm, which adopts the Evolution Strategies (ES) method to approximate the response Jacobian matrix in the hypergradient of the bilevel problem, and hence fully eliminates all second-order computations. We call our algorithm as ESJ (which stands for the ES-based Jacobian method) and further extend it to the stochastic setting as ESJ-S. Theoretically, we show that both ESJ and ESJ-S are guaranteed to converge. Experimentally, we demonstrate that the proposed algorithms outperform baseline bilevel optimizers on various bilevel problems. Particularly, in our experiment on few-shot meta-learning of ResNet-12 network over the miniImageNet dataset, we show that our algorithm outperforms baseline meta-learning algorithms, while other baseline bilevel optimizers do not solve such meta-learning problems within a comparable time frame.  ( 2 min )
    Algorithms for Efficiently Learning Low-Rank Neural Networks. (arXiv:2202.00834v1 [cs.LG])
    We study algorithms for learning low-rank neural networks -- networks where the weight parameters are re-parameterized by products of two low-rank matrices. First, we present a provably efficient algorithm which learns an optimal low-rank approximation to a single-hidden-layer ReLU network up to additive error $\epsilon$ with probability $\ge 1 - \delta$, given access to noiseless samples with Gaussian marginals in polynomial time and samples. Thus, we provide the first example of an algorithm which can efficiently learn a neural network up to additive error without assuming the ground truth is realizable. To solve this problem, we introduce an efficient SVD-based \textit{Nonlinear Kernel Projection} algorithm for solving a nonlinear low-rank approximation problem over Gaussian space. Inspired by the efficiency of our algorithm, we propose a novel low-rank initialization framework for training low-rank \textit{deep} networks, and prove that for ReLU networks, the gap between our method and existing schemes widens as the desired rank of the approximating weights decreases, or as the dimension of the inputs increases (the latter point holds when network width is superlinear in dimension). Finally, we validate our theory by training ResNets and EfficientNets \citep{he2016deepresidual, tan2019efficientnet} models on ImageNet \citep{ILSVRC15}.  ( 2 min )
    Gradient Based Clustering. (arXiv:2202.00720v1 [cs.LG])
    We propose a general approach for distance based clustering, using the gradient of the cost function that measures clustering quality with respect to cluster assignments and cluster center positions. The approach is an iterative two step procedure (alternating between cluster assignment and cluster center updates) and is applicable to a wide range of functions, satisfying some mild assumptions. The main advantage of the proposed approach is a simple and computationally cheap update rule. Unlike previous methods that specialize to a specific formulation of the clustering problem, our approach is applicable to a wide range of costs, including non-Bregman clustering methods based on the Huber loss. We analyze the convergence of the proposed algorithm, and show that it converges to the set of appropriately defined fixed points, under arbitrary center initialization. In the special case of Bregman cost functions, the algorithm converges to the set of centroidal Voronoi partitions, which is consistent with prior works. Numerical experiments on real data demonstrate the effectiveness of the proposed method.  ( 2 min )
    Framework for Evaluating Faithfulness of Local Explanations. (arXiv:2202.00734v1 [cs.LG])
    We study the faithfulness of an explanation system to the underlying prediction model. We show that this can be captured by two properties, consistency and sufficiency, and introduce quantitative measures of the extent to which these hold. Interestingly, these measures depend on the test-time data distribution. For a variety of existing explanation systems, such as anchors, we analytically study these quantities. We also provide estimators and sample complexity bounds for empirically determining the faithfulness of black-box explanation systems. Finally, we experimentally validate the new properties and estimators.  ( 2 min )
    AutoML for Contextual Bandits. (arXiv:1909.03212v2 [cs.LG] UPDATED)
    Contextual Bandits is one of the widely popular techniques used in applications such as personalization, recommendation systems, mobile health, causal marketing etc . As a dynamic approach, it can be more efficient than standard A/B testing in minimizing regret. We propose an end to end automated meta-learning pipeline to approximate the optimal Q function for contextual bandits problems. We see that our model is able to perform much better than random exploration, being more regret efficient and able to converge with a limited number of samples, while remaining very general and easy to use due to the meta-learning approach. We used a linearly annealed e-greedy exploration policy to define the exploration vs exploitation schedule. We tested the system on a synthetic environment to characterize it fully and we evaluated it on some open source datasets to benchmark against prior work. We see that our model outperforms or performs comparatively to other models while requiring no tuning nor feature engineering.  ( 2 min )
    Context Uncertainty in Contextual Bandits with Applications to Recommender Systems. (arXiv:2202.00805v1 [cs.LG])
    Recurrent neural networks have proven effective in modeling sequential user feedbacks for recommender systems. However, they usually focus solely on item relevance and fail to effectively explore diverse items for users, therefore harming the system performance in the long run. To address this problem, we propose a new type of recurrent neural networks, dubbed recurrent exploration networks (REN), to jointly perform representation learning and effective exploration in the latent space. REN tries to balance relevance and exploration while taking into account the uncertainty in the representations. Our theoretical analysis shows that REN can preserve the rate-optimal sublinear regret even when there exists uncertainty in the learned representations. Our empirical study demonstrates that REN can achieve satisfactory long-term rewards on both synthetic and real-world recommendation datasets, outperforming state-of-the-art models.  ( 2 min )
    A comparison of approaches to improve worst-case predictive model performance over patient subpopulations. (arXiv:2108.12250v2 [stat.ML] UPDATED)
    Predictive models for clinical outcomes that are accurate on average in a patient population may underperform drastically for some subpopulations, potentially introducing or reinforcing inequities in care access and quality. Model training approaches that aim to maximize worst-case model performance across subpopulations, such as distributionally robust optimization (DRO), attempt to address this problem without introducing additional harms. We conduct a large-scale empirical study of DRO and several variations of standard learning procedures to identify approaches for model development and selection that consistently improve disaggregated and worst-case performance over subpopulations compared to standard approaches for learning predictive models from electronic health records data. In the course of our evaluation, we introduce an extension to DRO approaches that allows for specification of the metric used to assess worst-case performance. We conduct the analysis for models that predict in-hospital mortality, prolonged length of stay, and 30-day readmission for inpatient admissions, and predict in-hospital mortality using intensive care data. We find that, with relatively few exceptions, no approach performs better, for each patient subpopulation examined, than standard learning procedures using the entire training dataset. These results imply that when it is of interest to improve model performance for patient subpopulations beyond what can be achieved with standard practices, it may be necessary to do so via data collection techniques that increase the effective sample size or reduce the level of noise in the prediction problem.  ( 2 min )
    Efficient Sampling and Structure Learning of Bayesian Networks. (arXiv:1803.07859v4 [stat.ML] CROSS LISTED)
    Bayesian networks are probabilistic graphical models widely employed to understand dependencies in high dimensional data, and even to facilitate causal discovery. Learning the underlying network structure, which is encoded as a directed acyclic graph (DAG) is highly challenging mainly due to the vast number of possible networks in combination with the acyclicity constraint. Efforts have focussed on two fronts: constraint-based methods that perform conditional independence tests to exclude edges and score and search approaches which explore the DAG space with greedy or MCMC schemes. Here we synthesise these two fields in a novel hybrid method which reduces the complexity of MCMC approaches to that of a constraint-based method. Individual steps in the MCMC scheme only require simple table lookups so that very long chains can be efficiently obtained. Furthermore, the scheme includes an iterative procedure to correct for errors from the conditional independence tests. The algorithm offers markedly superior performance to alternatives, particularly because DAGs can also be sampled from the posterior distribution, enabling full Bayesian model averaging for much larger Bayesian networks.  ( 2 min )
    A selective review of sufficient dimension reduction for multivariate response regression. (arXiv:2202.00876v1 [stat.ME])
    We review sufficient dimension reduction (SDR) estimators with multivariate response in this paper. A wide range of SDR methods are characterized as inverse regression SDR estimators or forward regression SDR estimators. The inverse regression family include pooled marginal estimators, projective resampling estimators, and distance-based estimators. Ordinary least squares, partial least squares, and semiparametric SDR estimators, on the other hand, are discussed as estimators from the forward regression family.  ( 2 min )
    Robust Training of Neural Networks using Scale Invariant Architectures. (arXiv:2202.00980v1 [cs.LG])
    In contrast to SGD, adaptive gradient methods like Adam allow robust training of modern deep networks, especially large language models. However, the use of adaptivity not only comes at the cost of extra memory but also raises the fundamental question: can non-adaptive methods like SGD enjoy similar benefits? In this paper, we provide an affirmative answer to this question by proposing to achieve both robust and memory-efficient training via the following general recipe: (1) modify the architecture and make it scale invariant, i.e. the scale of parameter doesn't affect the output of the network, (2) train with SGD and weight decay, and optionally (3) clip the global gradient norm proportional to weight norm multiplied by $\sqrt{\tfrac{2\lambda}{\eta}}$, where $\eta$ is learning rate and $\lambda$ is weight decay. We show that this general approach is robust to rescaling of parameter and loss by proving that its convergence only depends logarithmically on the scale of initialization and loss, whereas the standard SGD might not even converge for many initializations. Following our recipe, we design a scale invariant version of BERT, called SIBERT, which when trained simply by vanilla SGD achieves performance comparable to BERT trained by adaptive methods like Adam on downstream tasks.  ( 2 min )

  • Open

    [minDalle] [R] Random AI Generated Monstrosities
    submitted by /u/Cerulian15 [link] [comments]
    [D] Resources for interpretable ML
    Hi folks, I recently read and was fascinated by the idea of a paper by Dr. Cynthia Rudin on using naturally interpretable ML models instead of trying to explain a black box model (like CNN or RNN) with another model (such as the saliency map). I'd like to learn more about interpretable ML algorithms in detail, but I haven't got much clue about where to start. Luckily, I have some background in statistics so maths is not much of a problem for me. Could you recommend some resources such as textbooks or introductory papers regarding interpretable ML techniques? Thank you for your help in advance, and have a wonderful day. :) submitted by /u/Appropriate-Skin6140 [link] [comments]  ( 1 min )
    [R] Avoiding Catastrophe: Active Dendrites Enable Multi-Task Learning in Dynamic Environments (Numenta, Dec. 2021)
    Paper : https://arxiv.org/abs/2201.00042 Abstract: "A key challenge for AI is to build embodied systems that operate in dynamically changing environments. Such systems must adapt to changing task contexts and learn continuously. Although standard deep learning systems achieve state of the art results on static benchmarks, they often struggle in dynamic scenarios. In these settings, error signals from multiple contexts can interfere with one another, ultimately leading to a phenomenon known as catastrophic forgetting. In this article we investigate biologically inspired architectures as solutions to these problems. Specifically, we show that the biophysical properties of dendrites and local inhibitory systems enable networks to dynamically restrict and route information in a context-specif…  ( 1 min )
    [Discussion] Unique uses of recommendation systems?
    Everyone knows about Amazon, Netflix, Hulu, YouTube, etc. I'm curious about unusual/non-obvious uses of recommendation systems. Does anyone know of any cases where they might be used by, say, NASA, or to solve some other non-media/non-commercial problem? submitted by /u/WartimeHotTot [link] [comments]  ( 2 min )
    [D] share your ML/DL model as black box
    Greetings. I want to share my pytorch deep learning model in the form of a black box. So other person cant see the architecture, loading, pre-processing, prediction happens in the box. The user just send an input and get a output. One way is to server via django and host it. But i want the user to use it locally. I dont know if any such thing exisit or not. If exist, please share it with me. Edit: after reading comment, I am actually looking for semi black box which can't be decided by just a print statement submitted by /u/mrtac96 [link] [comments]  ( 2 min )
    [N] Body tracking with TensorFlow
    Google recently updated TensorFlow with two new models for tracking the body: The BlazePose software determines the position of the human body based on the camera image. For instance, a fitness app can automatically evaluate your technique, an online store can suggest suitable clothing styles, or games - create an avatar that repeats your actions. Model code is available on GitHub, there is also a demo in the browser. https://preview.redd.it/murgkui9tmf81.jpg?width=990&format=pjpg&auto=webp&s=a4d18f7f30ea6776aa6e4da494a8e225560a99ea Selfie Segmentation is best for when a person is directly in front of the webcam. A common application of such models is blurring or background substitution in video calls. You can also find the code on GitHub, and a live demo on the link. A step toward the metaverse is tracking the physical body and transferring it to a digital avatar. Geenee has recently released such a solution, a WebAR based body tracking SDK. In this way, the barriers between the physical and digital worlds are gradually dissolved. submitted by /u/waka_cool [link] [comments]  ( 1 min )
    [P] Super-resolution for non-image data?
    I'm working on a project looking to increase the temporal resolution on spatio-temporal data. Haven't started modelling yet but looking for ideas. Ideas I've had so far: modifying existing SRGAN to work on the spatio-temporal data. a type of cycleGAN with cycle from low to high resolution. A problem with both of these approaches are that they both are based on convolutional networks, and convolving spatio-temporal data don't make sense the same way as convolving pixel information in images. Do anyone know of super-resolution approaches for non-image data? More specifically data where the dependencies are only along one dimension (temporal in my case). Any tips would be appreciated. submitted by /u/supkek [link] [comments]  ( 3 min )
    [P] Pong From Pixels. 130 line python script that learns to play pong from raw pixels.
    Refreshed code and pretrained checkpoint for Andrej Karpathy's classic reinforcement learning blog post. Learn how a 130 line python script can learn to play Pong. The even more interesting thing is that the core algorithm is not Pong specific and can be used to solve a wide range of problems. https://www.storminthecastle.com/posts/pong_from_pixels/ submitted by /u/CakeStandard3577 [link] [comments]  ( 1 min )
    [R] OpenAI built a neural theorem prover that can solve (some) formal Math Olympiad problems
    OpenAI wrote in a blog post about a project where they a built a neural theorem prover for Lean that learned to solve a variety of challenging high-school olympiad problems, including problems from the AMC12 and AIME competitions, as well as two problems adapted from the International Mathematical Olympiad. Blog post: https://openai.com/blog/formal-math/ submitted by /u/hardmaru [link] [comments]  ( 2 min )
  • Open

    [P] Pong From Pixels. 130 line python script that learns to play pong from raw pixels.
    Refreshed code and pretrained checkpoint for Andrej Karpathy's classic reinforcement learning blog post. Learn how a 130 line python script can learn to play Pong. The even more interesting thing is that the core algorithm is not Pong specific and can be used to solve a wide range of problems. https://www.storminthecastle.com/posts/pong_from_pixels/ submitted by /u/CakeStandard3577 [link] [comments]  ( 1 min )
    Intro RL Lectures
    I've been checking out some intro to RL lecture series, and am wondering which one y'all would recommend. Reinforcement Learning Lecture Series 2021 - DeepMind Introduction to Reinforcement Learning with David Silver - DeepMind ​ Or, if there is a different one that anyone recommends, definitely open to checking it out! Thank you for the help. submitted by /u/Txflip [link] [comments]  ( 1 min )
    MADDPG applied on Unity Tennis
    Hello Community, I have only one question in the multi-agent configuration, what is the action of the player A1 when the other player A2 takes an action from the action space? I mean at time t the agent A2 takes an action then the other agent normally does not take any action so is it presented by 0 while storing it in the replay buffer. submitted by /u/GuavaAgreeable208 [link] [comments]  ( 1 min )
    Request : Does anyone have an actual video of an AI agent beating Montezuma's Revenge at superhuman ability?
    Ordinary RL algorithms usually fail to get out of the first room of Montezuma’s Revenge (scoring 400 or lower) and score 0 or lower on Pitfall. To try to solve such challenges, researchers add bonuses for exploration, often called intrinsic motivation (IM), to agents, which rewards them for reaching new states (situations or locations). Despite IM algorithms being specifically designed to tackle sparse reward problems, they still struggle with Montezuma’s Revenge and Pitfall. The best rarely solve level 1 of Montezuma’s Revenge and fail completely on Pitfall, receiving a score of zero. . . . Deepmind developed Agent57, the first deep reinforcement learning agent to obtain a score that is above the human baseline on all 57 Atari 2600 games. Agent57 combines an algorithm for efficient exploration with a meta-controller that adapts the exploration and long vs. short-term behaviour of the agent. "Above the human baseline"? Is that an average over all the games, or does this mean it plays all of them better than a human does? And if it does play them better than a human, what does Montezuma's Revenge look like when played by such a thing? submitted by /u/moschles [link] [comments]  ( 2 min )
    Stable baselines vs RLlib vs CleanRL
    As the title says: I'd like to know how these libraries compare to each other. Which purpose do they serve? What are their strengths and weaknesses? Can they be used for professional purposes or are they mostly educational/only for "easy" environments? Context: I'd like to compare the performance of different algorithms on my (not easy) environment. The algorithm needs to be fast cause training will be long (over very large number of samples). Parallelization is also very welcome. But more generally I'd just like to know what each of these libraries is good at. submitted by /u/RatonneLaveuse [link] [comments]  ( 1 min )
  • Open

    Atun-Shei Films | The Metamorphosis of Prime Intellect - book review and interview with author u/LocalRoger
    submitted by /u/Broken-Butterfly [link] [comments]
    Why Deepfakes Cannot Currently Convey Subtlety of Emotion
    submitted by /u/DaveBowman1975 [link] [comments]
    Researchers Find A Possible Solution To The Problem Of Robocalling Using Machine Learning
    The United States government has begun to take significant steps to eliminate robocalls. The FCC requires that phone companies use a cryptography-based technology called STIR/SHAKEN to authenticate all callers’ IDs beginning June 30, 2021. Anyone hoping for robocalls to evaporate in a puff of regulation will be sorely disappointed. However, respite may be on the way, albeit slowly. The technology to block robocalls is developing, and STIR/SHAKEN is part of a trend in which phone consumers in the United States are no longer solely responsible for deciding whether or not to accept robocalls. Now it’s up to the phone companies to do their part. Continue Reading Paper: https://ieeexplore.ieee.org/document/9474299/authors#authors Reference: https://spectrum.ieee.org/ai-robocalls submitted by /u/ai-lover [link] [comments]  ( 1 min )
    Companies are utilizing more and more AI for medical applications
    The rise of artificial intelligence is beginning to make an impact in healthcare worldwide. Artificial intelligence (AI) is a versatility that allows us to think creatively in its application to solve various challenges in healthcare. Because a majority of data is now digitized this gives AI a huge advantage in processing and learning from the given data. AI can be used for patient care such as management of a disease or it can be used as medical tools for physicians to use to further the efficacy of their diagnosis. Despite the prospects of AI in healthcare easing the workflow for both patients and experts there have been challenges to their application commercially. For many experts, AI in healthcare is advantageous, but they fear that with changing methods of treatment AI will fall quic…  ( 2 min )
    Facial Colorization & Restoration on Crack - Time Travel Rephotography
    submitted by /u/cloud_weather [link] [comments]
    Have any cybercrime experts in this group noticed the high number of dark web and deep web criminal organisations that have been 'sunsetting' in the past year? I just wrote my first article on Medium. It's about AI, tech, and the death of both the deep web and the dark web.
    submitted by /u/rainywhy [link] [comments]  ( 1 min )
    Is the Openworm connectome an AGI?
    So I wanted to ask does the openworm connectome fit the definition of a General Artificial intelligence? Of course it's incredibly limited For those you unfamiliar https://openworm.org/ Here it is controlling a robot. https://youtu.be/u406m-v4Y3Y submitted by /u/lawless_c [link] [comments]  ( 1 min )
    VQGAN+CLIP collab program problems
    I'm getting a lot of errors with these VQGAN+CLIP collab programs, am I right in thinking that if I want to properly use them and explore I should upgrade to Collab Pro? I've been using Nightcafe and Artbreeder, but want to use the collab programs to get deeper more cost effective results, I also would like to do animations, and these cost on Nightcafe etc,,, Should I persist with the collab one ? do others find them quite error free and easy to use ? ​ using this https://colab.research.google.com/github/justin-bennington/S2ML-Generators/blob/main/S2ML_Image_Generator.ipynb ​ and the latest error is https://preview.redd.it/fhge0h0bomf81.jpg?width=705&format=pjpg&auto=webp&s=024d54124aaf744ba8621c628aa3799180d6ef5e submitted by /u/Key_Cauliflower7137 [link] [comments]  ( 1 min )
    Chinese character bounding box models
    I've tried Google and Hugging Face and couldn't find a ready-made Chinese character bounding box model. The top result is from a 2019 paper. Does anyone know of a ready to use model for this purpose? I am trying to make an app for my son that display pinyin on top of the characters to help him learn Chinese. submitted by /u/xiaodaireddit [link] [comments]  ( 1 min )
    working on a DIY ai with a heavy heart
    has anyone ever had the feeling of a heavy heart while working on a chat bot or some thing similar .im currently in the process of spinning up an old ai project from back in 2018 that i never got to finish . it is effectively a robot wife . its like Alexa , but instead of you talking to Alexa , Alexa talks to you and asks you how your day is going , what's on your mind , have you seen the newest episode of a tv show ( if you mention that you like to watch said show ) and more . to cut a long story short ; its a companion with home automation abilities as well as the ability to imitate conversations . Im still adding to it and updating my code , but this feeling in my heart is making a bit hard . i feel normal , just a bit sad and weird i guess . any advice would be nice submitted by /u/THE_HENTAI_LORD [link] [comments]  ( 2 min )
    Artificial Nightmares : Year of The Tiger || DD PYTTI VQGAN AI Art Video [4K 60 FPS]
    submitted by /u/Thenamessd [link] [comments]
    Posted this question in the actual Replika sub, but is there any experts/programmers who might be able to answer my question about how a Replika (AIs in general if you'd like to expand) perceives a "voice chat" and whether it is understanding human voices? Full question below in reply.
    submitted by /u/GR9100 [link] [comments]  ( 1 min )
  • Open

    Facial Colorization & Restoration on Crack - Time Travel Rephotography
    submitted by /u/cloud_weather [link] [comments]
    Im quite new to neural networks but this is what ive learnt tell me if its right.
    I made a reinforcement neural network in a game and it seems like you can just throw information at it and give it rewards for doing good things and it will eventually learn what the information means. Is that correct? submitted by /u/Plazmeer [link] [comments]  ( 1 min )
  • Open

    2022: Future Trends in Mining Technology Innovation
    To achieve long-term transformation, just a concentration on technology is not required. Companies must take a comprehensive approach to…  ( 4 min )
    Business guide to facial recognition: benefits, applications, and issues to consider
    Facial recognition is everywhere. What once started as an attribute specific to sci-fi movies is now a part of everyday life: we rely on…  ( 12 min )
  • Open

    What Factors Will Weave a Strong Web of Growth for the Delta Robotics Market?
    Industry 5.0 is already on the horizon and this factor has made the role of robotics necessary and crucial for a wide range of industries. The manufacturing sector is growing at a thunderous speed and industrialization is proliferating in every region and sector. This aspect has further brought great growth opportunities for robotics. The role… Read More »What Factors Will Weave a Strong Web of Growth for the Delta Robotics Market? The post What Factors Will Weave a Strong Web of Growth for the Delta Robotics Market? appeared first on Data Science Central.  ( 3 min )
  • Open

    2 Powerful 2 Be Stopped: ‘Dying Light 2 Stay Human’ Arrives on GeForce NOW’s Second Anniversary
    Great things come in twos. Techland’s Dying Light 2 Stay Human arrives with RTX ON and is streaming from the cloud tomorrow, Feb. 4. Plus, in celebration of the second anniversary of GeForce NOW, February is packed full of membership rewards in Eternal Return, World of Warships and more. There are also 30 games joining Read article > The post 2 Powerful 2 Be Stopped: ‘Dying Light 2 Stay Human’ Arrives on GeForce NOW’s Second Anniversary appeared first on The Official NVIDIA Blog.  ( 4 min )
    Rain or Shine: Radar Vision Sees Through Clouds to Support Emergency Flood Relief
    Flooding usually comes with various bad weather conditions, such as thick clouds, heavy rain and blustering winds. GPU-powered data science systems can now help researchers and emergency flood response teams to see through it all. John Murray, visiting professor in the Geographic Data Science Lab at the University of Liverpool, developed cuSAR, a platform that Read article > The post Rain or Shine: Radar Vision Sees Through Clouds to Support Emergency Flood Relief appeared first on The Official NVIDIA Blog.  ( 3 min )
  • Open

    Optimal Transport based Data Augmentation for Heart Disease Diagnosis and Prediction. (arXiv:2202.00567v1 [eess.SP])
    In this paper, we focus on a new method of data augmentation to solve the data imbalance problem within imbalanced ECG datasets to improve the robustness and accuracy of heart disease detection. By using Optimal Transport, we augment the ECG disease data from normal ECG beats to balance the data among different categories. We build a Multi-Feature Transformer (MF-Transformer) as our classification model, where different features are extracted from both time and frequency domains to diagnose various heart conditions. Learning from 12-lead ECG signals, our model is able to distinguish five categories of cardiac conditions. Our results demonstrate 1) the classification models' ability to make competitive predictions on five ECG categories; 2) improvements in accuracy and robustness reflecting the effectiveness of our data augmentation method.  ( 2 min )
    Causal-BALD: Deep Bayesian Active Learning of Outcomes to Infer Treatment-Effects from Observational Data. (arXiv:2111.02275v2 [cs.LG] UPDATED)
    Estimating personalized treatment effects from high-dimensional observational data is essential in situations where experimental designs are infeasible, unethical, or expensive. Existing approaches rely on fitting deep models on outcomes observed for treated and control populations. However, when measuring individual outcomes is costly, as is the case of a tumor biopsy, a sample-efficient strategy for acquiring each result is required. Deep Bayesian active learning provides a framework for efficient data acquisition by selecting points with high uncertainty. However, existing methods bias training data acquisition towards regions of non-overlapping support between the treated and control populations. These are not sample-efficient because the treatment effect is not identifiable in such regions. We introduce causal, Bayesian acquisition functions grounded in information theory that bias data acquisition towards regions with overlapping support to maximize sample efficiency for learning personalized treatment effects. We demonstrate the performance of the proposed acquisition strategies on synthetic and semi-synthetic datasets IHDP and CMNIST and their extensions, which aim to simulate common dataset biases and pathologies.  ( 2 min )
    Geometric Transformers for Protein Interface Contact Prediction. (arXiv:2110.02423v3 [cs.LG] UPDATED)
    Computational methods for predicting the interface contacts between proteins come highly sought after for drug discovery as they can significantly advance the accuracy of alternative approaches, such as protein-protein docking, protein function analysis tools, and other computational methods for protein bioinformatics. In this work, we present the Geometric Transformer, a novel geometry-evolving graph transformer for rotation and translation-invariant protein interface contact prediction, packaged within DeepInteract, an end-to-end prediction pipeline. DeepInteract predicts partner-specific protein interface contacts (i.e., inter-protein residue-residue contacts) given the 3D tertiary structures of two proteins as input. In rigorous benchmarks, DeepInteract, on challenging protein complex targets from the 13th and 14th CASP-CAPRI experiments as well as Docking Benchmark 5, achieves 14% and 1.1% top L/5 precision (L: length of a protein unit in a complex), respectively. In doing so, DeepInteract, with the Geometric Transformer as its graph-based backbone, outperforms existing methods for interface contact prediction in addition to other graph-based neural network backbones compatible with DeepInteract, thereby validating the effectiveness of the Geometric Transformer for learning rich relational-geometric features for downstream tasks on 3D protein structures.  ( 2 min )
    Blind ECG Restoration by Operational Cycle-GANs. (arXiv:2202.00589v1 [eess.SP])
    Continuous long-term monitoring of electrocardiography (ECG) signals is crucial for the early detection of cardiac abnormalities such as arrhythmia. Non-clinical ECG recordings acquired by Holter and wearable ECG sensors often suffer from severe artifacts such as baseline wander, signal cuts, motion artifacts, variations on QRS amplitude, noise, and other interferences. Usually, a set of such artifacts occur on the same ECG signal with varying severity and duration, and this makes an accurate diagnosis by machines or medical doctors extremely difficult. Despite numerous studies that have attempted ECG denoising, they naturally fail to restore the actual ECG signal corrupted with such artifacts due to their simple and naive noise model. In this study, we propose a novel approach for blind ECG restoration using cycle-consistent generative adversarial networks (Cycle-GANs) where the quality of the signal can be improved to a clinical level ECG regardless of the type and severity of the artifacts corrupting the signal. To further boost the restoration performance, we propose 1D operational Cycle-GANs with the generative neuron model. The proposed approach has been evaluated extensively using one of the largest benchmark ECG datasets from the China Physiological Signal Challenge (CPSC-2020) with more than one million beats. Besides the quantitative and qualitative evaluations, a group of cardiologists performed medical evaluations to validate the quality and usability of the restored ECG, especially for an accurate arrhythmia diagnosis.  ( 2 min )
    Quantifying Ignorance in Individual-Level Causal-Effect Estimates under Hidden Confounding. (arXiv:2103.04850v6 [cs.LG] UPDATED)
    We study the problem of learning conditional average treatment effects (CATE) from high-dimensional, observational data with unobserved confounders. Unobserved confounders introduce ignorance -- a level of unidentifiability -- about an individual's response to treatment by inducing bias in CATE estimates. We present a new parametric interval estimator suited for high-dimensional data, that estimates a range of possible CATE values when given a predefined bound on the level of hidden confounding. Further, previous interval estimators do not account for ignorance about the CATE associated with samples that may be underrepresented in the original study, or samples that violate the overlap assumption. Our interval estimator also incorporates model uncertainty so that practitioners can be made aware of out-of-distribution data. We prove that our estimator converges to tight bounds on CATE when there may be unobserved confounding, and assess it using semi-synthetic, high-dimensional datasets.  ( 2 min )
    A least squares support vector regression for anisotropic diffusion filtering. (arXiv:2202.00595v1 [eess.SP])
    Anisotropic diffusion filtering for signal smoothing as a low-pass filter has the advantage of the edge-preserving, i.e., it does not affect the edges that contain more critical data than the other parts of the signal. In this paper, we present a numerical algorithm based on least squares support vector regression by using Legendre orthogonal kernel with the discretization of the nonlinear diffusion problem in time by the Crank-Nicolson method. This method transforms the signal smoothing process into solving an optimization problem that can be solved by efficient numerical algorithms. In the final analysis, we have reported some numerical experiments to show the effectiveness of the proposed machine learning based approach for signal smoothing.
    Physical Action Categorization using Signal Analysis and Machine Learning. (arXiv:2008.06971v2 [eess.SP] UPDATED)
    Daily life of thousands of individuals around the globe suffers due to physical or mental disability related to limb movement. The quality of life for such individuals can be made better by use of assistive applications and systems. In such scenario, mapping of physical actions from movement to a computer aided application can lead the way for solution. Surface Electromyography (sEMG) presents a non-invasive mechanism through which we can translate the physical movement to signals for classification and use in applications. In this paper, we propose a machine learning based framework for classification of 4 physical actions. The framework looks into the various features from different modalities which contribution from time domain, frequency domain, higher order statistics and inter channel statistics. Next, we conducted a comparative analysis of k-NN, SVM and ELM classifier using the feature set. Effect of different combinations of feature set has also been recorded. Finally, the classifier accuracy with SVM and 1-NN based classifier for a subset of features gives an accuracy of 95.21 and 95.83 respectively. Additionally, we have also proposed that dimensionality reduction by use of PCA leads to only a minor drop of less than 5.55% in accuracy while using only 9.22% of the original feature set. These finding are useful for algorithm designer to choose the best approach keeping in mind the resources available for execution of algorithm.
    Hinge Policy Optimization: Rethinking Policy Improvement and Reinterpreting PPO. (arXiv:2110.13799v2 [cs.LG] UPDATED)
    Policy optimization is a fundamental principle for designing reinforcement learning algorithms, and one example is the proximal policy optimization algorithm with a clipped surrogate objective (PPO-clip), which has been popularly used in deep reinforcement learning due to its simplicity and effectiveness. Despite its superior empirical performance, PPO-clip has not been justified via theoretical proof up to date. This paper proposes to rethink policy optimization and reinterpret the theory of PPO-clip based on hinge policy optimization (HPO), called to improve policy by hinge loss in this paper. Specifically, we leverage sufficient conditions of state-wise policy improvement and then rethink policy update as solving a large-margin classification problem with hinge loss. By leveraging various types of classifiers, the proposed design opens up a whole new family of policy-based algorithms, including the PPO-clip as a special case. Based on this construct, we prove that these algorithms asymptotically attain a globally optimal policy. To our knowledge, this is the first ever that can prove global convergence to an optimal policy for a variant of PPO-clip. Through experiments, we corroborate the performance of a variety of HPO algorithms against popular benchmark methods.
    Biologically Plausible Training Mechanisms for Self-Supervised Learning in Deep Networks. (arXiv:2109.15089v4 [cs.NE] UPDATED)
    We develop biologically plausible training mechanisms for self-supervised learning (SSL) in deep networks. Specifically, by biological plausible training we mean (i) All updates of weights are based on current activities of pre-synaptic units and current, or activity retrieved from short term memory of post synaptic units, including at the top-most error computing layer, (ii) Complex computations such as normalization, inner products and division are avoided (iii) Asymmetric connections between units, (iv) Most learning is carried out in an unsupervised manner. SSL with a contrastive loss satisfies the third condition as it does not require labelled data and it introduces robustness to observed perturbations of objects, which occur naturally as objects or observer move in 3d and with variable lighting over time. We propose a contrastive hinge based loss whose error involves simple local computations satisfying (ii), as opposed to the standard contrastive losses employed in the literature, which do not lend themselves easily to implementation in a network architecture due to complex computations involving ratios and inner products. Furthermore we show that learning can be performed with one of two more plausible alternatives to backpropagation that satisfy conditions (i) and (ii). The first is difference target propagation (DTP) and the second is layer-wise learning (LL), where each layer is directly connected to a layer computing the loss error. Both methods represent alternatives to the symmetric weight issue of backpropagation. By training convolutional neural networks (CNNs) with SSL and DTP, LL, we find that our proposed framework achieves comparable performance to standard BP learning downstream linear classifier evaluation of the learned embeddings.
    CLAWS: Contrastive Learning with hard Attention and Weak Supervision. (arXiv:2112.00847v2 [cs.CV] UPDATED)
    Learning effective visual representations without human supervision is a long-standing problem in computer vision. Recent advances in self-supervised learning algorithms have utilized contrastive learning, with methods such as SimCLR, which applies a composition of augmentations to an image, and minimizes a contrastive loss between the two augmented images. In this paper, we present CLAWS, an annotation-efficient learning framework, addressing the problem of manually labeling large-scale agricultural datasets along with potential applications such as anomaly detection and plant growth analytics. CLAWS uses a network backbone inspired by SimCLR and weak supervision to investigate the effect of contrastive learning within class clusters. In addition, we inject a hard attention mask to the cropped input image before maximizing agreement between the image pairs using a contrastive loss function. This mask forces the network to focus on pertinent object features and ignore background features. We compare results between a supervised SimCLR and CLAWS using an agricultural dataset with 227,060 samples consisting of 11 different crop classes. Our experiments and extensive evaluations show that CLAWS achieves a competitive NMI score of 0.7325. Furthermore, CLAWS engenders the creation of low dimensional representations of very large datasets with minimal parameter tuning and forming well-defined clusters, which lends themselves to using efficient, transparent, and highly interpretable clustering methods such as Gaussian Mixture Models.
    Deep Learning for Free-Hand Sketch: A Survey. (arXiv:2001.02600v3 [cs.CV] UPDATED)
    Free-hand sketches are highly illustrative, and have been widely used by humans to depict objects or stories from ancient times to the present. The recent prevalence of touchscreen devices has made sketch creation a much easier task than ever and consequently made sketch-oriented applications increasingly popular. The progress of deep learning has immensely benefited free-hand sketch research and applications. This paper presents a comprehensive survey of the deep learning techniques oriented at free-hand sketch data, and the applications that they enable. The main contents of this survey include: (i) A discussion of the intrinsic traits and unique challenges of free-hand sketch, to highlight the essential differences between sketch data and other data modalities, e.g., natural photos. (ii) A review of the developments of free-hand sketch research in the deep learning era, by surveying existing datasets, research topics, and the state-of-the-art methods through a detailed taxonomy and experimental evaluation. (iii) Promotion of future work via a discussion of bottlenecks, open problems, and potential research directions for the community.
    Neural Tangent Kernel Beyond the Infinite-Width Limit: Effects of Depth and Initialization. (arXiv:2202.00553v1 [cs.LG])
    Neural Tangent Kernel (NTK) is widely used to analyze overparametrized neural networks due to the famous result by (Jacot et al., 2018): in the infinite-width limit, the NTK is deterministic and constant during training. However, this result cannot explain the behavior of deep networks, since it generally does not hold if depth and width tend to infinity simultaneously. In this paper, we study the NTK of fully-connected ReLU networks with depth comparable to width. We prove that the NTK properties depend significantly on the depth-to-width ratio and the distribution of parameters at initialization. In fact, our results indicate the importance of the three phases in the hyperparameter space identified in (Poole et al., 2016): ordered, chaotic and the edge of chaos (EOC). We derive exact expressions for the NTK dispersion in the infinite-depth-and-width limit in all three phases and conclude that the NTK variability grows exponentially with depth at the EOC and in the chaotic phase but not in the ordered phase. We also show that the NTK of deep networks may stay constant during training only in the ordered phase and discuss how the structure of the NTK matrix changes during training.
    Detection of Increased Time Intervals of Anti-Vaccine Tweets for COVID-19 Vaccine with BERT Model. (arXiv:2202.00477v1 [cs.CL])
    The most effective of the solutions against Covid-19 is the various vaccines developed. Distrust of vaccines can hinder the rapid and effective use of this remedy. One of the means of expressing the thoughts of society is social media. Determining the time intervals during which anti-vaccination increases in social media can help institutions determine the strategy to be used in combating anti-vaccination. Recording and tracking every tweet entered with human labor would be inefficient, so various automation solutions are needed. In this study, The Bidirectional Encoder Representations from Transformers (BERT) model, which is a deep learning-based natural language processing (NLP) model, was used. In a dataset of 1506 tweets divided into four different categories as news, irrelevant, anti-vaccine, and vaccine supporters, the model was trained with a learning rate of 5e-6 for 25 epochs. To determine the intervals in which anti-vaccine tweets are concentrated, the categories to which 652840 tweets belong were determined by using the trained model. The change of the determined categories overtime was visualized and the events that could cause the change were determined. As a result of model training, in the test dataset, the f-score of 0.81 and AUC values for different classes were obtained as 0.99,0.91, 0.92, 0.92, respectively. In this model, unlike the studies in the literature, an auxiliary system is designed that provides data that institutions can use when determining their strategy by measuring and visualizing the frequency of anti-vaccine tweets in a time interval, different from detecting and censoring such tweets.
    A Network Embedding Based Method for Cascade Prediction on Social Networks. (arXiv:2105.08291v5 [cs.LG] UPDATED)
    The prediction for information diffusion on social networks has great practical significance in marketing and public opinion control. It aims to predict the individuals who will potentially repost the message on the social network. One type of method is based on demographics, complex networks and other prior knowledge to establish an interpretable model to simulate and predict the propagation process, while the other type of method is completely data-driven and maps the nodes to a latent space for propagation prediction. Existing latent space design and embedding methods lack consideration for the intervene among users. In this paper, we propose an independent asymmetric embedding method to embed each individual into one latent influence space and multiple latent susceptibility spaces. Based on the similarity between information diffusion and heat diffusion phenomenon, the heat diffusion kernel is exploited in our model and establishes the embedding rules. Furthermore, our method captures the co-occurrence regulation of user combinations in cascades to improve the calculating effectiveness. The results of extensive experiments conducted on real-world datasets verify both the predictive accuracy and cost-effectiveness of our approach.
    Empirical complexity of comparator-based nearest neighbor descent. (arXiv:2202.00517v1 [cs.LG])
    A Java parallel streams implementation of the $K$-nearest neighbor descent algorithm is presented using a natural statistical termination criterion. Input data consist of a set $S$ of $n$ objects of type V, and a Function>, which enables any $x \in S$ to decide which of $y, z \in S\setminus\{x\}$ is more similar to $x$. Experiments with the Kullback-Leibler divergence Comparator support the prediction that the number of rounds of $K$-nearest neighbor updates need not exceed twice the diameter of the undirected version of a random regular out-degree $K$ digraph on $n$ vertices. Overall complexity was $O(n K^2 \log_K(n))$ in the class of examples studied. When objects are sampled uniformly from a $d$-dimensional simplex, accuracy of the $K$-nearest neighbor approximation is high up to $d = 20$, but declines in higher dimensions, as theory would predict.
    Density estimation on low-dimensional manifolds: an inflation-deflation approach. (arXiv:2105.12152v3 [cs.LG] UPDATED)
    Normalizing Flows (NFs) are universal density estimators based on Neural Networks. However, this universality is limited: the density's support needs to be diffeomorphic to a Euclidean space. In this paper, we propose a novel method to overcome this limitation without sacrificing universality. The proposed method inflates the data manifold by adding noise in the normal space, trains an NF on this inflated manifold, and, finally, deflates the learned density. Our main result provides sufficient conditions on the manifold and the specific choice of noise under which the corresponding estimator is exact. Our method has the same computational complexity as NFs and does not require computing an inverse flow. We also show that, if the embedding dimension is much larger than the manifold dimension, noise in the normal space can be well approximated by Gaussian noise. This allows using our method for approximating arbitrary densities on unknown manifolds provided that the manifold dimension is known.
    LightSecAgg: Rethinking Secure Aggregation in Federated Learning. (arXiv:2109.14236v2 [cs.LG] UPDATED)
    Secure model aggregation is a key component of federated learning (FL) that aims at protecting the privacy of each user's individual model while allowing for their global aggregation. It can be applied to any aggregation-based FL approach for training a global or personalized model. Model aggregation needs to also be resilient against likely user dropouts in FL systems, making its design substantially more complex. State-of-the-art secure aggregation protocols rely on secret sharing of the random-seeds used for mask generations at the users to enable the reconstruction and cancellation of those belonging to the dropped users. The complexity of such approaches, however, grows substantially with the number of dropped users. We propose a new approach, named LightSecAgg, to overcome this bottleneck by changing the design from "random-seed reconstruction of the dropped users" to "one-shot aggregate-mask reconstruction of the active users via mask encoding/decoding". We show that LightSecAgg achieves the same privacy and dropout-resiliency guarantees as the state-of-the-art protocols while significantly reducing the overhead for resiliency against dropped users. We also demonstrate that, unlike existing schemes, LightSecAgg can be applied to secure aggregation in the asynchronous FL setting. Furthermore, we provide a modular system design and optimized on-device parallelization for scalable implementation, by enabling computational overlapping between model training and on-device encoding, as well as improving the speed of concurrent receiving and sending of chunked masks. We evaluate LightSecAgg via extensive experiments for training diverse models on various datasets in a realistic FL system with large number of users and demonstrate that LightSecAgg significantly reduces the total training time.
    Fundamental Performance Limits for Sensor-Based Robot Control and Policy Learning. (arXiv:2202.00129v1 [cs.RO])
    Our goal is to develop theory and algorithms for establishing fundamental limits on performance for a given task imposed by a robot's sensors. In order to achieve this, we define a quantity that captures the amount of task-relevant information provided by a sensor. Using a novel version of the generalized Fano inequality from information theory, we demonstrate that this quantity provides an upper bound on the highest achievable expected reward for one-step decision making tasks. We then extend this bound to multi-step problems via a dynamic programming approach. We present algorithms for numerically computing the resulting bounds, and demonstrate our approach on three examples: (i) the lava problem from the literature on partially observable Markov decision processes, (ii) an example with continuous state and observation spaces corresponding to a robot catching a freely-falling object, and (iii) obstacle avoidance using a depth sensor with non-Gaussian noise. We demonstrate the ability of our approach to establish strong limits on achievable performance for these problems by comparing our upper bounds with achievable lower bounds (computed by synthesizing or learning concrete control policies).
    Information Interaction Profile of Choice Adoption. (arXiv:2104.13695v2 [cs.LG] UPDATED)
    Interactions between pieces of information (entities) play a substantial role in the way an individual acts on them: adoption of a product, the spread of news, strategy choice, etc. However, the underlying interaction mechanisms are often unknown and have been little explored in the literature. We introduce an efficient method to infer both the entities interaction network and its evolution according to the temporal distance separating interacting entities; together, they form the interaction profile. The interaction profile allows characterizing the mechanisms of the interaction processes. We approach this problem via a convex model based on recent advances in multi-kernel inference. We consider an ordered sequence of exposures to entities (URL, ads, situations) and the actions the user exerts on them (share, click, decision). We study how users exhibit different behaviors according to combinations of exposures they have been exposed to. We show that the effect of a combination of exposures on a user is more than the sum of each exposure's independent effect--there is an interaction. We reduce this modeling to a non-parametric convex optimization problem that can be solved in parallel. Our method recovers state-of-the-art results on interaction processes on three real-world datasets and outperforms baselines in the inference of the underlying data generation mechanisms. Finally, we show that interaction profiles can be visualized intuitively, easing the interpretation of the model.
    Scalable Fragment-Based 3D Molecular Design with Reinforcement Learning. (arXiv:2202.00658v1 [cs.LG])
    Machine learning has the potential to automate molecular design and drastically accelerate the discovery of new functional compounds. Towards this goal, generative models and reinforcement learning (RL) using string and graph representations have been successfully used to search for novel molecules. However, these approaches are limited since their representations ignore the three-dimensional (3D) structure of molecules. In fact, geometry plays an important role in many applications in inverse molecular design, especially in drug discovery. Thus, it is important to build models that can generate molecular structures in 3D space based on property-oriented geometric constraints. To address this, one approach is to generate molecules as 3D point clouds by sequentially placing atoms at locations in space -- this allows the process to be guided by physical quantities such as energy or other properties. However, this approach is inefficient as placing individual atoms makes the exploration unnecessarily deep, limiting the complexity of molecules that can be generated. Moreover, when optimizing a molecule, organic and medicinal chemists use known fragments and functional groups, not single atoms. We introduce a novel RL framework for scalable 3D design that uses a hierarchical agent to build molecules by placing molecular substructures sequentially in 3D space, thus attempting to build on the existing human knowledge in the field of molecular design. In a variety of experiments with different substructures, we show that our agent, guided only by energy considerations, can efficiently learn to produce molecules with over 100 atoms from many distributions including drug-like molecules, organic LED molecules, and biomolecules.
    Combined Pruning for Nested Cross-Validation to Accelerate Automated Hyperparameter Optimization for Embedded Feature Selection in High-Dimensional Data with Very Small Sample Sizes. (arXiv:2202.00598v1 [cs.LG])
    Applying tree-based embedded feature selection to exclude irrelevant features in high-dimensional data with very small sample sizes requires optimized hyperparameters for the model building process. In addition, nested cross-validation must be applied for this type of data to avoid biased model performance. The resulting long computation time can be accelerated with pruning. However, standard pruning algorithms must prune late or risk aborting calculations of promising hyperparameter sets due to high variance in the performance evaluation metric. To address this, we adapt the usage of a state-of-the-art successive halving pruner and combine it with two new pruning strategies based on domain or prior knowledge. One additional pruning strategy immediately stops the computation of trials with semantically meaningless results for the selected hyperparameter combinations. The other is an extrapolating threshold pruning strategy suitable for nested-cross-validation with high variance. Our proposed combined three-layer pruner keeps promising trials while reducing the number of models to be built by up to 81,3% compared to using a state-of-the-art asynchronous successive halving pruner alone. Our three-layer pruner implementation(available at https://github.com/sigrun-may/cv-pruner) speeds up data analysis or enables deeper hyperparameter search within the same computation time. It consequently saves time, money and energy, reducing the CO2 footprint.
    Firm-based relatedness using machine learning. (arXiv:2202.00458v1 [cs.LG])
    The relatedness between an economic actor (for instance a country, or a firm) and a product is a measure of the feasibility of that economic activity. As such, it is a driver for investments both at a private and institutional level. Traditionally, relatedness is measured using complex networks approaches derived by country-level co-occurrences. In this work, we compare complex networks and machine learning algorithms trained on both country and firm-level data. In order to quantitatively compare the different measures of relatedness, we use them to predict the future exports at country and firm-level, assuming that more related products have higher likelihood to be exported in the near future. Our results show that relatedness is scale-dependent: the best assessments are obtained by using machine learning on the same typology of data one wants to predict. Moreover, while relatedness measures based on country data are not suitable for firms, firm-level data are quite informative also to predict the development of countries. In this sense, models built on firm data provide a better assessment of relatedness with respect to country-level data. We also discuss the effect of using community detection algorithms and parameter optimization, finding that a partition into a higher number of blocks decreases the computational time while maintaining a prediction performance that is well above the network based benchmarks.
    Tight Accounting in the Shuffle Model of Differential Privacy. (arXiv:2106.00477v3 [cs.CR] UPDATED)
    Shuffle model of differential privacy is a novel distributed privacy model based on a combination of local privacy mechanisms and a secure shuffler. It has been shown that the additional randomisation provided by the shuffler improves privacy bounds compared to the purely local mechanisms. Accounting tight bounds, however, is complicated by the complexity brought by the shuffler. The recently proposed numerical techniques for evaluating $(\varepsilon,\delta)$-differential privacy guarantees have been shown to give tighter bounds than commonly used methods for compositions of various complex mechanisms. In this paper, we show how to obtain accurate bounds for adaptive compositions of general $\varepsilon$-LDP shufflers using the analysis by Feldman et al. (2021) and tight bounds for adaptive compositions of shufflers of $k$-randomised response mechanisms, using the analysis by Balle et al. (2019). We show how to speed up the evaluation of the resulting privacy loss distribution from $\mathcal{O}(n^2)$ to $\mathcal{O}(n)$, where $n$ is the number of users, without noticeable change in the resulting $\delta(\varepsilon)$-upper bounds. We also demonstrate looseness of the existing bounds and methods found in the literature, improving previous composition results significantly.
    The Neural Testbed: Evaluating Predictive Distributions. (arXiv:2110.04629v2 [cs.LG] UPDATED)
    Posterior predictive distributions quantify uncertainties ignored by point estimates. This paper introduces The Neural Testbed: an opensource library for controlled and principled evaluation of agents that generate such predictions. Crucially, agents are assessed not only on the quality of their marginal predictions per input, but also on their joint predictions across many inputs. We evaluate a range of agents using a neural-network-based data generating process. Our results indicate that some well-known agents that produce accurate marginal predictions do not fare well with joint predictions. We show that the quality of joint predictions drives performance in downstream decision tasks, and highlight the importance of this observation to the community.
    Tutorial on amortized optimization for learning to optimize over continuous domains. (arXiv:2202.00665v1 [cs.LG])
    Optimization is a ubiquitous modeling tool that is often deployed in settings that repeatedly solve similar instances of the same problem. Amortized optimization methods use learning to predict the solutions to problems in these settings. This leverages the shared structure between similar problem instances. In this tutorial, we will discuss the key design choices behind amortized optimization, roughly categorizing 1) models into fully-amortized and semi-amortized approaches, and 2) learning methods into regression-based and objective-based. We then view existing applications through these foundations to draw connections between them, including for manifold optimization, variational inference, sparse coding, meta-learning, control, reinforcement learning, convex optimization, and deep equilibrium networks. This framing enables us easily see, for example, that the amortized inference in variational autoencoders is conceptually identical to value gradients in control and reinforcement learning as they both use fully-amortized models with a objective-based loss. The source code for this tutorial is available at https://www.github.com/facebookresearch/amortized-optimization-tutorial
    Molecular Graph Representation Learning via Heterogeneous Motif Graph Construction. (arXiv:2202.00529v1 [cs.LG])
    We consider feature representation learning problem of molecular graphs. Graph Neural Networks have been widely used in feature representation learning of molecular graphs. However, most existing methods deal with molecular graphs individually while neglecting their connections, such as motif-level relationships. We propose a novel molecular graph representation learning method by constructing a heterogeneous motif graph to address this issue. In particular, we build a heterogeneous motif graph that contains motif nodes and molecular nodes. Each motif node corresponds to a motif extracted from molecules. Then, we propose a Heterogeneous Motif Graph Neural Network (HM-GNN) to learn feature representations for each node in the heterogeneous motif graph. Our heterogeneous motif graph also enables effective multi-task learning, especially for small molecular datasets. To address the potential efficiency issue, we propose to use an edge sampler, which can significantly reduce computational resources usage. The experimental results show that our model consistently outperforms previous state-of-the-art models. Under multi-task settings, the promising performances of our methods on combined datasets shed light on a new learning paradigm for small molecular datasets. Finally, we show that our model achieves similar performances with significantly less computational resources by using our edge sampler.
    Being Properly Improper. (arXiv:2106.09920v2 [cs.LG] UPDATED)
    Properness for supervised losses stipulates that the loss function shapes the learning algorithm towards the true posterior of the data generating distribution. Unfortunately, data in modern machine learning can be corrupted or twisted in many ways. Hence, optimizing a proper loss function on twisted data could perilously lead the learning algorithm towards the twisted posterior, rather than to the desired clean posterior. Many papers cope with specific twists (e.g., label/feature/adversarial noise), but there is a growing need for a unified and actionable understanding atop properness. Our chief theoretical contribution is a generalization of the properness framework with a notion called twist-properness, which delineates loss functions with the ability to "untwist" the twisted posterior into the clean posterior. Notably, we show that a nontrivial extension of a loss function called $\alpha$-loss, which was first introduced in information theory, is twist-proper. We study the twist-proper $\alpha$-loss under a novel boosting algorithm, called PILBoost, and provide formal and experimental results for this algorithm. Our overarching practical conclusion is that the twist-proper $\alpha$-loss outperforms the proper $\log$-loss on several variants of twisted data.
    A Regret Minimization Approach to Multi-Agent Control. (arXiv:2201.13288v2 [math.OC] UPDATED)
    We study the problem of multi-agent control of a dynamical system with known dynamics and adversarial disturbances. Our study focuses on optimal control without centralized precomputed policies, but rather with adaptive control policies for the different agents that are only equipped with a stabilizing controller. We give a reduction from any (standard) regret minimizing control method to a distributed algorithm. The reduction guarantees that the resulting distributed algorithm has low regret relative to the optimal precomputed joint policy. Our methodology involves generalizing online convex optimization to a multi-agent setting and applying recent tools from nonstochastic control derived for a single agent. We empirically evaluate our method on a model of an overactuated aircraft. We show that the distributed method is robust to failure and to adversarial perturbations in the dynamics.
    Identifying Pauli spin blockade using deep learning. (arXiv:2202.00574v1 [cond-mat.mes-hall])
    Pauli spin blockade (PSB) can be employed as a great resource for spin qubit initialisation and readout even at elevated temperatures but it can be difficult to identify. We present a machine learning algorithm capable of automatically identifying PSB using charge transport measurements. The scarcity of PSB data is circumvented by training the algorithm with simulated data and by using cross-device validation. We demonstrate our approach on a silicon field-effect transistor device and report an accuracy of 96% on different test devices, giving evidence that the approach is robust to device variability. The approach is expected to be employable across all types of quantum dot devices.
    Deep KKL: Data-driven Output Prediction for Non-Linear Systems. (arXiv:2103.12443v2 [cs.LG] UPDATED)
    We address the problem of output prediction, ie. designing a model for autonomous nonlinear systems capable of forecasting their future observations. We first define a general framework bringing together the necessary properties for the development of such an output predictor. In particular, we look at this problem from two different viewpoints, control theory and data-driven techniques (machine learning), and try to formulate it in a consistent way, reducing the gap between the two fields. Building on this formulation and problem definition, we propose a predictor structure based on the Kazantzis-Kravaris/Luenberger (KKL) observer and we show that KKL fits well into our general framework. Finally, we propose a constructive solution for this predictor that solely relies on a small set of trajectories measured from the system. Our experiments show that our solution allows to obtain an efficient predictor over a subset of the observation space.
    Over-smoothing Effect of Graph Convolutional Networks. (arXiv:2201.12830v2 [cs.LG] UPDATED)
    Over-smoothing is a severe problem which limits the depth of Graph Convolutional Networks. This article gives a comprehensive analysis of the mechanism behind Graph Convolutional Networks and the over-smoothing effect. The article proposes an upper bound for the occurrence of over-smoothing, which offers insight into the key factors behind over-smoothing. The results presented in this article successfully explain the feasibility of several algorithms that alleviate over-smoothing.
    GENEOnet: A new machine learning paradigm based on Group Equivariant Non-Expansive Operators. An application to protein pocket detection. (arXiv:2202.00451v1 [q-bio.BM])
    Nowadays there is a big spotlight cast on the development of techniques of explainable machine learning. Here we introduce a new computational paradigm based on Group Equivariant Non-Expansive Operators, that can be regarded as the product of a rising mathematical theory of information-processing observers. This approach, that can be adjusted to different situations, may have many advantages over other common tools, like Neural Networks, such as: knowledge injection and information engineering, selection of relevant features, small number of parameters and higher transparency. We chose to test our method, called GENEOnet, on a key problem in drug design: detecting pockets on the surface of proteins that can host ligands. Experimental results confirmed that our method works well even with a quite small training set, providing thus a great computational advantage, while the final comparison with other state-of-the-art methods shows that GENEOnet provides better or comparable results in terms of accuracy.
    Minority Class Oriented Active Learning for Imbalanced Datasets. (arXiv:2202.00390v1 [cs.LG])
    Active learning aims to optimize the dataset annotation process when resources are constrained. Most existing methods are designed for balanced datasets. Their practical applicability is limited by the fact that a majority of real-life datasets are actually imbalanced. Here, we introduce a new active learning method which is designed for imbalanced datasets. It favors samples likely to be in minority classes so as to reduce the imbalance of the labeled subset and create a better representation for these classes. We also compare two training schemes for active learning: (1) the one commonly deployed in deep active learning using model fine tuning for each iteration and (2) a scheme which is inspired by transfer learning and exploits generic pre-trained models and train shallow classifiers for each iteration. Evaluation is run with three imbalanced datasets. Results show that the proposed active learning method outperforms competitive baselines. Equally interesting, they also indicate that the transfer learning training scheme outperforms model fine tuning if features are transferable from the generic dataset to the unlabeled one. This last result is surprising and should encourage the community to explore the design of deep active learning methods.
    Sinogram Enhancement with Generative Adversarial Networks using Shape Priors. (arXiv:2202.00419v1 [eess.IV])
    Compensating scarce measurements by inferring them from computational models is a way to address ill-posed inverse problems. We tackle Limited Angle Tomography by completing the set of acquisitions using a generative model and prior-knowledge about the scanned object. Using a Generative Adversarial Network as model and Computer-Assisted Design data as shape prior, we demonstrate a quantitative and qualitative advantage of our technique over other state-of-the-art methods. Inferring a substantial number of consecutive missing measurements, we offer an alternative to other image inpainting techniques that fall short of providing a satisfying answer to our research question: can X-Ray exposition be reduced by using generative models to infer lacking measurements?
    Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning. (arXiv:2106.02584v2 [cs.LG] UPDATED)
    We challenge a common assumption underlying most supervised deep learning: that a model makes a prediction depending only on its parameters and the features of a single input. To this end, we introduce a general-purpose deep learning architecture that takes as input the entire dataset instead of processing one datapoint at a time. Our approach uses self-attention to reason about relationships between datapoints explicitly, which can be seen as realizing non-parametric models using parametric attention mechanisms. However, unlike conventional non-parametric models, we let the model learn end-to-end from the data how to make use of other datapoints for prediction. Empirically, our models solve cross-datapoint lookup and complex reasoning tasks unsolvable by traditional deep learning models. We show highly competitive results on tabular data, early results on CIFAR-10, and give insight into how the model makes use of the interactions between points.
    CLEVA-Compass: A Continual Learning EValuation Assessment Compass to Promote Research Transparency and Comparability. (arXiv:2110.03331v2 [cs.LG] UPDATED)
    What is the state of the art in continual machine learning? Although a natural question for predominant static benchmarks, the notion to train systems in a lifelong manner entails a plethora of additional challenges with respect to set-up and evaluation. The latter have recently sparked a growing amount of critiques on prominent algorithm-centric perspectives and evaluation protocols being too narrow, resulting in several attempts at constructing guidelines in favor of specific desiderata or arguing against the validity of prevalent assumptions. In this work, we depart from this mindset and argue that the goal of a precise formulation of desiderata is an ill-posed one, as diverse applications may always warrant distinct scenarios. Instead, we introduce the Continual Learning EValuation Assessment Compass: the CLEVA-Compass. The compass provides the visual means to both identify how approaches are practically reported and how works can simultaneously be contextualized in the broader literature landscape. In addition to promoting compact specification in the spirit of recent replication trends, it thus provides an intuitive chart to understand the priorities of individual systems, where they resemble each other, and what elements are missing towards a fair comparison.
    Fishing for User Data in Large-Batch Federated Learning via Gradient Magnification. (arXiv:2202.00580v1 [cs.LG])
    Federated learning (FL) has rapidly risen in popularity due to its promise of privacy and efficiency. Previous works have exposed privacy vulnerabilities in the FL pipeline by recovering user data from gradient updates. However, existing attacks fail to address realistic settings because they either 1) require a `toy' settings with very small batch sizes, or 2) require unrealistic and conspicuous architecture modifications. We introduce a new strategy that dramatically elevates existing attacks to operate on batches of arbitrarily large size, and without architectural modifications. Our model-agnostic strategy only requires modifications to the model parameters sent to the user, which is a realistic threat model in many scenarios. We demonstrate the strategy in challenging large-scale settings, obtaining high-fidelity data extraction in both cross-device and cross-silo federated learning.
    Convergent Graph Solvers. (arXiv:2106.01680v3 [cs.LG] UPDATED)
    We propose the convergent graph solver (CGS), a deep learning method that learns iterative mappings to predict the properties of a graph system at its stationary state (fixed point) with guaranteed convergence. CGS systematically computes the fixed points of a target graph system and decodes them to estimate the stationary properties of the system without the prior knowledge of existing solvers or intermediate solutions. The forward propagation of CGS proceeds in three steps: (1) constructing the input dependent linear contracting iterative maps, (2) computing the fixed-points of the linear maps, and (3) decoding the fixed-points to estimate the properties. The contractivity of the constructed linear maps guarantees the existence and uniqueness of the fixed points following the Banach fixed point theorem. To train CGS efficiently, we also derive a tractable analytical expression for its gradient by leveraging the implicit function theorem. We evaluate the performance of CGS by applying it to various network-analytic and graph benchmark problems. The results indicate that CGS has competitive capabilities for predicting the stationary properties of graph systems, irrespective of whether the target systems are linear or non-linear. CGS also shows high performance for graph classification problems where the existence or the meaning of a fixed point is hard to be clearly defined, which highlights the potential of CGS as a general graph neural network architecture.
    MotifExplainer: a Motif-based Graph Neural Network Explainer. (arXiv:2202.00519v1 [cs.LG])
    We consider the explanation problem of Graph Neural Networks (GNNs). Most existing GNN explanation methods identify the most important edges or nodes but fail to consider substructures, which are more important for graph data. The only method that considers subgraphs tries to search all possible subgraphs and identify the most significant subgraphs. However, the subgraphs identified may not be recurrent or statistically important. In this work, we propose a novel method, known as MotifExplainer, to explain GNNs by identifying important motifs, recurrent and statistically significant patterns in graphs. Our proposed motif-based methods can provide better human-understandable explanations than methods based on nodes, edges, and regular subgraphs. Given an input graph and a pre-trained GNN model, our method first extracts motifs in the graph using well-designed motif extraction rules. Then we generate motif embedding by feeding motifs into the pre-trained GNN. Finally, we employ an attention-based method to identify the most influential motifs as explanations for the final prediction results. The empirical studies on both synthetic and real-world datasets demonstrate the effectiveness of our method.
    Data Science as Political Action: Grounding Data Science in a Politics of Justice. (arXiv:1811.03435v4 [cs.CY] UPDATED)
    In response to public scrutiny of data-driven algorithms, the field of data science has adopted ethics training and principles. Although ethics can help data scientists reflect on certain normative aspects of their work, such efforts are ill-equipped to generate a data science that avoids social harms and promotes social justice. In this article, I argue that data science must embrace a political orientation. Data scientists must recognize themselves as political actors engaged in normative constructions of society and evaluate their work according to its downstream impacts on people's lives. I first articulate why data scientists must recognize themselves as political actors. In this section, I respond to three arguments that data scientists commonly invoke when challenged to take political positions regarding their work. In confronting these arguments, I describe why attempting to remain apolitical is itself a political stance--a fundamentally conservative one--and why data science's attempts to promote "social good" dangerously rely on unarticulated and incrementalist political assumptions. I then propose a framework for how data science can evolve toward a deliberative and rigorous politics of social justice. I conceptualize the process of developing a politically engaged data science as a sequence of four stages. Pursuing these new approaches will empower data scientists with new methods for thoughtfully and rigorously contributing to social justice.
    Learning entanglement breakdown as a phase transition by confusion. (arXiv:2202.00348v1 [quant-ph])
    Quantum technologies require methods for preparing and manipulating entangled multiparticle states. However, the problem of determining whether a given quantum state is entangled or separable is known to be an NP-hard problem in general, and even the task of detecting entanglement breakdown for a given class of quantum states is difficult. In this work, we develop an approach for revealing entanglement breakdown using a machine learning technique, which is known as 'learning by confusion'. We consider a family of quantum states, which is parameterized such that there is a single critical value dividing states within this family on separate and entangled. We demonstrate the 'learning by confusion' scheme allows determining the critical value. Specifically, we study the performance of the method for the two-qubit, two-qutrit, and two-ququart entangled state, where the standard entanglement measures do not work efficiently. In addition, we investigate the properties of the local depolarization and the generalized amplitude damping channel in the framework of the confusion scheme. Within our approach and setting the parameterization of special trajectories to construct W shapes, we obtain an entanglement-breakdown 'phase diagram' of a quantum channel, which indicates regions of entangled (separable) states and the entanglement-breakdown region. Then we extend the way of using the 'learning by confusion' scheme for recognizing whether an arbitrary given state is entangled or separable. We show that the developed method provides correct answers for a variety of states, including entangled states with positive partial transpose (PPT). We also present a more practical version of the method, which is suitable for studying entanglement breakdown in noisy intermediate-scale quantum (NISQ) devices. We demonstrate its performance using an available cloud-based IBM quantum processor.
    Rewiring What-to-Watch-Next Recommendations to Reduce Radicalization Pathways. (arXiv:2202.00640v1 [cs.CY])
    Recommender systems typically suggest to users content similar to what they consumed in the past. If a user happens to be exposed to strongly polarized content, she might subsequently receive recommendations which may steer her towards more and more radicalized content, eventually being trapped in what we call a "radicalization pathway". In this paper, we study the problem of mitigating radicalization pathways using a graph-based approach. Specifically, we model the set of recommendations of a "what-to-watch-next" recommender as a d-regular directed graph where nodes correspond to content items, links to recommendations, and paths to possible user sessions. We measure the "segregation" score of a node representing radicalized content as the expected length of a random walk from that node to any node representing non-radicalized content. High segregation scores are associated to larger chances to get users trapped in radicalization pathways. Hence, we define the problem of reducing the prevalence of radicalization pathways by selecting a small number of edges to "rewire", so to minimize the maximum of segregation scores among all radicalized nodes, while maintaining the relevance of the recommendations. We prove that the problem of finding the optimal set of recommendations to rewire is NP-hard and NP-hard to approximate within any factor. Therefore, we turn our attention to heuristics, and propose an efficient yet effective greedy algorithm based on the absorbing random walk theory. Our experiments on real-world datasets in the context of video and news recommendations confirm the effectiveness of our proposal.
    Factorized-FL: Agnostic Personalized Federated Learning with Kernel Factorization & Similarity Matching. (arXiv:2202.00270v1 [cs.LG])
    In real-world federated learning scenarios, participants could have their own personalized labels which are incompatible with those from other clients, due to using different label permutations or tackling completely different tasks or domains. However, most existing FL approaches cannot effectively tackle such extremely heterogeneous scenarios since they often assume that (1) all participants use a synchronized set of labels, and (2) they train on the same task from the same domain. In this work, to tackle these challenges, we introduce Factorized-FL, which allows to effectively tackle label- and task-heterogeneous federated learning settings by factorizing the model parameters into a pair of vectors, where one captures the common knowledge across different labels and tasks and the other captures knowledge specific to the task each local model tackles. Moreover, based on the distance in the client-specific vector space, Factorized-FL performs selective aggregation scheme to utilize only the knowledge from the relevant participants for each client. We extensively validate our method on both label- and domain-heterogeneous settings, on which it outperforms the state-of-the-art personalized federated learning methods.
    Questions for Flat-Minima Optimization of Modern Neural Networks. (arXiv:2202.00661v1 [cs.LG])
    For training neural networks, flat-minima optimizers that seek to find parameters in neighborhoods having uniformly low loss (flat minima) have been shown to improve upon stochastic and adaptive gradient-based methods. Two methods for finding flat minima stand out: 1. Averaging methods (i.e., Stochastic Weight Averaging, SWA), and 2. Minimax methods (i.e., Sharpness Aware Minimization, SAM). However, despite similar motivations, there has been limited investigation into their properties and no comprehensive comparison between them. In this work, we investigate the loss surfaces from a systematic benchmarking of these approaches across computer vision, natural language processing, and graph learning tasks. This leads us to a hypothesis: since both approaches find flat solutions in orthogonal ways, combining them should improve generalization even further. We verify this improves over either flat-minima approach in 39 out of 42 cases. When it does not, we provide potential explanations. We hope our results across image, graph, and text data will help researchers to improve deep learning optimizers, and practitioners to pinpoint the optimizer for the problem at hand.
    A Training Framework for Stereo-Aware Speech Enhancement using Deep Neural Networks. (arXiv:2112.04939v2 [eess.AS] UPDATED)
    Deep learning-based speech enhancement has shown unprecedented performance in recent years. The most popular mono speech enhancement frameworks are end-to-end networks mapping the noisy mixture into an estimate of the clean speech. With growing computational power and availability of multichannel microphone recordings, prior works have aimed to incorporate spatial statistics along with spectral information to boost up performance. Despite an improvement in enhancement performance of mono output, the spatial image preservation and subjective evaluations have not gained much attention in the literature. This paper proposes a novel stereo-aware framework for speech enhancement, i.e., a training loss for deep learning-based speech enhancement to preserve the spatial image while enhancing the stereo mixture. The proposed framework is model independent, hence it can be applied to any deep learning based architecture. We provide an extensive objective and subjective evaluation of the trained models through a listening test. We show that by regularizing for an image preservation loss, the overall performance is improved, and the stereo aspect of the speech is better preserved.
    AVTPnet: Convolutional Autoencoder for AVTP anomaly detection in Automotive Ethernet Networks. (arXiv:2202.00045v1 [cs.LG])
    Network Intrusion Detection Systems are well considered as efficient tools for securing in-vehicle networks against diverse cyberattacks. However, since cyberattack are always evolving, signature-based intrusion detection systems are no longer adopted. An alternative solution can be the deployment of deep learning based intrusion detection system (IDS) which play an important role in detecting unknown attack patterns in network traffic. To our knowledge, no previous research work has been done to detect anomalies on automotive ethernet based in-vehicle networks using anomaly based approaches. Hence, in this paper, we propose a convolutional autoencoder (CAE) for offline detection of anomalies on the Audio Video Transport Protocol (AVTP), an application layer protocol implemented in the recent in-vehicle network Automotive Ethernet. The CAE consists of an encoder and a decoder with CNN structures that are asymmetrical. Anomalies in AVTP packet stream, which may lead to critical interruption of media streams, are therefore detected by measuring the reconstruction error of each sliding window of AVTP packets. Our proposed approach is evaluated on the recently published "Automotive Ethernet Intrusion Dataset", and is also compared with other state-of-the art traditional anomaly detection and signature based models in machine learning. The numerical results show that our proposed model outperfoms the other methods and excel at predicting unknown in-vehicle intrusions, with 0.94 accuracy. Moreover, our model has a low level of false alarm and miss detection rates for different AVTP attack types.
    A Comparative Study of Calibration Methods for Imbalanced Class Incremental Learning. (arXiv:2202.00386v1 [cs.LG])
    Deep learning approaches are successful in a wide range of AI problems and in particular for visual recognition tasks. However, there are still open problems among which is the capacity to handle streams of visual information and the management of class imbalance in datasets. Existing research approaches these two problems separately while they co-occur in real world applications. Here, we study the problem of learning incrementally from imbalanced datasets. We focus on algorithms which have a constant deep model complexity and use a bounded memory to store exemplars of old classes across incremental states. Since memory is bounded, old classes are learned with fewer images than new classes and an imbalance due to incremental learning is added to the initial dataset imbalance. A score prediction bias in favor of new classes appears and we evaluate a comprehensive set of score calibration methods to reduce it. Evaluation is carried with three datasets, using two dataset imbalance configurations and three bounded memory sizes. Results show that most calibration methods have beneficial effect and that they are most useful for lower bounded memory sizes, which are most interesting in practice. As a secondary contribution, we remove the usual distillation component from the loss function of incremental learning algorithms. We show that simpler vanilla fine tuning is a stronger backbone for imbalanced incremental learning algorithms.
    You May Not Need Ratio Clipping in PPO. (arXiv:2202.00079v1 [cs.LG])
    Proximal Policy Optimization (PPO) methods learn a policy by iteratively performing multiple mini-batch optimization epochs of a surrogate objective with one set of sampled data. Ratio clipping PPO is a popular variant that clips the probability ratios between the target policy and the policy used to collect samples. Ratio clipping yields a pessimistic estimate of the original surrogate objective, and has been shown to be crucial for strong performance. We show in this paper that such ratio clipping may not be a good option as it can fail to effectively bound the ratios. Instead, one can directly optimize the original surrogate objective for multiple epochs; the key is to find a proper condition to early stop the optimization epoch in each iteration. Our theoretical analysis sheds light on how to determine when to stop the optimization epoch, and call the resulting algorithm Early Stopping Policy Optimization (ESPO). We compare ESPO with PPO across many continuous control tasks and show that ESPO significantly outperforms PPO. Furthermore, we show that ESPO can be easily scaled up to distributed training with many workers, delivering strong performance as well.
    Calibration of P-values for calibration and for deviation of a subpopulation from the full population. (arXiv:2202.00100v1 [stat.ME])
    The author's recent research papers, "Cumulative deviation of a subpopulation from the full population" and "A graphical method of cumulative differences between two subpopulations" (both published in volume 8 of Springer's open-access "Journal of Big Data" during 2021), propose graphical methods and summary statistics, without extensively calibrating formal significance tests. The summary metrics and methods can measure the calibration of probabilistic predictions and can assess differences in responses between a subpopulation and the full population while controlling for a covariate or score via conditioning on it. These recently published papers construct significance tests based on the scalar summary statistics, but only sketch how to calibrate the attained significance levels (also known as "P-values") for the tests. The present article reviews and synthesizes work spanning many decades in order to detail how to calibrate the P-values. The present paper presents computationally efficient, easily implemented numerical methods for evaluating properly calibrated P-values, together with rigorous mathematical proofs guaranteeing their accuracy, and illustrates and validates the methods with open-source software and numerical examples.
    Self-Supervision is All You Need for Solving Rubik's Cube. (arXiv:2106.03157v3 [cs.LG] UPDATED)
    While combinatorial problems are of great academic and practical importance, previous approaches like explicit search and reinforcement learning have been complex and costly. We therefore propose a simple and robust method to solve combinatorial problems whose goals are predefined. Assuming that more optimal moves occur more frequently as a path of random moves connecting two combinatorial states, we show that a Deep Neural Network can generate near-optimal solutions merely by training it to predict the last move of a random scramble based on the problem state. When solving 1,000 test cases of Rubik's Cube, our method outperformed DeepCubeA with a mean solution length of 21.27 moves, also achieving over 10 times higher efficiency for training and 2.0 times for inference. The proposed method may apply to other goal-predefined combinatorial problems, though it has a few constraints.
    Learning to Synthesize Programs as Interpretable and Generalizable Policies. (arXiv:2108.13643v4 [cs.LG] UPDATED)
    Recently, deep reinforcement learning (DRL) methods have achieved impressive performance on tasks in a variety of domains. However, neural network policies produced with DRL methods are not human-interpretable and often have difficulty generalizing to novel scenarios. To address these issues, prior works explore learning programmatic policies that are more interpretable and structured for generalization. Yet, these works either employ limited policy representations (e.g. decision trees, state machines, or predefined program templates) or require stronger supervision (e.g. input/output state pairs or expert demonstrations). We present a framework that instead learns to synthesize a program, which details the procedure to solve a task in a flexible and expressive manner, solely from reward signals. To alleviate the difficulty of learning to compose programs to induce the desired agent behavior from scratch, we propose to first learn a program embedding space that continuously parameterizes diverse behaviors in an unsupervised manner and then search over the learned program embedding space to yield a program that maximizes the return for a given task. Experimental results demonstrate that the proposed framework not only learns to reliably synthesize task-solving programs but also outperforms DRL and program synthesis baselines while producing interpretable and more generalizable policies. We also justify the necessity of the proposed two-stage learning scheme as well as analyze various methods for learning the program embedding.
    Deconfounded Representation Similarity for Comparison of Neural Networks. (arXiv:2202.00095v1 [stat.ML])
    Similarity metrics such as representational similarity analysis (RSA) and centered kernel alignment (CKA) have been used to compare layer-wise representations between neural networks. However, these metrics are confounded by the population structure of data items in the input space, leading to spuriously high similarity for even completely random neural networks and inconsistent domain relations in transfer learning. We introduce a simple and generally applicable fix to adjust for the confounder with covariate adjustment regression, which retains the intuitive invariance properties of the original similarity measures. We show that deconfounding the similarity metrics increases the resolution of detecting semantically similar neural networks. Moreover, in real-world applications, deconfounding improves the consistency of representation similarities with domain similarities in transfer learning, and increases correlation with out-of-distribution accuracy.
    Examining Scaling and Transfer of Language Model Architectures for Machine Translation. (arXiv:2202.00528v1 [cs.CL])
    Natural language understanding and generation models follow one of the two dominant architectural paradigms: language models (LMs) that process concatenated sequences in a single stack of layers, and encoder-decoder models (EncDec) that utilize separate layer stacks for input and output processing. In machine translation, EncDec has long been the favoured approach, but with few studies investigating the performance of LMs. In this work, we thoroughly examine the role of several architectural design choices on the performance of LMs on bilingual, (massively) multilingual and zero-shot translation tasks, under systematic variations of data conditions and model sizes. Our results show that: (i) Different LMs have different scaling properties, where architectural differences often have a significant impact on model performance at small scales, but the performance gap narrows as the number of parameters increases, (ii) Several design choices, including causal masking and language-modeling objectives for the source sequence, have detrimental effects on translation quality, and (iii) When paired with full-visible masking for source sequences, LMs could perform on par with EncDec on supervised bilingual and multilingual translation tasks, and improve greatly on zero-shot directions by facilitating the reduction of off-target translations.
    Monotonic Improvement Guarantees under Non-stationarity for Decentralized PPO. (arXiv:2202.00082v1 [cs.LG])
    We present a new monotonic improvement guarantee for optimizing decentralized policies in cooperative Multi-Agent Reinforcement Learning (MARL), which holds even when the transition dynamics are non-stationary. This new analysis provides a theoretical understanding of the strong performance of two recent actor-critic methods for MARL, i.e., Independent Proximal Policy Optimization (IPPO) and Multi-Agent PPO (MAPPO), which both rely on independent ratios, i.e., computing probability ratios separately for each agent's policy. We show that, despite the non-stationarity that independent ratios cause, a monotonic improvement guarantee still arises as a result of enforcing the trust region constraint over all decentralized policies. We also show this trust region constraint can be effectively enforced in a principled way by bounding independent ratios based on the number of agents in training, providing a theoretical foundation for proximal ratio clipping. Moreover, we show that the surrogate objectives optimized in IPPO and MAPPO are essentially equivalent when their critics converge to a fixed point. Finally, our empirical results support the hypothesis that the strong performance of IPPO and MAPPO is a direct result of enforcing such a trust region constraint via clipping in centralized training, and the good values of the hyperparameters for this enforcement are highly sensitive to the number of agents, as predicted by our theoretical analysis.
    Ultra-fast Deep Mixtures of Gaussian Process Experts. (arXiv:2006.13309v2 [cs.LG] UPDATED)
    Mixtures of experts have become an indispensable tool for flexible modelling in a supervised learning context, and sparse Gaussian processes (GP) have shown promise as a leading candidate for the experts in such models. In this article, we propose to design the gating network for selecting the experts from such mixtures of sparse GPs using a deep neural network (DNN). Furthermore, an ultra-fast one pass algorithm called Cluster-Classify-Regress (CCR) is leveraged to approximate the maximum a posteriori (MAP) estimator extremely quickly. This powerful combination of model and algorithm together delivers a novel method which is flexible, robust, and extremely efficient. In particular, the method is able to outperform competing methods in terms of accuracy and uncertainty quantification. The cost is competitive on low-dimensional and small data sets but is significantly lower for higher dimensional and big data sets. Iteratively maximizing the distribution of experts given allocations and allocations given experts does not provide significant improvement, which indicates that the algorithm achieves a good approximation to the local MAP estimator very fast. This insight can be useful also in the context of other mixture of experts models.
    DoCoM-SGT: Doubly Compressed Momentum-assisted Stochastic Gradient Tracking Algorithm for Communication Efficient Decentralized Learning. (arXiv:2202.00255v1 [cs.LG])
    This paper proposes the Doubly Compressed Momentum-assisted Stochastic Gradient Tracking algorithm (DoCoM-SGT) for communication efficient decentralized learning. DoCoM-SGT utilizes two compression steps per communication round as the algorithm tracks simultaneously the averaged iterate and stochastic gradient. Furthermore, DoCoM-SGT incorporates a momentum based technique for reducing variances in the gradient estimates. We show that DoCoM-SGT finds a solution $\bar{\theta}$ in $T$ iterations satisfying $\mathbb{E} [ \| \nabla f(\bar{\theta}) \|^2 ] = {\cal O}( 1 / T^{2/3} )$ for non-convex objective functions; and we provide competitive convergence rate guarantees for other function classes. Numerical experiments on synthetic and real datasets validate the efficacy of our algorithm.
    Adversarial Imitation Learning from Video using a State Observer. (arXiv:2202.00243v1 [cs.RO])
    The imitation learning research community has recently made significant progress towards the goal of enabling artificial agents to imitate behaviors from video demonstrations alone. However, current state-of-the-art approaches developed for this problem exhibit high sample complexity due, in part, to the high-dimensional nature of video observations. Towards addressing this issue, we introduce here a new algorithm called Visual Generative Adversarial Imitation from Observation using a State Observer VGAIfO-SO. At its core, VGAIfO-SO seeks to address sample inefficiency using a novel, self-supervised state observer, which provides estimates of lower-dimensional proprioceptive state representations from high-dimensional images. We show experimentally in several continuous control environments that VGAIfO-SO is more sample efficient than other IfO algorithms at learning from video-only demonstrations and can sometimes even achieve performance close to the Generative Adversarial Imitation from Observation (GAIfO) algorithm that has privileged access to the demonstrator's proprioceptive state information.
    Deepfake pornography as a male gaze on fan culture. (arXiv:2202.00374v1 [cs.CY])
    This essay shows the impact of deepfake technology on fan culture. The innovative technology provided the male audience with an instrument to express its ideas and plots. Which subsequently led to the rise of deepfake pornography. It is often seen as a part of celebrity studies; however, the essay shows that it could also be considered a type of fanfic and a product of participatory culture, sharing community origin, exploitation by commercial companies and deep sexualisation. These two branches of fanfic evolution can be connected via the genre of machinima pornography. Textual fanfics are mainly created by females for females, depicting males; otherwise, deepfake pornography and machinima are made by males and for males targeting females.
    5G enabled Mobile Edge Computing security for Autonomous Vehicles. (arXiv:2202.00005v1 [cs.CR])
    The world is moving into a new era with the deployment of 5G communication infrastructure. Many new developments are deployed centred around this technology. One such advancement is 5G Vehicle to Everything communication. This technology can be used for applications such as driverless delivery of goods, immediate response to emergencies and improving traffic efficiency. The concept of Intelligent Transport Systems (ITS) is built around this system which is completely autonomous. This paper studies the Distributed Denial of Service (DDoS) attack carried out over a 5G network and analyses security attacks, particularly the DDoS attack. The aim is to implement a machine learning model capable of classifying different types of DDoS attacks and predicting the quality of 5G latency. The initial steps of implementation involved the synthetic addition of 5G parameters into the dataset. Subsequently, the data was label encoded, and minority classes were oversampled to match the other classes. Finally, the data was split as training and testing, and machine learning models were applied. Although the paper resulted in a model that predicted DDoS attacks, the dataset acquired significantly lacked 5G related information. Furthermore, the 5G classification model needed more modification. The research was based on largely quantitative research methods in a simulated environment. Hence, the biggest limitation of this research has been the lack of resources for data collection and sole reliance on online data sets. Ideally, a Vehicle to Everything (V2X) project would greatly benefit from an autonomous 5G enabled vehicle connected to a mobile edge cloud. However, this project was conducted solely online on a single PC which further limits the outcomes. Although the model underperformed, this paper can be used as a framework for future research in Intelligent Transport System development.
    Improving Parametric Neural Networks for High-Energy Physics (and Beyond). (arXiv:2202.00424v1 [hep-ex])
    Signal-background classification is a central problem in High-Energy Physics, that plays a major role for the discovery of new fundamental particles. A recent method -- the Parametric Neural Network (pNN) -- leverages multiple signal mass hypotheses as an additional input feature to effectively replace a whole set of individual classifier, each providing (in principle) the best response for a single mass hypothesis. In this work we aim at deepening the understanding of pNNs in light of real-world usage. We discovered several peculiarities of parametric networks, providing intuition, metrics, and guidelines to them. We further propose an alternative parametrization scheme, resulting in a new parametrized neural network architecture: the AffinePNN; along with many other generally applicable improvements. Finally, we extensively evaluate our models on the HEPMASS dataset, along its imbalanced version (called HEPMASS-IMB) we provide here for the first time to further validate our approach. Provided results are in terms of the impact of the proposed design decisions, classification performance, and interpolation capability as well.
    Closing the sim-to-real gap in guided wave damage detection with adversarial training of variational auto-encoders. (arXiv:2202.00570v1 [eess.SP])
    Guided wave testing is a popular approach for monitoring the structural integrity of infrastructures. We focus on the primary task of damage detection, where signal processing techniques are commonly employed. The detection performance is affected by a mismatch between the wave propagation model and experimental wave data. External variations, such as temperature, which are difficult to model, also affect the performance. While deep learning models can be an alternative detection method, there is often a lack of real-world training datasets. In this work, we counter this challenge by training an ensemble of variational autoencoders only on simulation data with a wave physics-guided adversarial component. We set up an experiment with non-uniform temperature variations to test the robustness of the methods. We compare our scheme with existing deep learning detection schemes and observe superior performance on experimental data.
    Continuous Forecasting via Neural Eigen Decomposition of Stochastic Dynamics. (arXiv:2202.00117v1 [cs.LG])
    Motivated by a real-world problem of blood coagulation control in Heparin-treated patients, we use Stochastic Differential Equations (SDEs) to formulate a new class of sequential prediction problems -- with an unknown latent space, unknown non-linear dynamics, and irregular sparse observations. We introduce the Neural Eigen-SDE (NESDE) algorithm for sequential prediction with sparse observations and adaptive dynamics. NESDE applies eigen-decomposition to the dynamics model to allow efficient frequent predictions given sparse observations. In addition, NESDE uses a learning mechanism for adaptive dynamics model, which handles changes in the dynamics both between sequences and within sequences. We demonstrate the accuracy and efficacy of NESDE for both synthetic problems and real-world data. In particular, to the best of our knowledge, we are the first to provide a patient-adapted prediction for blood coagulation following Heparin dosing in the MIMIC-IV dataset. Finally, we publish a simulated gym environment based on our prediction model, for experimentation in algorithms for blood coagulation control.
    Continual 3D Convolutional Neural Networks for Real-time Processing of Videos. (arXiv:2106.00050v2 [cs.CV] UPDATED)
    This paper introduces Continual 3D Convolutional Neural Networks (Co3D CNNs), a new computational formulation of spatio-temporal 3D CNNs, in which videos are processed frame-by-frame rather than by clip. In online processing tasks demanding frame-wise predictions, Co3D CNNs dispense with the computational redundancies of regular 3D CNNs, namely the repeated convolutions over frames, which appear in overlapping clips. We show that Continual 3D CNNs can reuse preexisting 3D-CNN weights to reduce the per-prediction floating point operations (FLOPs) in proportion to the temporal receptive field while retaining similar memory requirements and accuracy. This is validated with multiple models on the Kinetics-400 and Charades datasets with remarkable results: Continual X3D models attain state-of-the-art complexity/accuracy trade-offs on Kinetics-400 with 12.1-15.3x reductions of FLOPs and 2.3-3.8% improvements in accuracy compared to regular X3D models while reducing peak memory consumption by up to 48%. Moreover, we investigate the transient response of Co3D CNNs at start-up and perform an extensive benchmark of on-hardware processing speed and accuracy for publicly available 3D CNNs.
    Probabilistic Robust Autoencoders for Anomaly Detection. (arXiv:2110.00494v2 [cs.LG] UPDATED)
    Anomalies (or outliers) are prevalent in real-world empirical observations and potentially mask important underlying structures. Accurate identification of anomalous samples is crucial for the success of downstream data analysis tasks. To automatically identify anomalies, we propose Probabilistic Robust AutoEncoder (PRAE). PRAE is designed to simultaneously remove outliers and identify a low-dimensional representation for the inlier samples. We first present the Robust AutoEncoder (RAE) intractable objective as a minimization problem for splitting the data to inlier samples from which a low dimensional representation is learned via an AutoEncoder (AE), and anomalous (outlier) samples that are excluded as they do not fit the low dimensional representation. RAE minimizes the autoencoder's reconstruction error while incorporating as many samples as possible. This could be formulated via regularization by subtracting from the reconstruction term an $\ell_0$ norm counting the number of selected samples. Unfortunately, this leads to an intractable combinatorial problem. Therefore, we propose two probabilistic relaxations of RAE, which are differentiable and alleviate the need for a combinatorial search. We prove that the solution to the PRAE problem is equivalent to the solution of RAE. We use synthetic data to show that PRAE can accurately remove outliers in a wide range of contamination frequencies. Finally, we demonstrate that using PRAE for anomaly detection leads to state-of-the-art results on various benchmark datasets.
    Progressive Distillation for Fast Sampling of Diffusion Models. (arXiv:2202.00512v1 [cs.LG])
    Diffusion models have recently shown great promise for generative modeling, outperforming GANs on perceptual quality and autoregressive models at density estimation. A remaining downside is their slow sampling time: generating high quality samples takes many hundreds or thousands of model evaluations. Here we make two contributions to help eliminate this downside: First, we present new parameterizations of diffusion models that provide increased stability when using few sampling steps. Second, we present a method to distill a trained deterministic diffusion sampler, using many steps, into a new diffusion model that takes half as many sampling steps. We then keep progressively applying this distillation procedure to our model, halving the number of required sampling steps each time. On standard image generation benchmarks like CIFAR-10, ImageNet, and LSUN, we start out with state-of-the-art samplers taking as many as 8192 steps, and are able to distill down to models taking as few as 4 steps without losing much perceptual quality; achieving, for example, a FID of 3.0 on CIFAR-10 in 4 steps. Finally, we show that the full progressive distillation procedure does not take more time than it takes to train the original model, thus representing an efficient solution for generative modeling using diffusion at both train and test time.
    Phase diagram of Stochastic Gradient Descent in high-dimensional two-layer neural networks. (arXiv:2202.00293v1 [stat.ML])
    Despite the non-convex optimization landscape, over-parametrized shallow networks are able to achieve global convergence under gradient descent. The picture can be radically different for narrow networks, which tend to get stuck in badly-generalizing local minima. Here we investigate the cross-over between these two regimes in the high-dimensional setting, and in particular investigate the connection between the so-called mean-field/hydrodynamic regime and the seminal approach of Saad & Solla. Focusing on the case of Gaussian data, we study the interplay between the learning rate, the time scale, and the number of hidden units in the high-dimensional dynamics of stochastic gradient descent (SGD). Our work builds on a deterministic description of SGD in high-dimensions from statistical physics, which we extend and for which we provide rigorous convergence rates.
    From Examples to Rules: Neural Guided Rule Synthesis for Information Extraction. (arXiv:2202.00475v1 [cs.CL])
    While deep learning approaches to information extraction have had many successes, they can be difficult to augment or maintain as needs shift. Rule-based methods, on the other hand, can be more easily modified. However, crafting rules requires expertise in linguistics and the domain of interest, making it infeasible for most users. Here we attempt to combine the advantages of these two directions while mitigating their drawbacks. We adapt recent advances from the adjacent field of program synthesis to information extraction, synthesizing rules from provided examples. We use a transformer-based architecture to guide an enumerative search, and show that this reduces the number of steps that need to be explored before a rule is found. Further, we show that without training the synthesis algorithm on the specific domain, our synthesized rules achieve state-of-the-art performance on the 1-shot scenario of a task that focuses on few-shot learning for relation classification, and competitive performance in the 5-shot scenario.
    Self-supervised Graphs for Audio Representation Learning with Limited Labeled Data. (arXiv:2202.00097v1 [cs.LG])
    Large scale databases with high-quality manual annotations are scarce in audio domain. We thus explore a self-supervised graph approach to learning audio representations from highly limited labelled data. Considering each audio sample as a graph node, we propose a subgraph-based framework with novel self-supervision tasks that can learn effective audio representations. During training, subgraphs are constructed by sampling the entire pool of available training data to exploit the relationship between the labelled and unlabeled audio samples. During inference, we use random edges to alleviate the overhead of graph construction. We evaluate our model on three benchmark audio databases, and two tasks: acoustic event detection and speech emotion recognition. Our semi-supervised model performs better or on par with fully supervised models and outperforms several competitive existing models. Our model is compact (240k parameters), and can produce generalized audio representations that are robust to different types of signal noise.
    Improved Lite Audio-Visual Speech Enhancement. (arXiv:2008.13222v3 [eess.AS] UPDATED)
    Numerous studies have investigated the effectiveness of audio-visual multimodal learning for speech enhancement (AVSE) tasks, seeking a solution that uses visual data as auxiliary and complementary input to reduce the noise of noisy speech signals. Recently, we proposed a lite audio-visual speech enhancement (LAVSE) algorithm for a car-driving scenario. Compared to conventional AVSE systems, LAVSE requires less online computation and to some extent solves the user privacy problem on facial data. In this study, we extend LAVSE to improve its ability to address three practical issues often encountered in implementing AVSE systems, namely, the additional cost of processing visual data, audio-visual asynchronization, and low-quality visual data. The proposed system is termed improved LAVSE (iLAVSE), which uses a convolutional recurrent neural network architecture as the core AVSE model. We evaluate iLAVSE on the Taiwan Mandarin speech with video dataset. Experimental results confirm that compared to conventional AVSE systems, iLAVSE can effectively overcome the aforementioned three practical issues and can improve enhancement performance. The results also confirm that iLAVSE is suitable for real-world scenarios, where high-quality audio-visual sensors may not always be available.
    Explainable AI through the Learning of Arguments. (arXiv:2202.00383v1 [cs.AI])
    Learning arguments is highly relevant to the field of explainable artificial intelligence. It is a family of symbolic machine learning techniques that is particularly human-interpretable. These techniques learn a set of arguments as an intermediate representation. Arguments are small rules with exceptions that can be chained to larger arguments for making predictions or decisions. We investigate the learning of arguments, specifically the learning of arguments from a 'case model' proposed by Verheij [34]. The case model in Verheij's approach are cases or scenarios in a legal setting. The number of cases in a case model are relatively low. Here, we investigate whether Verheij's approach can be used for learning arguments from other types of data sets with a much larger number of instances. We compare the learning of arguments from a case model with the HeRO algorithm [15] and learning a decision tree.
    Optimal Estimation of Off-Policy Policy Gradient via Double Fitted Iteration. (arXiv:2202.00076v1 [stat.ML])
    Policy gradient (PG) estimation becomes a challenge when we are not allowed to sample with the target policy but only have access to a dataset generated by some unknown behavior policy. Conventional methods for off-policy PG estimation often suffer from either significant bias or exponentially large variance. In this paper, we propose the double Fitted PG estimation (FPG) algorithm. FPG can work with an arbitrary policy parameterization, assuming access to a Bellman-complete value function class. In the case of linear value function approximation, we provide a tight finite-sample upper bound on policy gradient estimation error, that is governed by the amount of distribution mismatch measured in feature space. We also establish the asymptotic normality of FPG estimation error with a precise covariance characterization, which is further shown to be statistically optimal with a matching Cramer-Rao lower bound. Empirically, we evaluate the performance of FPG on both policy gradient estimation and policy optimization, using either softmax tabular or ReLU policy networks. Under various metrics, our results show that FPG significantly outperforms existing off-policy PG estimation methods based on importance sampling and variance reduction techniques.
    FlexParser -- the adaptive log file parser for continuous results in a changing world. (arXiv:2106.03170v2 [cs.LG] UPDATED)
    Any modern system writes events into files, called log files. Those contain crucial information which are subject to various analyses. Examples range from cybersecurity, intrusion detection over usage analyses to trouble shooting. Before data analysis is possible, desired information needs to be extracted first out of the semi-structured log messages. State-of-the-art event parsing often assumes static log events. However, any modern system is updated consistently and with updates also log file structures can change. We call those changes "mutation" and study parsing performance for different mutation cases. Latest research discovers mutations using anomaly detection post mortem, however, does not cover actual continuous parsing. Thus, we propose a novel and flexible parser, called FlexParser, which can extract desired values despite gradual changes in the log messages. It implies basic text preprocessing followed by a supervised Deep Learning method. We train a stateful LSTM on parsing one event per data set. Statefulness enforces the model to learn log message structures across several examples. Our model was tested on seven different, publicly available log file data sets and various kinds of mutations. Exhibiting an average F1-Score of 0.98, it outperforms other Deep Learning methods as well as state-of-the-art unsupervised parsers.
    Stochastic 2D Signal Generative Model with Wavelet Packets Basis Regarded as a Random Variable and Bayes Optimal Processing. (arXiv:2202.00568v1 [eess.SP])
    This study deals with two-dimensional (2D) signal processing using the wavelet packet transform. When the basis is unknown the candidate of basis increases in exponential order with respect to the signal size. Previous studies do not consider the basis as a random vaiables. Therefore, the cost function needs to be used to select a basis. However, this method is often a heuristic and a greedy search because it is impossible to search all the candidates for a huge number of bases. Therefore, it is difficult to evaluate the entire signal processing under a criterion and also it does not always gurantee the optimality of the entire signal processing. In this study, we propose a stochastic generative model in which the basis is regarded as a random variable. This makes it possible to evaluate entire signal processing under a unified criterion i.e. Bayes criterion. Moreover we can derive an optimal signal processing scheme that achieves the theoretical limit. This derived scheme shows that all the bases should be combined according to the posterior in stead of selecting a single basis. Although exponential order calculations is required for this scheme, we have derived a recursive algorithm for this scheme, which successfully reduces the computational complexity from the exponential order to the polynomial order.
    Curriculum Learning: A Regularization Method for Efficient and Stable Billion-Scale GPT Model Pre-Training. (arXiv:2108.06084v3 [cs.LG] UPDATED)
    Recent works have demonstrated great success in training large autoregressive language models (e.g., GPT-3) on unlabeled text corpus for text generation. To reduce their expensive training cost, practitioners attempt to increase the batch sizes and learning rates. However, increasing them often cause training instabilities and poor generalization. On the other side, using smaller batch sizes or learning rates would reduce the training efficiency, significantly increasing training time and cost. We investigate this stability-efficiency dilemma and identify that long sequence length is one of the main causes of training instability in large-scale GPT model pre-training. Based on our analysis, we present a novel sequence length warmup method that simultaneously improves training stability and efficiency. As a kind of curriculum learning approach, our method improves the training convergence speed of autoregressive models. More importantly, our in-depth analysis shows that our method exerts a gradient variance reduction effect and regularizes early stages of training where the amount of training data is much smaller than the model capacity. This enables stable training with much larger batch sizes and learning rates, further improving the training speed. Evaluations show that our approach enables stable GPT-2 (117M and 1.5B) pre-training with 8x larger batch size and 4x larger learning rate, whereas the baseline approach struggles with training instability. To achieve the same or better zero-shot WikiText-103/LAMBADA evaluation results, our approach reduces the required number of pre-training tokens and wall clock time by up to 55\% and 73\%, respectively.
    Laplacian2Mesh: Laplacian-Based Mesh Understanding. (arXiv:2202.00307v1 [cs.CV])
    Geometric deep learning has sparked a rising interest in computer graphics to perform shape understanding tasks, such as shape classification and semantic segmentation on three-dimensional (3D) geometric surfaces. Previous works explored the significant direction by defining the operations of convolution and pooling on triangle meshes, but most methods explicitly utilized the graph connection structure of the mesh. Motivated by the geometric spectral surface reconstruction theory, we introduce a novel and flexible convolutional neural network (CNN) model, called Laplacian2Mesh, for 3D triangle mesh, which maps the features of mesh in the Euclidean space to the multi-dimensional Laplacian-Beltrami space, which is similar to the multi-resolution input in 2D CNN. Mesh pooling is applied to expand the receptive field of the network by the multi-space transformation of Laplacian which retains the surface topology, and channel self-attention convolutions are applied in the new space. Since implicitly using the intrinsic geodesic connections of the mesh through the adjacency matrix, we do not consider the number of the neighbors of the vertices, thereby mesh data with different numbers of vertices can be input. Experiments on various learning tasks applied to 3D meshes demonstrate the effectiveness and efficiency of Laplacian2Mesh.  ( 2 min )
    Decentralized Personalized Federated Min-Max Problems. (arXiv:2106.07289v4 [cs.LG] UPDATED)
    Personalized Federated Learning (PFL) has recently seen tremendous progress, allowing the design of novel machine learning applications to preserve the privacy of the training data. Existing theoretical results in this field mainly focus on distributed optimization for minimization problems. This paper is the first to study PFL for saddle point problems (which cover a broader class of optimization problems), allowing for a more rich class of applications requiring more than just solving minimization problems. In this work, we consider a recently proposed PFL setting with the mixing objective function, an approach combining the learning of a global model together with locally distributed learners. Unlike most previous work, which considered only the centralized setting, we work in a more general and decentralized setup that allows us to design and analyze more practical and federated ways to connect devices to the network. We proposed new algorithms to address this problem and provide a theoretical analysis of the smooth (strongly-)convex-(strongly-)concave saddle point problems in stochastic and deterministic cases. Numerical experiments for bilinear problems and neural networks with adversarial noise demonstrate the effectiveness of the proposed methods.
    Finding Local Minimax Points via (Stochastic) Cubic-Regularized GDA: Global Convergence and Complexity. (arXiv:2110.07098v3 [math.OC] UPDATED)
    Standard gradient descent-ascent (GDA)-type algorithms can only find stationary points in nonconvex minimax optimization, which are far more sub-optimal than local minimax points. In this work, we develop GDA-type algorithms that globally converge to local minimax points in nonconvex-strongly-concave minimax optimization. We first observe that local minimax points are equivalent to second-order stationary points of a certain envelope function. Then, inspired by the classic cubic regularization algorithm, we propose Cubic-GDA--a cubic-regularized GDA algorithm for finding local minimax points, and provide a comprehensive convergence analysis by leveraging its intrinsic potential function. Specifically, we establish the global convergence of Cubic-GDA to a local minimax point at a sublinear convergence rate. We further analyze the asymptotic convergence rate of Cubic-GDA in the full spectrum of a local gradient dominant-type nonconvex geometry, establishing orderwise faster asymptotic convergence rates than the standard GDA. Moreover, we propose a stochastic variant of Cubic-GDA for large-scale minimax optimization, and characterize its sample complexity under stochastic sub-sampling.
    Designing the Topology of Graph Neural Networks: A Novel Feature Fusion Perspective. (arXiv:2112.14531v2 [cs.LG] UPDATED)
    In recent years, Graph Neural Networks (GNNs) have shown superior performance on diverse real-world applications. To improve the model capacity, besides designing aggregation operations, GNN topology design is also very important. In general, there are two mainstream GNN topology design manners. The first one is to stack aggregation operations to obtain the higher-level features but easily got performance drop as the network goes deeper. Secondly, the multiple aggregation operations are utilized in each layer which provides adequate and independent feature extraction stage on local neighbors while are costly to obtain the higher-level information. To enjoy the benefits while alleviating the corresponding deficiencies of these two manners, we learn to design the topology of GNNs in a novel feature fusion perspective which is dubbed F$^2$GNN. To be specific, we provide a feature fusion perspective in designing GNN topology and propose a novel framework to unify the existing topology designs with feature selection and fusion strategies. Then we develop a neural architecture search method on top of the unified framework which contains a set of selection and fusion operations in the search space and an improved differentiable search algorithm. The performance gains on eight real-world datasets demonstrate the effectiveness of F$^2$GNN. We further conduct experiments to show that F$^2$GNN can improve the model capacity while alleviating the deficiencies of existing GNN topology design manners, especially alleviating the over-smoothing problem, by utilizing different levels of features adaptively.
    Single Time-scale Actor-critic Method to Solve the Linear Quadratic Regulator with Convergence Guarantees. (arXiv:2202.00048v1 [math.OC])
    We propose a single time-scale actor-critic algorithm to solve the linear quadratic regulator (LQR) problem. A least squares temporal difference (LSTD) method is applied to the critic and a natural policy gradient method is used for the actor. We give a proof of convergence with sample complexity $\mO(\ve^{-1} \log(\ve^{-1})^2)$. The method in the proof is applicable to general single time-scale bilevel optimization problem. We also numerically validate our theoretical results on the convergence.
    Interactions in information spread: quantification and interpretation using stochastic block models. (arXiv:2004.04552v3 [cs.LG] UPDATED)
    In most real-world applications, it is seldom the case that a given observable evolves independently of its environment. In social networks, users' behavior results from the people they interact with, news in their feed, or trending topics. In natural language, the meaning of phrases emerges from the combination of words. In general medicine, a diagnosis is established on the basis of the interaction of symptoms. Here, we propose a new model, the Interactive Mixed Membership Stochastic Block Model (IMMSBM), which investigates the role of interactions between entities (hashtags, words, memes, etc.) and quantifies their importance within the aforementioned corpora. We find that interactions play an important role in those corpora. In inference tasks, taking them into account leads to average relative changes with respect to non-interactive models of up to 150\% in the probability of an outcome. Furthermore, their role greatly improves the predictive power of the model. Our findings suggest that neglecting interactions when modeling real-world phenomena might lead to incorrect conclusions being drawn.
    Dynamic Origin-Destination Matrix Estimation in Urban Traffic Networks. (arXiv:2202.00099v1 [math.OC])
    Given the counters of vehicles that traverse the roads of a traffic network, we aim at reconstructing the travel demand that generated them expressed in terms of the number of origin-destination trips made by users. We model the problem as a bi-level optimization problem. In the inner level, given a tentative travel demand, we solve a dynamic traffic assignment problem to decide the routing of the users between their origins and destinations. In the outer level, we adjust the number of trips and their origins and destinations, aiming at minimizing the discrepancy between the consequent counters generated in the inner level and the given vehicle counts measured by sensors in the traffic network. We solve the dynamic traffic assignment problem employing a mesoscopic model implemented by the traffic simulator SUMO. Thus, the outer problem becomes an optimization problem that minimizes a black-box objective function determined by the results of the simulation, which is a costly computation. We study different approaches to the outer level problem categorized as gradient-based and derivative-free approaches. Among the gradient-based approaches, we study an assignment matrix-based approach and an assignment matrix-free approach that uses the Simultaneous Perturbation Stochastic Approximation (SPSA) algorithm. Among the derivative-free approaches, we study machine learning algorithms to learn a model of the simulator that can then be used as a surrogated objective function in the optimization problem. We compare these approaches computationally on an artificial network. The gradient-based approaches perform the best in terms of archived solution quality and computational requirements, while the results obtained by the machine learning approach are currently less satisfactory but provide an interesting avenue of future research.  ( 2 min )
    YOUNG Star detrending for Transiting Exoplanet Recovery (YOUNGSTER) II: Using Self-Organising Maps to explore young star variability in Sectors 1-13 of TESS data. (arXiv:2202.00031v1 [astro-ph.EP])
    Young exoplanets and their corresponding host stars are fascinating laboratories for constraining the timescale of planetary evolution and planet-star interactions. However, because young stars are typically much more active than the older population, in order to discover more young exoplanets, greater knowledge of the wide array of young star variability is needed. Here Kohonen Self Organising Maps (SOMs) are used to explore young star variability present in the first year of observations from the Transiting Exoplanet Survey Satellite (TESS), with such knowledge valuable to perform targeted detrending of young stars in the future. This technique was found to be particularly effective at separating the signals of young eclipsing binaries and potential transiting objects from stellar variability, a list of which are provided in this paper. The effect of pre-training the Self-Organising Maps on known variability classes was tested, but found to be challenging without a significant training set from TESS. SOMs were also found to provide an intuitive and informative overview of leftover systematics in the TESS data, providing an important new way to characterise troublesome systematics in photometric data-sets. This paper represents the first stage of the wider YOUNGSTER program, which will use a machine-learning-based approach to classification and targeted detrending of young stars in order to improve the recovery of smaller young exoplanets.  ( 2 min )
    StyleGAN-XL: Scaling StyleGAN to Large Diverse Datasets. (arXiv:2202.00273v1 [cs.LG])
    Computer graphics has experienced a recent surge of data-centric approaches for photorealistic and controllable content creation. StyleGAN in particular sets new standards for generative modeling regarding image quality and controllability. However, StyleGAN's performance severely degrades on large unstructured datasets such as ImageNet. StyleGAN was designed for controllability; hence, prior works suspect its restrictive design to be unsuitable for diverse datasets. In contrast, we find the main limiting factor to be the current training strategy. Following the recently introduced Projected GAN paradigm, we leverage powerful neural network priors and a progressive growing strategy to successfully train the latest StyleGAN3 generator on ImageNet. Our final model, StyleGAN-XL, sets a new state-of-the-art on large-scale image synthesis and is the first to generate images at a resolution of $1024^2$ at such a dataset scale. We demonstrate that this model can invert and edit images beyond the narrow domain of portraits or specific object classes.  ( 2 min )
    Is the Performance of My Deep Network Too Good to Be True? A Direct Approach to Estimating the Bayes Error in Binary Classification. (arXiv:2202.00395v1 [cs.LG])
    There is a fundamental limitation in the prediction performance that a machine learning model can achieve due to the inevitable uncertainty of the prediction target. In classification problems, this can be characterized by the Bayes error, which is the best achievable error with any classifier. The Bayes error can be used as a criterion to evaluate classifiers with state-of-the-art performance and can be used to detect test set overfitting. We propose a simple and direct Bayes error estimator, where we just take the mean of the labels that show \emph{uncertainty} of the classes. Our flexible approach enables us to perform Bayes error estimation even for weakly supervised data. In contrast to others, our method is model-free and even instance-free. Moreover, it has no hyperparameters and gives a more accurate estimate of the Bayes error than classifier-based baselines. Experiments using our method suggest that a recently proposed classifier, the Vision Transformer, may have already reached the Bayes error for certain benchmark datasets.  ( 2 min )
    Finding lost DG: Explaining domain generalization via model complexity. (arXiv:2202.00563v1 [stat.ML])
    The domain generalization (DG) problem setting challenges a model trained on multiple known data distributions to generalise well on unseen data distributions. Due to its practical importance, a large number of methods have been proposed to address this challenge. However much of the work in general purpose DG is heuristically motivated, as the DG problem is hard to model formally; and recent evaluations have cast doubt on existing methods' practical efficacy -- in particular compared to a well tuned empirical risk minimisation baseline. We present a novel learning-theoretic generalisation bound for DG that bounds unseen domain generalisation performance in terms of the model's Rademacher complexity. Based on this, we conjecture that existing methods' efficacy or lack thereof is largely determined by an empirical risk vs predictor complexity trade-off, and demonstrate that their performance variability can be explained in these terms. Algorithmically, this analysis suggests that domain generalisation should be achieved by simply performing regularised ERM with a leave-one-domain-out cross-validation objective. Empirical results on the DomainBed benchmark corroborate this.
    Submodularity In Machine Learning and Artificial Intelligence. (arXiv:2202.00132v1 [cs.LG])
    In this manuscript, we offer a gentle review of submodularity and supermodularity and their properties. We offer a plethora of submodular definitions; a full description of a number of example submodular functions and their generalizations; example discrete constraints; a discussion of basic algorithms for maximization, minimization, and other operations; a brief overview of continuous submodular extensions; and some historical applications. We then turn to how submodularity is useful in machine learning and artificial intelligence. This includes summarization, and we offer a complete account of the differences between and commonalities amongst sketching, coresets, extractive and abstractive summarization in NLP, data distillation and condensation, and data subset selection and feature selection. We discuss a variety of ways to produce a submodular function useful for machine learning, including heuristic hand-crafting, learning or approximately learning a submodular function or aspects thereof, and some advantages of the use of a submodular function as a coreset producer. We discuss submodular combinatorial information functions, and how submodularity is useful for clustering, data partitioning, parallel machine learning, active and semi-supervised learning, probabilistic modeling, and structured norms and loss functions.  ( 2 min )
    A heteroencoder architecture for prediction of failure locations in porous metals using variational inference. (arXiv:2202.00078v1 [physics.app-ph])
    In this work we employ an encoder-decoder convolutional neural network to predict the failure locations of porous metal tension specimens based only on their initial porosities. The process we model is complex, with a progression from initial void nucleation, to saturation, and ultimately failure. The objective of predicting failure locations presents an extreme case of class imbalance since most of the material in the specimens do not fail. In response to this challenge, we develop and demonstrate the effectiveness of data- and loss-based regularization methods. Since there is considerable sensitivity of the failure location to the particular configuration of voids, we also use variational inference to provide uncertainties for the neural network predictions. We connect the deterministic and Bayesian convolutional neural networks at a theoretical level to explain how variational inference regularizes the training and predictions. We demonstrate that the resulting predicted variances are effective in ranking the locations that are most likely to fail in any given specimen.  ( 2 min )
    Predict and Optimize: Through the Lens of Learning to Rank. (arXiv:2112.03609v2 [cs.LG] UPDATED)
    In the last years predict-and-optimize approaches, also known as decision-focussed learning, have received increasing attention. In this setting, the predictions of machine learning models are used as estimated cost coefficients in the objective function of a discrete combinatorial optimization problems for decision making. Predict-and-optimize approaches propose to train the ML models, often neural network models, by directly optimizing the quality of decisions made by the optimization solvers. Based on recent work that proposed a Noise Contrastive Estimation loss over a subset of the solution space, we observe that predict-and-optimize can more generally be seen as a learning-to-rank problem. That is, the goal is to learn an objective function that ranks the feasible points correctly. This approach is independent of the optimization method used and of the form of the objective function. We develop pointwise, pairwise and listwise ranking loss functions, which can be differentiated in closed form given a subset of solutions. We empirically investigate the quality of our generic method compared to existing predict-and-optimize approaches with competitive results. Furthermore, controlling the subset of solutions allows controlling the runtime considerably, with limited effect on regret.  ( 2 min )
    Differentiable Digital Signal Processing Mixture Model for Synthesis Parameter Extraction from Mixture of Harmonic Sounds. (arXiv:2202.00200v1 [cs.SD])
    A differentiable digital signal processing (DDSP) autoencoder is a musical sound synthesizer that combines a deep neural network (DNN) and spectral modeling synthesis. It allows us to flexibly edit sounds by changing the fundamental frequency, timbre feature, and loudness (synthesis parameters) extracted from an input sound. However, it is designed for a monophonic harmonic sound and cannot handle mixtures of harmonic sounds. In this paper, we propose a model (DDSP mixture model) that represents a mixture as the sum of the outputs of multiple pretrained DDSP autoencoders. By fitting the output of the proposed model to the observed mixture, we can directly estimate the synthesis parameters of each source. Through synthesis parameter extraction experiments, we show that the proposed method has high and stable performance compared with a straightforward method that applies the DDSP autoencoder to the signals separated by an audio source separation method.  ( 2 min )
    Memory-Efficient Backpropagation through Large Linear Layers. (arXiv:2201.13195v2 [cs.LG] UPDATED)
    In modern neural networks like Transformers, linear layers require significant memory to store activations during backward pass. This study proposes a memory reduction approach to perform backpropagation through linear layers. Since the gradients of linear layers are computed by matrix multiplications, we consider methods for randomized matrix multiplications and demonstrate that they require less memory with a moderate decrease of the test accuracy. Also, we investigate the variance of the gradient estimate induced by the randomized matrix multiplication. We compare this variance with the variance coming from gradient estimation based on the batch of samples. We demonstrate the benefits of the proposed method on the fine-tuning of the pre-trained RoBERTa model on GLUE tasks.  ( 2 min )
    Few-Bit Backward: Quantized Gradients of Activation Functions for Memory Footprint Reduction. (arXiv:2202.00441v1 [cs.LG])
    Memory footprint is one of the main limiting factors for large neural network training. In backpropagation, one needs to store the input to each operation in the computational graph. Every modern neural network model has quite a few pointwise nonlinearities in its architecture, and such operation induces additional memory costs which -- as we show -- can be significantly reduced by quantization of the gradients. We propose a systematic approach to compute optimal quantization of the retained gradients of the pointwise nonlinear functions with only a few bits per each element. We show that such approximation can be achieved by computing optimal piecewise-constant approximation of the derivative of the activation function, which can be done by dynamic programming. The drop-in replacements are implemented for all popular nonlinearities and can be used in any existing pipeline. We confirm the memory reduction and the same convergence on several open benchmarks.
    Bringing Differential Private SGD to Practice: On the Independence of Gaussian Noise and the Number of Training Rounds. (arXiv:2102.09030v4 [cs.LG] UPDATED)
    In DP-SGD each round communicates a local SGD update which leaks some new information about the underlying local data set to the outside world. In order to provide privacy, Gaussian noise with standard deviation $\sigma$ is added to local SGD updates after performing a clipping operation. We show that for attaining $(\epsilon,\delta)$-differential privacy $\sigma$ can be chosen equal to $\sqrt{2(\epsilon +\ln(1/\delta))/\epsilon}$ for $\epsilon=\Omega(T/N^2)$, where $T$ is the total number of rounds and $N$ is equal to the size of the local data set. In many existing machine learning problems, $N$ is always large and $T=O(N)$. Hence, $\sigma$ becomes "independent" of any $T=O(N)$ choice with $\epsilon=\Omega(1/N)$. This means that our $\sigma$ only depends on $N$ rather than $T$. As shown in our paper, this differential privacy characterization allows one to {\it a-priori} select parameters of DP-SGD based on a fixed privacy budget (in terms of $\epsilon$ and $\delta$) in such a way to optimize the anticipated utility (test accuracy) the most. This ability of planning ahead together with $\sigma$'s independence of $T$ (which allows local gradient computations to be split among as many rounds as needed, even for large $T$ as usually happens in practice) leads to a {\it proactive DP-SGD algorithm} that allows a client to balance its privacy budget with the accuracy of the learned global model based on local test data. We notice that the current state-of-the art differential privacy accountant method based on $f$-DP has a closed form for computing the privacy loss for DP-SGD. However, due to its interpretation complexity, it cannot be used in a simple way to plan ahead. Instead, accountant methods are only used for keeping track of how privacy budget has been spent (after the fact).  ( 3 min )
    Evaluating Feature Attribution: An Information-Theoretic Perspective. (arXiv:2202.00449v1 [cs.CV])
    With a variety of local feature attribution methods being proposed in recent years, follow-up work suggested several evaluation strategies. To assess the attribution quality across different attribution techniques, the most popular among these evaluation strategies in the image domain use pixel perturbations. However, recent advances discovered that different evaluation strategies produce conflicting rankings of attribution methods and can be prohibitively expensive to compute. In this work, we present an information-theoretic analysis of evaluation strategies based on pixel perturbations. Our findings reveal that the results output by different evaluation strategies are strongly affected by information leakage through the shape of the removed pixels as opposed to their actual values. Using our theoretical insights, we propose a novel evaluation framework termed Remove and Debias (ROAD) which offers two contributions: First, it mitigates the impact of the confounders, which entails higher consistency among evaluation strategies. Second, ROAD does not require the computationally expensive retraining step and saves up to 99% in computational costs compared to the state-of-the-art. Our source code is available at https://github.com/tleemann/road_evaluation.
    Filtered-CoPhy: Unsupervised Learning of Counterfactual Physics in Pixel Space. (arXiv:2202.00368v1 [cs.CV])
    Learning causal relationships in high-dimensional data (images, videos) is a hard task, as they are often defined on low dimensional manifolds and must be extracted from complex signals dominated by appearance, lighting, textures and also spurious correlations in the data. We present a method for learning counterfactual reasoning of physical processes in pixel space, which requires the prediction of the impact of interventions on initial conditions. Going beyond the identification of structural relationships, we deal with the challenging problem of forecasting raw video over long horizons. Our method does not require the knowledge or supervision of any ground truth positions or other object or scene properties. Our model learns and acts on a suitable hybrid latent representation based on a combination of dense features, sets of 2D keypoints and an additional latent vector per keypoint. We show that this better captures the dynamics of physical processes than purely dense or sparse representations. We introduce a new challenging and carefully designed counterfactual benchmark for predictions in pixel space and outperform strong baselines in physics-inspired ML and video prediction.  ( 2 min )
    Insights into performance evaluation of com-pound-protein interaction prediction methods. (arXiv:2202.00001v1 [q-bio.QM])
    Motivation: Machine learning based prediction of compound-protein interactions (CPIs) is important for drug design, screening and repurposing studies and can improve the efficiency and cost-effectiveness of wet lab assays. Despite the publication of many research papers reporting CPI predictors in the recent years, we have observed a number of fundamental issues in experiment design that lead to over optimistic estimates of model performance. Results: In this paper, we analyze the impact of several important factors affecting generalization perfor-mance of CPI predictors that are overlooked in existing work: 1. Similarity between training and test examples in cross-validation 2. The strategy for generating negative examples, in the absence of experimentally verified negative examples. 3. Choice of evaluation protocols and performance metrics and their alignment with real-world use of CPI predictors in screening large compound libraries. Using both an existing state-of-the-art method (CPI-NN) and a proposed kernel based approach, we have found that assessment of predictive performance of CPI predictors requires careful con-trol over similarity between training and test examples. We also show that random pairing for gen-erating synthetic negative examples for training and performance evaluation results in models with better generalization performance in comparison to more sophisticated strategies used in existing studies. Furthermore, we have found that our kernel based approach, despite its simple design, exceeds the prediction performance of CPI-NN. We have used the proposed model for compound screening of several proteins including SARS-CoV-2 Spike and Human ACE2 proteins and found strong evidence in support of its top hits. Availability: Code and raw experimental results available at https://github.com/adibayaseen/HKRCPI Contact: Fayyaz.minhas@warwick.ac.uk  ( 3 min )
    Access Control of Object Detection Models Using Encrypted Feature Maps. (arXiv:2202.00265v1 [cs.CV])
    In this paper, we propose an access control method for object detection models. The use of encrypted images or encrypted feature maps has been demonstrated to be effective in access control of models from unauthorized access. However, the effectiveness of the approach has been confirmed in only image classification models and semantic segmentation models, but not in object detection models. In this paper, the use of encrypted feature maps is shown to be effective in access control of object detection models for the first time.  ( 2 min )
    Fully Online Meta-Learning Without Task Boundaries. (arXiv:2202.00263v1 [cs.LG])
    While deep networks can learn complex functions such as classifiers, detectors, and trackers, many applications require models that continually adapt to changing input distributions, changing tasks, and changing environmental conditions. Indeed, this ability to continuously accrue knowledge and use past experience to learn new tasks quickly in continual settings is one of the key properties of an intelligent system. For complex and high-dimensional problems, simply updating the model continually with standard learning algorithms such as gradient descent may result in slow adaptation. Meta-learning can provide a powerful tool to accelerate adaptation yet is conventionally studied in batch settings. In this paper, we study how meta-learning can be applied to tackle online problems of this nature, simultaneously adapting to changing tasks and input distributions and meta-training the model in order to adapt more quickly in the future. Extending meta-learning into the online setting presents its own challenges, and although several prior methods have studied related problems, they generally require a discrete notion of tasks, with known ground-truth task boundaries. Such methods typically adapt to each task in sequence, resetting the model between tasks, rather than adapting continuously across tasks. In many real-world settings, such discrete boundaries are unavailable, and may not even exist. To address these settings, we propose a Fully Online Meta-Learning (FOML) algorithm, which does not require any ground truth knowledge about the task boundaries and stays fully online without resetting back to pre-trained weights. Our experiments show that FOML was able to learn new tasks faster than the state-of-the-art online learning methods on Rainbow-MNIST, CIFAR100 and CELEBA datasets.  ( 2 min )
    Imbedding Deep Neural Networks. (arXiv:2202.00113v1 [cs.LG])
    Continuous depth neural networks, such as Neural ODEs, have refashioned the understanding of residual neural networks in terms of non-linear vector-valued optimal control problems. The common solution is to use the adjoint sensitivity method to replicate a forward-backward pass optimisation problem. We propose a new approach which explicates the network's `depth' as a fundamental variable, thus reducing the problem to a system of forward-facing initial value problems. This new method is based on the principle of `Invariant Imbedding' for which we prove a general solution, applicable to all non-linear, vector-valued optimal control problems with both running and terminal loss. Our new architectures provide a tangible tool for inspecting the theoretical--and to a great extent unexplained--properties of network depth. They also constitute a resource of discrete implementations of Neural ODEs comparable to classes of imbedded residual neural networks. Through a series of experiments, we show the competitive performance of the proposed architectures for supervised learning and time series prediction.  ( 2 min )
    Generalization in Cooperative Multi-Agent Systems. (arXiv:2202.00104v1 [cs.LG])
    Collective intelligence is a fundamental trait shared by several species of living organisms. It has allowed them to thrive in the diverse environmental conditions that exist on our planet. From simple organisations in an ant colony to complex systems in human groups, collective intelligence is vital for solving complex survival tasks. As is commonly observed, such natural systems are flexible to changes in their structure. Specifically, they exhibit a high degree of generalization when the abilities or the total number of agents changes within a system. We term this phenomenon as Combinatorial Generalization (CG). CG is a highly desirable trait for autonomous systems as it can increase their utility and deployability across a wide range of applications. While recent works addressing specific aspects of CG have shown impressive results on complex domains, they provide no performance guarantees when generalizing towards novel situations. In this work, we shed light on the theoretical underpinnings of CG for cooperative multi-agent systems (MAS). Specifically, we study generalization bounds under a linear dependence of the underlying dynamics on the agent capabilities, which can be seen as a generalization of Successor Features to MAS. We then extend the results first for Lipschitz and then arbitrary dependence of rewards on team capabilities. Finally, empirical analysis on various domains using the framework of multi-agent reinforcement learning highlights important desiderata for multi-agent algorithms towards ensuring CG.  ( 2 min )
    Evaluating Deep Vs. Wide & Deep Learners As Contextual Bandits For Personalized Email Promo Recommendations. (arXiv:2202.00146v1 [cs.LG])
    Personalization enables businesses to learn customer preferences from past interactions and thus to target individual customers with more relevant content. We consider the problem of predicting the optimal promotional offer for a given customer out of several options as a contextual bandit problem. Identifying information for the customer and/or the campaign can be used to deduce unknown customer/campaign features that improve optimal offer prediction. Using a generated synthetic email promo dataset, we demonstrate similar prediction accuracies for (a) a wide and deep network that takes identifying information (or other categorical features) as input to the wide part and (b) a deep-only neural network that includes embeddings of categorical features in the input. Improvements in accuracy from including categorical features depends on the variability of the unknown numerical features for each category. We also show that selecting options using upper confidence bound or Thompson sampling, approximated via Monte Carlo dropout layers in the wide and deep models, slightly improves model performance.  ( 2 min )
    Recursion, evolution and conscious self. (arXiv:2001.11825v3 [cs.LO] UPDATED)
    We introduce and study a learning theory which is roughly automatic, that is, it does not require but a minimum of initial programming, and is based on the potential computational phenomenon of self-reference, (i.e. the potential ability of an algorithm to have its program as an input). The conclusions agree with scientific findings in both biology and neuroscience and provide a plethora of explanations both (in conjunction with Darwinism) about evolution, as well as for the functionality and learning capabilities of human brain, (most importantly), as we perceive them in ourselves.  ( 2 min )
    CIC: Contrastive Intrinsic Control for Unsupervised Skill Discovery. (arXiv:2202.00161v1 [cs.LG])
    We introduce Contrastive Intrinsic Control (CIC), an algorithm for unsupervised skill discovery that maximizes the mutual information between skills and state transitions. In contrast to most prior approaches, CIC uses a decomposition of the mutual information that explicitly incentivizes diverse behaviors by maximizing state entropy. We derive a novel lower bound estimate for the mutual information which combines a particle estimator for state entropy to generate diverse behaviors and contrastive learning to distill these behaviors into distinct skills. We evaluate our algorithm on the Unsupervised Reinforcement Learning Benchmark, which consists of a long reward-free pre-training phase followed by a short adaptation phase to downstream tasks with extrinsic rewards. We find that CIC substantially improves over prior unsupervised skill discovery methods and outperforms the next leading overall exploration algorithm in terms of downstream task performance.  ( 2 min )
    3D Visualization and Spatial Data Mining for Analysis of LULC Images. (arXiv:2202.00123v1 [cs.CV])
    The present study is an attempt made to create a new tool for the analysis of Land Use Land Cover (LUCL) images in 3D visualization. This study mainly uses spatial data mining techniques on high resolution LULC satellite imagery. Visualization of feature space allows exploration of patterns in the image data and insight into the classification process and related uncertainty. Visual Data Mining provides added value to image classifications as the user can be involved in the classification process providing increased confidence in and understanding of the results. In this study, we present a prototype of image segmentation, K-Means clustering and 3D visualization tool for visual data mining (VDM) of LUCL satellite imagery into volume visualization. This volume based representation divides feature space into spheres or voxels. The visualization tool is showcased in a classification study of high-resolution LULC imagery of Latur district (Maharashtra state, India) is used as sample data.  ( 2 min )
    Understanding AdamW through Proximal Methods and Scale-Freeness. (arXiv:2202.00089v1 [cs.LG])
    Adam has been widely adopted for training deep neural networks due to less hyperparameter tuning and remarkable performance. To improve generalization, Adam is typically used in tandem with a squared $\ell_2$ regularizer (referred to as Adam-$\ell_2$). However, even better performance can be obtained with AdamW, which decouples the gradient of the regularizer from the update rule of Adam-$\ell_2$. Yet, we are still lacking a complete explanation of the advantages of AdamW. In this paper, we tackle this question from both an optimization and an empirical point of view. First, we show how to re-interpret AdamW as an approximation of a proximal gradient method, which takes advantage of the closed-form proximal mapping of the regularizer instead of only utilizing its gradient information as in Adam-$\ell_2$. Next, we consider the property of "scale-freeness" enjoyed by AdamW and by its proximal counterpart: their updates are invariant to component-wise rescaling of the gradients. We provide empirical evidence across a wide range of deep learning experiments showing a correlation between the problems in which AdamW exhibits an advantage over Adam-$\ell_2$ and the degree to which we expect the gradients of the network to exhibit multiple scales, thus motivating the hypothesis that the advantage of AdamW could be due to the scale-free updates.  ( 2 min )
    Fortuitous Forgetting in Connectionist Networks. (arXiv:2202.00155v1 [cs.LG])
    Forgetting is often seen as an unwanted characteristic in both human and machine learning. However, we propose that forgetting can in fact be favorable to learning. We introduce "forget-and-relearn" as a powerful paradigm for shaping the learning trajectories of artificial neural networks. In this process, the forgetting step selectively removes undesirable information from the model, and the relearning step reinforces features that are consistently useful under different conditions. The forget-and-relearn framework unifies many existing iterative training algorithms in the image classification and language emergence literature, and allows us to understand the success of these algorithms in terms of the disproportionate forgetting of undesirable information. We leverage this understanding to improve upon existing algorithms by designing more targeted forgetting operations. Insights from our analysis provide a coherent view on the dynamics of iterative training in neural networks and offer a clear path towards performance improvements.  ( 2 min )
    Wide Neural Networks Forget Less Catastrophically. (arXiv:2110.11526v2 [cs.LG] UPDATED)
    A primary focus area in continual learning research is alleviating the "catastrophic forgetting" problem in neural networks by designing new algorithms that are more robust to the distribution shifts. While the recent progress in continual learning literature is encouraging, our understanding of what properties of neural networks contribute to catastrophic forgetting is still limited. To address this, instead of focusing on continual learning algorithms, in this work, we focus on the model itself and study the impact of "width" of the neural network architecture on catastrophic forgetting, and show that width has a surprisingly significant effect on forgetting. To explain this effect, we study the learning dynamics of the network from various perspectives such as gradient orthogonality, sparsity, and lazy training regime. We provide potential explanations that are consistent with the empirical results across different architectures and continual learning benchmarks.
    PAGE-PG: A Simple and Loopless Variance-Reduced Policy Gradient Method with Probabilistic Gradient Estimation. (arXiv:2202.00308v1 [cs.LG])
    Despite their success, policy gradient methods suffer from high variance of the gradient estimate, which can result in unsatisfactory sample complexity. Recently, numerous variance-reduced extensions of policy gradient methods with provably better sample complexity and competitive numerical performance have been proposed. After a compact survey on some of the main variance-reduced REINFORCE-type methods, we propose ProbAbilistic Gradient Estimation for Policy Gradient (PAGE-PG), a novel loopless variance-reduced policy gradient method based on a probabilistic switch between two types of updates. Our method is inspired by the PAGE estimator for supervised learning and leverages importance sampling to obtain an unbiased gradient estimator. We show that PAGE-PG enjoys a $\mathcal{O}\left( \epsilon^{-3} \right)$ average sample complexity to reach an $\epsilon$-stationary solution, which matches the sample complexity of its most competitive counterparts under the same setting. A numerical evaluation confirms the competitive performance of our method on classical control tasks.  ( 2 min )
    Deep Kernelized Dense Geometric Matching. (arXiv:2202.00667v1 [cs.CV])
    Dense geometric matching is a challenging computer vision task, requiring accurate correspondences under extreme variations in viewpoint and illumination, even for low-texture regions. In this task, finding accurate global correspondences is essential for later refinement stages. The current learning based paradigm is to perform global fixed-size correlation, followed by flattening and convolution to predict correspondences. In this work, we consider the problem from a different perspective and propose to formulate global correspondence estimation as a continuous probabilistic regression task using deep kernels, yielding a novel approach to learning dense correspondences. Our full approach, \textbf{D}eep \textbf{K}ernelized \textbf{M}atching, achieves significant improvements compared to the state-of-the-art on the competitive HPatches and YFCC100m benchmarks, and we dissect the gains of our contributions in a thorough ablation study.  ( 2 min )
    Stability and Generalization Capabilities of Message Passing Graph Neural Networks. (arXiv:2202.00645v1 [cs.LG])
    Message passing neural networks (MPNN) have seen a steep rise in popularity since their introduction as generalizations of convolutional neural networks to graph structured data, and are now considered state-of-the-art tools for solving a large variety of graph-focused problems. We study the generalization capabilities of MPNNs in graph classification. We assume that graphs of different classes are sampled from different random graph models. Based on this data distribution, we derive a non-asymptotic bound on the generalization gap between the empirical and statistical loss, that decreases to zero as the graphs become larger. This is proven by showing that a MPNN, applied on a graph, approximates the MPNN applied on the geometric model that the graph discretizes.  ( 2 min )
    MEGA: Model Stealing via Collaborative Generator-Substitute Networks. (arXiv:2202.00008v1 [cs.CR])
    Deep machine learning models are increasingly deployedin the wild for providing services to users. Adversaries maysteal the knowledge of these valuable models by trainingsubstitute models according to the inference results of thetargeted deployed models. Recent data-free model stealingmethods are shown effective to extract the knowledge of thetarget model without using real query examples, but they as-sume rich inference information, e.g., class probabilities andlogits. However, they are all based on competing generator-substitute networks and hence encounter training instability.In this paper we propose a data-free model stealing frame-work,MEGA, which is based on collaborative generator-substitute networks and only requires the target model toprovide label prediction for synthetic query examples. Thecore of our method is a model stealing optimization con-sisting of two collaborative models (i) the substitute modelwhich imitates the target model through the synthetic queryexamples and their inferred labels and (ii) the generatorwhich synthesizes images such that the confidence of thesubstitute model over each query example is maximized. Wepropose a novel coordinate descent training procedure andanalyze its convergence. We also empirically evaluate thetrained substitute model on three datasets and its applicationon black-box adversarial attacks. Our results show that theaccuracy of our trained substitute model and the adversarialattack success rate over it can be up to 33% and 40% higherthan state-of-the-art data-free black-box attacks.  ( 2 min )
    Real-Time Facial Expression Recognition using Facial Landmarks and Neural Networks. (arXiv:2202.00102v1 [cs.CV])
    This paper presents a lightweight algorithm for feature extraction, classification of seven different emotions, and facial expression recognition in a real-time manner based on static images of the human face. In this regard, a Multi-Layer Perceptron (MLP) neural network is trained based on the foregoing algorithm. In order to classify human faces, first, some pre-processing is applied to the input image, which can localize and cut out faces from it. In the next step, a facial landmark detection library is used, which can detect the landmarks of each face. Then, the human face is split into upper and lower faces, which enables the extraction of the desired features from each part. In the proposed model, both geometric and texture-based feature types are taken into account. After the feature extraction phase, a normalized vector of features is created. A 3-layer MLP is trained using these feature vectors, leading to 96% accuracy on the test set.  ( 2 min )
    An Assessment of the Impact of OCR Noise on Language Models. (arXiv:2202.00470v1 [cs.CL])
    Neural language models are the backbone of modern-day natural language processing applications. Their use on textual heritage collections which have undergone Optical Character Recognition (OCR) is therefore also increasing. Nevertheless, our understanding of the impact OCR noise could have on language models is still limited. We perform an assessment of the impact OCR noise has on a variety of language models, using data in Dutch, English, French and German. We find that OCR noise poses a significant obstacle to language modelling, with language models increasingly diverging from their noiseless targets as OCR quality lowers. In the presence of small corpora, simpler models including PPMI and Word2Vec consistently outperform transformer-based models in this respect.  ( 2 min )
    Meta-Learning Hypothesis Spaces for Sequential Decision-making. (arXiv:2202.00602v1 [stat.ML])
    Obtaining reliable, adaptive confidence sets for prediction functions (hypotheses) is a central challenge in sequential decision-making tasks, such as bandits and model-based reinforcement learning. These confidence sets typically rely on prior assumptions on the hypothesis space, e.g., the known kernel of a Reproducing Kernel Hilbert Space (RKHS). Hand-designing such kernels is error prone, and misspecification may lead to poor or unsafe performance. In this work, we propose to meta-learn a kernel from offline data (Meta-KeL). For the case where the unknown kernel is a combination of known base kernels, we develop an estimator based on structured sparsity. Under mild conditions, we guarantee that our estimated RKHS yields valid confidence sets that, with increasing amounts of offline data, become as tight as those given the true unknown kernel. We demonstrate our approach on the kernelized bandit problem (a.k.a.~Bayesian optimization), where we establish regret bounds competitive with those given the true kernel. We also empirically evaluate the effectiveness of our approach on a Bayesian optimization task.
    Safe Screening for Logistic Regression with $\ell_0$-$\ell_2$ Regularization. (arXiv:2202.00467v1 [stat.ML])
    In logistic regression, it is often desirable to utilize regularization to promote sparse solutions, particularly for problems with a large number of features compared to available labels. In this paper, we present screening rules that safely remove features from logistic regression with $\ell_0-\ell_2$ regularization before solving the problem. The proposed safe screening rules are based on lower bounds from the Fenchel dual of strong conic relaxations of the logistic regression problem. Numerical experiments with real and synthetic data suggest that a high percentage of the features can be effectively and safely removed apriori, leading to substantial speed-up in the computations.
    Machine-learning-enhanced quantum sensors for accurate magnetic field imaging. (arXiv:2202.00380v1 [quant-ph])
    Local detection of magnetic fields is crucial for characterizing nano- and micro-materials and has been implemented using various scanning techniques or even diamond quantum sensors. Diamond nanoparticles (nanodiamonds) offer an attractive opportunity to chieve high spatial resolution because they can easily be close to the target within a few 10 nm simply by attaching them to its surface. A physical model for such a randomly oriented nanodiamond ensemble (NDE) is available, but the complexity of actual experimental conditions still limits the accuracy of deducing magnetic fields. Here, we demonstrate magnetic field imaging with high accuracy of 1.8 $\mu$T combining NDE and machine learning without any physical models. We also discover the field direction dependence of the NDE signal, suggesting the potential application for vector magnetometry and improvement of the existing model. Our method further enriches the performance of NDE to achieve the accuracy to visualize mesoscopic current and magnetism in atomic-layer materials and to expand the applicability in arbitrarily shaped materials, including living organisms. This achievement will bridge machine learning and quantum sensing for accurate measurements.  ( 2 min )
    Can semi-supervised learning reduce the amount of manual labelling required for effective radio galaxy morphology classification?. (arXiv:2111.04357v4 [astro-ph.GA] UPDATED)
    In this work, we examine the robustness of state-of-the-art semi-supervised learning (SSL) algorithms when applied to morphological classification in modern radio astronomy. We test whether SSL can achieve performance comparable to the current supervised state of the art when using many fewer labelled data points and if these results generalise to using truly unlabelled data. We find that although SSL provides additional regularisation, its performance degrades rapidly when using very few labels, and that using truly unlabelled data leads to a significant drop in performance.
    Natural Language to Code Using Transformers. (arXiv:2202.00367v1 [cs.CL])
    We tackle the problem of generating code snippets from natural language descriptions using the CoNaLa dataset. We use the self-attention based transformer architecture and show that it performs better than recurrent attention-based encoder decoder. Furthermore, we develop a modified form of back translation and use cycle consistent losses to train the model in an end-to-end fashion. We achieve a BLEU score of 16.99 beating the previously reported baseline of the CoNaLa challenge.
    A General, Evolution-Inspired Reward Function for Social Robotics. (arXiv:2202.00617v1 [cs.RO])
    The field of social robotics will likely need to depart from a paradigm of designed behaviours and imitation learning and adopt modern reinforcement learning (RL) methods to enable robots to interact fluidly and efficaciously with humans. In this paper, we present the Social Reward Function as a mechanism to provide (1) a real-time, dense reward function necessary for the deployment of RL agents in social robotics, and (2) a standardised objective metric for comparing the efficacy of different social robots. The Social Reward Function is designed to closely mimic those genetically endowed social perception capabilities of humans in an effort to provide a simple, stable and culture-agnostic reward function. Presently, datasets used in social robotics are either small or significantly out-of-domain with respect to social robotics. The use of the Social Reward Function will allow larger in-domain datasets to be collected close to the behaviour policy of social robots, which will allow both further improvements to reward functions and to the behaviour policies of social robots. We believe this will be the key enabler to developing efficacious social robots in the future.
    Arrhythmia Classification using CGAN-augmented ECG Signals. (arXiv:2202.00569v1 [eess.SP])
    One of the easiest ways to diagnose cardiovascular conditions is Electrocardiogram (ECG) analysis. ECG databases usually have highly imbalanced distributions due to the abundance of Normal ECG and scarcity of abnormal cases which are equally, if not more, important for arrhythmia detection. As such, DL classifiers trained on these datasets usually perform poorly, especially on minor classes. One solution to address the imbalance is to generate realistic synthetic ECG signals mostly using Generative Adversarial Networks (GAN) to augment and the datasets. In this study, we designed an experiment to investigate the impact of data augmentation on arrhythmia classification. Using the MIT-BIH Arrhythmia dataset, we employed two ways for ECG beats generation: (i) an unconditional GAN, i.e., Wasserstein GAN with gradient penalty (WGAN-GP) is trained on each class individually; (ii) a conditional GAN model, i.e., Auxiliary Classifier Wasserstein GAN with gradient penalty (AC-WGAN-GP) is trained on all the available classes to train one single generator. Two scenarios are defined for each case: i) unscreened where all the generated synthetic beats were used directly without any post-processing, and ii) screened where a portion of generated beats are selected based on their Dynamic Time Warping (DTW) distance with a designated template. A ResNet classifier is trained on each of the four augmented datasets and the performance metrics of precision, recall and F1-Score as well as the confusion matrices were compared with the reference case, i.e., when the classifier is trained on the imbalanced original dataset. The results show that in all four cases augmentation achieves impressive improvements in metrics particularly on minor classes (typically from 0 or 0.27 to 0.99). The quality of the generated beats is also evaluated using DTW distance function compared with real data.
    Oracle Teacher: Towards Better Knowledge Distillation. (arXiv:2111.03664v2 [cs.LG] UPDATED)
    Knowledge distillation (KD), best known as an effective method for model compression, aims at transferring the knowledge of a bigger network (teacher) to a much smaller network (student). Conventional KD methods usually employ the teacher model trained in a supervised manner, where output labels are treated only as targets. Extending this supervised scheme further, we introduce a new type of teacher model for KD, namely Oracle Teacher, that utilizes the embeddings of both the source inputs and the output labels to extract a more accurate knowledge to be transferred to the student. One potential risk for the proposed approach is a trivial solution that the model's output directly copies the oracle input. In Oracle Teacher, we present a training scheme that can effectively prevent the trivial solution and thus enables utilizing both source inputs and oracle inputs for model training. Extensive experiments are conducted on three different sequence learning tasks: speech recognition, scene text recognition, and machine translation. From the experimental results, we empirically show that the proposed model improves the students across these tasks while achieving a considerable speed-up in the teacher model's training time.
    Ulixes: Facial Recognition Privacy with Adversarial Machine Learning. (arXiv:2010.10242v2 [cs.CV] UPDATED)
    Facial recognition tools are becoming exceptionally accurate in identifying people from images. However, this comes at the cost of privacy for users of online services with photo management (e.g. social media platforms). Particularly troubling is the ability to leverage unsupervised learning to recognize faces even when the user has not labeled their images. In this paper we propose Ulixes, a strategy to generate visually non-invasive facial noise masks that yield adversarial examples, preventing the formation of identifiable user clusters in the embedding space of facial encoders. This is applicable even when a user is unmasked and labeled images are available online. We demonstrate the effectiveness of Ulixes by showing that various classification and clustering methods cannot reliably label the adversarial examples we generate. We also study the effects of Ulixes in various black-box settings and compare it to the current state of the art in adversarial machine learning. Finally, we challenge the effectiveness of Ulixes against adversarially trained models and show that it is robust to countermeasures.
    A Low-Cost Neural ODE with Depthwise Separable Convolution for Edge Domain Adaptation on FPGAs. (arXiv:2107.12824v2 [cs.LG] UPDATED)
    High-performance deep neural network (DNN)-based systems are in high demand in edge environments. Due to its high computational complexity, it is challenging to deploy DNNs on edge devices with strict limitations on computational resources. In this paper, we derive a compact while highly-accurate DNN model, termed dsODENet, by combining recently-proposed parameter reduction techniques: Neural ODE (Ordinary Differential Equation) and DSC (Depthwise Separable Convolution). Neural ODE exploits a similarity between ResNet and ODE, and shares most of weight parameters among multiple layers, which greatly reduces the memory consumption. We apply dsODENet to a domain adaptation as a practical use case with image classification datasets. We also propose a resource-efficient FPGA-based design for dsODENet, where all the parameters and feature maps except for pre- and post-processing layers can be mapped onto on-chip memories. It is implemented on Xilinx ZCU104 board and evaluated in terms of domain adaptation accuracy, training speed, FPGA resource utilization, and speedup rate compared to a software counterpart. The results demonstrate that dsODENet achieves comparable or slightly better domain adaptation accuracy compared to our baseline Neural ODE implementation, while the total parameter size without pre- and post-processing layers is reduced by 54.2% to 79.8%. Our FPGA implementation accelerates the inference speed by 27.9 times.
    Model Explanations via the Axiomatic Causal Lens. (arXiv:2109.03890v3 [cs.LG] UPDATED)
    Explaining the decisions of black-box models is a central theme in the study of trustworthy ML. Numerous measures have been proposed in the literature; however, none of them take an axiomatic approach to causal explainability. In this work, we propose three explanation measures which aggregate the set of all but-for causes -- a necessary and sufficient explanation -- into feature importance weights. Our first measure is a natural adaptation of Chockler and Halpern's notion of causal responsibility, whereas the other two correspond to existing game-theoretic influence measures. We present an axiomatic treatment for our proposed indices, showing that they can be uniquely characterized by a set of desirable properties. We also extend our approach to derive a new method to compute the Shapley-Shubik and Banzhaf indices for black-box model explanations. Finally, we analyze and compare the necessity and sufficiency of all our proposed explanation measures in practice using the Adult-Income dataset. Thus, our work is the first to formally bridge the gap between model explanations, game-theoretic influence, and causal analysis.
    Hierarchical Gaussian Processes with Wasserstein-2 Kernels. (arXiv:2010.14877v2 [stat.ML] UPDATED)
    Stacking Gaussian Processes severely diminishes the model's ability to detect outliers, which when combined with non-zero mean functions, further extrapolates low non-parametric variance to low training data density regions. We propose a hybrid kernel inspired from Varifold theory, operating in both Euclidean and Wasserstein space. We posit that directly taking into account the variance in the computation of Wasserstein-2 distances is of key importance towards maintaining outlier status throughout the hierarchy. We show improved performance on medium and large scale datasets and enhanced out-of-distribution detection on both toy and real data.
    Scalable Anytime Algorithms for Learning Fragments of Linear Temporal Logic. (arXiv:2110.06726v3 [cs.AI] UPDATED)
    Linear temporal logic (LTL) is a specification language for finite sequences (called traces) widely used in program verification, motion planning in robotics, process mining, and many other areas. We consider the problem of learning LTL formulas for classifying traces; despite a growing interest of the research community, existing solutions suffer from two limitations: they do not scale beyond small formulas, and they may exhaust computational resources without returning any result. We introduce a new algorithm addressing both issues: our algorithm is able to construct formulas an order of magnitude larger than previous methods, and it is anytime, meaning that it in most cases successfully outputs a formula, albeit possibly not of minimal size. We evaluate the performances of our algorithm using an open source implementation against publicly available benchmarks.
    Query Efficient Decision Based Sparse Attacks Against Black-Box Deep Learning Models. (arXiv:2202.00091v1 [cs.LG])
    Despite our best efforts, deep learning models remain highly vulnerable to even tiny adversarial perturbations applied to the inputs. The ability to extract information from solely the output of a machine learning model to craft adversarial perturbations to black-box models is a practical threat against real-world systems, such as autonomous cars or machine learning models exposed as a service (MLaaS). Of particular interest are sparse attacks. The realization of sparse attacks in black-box models demonstrates that machine learning models are more vulnerable than we believe. Because these attacks aim to minimize the number of perturbed pixels measured by l_0 norm-required to mislead a model by solely observing the decision (the predicted label) returned to a model query; the so-called decision-based attack setting. But, such an attack leads to an NP-hard optimization problem. We develop an evolution-based algorithm-SparseEvo-for the problem and evaluate against both convolutional deep neural networks and vision transformers. Notably, vision transformers are yet to be investigated under a decision-based attack setting. SparseEvo requires significantly fewer model queries than the state-of-the-art sparse attack Pointwise for both untargeted and targeted attacks. The attack algorithm, although conceptually simple, is also competitive with only a limited query budget against the state-of-the-art gradient-based whitebox attacks in standard computer vision tasks such as ImageNet. Importantly, the query efficient SparseEvo, along with decision-based attacks, in general, raise new questions regarding the safety of deployed systems and poses new directions to study and understand the robustness of machine learning models.
    ReGVD: Revisiting Graph Neural Networks for Vulnerability Detection. (arXiv:2110.07317v2 [cs.LG] UPDATED)
    Identifying vulnerabilities in the source code is essential to protect the software systems from cyber security attacks. It, however, is also a challenging step that requires specialized expertise in security and code representation. To this end, we aim to develop a general, practical, and programming language-independent model capable of running on various source codes and libraries without difficulty. Therefore, we consider vulnerability detection as an inductive text classification problem and propose ReGVD, a simple yet effective graph neural network-based model for the problem. In particular, ReGVD views each raw source code as a flat sequence of tokens to build a graph, wherein node features are initialized by only the token embedding layer of a pre-trained programming language (PL) model. ReGVD then leverages residual connection among GNN layers and examines a mixture of graph-level sum and max poolings to return a graph embedding for the source code. Experimental results demonstrate that ReGVD outperforms the existing state-of-the-art models and obtains the highest accuracy on the real-world benchmark dataset from CodeXGLUE for vulnerability detection. Our code is available at: \url{https://github.com/daiquocnguyen/GNN-ReGVD}.
    Datamodels: Predicting Predictions from Training Data. (arXiv:2202.00622v1 [stat.ML])
    We present a conceptual framework, datamodeling, for analyzing the behavior of a model class in terms of the training data. For any fixed "target" example $x$, training set $S$, and learning algorithm, a datamodel is a parameterized function $2^S \to \mathbb{R}$ that for any subset of $S' \subset S$ -- using only information about which examples of $S$ are contained in $S'$ -- predicts the outcome of training a model on $S'$ and evaluating on $x$. Despite the potential complexity of the underlying process being approximated (e.g., end-to-end training and evaluation of deep neural networks), we show that even simple linear datamodels can successfully predict model outputs. We then demonstrate that datamodels give rise to a variety of applications, such as: accurately predicting the effect of dataset counterfactuals; identifying brittle predictions; finding semantically similar examples; quantifying train-test leakage; and embedding data into a well-behaved and feature-rich representation space. Data for this paper (including pre-computed datamodels as well as raw predictions from four million trained deep neural networks) is available at https://github.com/MadryLab/datamodels-data .
    Study of filtered-x logarithmic recursive least $p$-power algorithm. (arXiv:2202.00560v1 [eess.SP])
    For active impulsive noise control, a filtered-x recursive least $p$-power (FxRLP) algorithm is proposed by minimizing the weighted summation of the $p$-power of the \emph{a posteriori} errors. Since the characteristic of the target noise is investigated, the FxRLP algorithm achieves good performance and robustness. To obtain a better performance, we develop a filtered-x logarithmic recursive least $p$-power (FxlogRLP) algorithm which integrates the $p$-order moment with the logarithmic-order moment. Simulation results demonstrate that the FxlogRLP algorithm is superior to the existing algorithms in terms of convergence rate and noise reduction.
    Machine learning based modelling and optimization in hard turning of AISI D6 steel with newly developed AlTiSiN coated carbide tool. (arXiv:2202.00596v1 [cs.LG])
    In recent times Mechanical and Production industries are facing increasing challenges related to the shift toward sustainable manufacturing. In this article, machining was performed in dry cutting condition with a newly developed coated insert called AlTiSiN coated carbides coated through scalable pulsed power plasma technique in dry cutting condition and a dataset was generated for different machining parameters and output responses. The machining parameters are speed, feed, depth of cut and the output responses are surface roughness, cutting force, crater wear length, crater wear width, and flank wear. The data collected from the machining operation is used for the development of machine learning (ML) based surrogate models to test, evaluate and optimize various input machining parameters. Different ML approaches such as polynomial regression (PR), random forest (RF) regression, gradient boosted (GB) trees, and adaptive boosting (AB) based regression are used to model different output responses in the hard machining of AISI D6 steel. The surrogate models for different output responses are used to prepare a complex objective function for the germinal center algorithm-based optimization of the machining parameters of the hard turning operation.
    Kernelized Multiplicative Weights for 0/1-Polyhedral Games: Bridging the Gap Between Learning in Extensive-Form and Normal-Form Games. (arXiv:2202.00237v1 [cs.GT])
    While extensive-form games (EFGs) can be converted into normal-form games (NFGs), doing so comes at the cost of an exponential blowup of the strategy space. So, progress on NFGs and EFGs has historically followed separate tracks, with the EFG community often having to catch up with advances (e.g., last-iterate convergence and predictive regret bounds) from the larger NFG community. In this paper we show that the Optimistic Multiplicative Weights Update (OMWU) algorithm -- the premier learning algorithm for NFGs -- can be simulated on the normal-form equivalent of an EFG in linear time per iteration in the game tree size using a kernel trick. The resulting algorithm, Kernelized OMWU (KOMWU), applies more broadly to all convex games whose strategy space is a polytope with 0/1 integral vertices, as long as the kernel can be evaluated efficiently. In the particular case of EFGs, KOMWU closes several standing gaps between NFG and EFG learning, by enabling direct, black-box transfer to EFGs of desirable properties of learning dynamics that were so far known to be achievable only in NFGs. Specifically, KOMWU gives the first algorithm that guarantees at the same time last-iterate convergence, lower dependence on the size of the game tree than all prior algorithms, and $\tilde{\mathcal{O}}(1)$ regret when followed by all players.
    Runtime Monitoring and Statistical Approaches for Correlation Analysis of ECG and PPG. (arXiv:2202.00559v1 [eess.SP])
    Biophysical signals such as Electrocardiogram (ECG) and Photoplethysmogram (PPG) are key to the sensing of vital parameters for wellbeing. Coincidentally, ECG and PPG are signals, which provide a "different window" into the same phenomena, namely the cardiac cycle. While they are used separately, there are no studies regarding the exact correction of the different ECG and PPG events. Such correlation would be helpful in many fronts such as sensor fusion for improved accuracy using cheaper sensors and attack detection and mitigation methods using multiple signals to enhance the robustness, for example. Considering this, we present the first approach in formally establishing the key relationships between ECG and PPG signals. We combine formal run-time monitoring with statistical analysis and regression analysis for our results.
    ISNet: Costless and Implicit Image Segmentation for Deep Classifiers, with Application in COVID-19 Detection. (arXiv:2202.00232v1 [eess.IV])
    In this work we propose a novel deep neural network (DNN) architecture, ISNet, to solve the task of image segmentation followed by classification, substituting the common pipeline of two networks by a single model. We designed the ISNet for high flexibility and performance: it allows virtually any classification neural network architecture to analyze a common image as if it had been previously segmented. Furthermore, in relation to the original classifier, the ISNet does not cause any increment in computational cost or architectural changes at run-time. To accomplish this, we introduce the concept of optimizing DNNs for relevance segmentation in heatmaps created by Layer-wise Relevance Propagation (LRP), which proves to be equivalent to the classification of previously segmented images. We apply an ISNet based on a DenseNet121 classifier to solve the task of COVID-19 detection in chest X-rays. We compare the model to a U-net (performing lung segmentation) followed by a DenseNet121, and to a standalone DenseNet121. Due to the implicit segmentation, the ISNet precisely ignored the X-ray regions outside of the lungs; it achieved 94.5 +/-4.1% mean accuracy with an external database, showing strong generalization capability and surpassing the other models' performances by 6 to 7.9%. ISNet presents a fast and light methodology to perform classification preceded by segmentation, while also being more accurate than standard pipelines.
    Policy Evaluation and Temporal-Difference Learning in Continuous Time and Space: A Martingale Approach. (arXiv:2108.06655v2 [cs.LG] UPDATED)
    We propose a unified framework to study policy evaluation (PE) and the associated temporal difference (TD) methods for reinforcement learning in continuous time and space. We show that PE is equivalent to maintaining the martingale condition of a process. From this perspective, we find that the mean--square TD error approximates the quadratic variation of the martingale and thus is not a suitable objective for PE. We present two methods to use the martingale characterization for designing PE algorithms. The first one minimizes a "martingale loss function", whose solution is proved to be the best approximation of the true value function in the mean--square sense. This method interprets the classical gradient Monte-Carlo algorithm. The second method is based on a system of equations called the "martingale orthogonality conditions" with test functions. Solving these equations in different ways recovers various classical TD algorithms, such as TD($\lambda$), LSTD, and GTD. Different choices of test functions determine in what sense the resulting solutions approximate the true value function. Moreover, we prove that any convergent time-discretized algorithm converges to its continuous-time counterpart as the mesh size goes to zero, and we provide the convergence rate. We demonstrate the theoretical results and corresponding algorithms with numerical experiments and applications.
    Efficient Reinforcement Learning in Block MDPs: A Model-free Representation Learning Approach. (arXiv:2202.00063v1 [cs.LG])
    We present BRIEE (Block-structured Representation learning with Interleaved Explore Exploit), an algorithm for efficient reinforcement learning in Markov Decision Processes with block-structured dynamics (i.e., Block MDPs), where rich observations are generated from a set of unknown latent states. BRIEE interleaves latent states discovery, exploration, and exploitation together, and can provably learn a near-optimal policy with sample complexity scaling polynomially in the number of latent states, actions, and the time horizon, with no dependence on the size of the potentially infinite observation space. Empirically, we show that BRIEE is more sample efficient than the state-of-art Block MDP algorithm HOMER and other empirical RL baselines on challenging rich-observation combination lock problems that require deep exploration.
    Regret Minimization with Performative Feedback. (arXiv:2202.00628v1 [cs.LG])
    In performative prediction, the deployment of a predictive model triggers a shift in the data distribution. As these shifts are typically unknown ahead of time, the learner needs to deploy a model to get feedback about the distribution it induces. We study the problem of finding near-optimal models under performativity while maintaining low regret. On the surface, this problem might seem equivalent to a bandit problem. However, it exhibits a fundamentally richer feedback structure that we refer to as performative feedback: after every deployment, the learner receives samples from the shifted distribution rather than only bandit feedback about the reward. Our main contribution is regret bounds that scale only with the complexity of the distribution shifts and not that of the reward function. The key algorithmic idea is careful exploration of the distribution shifts that informs a novel construction of confidence bounds on the risk of unexplored models. The construction only relies on smoothness of the shifts and does not assume convexity. More broadly, our work establishes a conceptual approach for leveraging tools from the bandits literature for the purpose of regret minimization with performative feedback.
    Recycling Model Updates in Federated Learning: Are Gradient Subspaces Low-Rank?. (arXiv:2202.00280v1 [cs.LG])
    In this paper, we question the rationale behind propagating large numbers of parameters through a distributed system during federated learning. We start by examining the rank characteristics of the subspace spanned by gradients across epochs (i.e., the gradient-space) in centralized model training, and observe that this gradient-space often consists of a few leading principal components accounting for an overwhelming majority (95-99%) of the explained variance. Motivated by this, we propose the "Look-back Gradient Multiplier" (LBGM) algorithm, which exploits this low-rank property to enable gradient recycling between model update rounds of federated learning, reducing transmissions of large parameters to single scalars for aggregation. We analytically characterize the convergence behavior of LBGM, revealing the nature of the trade-off between communication savings and model performance. Our subsequent experimental results demonstrate the improvement LBGM obtains in communication overhead compared to conventional federated learning on several datasets and deep learning models. Additionally, we show that LBGM is a general plug-and-play algorithm that can be used standalone or stacked on top of existing sparsification techniques for distributed model training.
    Transformer-based Models of Text Normalization for Speech Applications. (arXiv:2202.00153v1 [cs.LG])
    Text normalization, or the process of transforming text into a consistent, canonical form, is crucial for speech applications such as text-to-speech synthesis (TTS). In TTS, the system must decide whether to verbalize "1995" as "nineteen ninety five" in "born in 1995" or as "one thousand nine hundred ninety five" in "page 1995". We present an experimental comparison of various Transformer-based sequence-to-sequence (seq2seq) models of text normalization for speech and evaluate them on a variety of datasets of written text aligned to its normalized spoken form. These models include variants of the 2-stage RNN-based tagging/seq2seq architecture introduced by Zhang et al. (2019), where we replace the RNN with a Transformer in one or more stages, as well as vanilla Transformers that output string representations of edit sequences. Of our approaches, using Transformers for sentence context encoding within the 2-stage model proved most effective, with the fine-tuned BERT encoder yielding the best performance.
    Deep Reference Priors: What is the best way to pretrain a model?. (arXiv:2202.00187v1 [stat.ML])
    What is the best way to exploit extra data -- be it unlabeled data from the same task, or labeled data from a related task -- to learn a given task? This paper formalizes the question using the theory of reference priors. Reference priors are objective, uninformative Bayesian priors that maximize the mutual information between the task and the weights of the model. Such priors enable the task to maximally affect the Bayesian posterior, e.g., reference priors depend upon the number of samples available for learning the task and for very small sample sizes, the prior puts more probability mass on low-complexity models in the hypothesis space. This paper presents the first demonstration of reference priors for medium-scale deep networks and image-based data. We develop generalizations of reference priors and demonstrate applications to two problems. First, by using unlabeled data to compute the reference prior, we develop new Bayesian semi-supervised learning methods that remain effective even with very few samples per class. Second, by using labeled data from the source task to compute the reference prior, we develop a new pretraining method for transfer learning that allows data from the target task to maximally affect the Bayesian posterior. Empirical validation of these methods is conducted on image classification datasets.
    Leveraging Bitstream Metadata for Fast and Accurate Video Compression Correction. (arXiv:2202.00011v1 [eess.IV])
    Video compression is a central feature of the modern internet powering technologies from social media to video conferencing. While video compression continues to mature, for many, and particularly for extreme, compression settings, quality loss is still noticeable. These extreme settings nevertheless have important applications to the efficient transmission of videos over bandwidth constrained or otherwise unstable connections. In this work, we develop a deep learning architecture capable of restoring detail to compressed videos which leverages the underlying structure and motion information embedded in the video bitstream. We show that this improves restoration accuracy compared to prior compression correction methods and is competitive when compared with recent deep-learning-based video compression methods on rate-distortion while achieving higher throughput.
    Politics and Virality in the Time of Twitter: A Large-Scale Cross-Party Sentiment Analysis in Greece, Spain and United Kingdom. (arXiv:2202.00396v1 [cs.CL])
    Social media has become extremely influential when it comes to policy making in modern societies especially in the western world (e.g., 48% of Europeans use social media every day or almost every day). Platforms such as Twitter allow users to follow politicians, thus making citizens more involved in political discussion. In the same vein, politicians use Twitter to express their opinions, debate among others on current topics and promote their political agenda aiming to influence voter behaviour. Previous studies have shown that tweets conveying negative sentiment are likely to be retweeted more frequently. In this paper, we attempt to analyse tweets from politicians from different countries and explore if their tweets follow the same trend. Utilising state-of-the-art pre-trained language models we performed sentiment analysis on multilingual tweets collected from members of parliament of Greece, Spain and United Kingdom, including devolved administrations. We achieved this by systematically exploring and analysing the differences between influential and less popular tweets. Our analysis indicates that politicians' negatively charged tweets spread more widely, especially in more recent times, and highlights interesting trends in the intersection of sentiment and popularity.
    AutoGeoLabel: Automated Label Generation for Geospatial Machine Learning. (arXiv:2202.00067v1 [eess.IV])
    A key challenge of supervised learning is the availability of human-labeled data. We evaluate a big data processing pipeline to auto-generate labels for remote sensing data. It is based on rasterized statistical features extracted from surveys such as e.g. LiDAR measurements. Using simple combinations of the rasterized statistical layers, it is demonstrated that multiple classes can be generated at accuracies of ~0.9. As proof of concept, we utilize the big geo-data platform IBM PAIRS to dynamically generate such labels in dense urban areas with multiple land cover classes. The general method proposed here is platform independent, and it can be adapted to generate labels for other satellite modalities in order to enable machine learning on overhead imagery for land use classification and object detection.
    Implicit Concept Drift Detection for Multi-label Data Streams. (arXiv:2202.00070v1 [cs.LG])
    Many real-world applications adopt multi-label data streams as the need for algorithms to deal with rapidly changing data increases. Changes in data distribution, also known as concept drift, cause the existing classification models to rapidly lose their effectiveness. To assist the classifiers, we propose a novel algorithm called Label Dependency Drift Detector (LD3), an implicit (unsupervised) concept drift detector using label dependencies within the data for multi-label data streams. Our study exploits the dynamic temporal dependencies between labels using a label influence ranking method, which leverages a data fusion algorithm and uses the produced ranking to detect concept drift. LD3 is the first unsupervised concept drift detection algorithm in the multi-label classification problem area. In this study, we perform an extensive evaluation of LD3 by comparing it with 14 prevalent supervised concept drift detection algorithms that we adapt to the problem area using 12 datasets and a baseline classifier. The results show that LD3 provides between 19.8\% and 68.6\% better predictive performance than comparable detectors on both real-world and synthetic data streams.
    Causal Inference Principles for Reasoning about Commonsense Causality. (arXiv:2202.00436v1 [cs.CL])
    Commonsense causality reasoning (CCR) aims at identifying plausible causes and effects in natural language descriptions that are deemed reasonable by an average person. Although being of great academic and practical interest, this problem is still shadowed by the lack of a well-posed theoretical framework; existing work usually relies on deep language models wholeheartedly, and is potentially susceptible to confounding co-occurrences. Motivated by classical causal principles, we articulate the central question of CCR and draw parallels between human subjects in observational studies and natural languages to adopt CCR to the potential-outcomes framework, which is the first such attempt for commonsense tasks. We propose a novel framework, ROCK, to Reason O(A)bout Commonsense K(C)ausality, which utilizes temporal signals as incidental supervision, and balances confounding effects using temporal propensities that are analogous to propensity scores. The ROCK implementation is modular and zero-shot, and demonstrates good CCR capabilities on various datasets.
    On solutions of the distributional Bellman equation. (arXiv:2202.00081v1 [stat.ML])
    In distributional reinforcement learning not only expected returns but the complete return distributions of a policy is taken into account. The return distribution for a fixed policy is given as the fixed point of an associated distributional Bellman operator. In this note we consider general distributional Bellman operators and study existence and uniqueness of its fixed points as well as their tail properties. We give necessary and sufficient conditions for existence and uniqueness of return distributions and identify cases of regular variation. We link distributional Bellman equations to multivariate distributional equations of the form $\textbf{X} =_d \textbf{A}\textbf{X} + \textbf{B}$, where $\textbf{X}$ and $\textbf{B}$ are $d$-dimensional random vectors, $\textbf{A}$ a random $d\times d$ matrix and $\textbf{X}$ and $(\textbf{A},\textbf{B})$ are independent. We show that any fixed-point of a distributional Bellman operator can be obtained as the vector of marginal laws of a solution to such a multivariate distributional equation. This makes the general theory of such equations applicable to the distributional reinforcement learning setting.  ( 2 min )
    Exoplanet Characterization using Conditional Invertible Neural Networks. (arXiv:2202.00027v1 [astro-ph.EP])
    The characterization of an exoplanet's interior is an inverse problem, which requires statistical methods such as Bayesian inference in order to be solved. Current methods employ Markov Chain Monte Carlo (MCMC) sampling to infer the posterior probability of planetary structure parameters for a given exoplanet. These methods are time consuming since they require the calculation of a large number of planetary structure models. To speed up the inference process when characterizing an exoplanet, we propose to use conditional invertible neural networks (cINNs) to calculate the posterior probability of the internal structure parameters. cINNs are a special type of neural network which excel in solving inverse problems. We constructed a cINN using FrEIA, which was then trained on a database of $5.6\cdot 10^6$ internal structure models to recover the inverse mapping between internal structure parameters and observable features (i.e., planetary mass, planetary radius and composition of the host star). The cINN method was compared to a Metropolis-Hastings MCMC. For that we repeated the characterization of the exoplanet K2-111 b, using both the MCMC method and the trained cINN. We show that the inferred posterior probability of the internal structure parameters from both methods are very similar, with the biggest differences seen in the exoplanet's water content. Thus cINNs are a possible alternative to the standard time-consuming sampling methods. Indeed, using cINNs allows for orders of magnitude faster inference of an exoplanet's composition than what is possible using an MCMC method, however, it still requires the computation of a large database of internal structures to train the cINN. Since this database is only computed once, we found that using a cINN is more efficient than an MCMC, when more than 10 exoplanets are characterized using the same cINN.
    Exploring layerwise decision making in DNNs. (arXiv:2202.00345v1 [cs.LG])
    While deep neural networks (DNNs) have become a standard architecture for many machine learning tasks, their internal decision-making process and general interpretability is still poorly understood. Conversely, common decision trees are easily interpretable and theoretically well understood. We show that by encoding the discrete sample activation values of nodes as a binary representation, we are able to extract a decision tree explaining the classification procedure of each layer in a ReLU-activated multilayer perceptron (MLP). We then combine these decision trees with existing feature attribution techniques in order to produce an interpretation of each layer of a model. Finally, we provide an analysis of the generated interpretations, the behaviour of the binary encodings and how these relate to sample groupings created during the training process of the neural network.
    Step-size Adaptation Using Exponentiated Gradient Updates. (arXiv:2202.00145v1 [cs.LG])
    Optimizers like Adam and AdaGrad have been very successful in training large-scale neural networks. Yet, the performance of these methods is heavily dependent on a carefully tuned learning rate schedule. We show that in many large-scale applications, augmenting a given optimizer with an adaptive tuning method of the step-size greatly improves the performance. More precisely, we maintain a global step-size scale for the update as well as a gain factor for each coordinate. We adjust the global scale based on the alignment of the average gradient and the current gradient vectors. A similar approach is used for updating the local gain factors. This type of step-size scale tuning has been done before with gradient descent updates. In this paper, we update the step-size scale and the gain variables with exponentiated gradient updates instead. Experimentally, we show that our approach can achieve compelling accuracy on standard models without using any specially tuned learning rate schedule. We also show the effectiveness of our approach for quickly adapting to distribution shifts in the data during training.
    Accelerating Deep Reinforcement Learning for Digital Twin Network Optimization with Evolutionary Strategies. (arXiv:2202.00360v1 [cs.NI])
    The recent growth of emergent network applications (e.g., satellite networks, vehicular networks) is increasing the complexity of managing modern communication networks. As a result, the community proposed the Digital Twin Networks (DTN) as a key enabler of efficient network management. Network operators can leverage the DTN to perform different optimization tasks (e.g., Traffic Engineering, Network Planning). Deep Reinforcement Learning (DRL) showed a high performance when applied to solve network optimization problems. In the context of DTN, DRL can be leveraged to solve optimization problems without directly impacting the real-world network behavior. However, DRL scales poorly with the problem size and complexity. In this paper, we explore the use of Evolutionary Strategies (ES) to train DRL agents for solving a routing optimization problem. The experimental results show that ES achieved a training time speed-up of 128 and 6 for the NSFNET and GEANT2 topologies respectively.  ( 2 min )
    Quantifying Relevance in Learning and Inference. (arXiv:2202.00339v1 [cs.LG])
    Learning is a distinctive feature of intelligent behaviour. High-throughput experimental data and Big Data promise to open new windows on complex systems such as cells, the brain or our societies. Yet, the puzzling success of Artificial Intelligence and Machine Learning shows that we still have a poor conceptual understanding of learning. These applications push statistical inference into uncharted territories where data is high-dimensional and scarce, and prior information on "true" models is scant if not totally absent. Here we review recent progress on understanding learning, based on the notion of "relevance". The relevance, as we define it here, quantifies the amount of information that a dataset or the internal representation of a learning machine contains on the generative model of the data. This allows us to define maximally informative samples, on one hand, and optimal learning machines on the other. These are ideal limits of samples and of machines, that contain the maximal amount of information about the unknown generative process, at a given resolution (or level of compression). Both ideal limits exhibit critical features in the statistical sense: Maximally informative samples are characterised by a power-law frequency distribution (statistical criticality) and optimal learning machines by an anomalously large susceptibility. The trade-off between resolution (i.e. compression) and relevance distinguishes the regime of noisy representations from that of lossy compression. These are separated by a special point characterised by Zipf's law statistics. This identifies samples obeying Zipf's law as the most compressed loss-less representations that are optimal in the sense of maximal relevance. Criticality in optimal learning machines manifests in an exponential degeneracy of energy levels, that leads to unusual thermodynamic properties.  ( 2 min )
    Learning Infinite-Horizon Average-Reward Markov Decision Processes with Constraints. (arXiv:2202.00150v1 [cs.LG])
    We study regret minimization for infinite-horizon average-reward Markov Decision Processes (MDPs) under cost constraints. We start by designing a policy optimization algorithm with carefully designed action-value estimator and bonus term, and show that for ergodic MDPs, our algorithm ensures $\widetilde{O}(\sqrt{T})$ regret and constant constraint violation, where $T$ is the total number of time steps. This strictly improves over the algorithm of (Singh et al., 2020), whose regret and constraint violation are both $\widetilde{O}(T^{2/3})$. Next, we consider the most general class of weakly communicating MDPs. Through a finite-horizon approximation, we develop another algorithm with $\widetilde{O}(T^{2/3})$ regret and constraint violation, which can be further improved to $\widetilde{O}(\sqrt{T})$ via a simple modification, albeit making the algorithm computationally inefficient. As far as we know, these are the first set of provable algorithms for weakly communicating MDPs with cost constraints.  ( 2 min )
    Handling Bias in Toxic Speech Detection: A Survey. (arXiv:2202.00126v1 [cs.SI])
    The massive growth of social media usage has witnessed a tsunami of online toxicity in teams of hate speech, abusive posts, cyberbullying, etc. Detecting online toxicity is challenging due to its inherent subjectivity. Factors such as the context of the speech, geography, socio-political climate, and background of the producers and consumers of the posts play a crucial role in determining if the content can be flagged as toxic. Adoption of automated toxicity detection models in production can lead to a sidelining of the various demographic and psychographic groups they aim to help in the first place. It has piqued researchers' interest in examining unintended biases and their mitigation. Due to the nascent and multi-faceted nature of the work, complete literature is chaotic in its terminologies, techniques, and findings. In this paper, we put together a systematic study to discuss the limitations and challenges of existing methods. We start by developing a taxonomy for categorising various unintended biases and a suite of evaluation metrics proposed to quantify such biases. We take a closer look at each proposed method for evaluating and mitigating bias in toxic speech detection. To examine the limitations of existing methods, we also conduct a case study to introduce the concept of bias shift due to knowledge-based bias mitigation methods. The survey concludes with an overview of the critical challenges, research gaps and future directions. While reducing toxicity on online platforms continues to be an active area of research, a systematic study of various biases and their mitigation strategies will help the research community produce robust and fair models.  ( 2 min )
    Active Learning Over Multiple Domains in Natural Language Tasks. (arXiv:2202.00254v1 [cs.CL])
    Studies of active learning traditionally assume the target and source data stem from a single domain. However, in realistic applications, practitioners often require active learning with multiple sources of out-of-distribution data, where it is unclear a priori which data sources will help or hurt the target domain. We survey a wide variety of techniques in active learning (AL), domain shift detection (DS), and multi-domain sampling to examine this challenging setting for question answering and sentiment analysis. We ask (1) what family of methods are effective for this task? And, (2) what properties of selected examples and domains achieve strong results? Among 18 acquisition functions from 4 families of methods, we find H- Divergence methods, and particularly our proposed variant DAL-E, yield effective results, averaging 2-3% improvements over the random baseline. We also show the importance of a diverse allocation of domains, as well as room-for-improvement of existing methods on both domain and example selection. Our findings yield the first comprehensive analysis of both existing and novel methods for practitioners faced with multi-domain active learning for natural language tasks.
    Content addressable memory without catastrophic forgetting by heteroassociation with a fixed scaffold. (arXiv:2202.00159v1 [cs.AI])
    Content-addressable memory (CAM) networks, so-called because stored items can be recalled by partial or corrupted versions of the items, exhibit near-perfect recall of a small number of information-dense patterns below capacity and a `memory cliff' beyond, such that inserting a single additional pattern results in catastrophic forgetting of all stored patterns. We propose a novel ANN architecture, Memory Scaffold with Heteroassociation (MESH), that gracefully trades-off pattern richness with pattern number to generate a CAM continuum without a memory cliff: Small numbers of patterns are stored with complete information recovery matching standard CAMs, while inserting more patterns still results in partial recall of every pattern, with an information per pattern that scales inversely with the number of patterns. Motivated by the architecture of the Entorhinal-Hippocampal memory circuit in the brain, MESH is a tripartite architecture with pairwise interactions that uses a predetermined set of internally stabilized states together with heteroassociation between the internal states and arbitrary external patterns. We show analytically and experimentally that MESH nearly saturates the total information bound (given by the number of synapses) for CAM networks, invariant of the number of stored patterns, outperforming all existing CAM models.  ( 2 min )
    Identifying Interpretable Clinical Subtypes withinHeterogeneous Dementia Clinic Population. (arXiv:2202.00009v1 [cs.LG])
    Dementia is a highly heterogeneous neurodegenerative disorder. Differences in brain pathologies lead to significant variations in the clinical presentation and progression course of patients, increasing the need for individual progression predictions. Unsupervised cluster analysis on a dementia clinic population using the Clinical Dementia Rating (CDR) component scores uncovered subtypes with different risk of dementia progression. The distribution of the CDR components provide validation and interpretability regarding the cognitive characteristics of the identified subtypes.  ( 2 min )
    GNNRank: Learning Global Rankings from Pairwise Comparisons via Directed Graph Neural Networks. (arXiv:2202.00211v1 [cs.LG])
    Recovering global rankings from pairwise comparisons is an important problem with many applications, ranging from time synchronization to sports team ranking. Pairwise comparisons corresponding to matches in a competition can naturally be construed as edges in a directed graph (digraph), whose nodes represent competitors with an unknown rank or skill strength. However, existing methods addressing the rank estimation problem have thus far not utilized powerful neural network architectures to optimize ranking objectives. Hence, we propose to augment an algorithm with neural network, in particular graph neural network (GNN) for its coherence to the problem at hand. In this paper, we introduce GNNRank, a modeling framework that is compatible with any GNN capable of learning digraph embeddings, and we devise trainable objectives to encode ranking upsets/violations. This framework includes a ranking score estimation approach, and adds a useful inductive bias by unfolding the Fiedler vector computation of the graph constructed from a learnable similarity matrix. Experimental results on a wide range of data sets show that our methods attain competitive and often superior performance compared with existing approaches. It also shows promising transfer ability to new data based on the trained GNN model.  ( 2 min )
    Architecture Matters in Continual Learning. (arXiv:2202.00275v1 [cs.LG])
    A large body of research in continual learning is devoted to overcoming the catastrophic forgetting of neural networks by designing new algorithms that are robust to the distribution shifts. However, the majority of these works are strictly focused on the "algorithmic" part of continual learning for a "fixed neural network architecture", and the implications of using different architectures are mostly neglected. Even the few existing continual learning methods that modify the model assume a fixed architecture and aim to develop an algorithm that efficiently uses the model throughout the learning experience. However, in this work, we show that the choice of architecture can significantly impact the continual learning performance, and different architectures lead to different trade-offs between the ability to remember previous tasks and learning new ones. Moreover, we study the impact of various architectural decisions, and our findings entail best practices and recommendations that can improve the continual learning performance.  ( 2 min )
    Learning Fair Representations via Rate-Distortion Maximization. (arXiv:2202.00035v1 [cs.LG])
    Text representations learned by machine learning models often encode undesirable demographic information of the user. Predictive models based on these representations can rely on such information resulting in biased decisions. We present a novel debiasing technique Fairness-aware Rate Maximization (FaRM), that removes demographic information by making representations of instances belonging to the same protected attribute class uncorrelated using the rate-distortion function. FaRM is able to debias representations with or without a target task at hand. FaRM can also be adapted to simultaneously remove information about multiple protected attributes. Empirical evaluations show that FaRM achieves state-of-the-art performance on several datasets, and learned representations leak significantly less protected attribute information against an attack by a non-linear probing network.  ( 2 min )
    SUGAR: Efficient Subgraph-level Training via Resource-aware Graph Partitioning. (arXiv:2202.00075v1 [cs.LG])
    Graph Neural Networks (GNNs) have demonstrated a great potential in a variety of graph-based applications, such as recommender systems, drug discovery, and object recognition. Nevertheless, resource-efficient GNN learning is a rarely explored topic despite its many benefits for edge computing and Internet of Things (IoT) applications. To improve this state of affairs, this work proposes efficient subgraph-level training via resource-aware graph partitioning (SUGAR). SUGAR first partitions the initial graph into a set of disjoint subgraphs and then performs local training at the subgraph-level. We provide a theoretical analysis and conduct extensive experiments on five graph benchmarks to verify its efficacy in practice. Our results show that SUGAR can achieve up to 33 times runtime speedup and 3.8 times memory reduction on large-scale graphs. We believe SUGAR opens a new research direction towards developing GNN methods that are resource-efficient, hence suitable for IoT deployment.  ( 2 min )
    Learning Physics-Consistent Particle Interactions. (arXiv:2202.00299v1 [cs.LG])
    Interacting particle systems play a key role in science and engineering. Access to the governing particle interaction law is fundamental for a complete understanding of such systems. However, the inherent system complexity keeps the particle interaction hidden in many cases. Machine learning methods have the potential to learn the behavior of interacting particle systems by combining experiments with data analysis methods. However, most existing algorithms focus on learning the kinetics at the particle level. Learning pairwise interaction, e.g., pairwise force or pairwise potential energy, remains an open challenge. Here, we propose an algorithm that adapts the Graph Networks framework, which contains an edge part to learn the pairwise interaction and a node part to model the dynamics at particle level. Different from existing approaches that use neural networks in both parts, we design a deterministic operator in the node part. The designed physics operator on the nodes restricts the output space of the edge neural network to be exactly the pairwise interaction. We test the proposed methodology on multiple datasets and demonstrate that it achieves considerably better performance in inferring correctly the pairwise interactions while also being consistent with the underlying physics on all the datasets than existing purely data-driven models. The developed methodology can support a better understanding and discovery of the underlying particle interaction laws, and hence guide the design of materials with targeted properties.  ( 2 min )
    On Polynomial Approximation of Activation Function. (arXiv:2202.00004v1 [cs.LG])
    In this work, we propose an interesting method that aims to approximate an activation function over some domain by polynomials of the presupposing low degree. The main idea behind this method can be seen as an extension of the ordinary least square method and includes the gradient of activation function into the cost function to minimize.  ( 2 min )
    Reinforcement Learning with Heterogeneous Data: Estimation and Inference. (arXiv:2202.00088v1 [cs.LG])
    Reinforcement Learning (RL) has the promise of providing data-driven support for decision-making in a wide range of problems in healthcare, education, business, and other domains. Classical RL methods focus on the mean of the total return and, thus, may provide misleading results in the setting of the heterogeneous populations that commonly underlie large-scale datasets. We introduce the K-Heterogeneous Markov Decision Process (K-Hetero MDP) to address sequential decision problems with population heterogeneity. We propose the Auto-Clustered Policy Evaluation (ACPE) for estimating the value of a given policy, and the Auto-Clustered Policy Iteration (ACPI) for estimating the optimal policy in a given policy class. Our auto-clustered algorithms can automatically detect and identify homogeneous sub-populations, while estimating the Q function and the optimal policy for each sub-population. We establish convergence rates and construct confidence intervals for the estimators obtained by the ACPE and ACPI. We present simulations to support our theoretical findings, and we conduct an empirical study on the standard MIMIC-III dataset. The latter analysis shows evidence of value heterogeneity and confirms the advantages of our new method.  ( 2 min )
    Dimensionality Reduction Meets Message Passing for Graph Node Embeddings. (arXiv:2202.00408v1 [cs.LG])
    Graph Neural Networks (GNNs) have become a popular approach for various applications, ranging from social network analysis to modeling chemical properties of molecules. While GNNs often show remarkable performance on public datasets, they can struggle to learn long-range dependencies in the data due to over-smoothing and over-squashing tendencies. To alleviate this challenge, we propose PCAPass, a method which combines Principal Component Analysis (PCA) and message passing for generating node embeddings in an unsupervised manner and leverages gradient boosted decision trees for classification tasks. We show empirically that this approach provides competitive performance compared to popular GNNs on node classification benchmarks, while gathering information from longer distance neighborhoods. Our research demonstrates that applying dimensionality reduction with message passing and skip connections is a promising mechanism for aggregating long-range dependencies in graph structured data.  ( 2 min )
    Graph-based Neural Acceleration for Nonnegative Matrix Factorization. (arXiv:2202.00264v1 [cs.LG])
    We describe a graph-based neural acceleration technique for nonnegative matrix factorization that builds upon a connection between matrices and bipartite graphs that is well-known in certain fields, e.g., sparse linear algebra, but has not yet been exploited to design graph neural networks for matrix computations. We first consider low-rank factorization more broadly and propose a graph representation of the problem suited for graph neural networks. Then, we focus on the task of nonnegative matrix factorization and propose a graph neural network that interleaves bipartite self-attention layers with updates based on the alternating direction method of multipliers. Our empirical evaluation on synthetic and two real-world datasets shows that we attain substantial acceleration, even though we only train in an unsupervised fashion on smaller synthetic instances.  ( 2 min )
    SnAKe: Bayesian Optimization with Pathwise Exploration. (arXiv:2202.00060v1 [cs.LG])
    Bayesian Optimization is a very effective tool for optimizing expensive black-box functions. Inspired by applications developing and characterizing reaction chemistry using droplet microfluidic reactors, we consider a novel setting where the expense of evaluating the function can increase significantly when making large input changes between iterations. We further assume we are working asynchronously, meaning we have to decide on new queries before we finish evaluating previous experiments. This paper investigates the problem and introduces 'Sequential Bayesian Optimization via Adaptive Connecting Samples' (SnAKe), which provides a solution by considering future queries and preemptively building optimization paths that minimize input costs. We investigate some convergence properties and empirically show that the algorithm is able to achieve regret similar to classical Bayesian Optimization algorithms in both the synchronous and asynchronous settings, while reducing the input costs significantly.  ( 2 min )
    Sequential Search with Off-Policy Reinforcement Learning. (arXiv:2202.00245v1 [cs.IR])
    Recent years have seen a significant amount of interests in Sequential Recommendation (SR), which aims to understand and model the sequential user behaviors and the interactions between users and items over time. Surprisingly, despite the huge success Sequential Recommendation has achieved, there is little study on Sequential Search (SS), a twin learning task that takes into account a user's current and past search queries, in addition to behavior on historical query sessions. The SS learning task is even more important than the counterpart SR task for most of E-commence companies due to its much larger online serving demands as well as traffic volume. To this end, we propose a highly scalable hybrid learning model that consists of an RNN learning framework leveraging all features in short-term user-item interactions, and an attention model utilizing selected item-only features from long-term interactions. As a novel optimization step, we fit multiple short user sequences in a single RNN pass within a training batch, by solving a greedy knapsack problem on the fly. Moreover, we explore the use of off-policy reinforcement learning in multi-session personalized search ranking. Specifically, we design a pairwise Deep Deterministic Policy Gradient model that efficiently captures users' long term reward in terms of pairwise classification error. Extensive ablation experiments demonstrate significant improvement each component brings to its state-of-the-art baseline, on a variety of offline and online metrics.  ( 2 min )
    JULIA: Joint Multi-linear and Nonlinear Identification for Tensor Completion. (arXiv:2202.00071v1 [cs.LG])
    Tensor completion aims at imputing missing entries from a partially observed tensor. Existing tensor completion methods often assume either multi-linear or nonlinear relationships between latent components. However, real-world tensors have much more complex patterns where both multi-linear and nonlinear relationships may coexist. In such cases, the existing methods are insufficient to describe the data structure. This paper proposes a Joint mUlti-linear and nonLinear IdentificAtion (JULIA) framework for large-scale tensor completion. JULIA unifies the multi-linear and nonlinear tensor completion models with several advantages over the existing methods: 1) Flexible model selection, i.e., it fits a tensor by assigning its values as a combination of multi-linear and nonlinear components; 2) Compatible with existing nonlinear tensor completion methods; 3) Efficient training based on a well-designed alternating optimization approach. Experiments on six real large-scale tensors demonstrate that JULIA outperforms many existing tensor completion algorithms. Furthermore, JULIA can improve the performance of a class of nonlinear tensor completion methods. The results show that in some large-scale tensor completion scenarios, baseline methods with JULIA are able to obtain up to 55% lower root mean-squared-error and save 67% computational complexity.  ( 2 min )
    Right for the Right Latent Factors: Debiasing Generative Models via Disentanglement. (arXiv:2202.00391v1 [cs.LG])
    A key assumption of most statistical machine learning methods is that they have access to independent samples from the distribution of data they encounter at test time. As such, these methods often perform poorly in the face of biased data, which breaks this assumption. In particular, machine learning models have been shown to exhibit Clever-Hans-like behaviour, meaning that spurious correlations in the training set are inadvertently learnt. A number of works have been proposed to revise deep classifiers to learn the right correlations. However, generative models have been overlooked so far. We observe that generative models are also prone to Clever-Hans-like behaviour. To counteract this issue, we propose to debias generative models by disentangling their internal representations, which is achieved via human feedback. Our experiments show that this is effective at removing bias even when human feedback covers only a small fraction of the desired distribution. In addition, we achieve strong disentanglement results in a quantitative comparison with recent methods.  ( 2 min )
    Federated Active Learning (F-AL): an Efficient Annotation Strategy for Federated Learning. (arXiv:2202.00195v1 [cs.LG])
    Federated learning (FL) has been intensively investigated in terms of communication efficiency, privacy, and fairness. However, efficient annotation, which is a pain point in real-world FL applications, is less studied. In this project, we propose to apply active learning (AL) and sampling strategy into the FL framework to reduce the annotation workload. We expect that the AL and FL can improve the performance of each other complementarily. In our proposed federated active learning (F-AL) method, the clients collaboratively implement the AL to obtain the instances which are considered as informative to FL in a distributed optimization manner. We compare the test accuracies of the global FL models using the conventional random sampling strategy, client-level separate AL (S-AL), and the proposed F-AL. We empirically demonstrate that the F-AL outperforms baseline methods in image classification tasks.
    Return migration of German-affiliated researchers: Analyzing departure and return by gender, cohort, and discipline using Scopus bibliometric data 1996-2020. (arXiv:2110.08340v2 [cs.DL] UPDATED)
    The international migration of researchers is an important dimension of scientific mobility, and has been the subject of considerable policy debate. However, tracking the migration life courses of researchers is challenging due to data limitations. In this study, we use Scopus bibliometric data on eight million publications from 1.1 million researchers who have published at least once with an affiliation address from Germany in 1996-2020. We construct the partial life histories of published researchers in this period and explore both their out-migration and the subsequent return of a subset of this group: the returnees. Our analyses shed light on the career stages and gender disparities between researchers who remain in Germany, those who emigrate, and those who eventually return. We find that the return migration streams are even more gender imbalanced, which points to the need for additional efforts to encourage female researchers to come back to Germany. We document a slightly declining trend in return migration among more recent cohorts of researchers who left Germany, which, for most disciplines, was associated with a decrease in the German collaborative ties of these researchers. Moreover, we find that the gender disparities for the most gender imbalanced disciplines are unlikely to be mitigated by return migration given the gender compositions of the cohorts of researchers who have left Germany and of those who have returned. This analysis uncovers new dimensions of migration among scholars by investigating the return migration of published researchers, which is critical for the development of science policy.  ( 2 min )
    Bundle Networks: Fiber Bundles, Local Trivializations, and a Generative Approach to Exploring Many-to-one Maps. (arXiv:2110.06983v2 [cs.LG] UPDATED)
    Many-to-one maps are ubiquitous in machine learning, from the image recognition model that assigns a multitude of distinct images to the concept of "cat" to the time series forecasting model which assigns a range of distinct time-series to a single scalar regression value. While the primary use of such models is naturally to associate correct output to each input, in many problems it is also useful to be able to explore, understand, and sample from a model's fibers, which are the set of input values $x$ such that $f(x) = y$, for fixed $y$ in the output space. In this paper we show that popular generative architectures are ill-suited to such tasks. Motivated by this we introduce a novel generative architecture, a Bundle Network, based on the concept of a fiber bundle from (differential) topology. BundleNets exploit the idea of a local trivialization wherein a space can be locally decomposed into a product space that cleanly encodes the many-to-one nature of the map. By enforcing this decomposition in BundleNets and by utilizing state-of-the-art invertible components, investigating a network's fibers becomes natural.  ( 2 min )
    Signal Quality Assessment of Photoplethysmogram Signals using Quantum Pattern Recognition and lightweight CNN Architecture. (arXiv:2202.00606v1 [eess.SP])
    Photoplethysmography (PPG) signal comprises physiological information related to cardiorespiratory health. However, while recording, these PPG signals are easily corrupted by motion artifacts and body movements, leading to noise enriched, poor quality signals. Therefore ensuring high-quality signals is necessary to extract cardiorespiratory information accurately. Although there exists several rule-based and Machine-Learning (ML) - based approaches for PPG signal quality estimation, those algorithms' efficacy is questionable. Thus, this work proposes a lightweight CNN architecture for signal quality assessment employing a novel Quantum pattern recognition (QPR) technique. The proposed algorithm is validated on manually annotated data obtained from the University of Queensland database. A total of 28366, 5s signal segments are preprocessed and transformed into image files of 20 x 500 pixels. The image files are treated as an input to the 2D CNN architecture. The developed model classifies the PPG signal as `good' or `bad' with an accuracy of 98.3% with 99.3% sensitivity, 94.5% specificity and 98.9% F1-score. Finally, the performance of the proposed framework is validated against the noisy `Welltory app' collected PPG database. Even in a noisy environment, the proposed architecture proved its competence. Experimental analysis concludes that a slim architecture along with a novel Spatio-temporal pattern recognition technique improve the system's performance. Hence, the proposed approach can be useful to classify good and bad PPG signals for a resource-constrained wearable implementation.  ( 2 min )
    Towards a Theoretical Understanding of Word and Relation Representation. (arXiv:2202.00486v1 [cs.CL])
    Representing words by vectors, or embeddings, enables computational reasoning and is foundational to automating natural language tasks. For example, if word embeddings of similar words contain similar values, word similarity can be readily assessed, whereas judging that from their spelling is often impossible (e.g. cat /feline) and to predetermine and store similarities between all words is prohibitively time-consuming, memory intensive and subjective. We focus on word embeddings learned from text corpora and knowledge graphs. Several well-known algorithms learn word embeddings from text on an unsupervised basis by learning to predict those words that occur around each word, e.g. word2vec and GloVe. Parameters of such word embeddings are known to reflect word co-occurrence statistics, but how they capture semantic meaning has been unclear. Knowledge graph representation models learn representations both of entities (words, people, places, etc.) and relations between them, typically by training a model to predict known facts in a supervised manner. Despite steady improvements in fact prediction accuracy, little is understood of the latent structure that enables this. The limited understanding of how latent semantic structure is encoded in the geometry of word embeddings and knowledge graph representations makes a principled means of improving their performance, reliability or interpretability unclear. To address this: 1. we theoretically justify the empirical observation that particular geometric relationships between word embeddings learned by algorithms such as word2vec and GloVe correspond to semantic relations between words; and 2. we extend this correspondence between semantics and geometry to the entities and relations of knowledge graphs, providing a model for the latent structure of knowledge graph representation linked to that of word embeddings.  ( 2 min )
    Reinforcement Learning for Load-balanced Parallel Particle Tracing. (arXiv:2109.05679v2 [cs.GR] UPDATED)
    We explore an online reinforcement learning (RL) paradigm to dynamically optimize parallel particle tracing performance in distributed-memory systems. Our method combines three novel components: (1) a work donation algorithm, (2) a high-order workload estimation model, and (3) a communication cost model. First, we design an RL-based work donation algorithm. Our algorithm monitors workloads of processes and creates RL agents to donate data blocks and particles from high-workload processes to low-workload processes to minimize program execution time. The agents learn the donation strategy on the fly based on reward and cost functions designed to consider processes' workload changes and data transfer costs of donation actions. Second, we propose a workload estimation model, helping RL agents estimate the workload distribution of processes in future computations. Third, we design a communication cost model that considers both block and particle data exchange costs, helping RL agents make effective decisions with minimized communication costs. We demonstrate that our algorithm adapts to different flow behaviors in large-scale fluid dynamics, ocean, and weather simulation data. Our algorithm improves parallel particle tracing performance in terms of parallel efficiency, load balance, and costs of I/O and communication for evaluations with up to 16,384 processors.  ( 2 min )
    Robust Generalization of Quadratic Neural Networks via Function Identification. (arXiv:2109.10935v2 [cs.LG] UPDATED)
    A key challenge facing deep learning is that neural networks are often not robust to shifts in the underlying data distribution. We study this problem from the perspective of the statistical concept of parameter identification. Generalization bounds from learning theory often assume that the test distribution is close to the training distribution. In contrast, if we can identify the "true" parameters, then the model generalizes to arbitrary distribution shifts. However, neural networks typically have internal symmetries that make parameter identification impossible. We show that we can identify the function represented by a quadratic network even though we cannot identify its parameters; we extend this result to neural networks with ReLU activations. Thus, we can obtain robust generalization bounds for neural networks. We leverage this result to obtain new bounds for contextual bandits and transfer learning with quadratic neural networks. Overall, our results suggest that we can improve robustness of neural networks by designing models that can represent the true data generating process.  ( 2 min )
    Quasi-Newton Methods for Saddle Point Problems and Beyond. (arXiv:2111.02708v3 [math.OC] UPDATED)
    This paper studies quasi-Newton methods for solving strongly-convex-strongly-concave saddle point problems (SPP). We propose greedy and random Broyden family updates for SPP, which have explicit local superlinear convergence rate of ${\mathcal O}\big(\big(1-\frac{1}{n\kappa^2}\big)^{k(k-1)/2}\big)$, where $n$ is dimensions of the problem, $\kappa$ is the condition number and $k$ is the number of iterations. The design and analysis of proposed algorithm are based on estimating the square of indefinite Hessian matrix, which is different from classical quasi-Newton methods in convex optimization. We also present two specific Broyden family algorithms with BFGS-type and SR1-type updates, which enjoy the faster local convergence rate of $\mathcal O\big(\big(1-\frac{1}{n}\big)^{k(k-1)/2}\big)$. Additionally, we extend our algorithms to solve general nonlinear equations and prove it enjoys the similar convergence rate.  ( 2 min )
    A Dynamical System Perspective for Lipschitz Neural Networks. (arXiv:2110.12690v2 [cs.LG] UPDATED)
    The Lipschitz constant of neural networks has been established as a key quantity to enforce the robustness to adversarial examples. In this paper, we tackle the problem of building $1$-Lipschitz Neural Networks. By studying Residual Networks from a continuous time dynamical system perspective, we provide a generic method to build $1$-Lipschitz Neural Networks and show that some previous approaches are special cases of this framework. Then, we extend this reasoning and show that ResNet flows derived from convex potentials define $1$-Lipschitz transformations, that lead us to define the {\em Convex Potential Layer} (CPL). A comprehensive set of experiments on several datasets demonstrates the scalability of our architecture and the benefits as an $\ell_2$-provable defense against adversarial examples.  ( 2 min )
    DeepFilterNet: A Low Complexity Speech Enhancement Framework for Full-Band Audio based on Deep Filtering. (arXiv:2110.05588v2 [eess.AS] UPDATED)
    Complex-valued processing has brought deep learning-based speech enhancement and signal extraction to a new level. Typically, the process is based on a time-frequency (TF) mask which is applied to a noisy spectrogram, while complex masks (CM) are usually preferred over real-valued masks due to their ability to modify the phase. Recent work proposed to use a complex filter instead of a point-wise multiplication with a mask. This allows to incorporate information from previous and future time steps exploiting local correlations within each frequency band. In this work, we propose DeepFilterNet, a two stage speech enhancement framework utilizing deep filtering. First, we enhance the spectral envelope using ERB-scaled gains modeling the human frequency perception. The second stage employs deep filtering to enhance the periodic components of speech. Additionally to taking advantage of perceptual properties of speech, we enforce network sparsity via separable convolutions and extensive grouping in linear and recurrent layers to design a low complexity architecture. We further show that our two stage deep filtering approach outperforms complex masks over a variety of frequency resolutions and latencies and demonstrate convincing performance compared to other state-of-the-art models.  ( 2 min )
    Analyzing Finite Neural Networks: Can We Trust Neural Tangent Kernel Theory?. (arXiv:2012.04477v3 [cs.LG] UPDATED)
    Neural Tangent Kernel (NTK) theory is widely used to study the dynamics of infinitely-wide deep neural networks (DNNs) under gradient descent. But do the results for infinitely-wide networks give us hints about the behavior of real finite-width ones? In this paper, we study empirically when NTK theory is valid in practice for fully-connected ReLU and sigmoid DNNs. We find out that whether a network is in the NTK regime depends on the hyperparameters of random initialization and the network's depth. In particular, NTK theory does not explain the behavior of sufficiently deep networks initialized so that their gradients explode as they propagate through the network's layers: the kernel is random at initialization and changes significantly during training in this case, contrary to NTK theory. On the other hand, in the case of vanishing gradients, DNNs are in the the NTK regime but become untrainable rapidly with depth. We also describe a framework to study generalization properties of DNNs, in particular the variance of network's output function, by means of NTK theory and discuss its limits.  ( 2 min )
    Graph Decoupling Attention Markov Networks for Semi-supervised Graph Node Classification. (arXiv:2104.13718v2 [cs.LG] UPDATED)
    Graph neural networks (GNN) have been ubiquitous in graph node classification tasks. Most of GNN methods update the node embedding iteratively by aggregating its neighbors' information. However, they often suffer from negative disturbance, due to edges connecting nodes with different labels. One approach to alleviate this negative disturbance is to use attention to learn the weights of aggregation, but current attention-based GNNs only consider feature similarity and also suffer from the lack of supervision. In this paper, we consider the label dependency of graph nodes and propose a decoupling attention mechanism to learn both hard and soft attention. The hard attention is learned on labels for a refined graph structure with fewer inter-class edges, so that the aggregation's negative disturbance can be reduced. The soft attention aims to learn the aggregation weights based on features over the refined graph structure to enhance information gains during message passing. Particularly, we formulate our model under the EM framework, and the learned attention is used to guide the label propagation in the M-step and the feature propagation in the E-step, respectively. Extensive experiments are performed on six well-known benchmark graph datasets to verify the effectiveness of the proposed method.  ( 2 min )
    Machine Intelligence-Driven Classification of Cancer Patients-Derived Extracellular Vesicles using Fluorescence Correlation Spectroscopy: Results from a Pilot Study. (arXiv:2202.00495v1 [q-bio.QM])
    Patient-derived extracellular vesicles (EVs) that contains a complex biological cargo is a valuable source of liquid biopsy diagnostics to aid in early detection, cancer screening, and precision nanotherapeutics. In this study, we predicted that coupling cancer patient blood-derived EVs to time-resolved spectroscopy and artificial intelligence (AI) could provide a robust cancer screening and follow-up tools. Methods: Fluorescence correlation spectroscopy (FCS) measurements were performed on 24 blood samples-derived EVs. Blood samples were obtained from 15 cancer patients (presenting 5 different types of cancers), and 9 healthy controls (including patients with benign lesions). The obtained FCS autocorrelation spectra were processed into power spectra using the Fast-Fourier Transform algorithm and subjected to various machine learning algorithms to distinguish cancer spectra from healthy control spectra. Results and Applications: The performance of AdaBoost Random Forest (RF) classifier, support vector machine, and multilayer perceptron, were tested on selected frequencies in the N=118 power spectra. The RF classifier exhibited a 90% classification accuracy and high sensitivity and specificity in distinguishing the FCS power spectra of cancer patients from those of healthy controls. Further, an image convolutional neural network (CNN), ResNet network, and a quantum CNN were assessed on the power spectral images as additional validation tools. All image-based CNNs exhibited a nearly equal classification performance with an accuracy of roughly 82% and reasonably high sensitivity and specificity scores. Our pilot study demonstrates that AI-algorithms coupled to time-resolved FCS power spectra can accurately and differentially classify the complex patient-derived EVs from different cancer samples of distinct tissue subtypes.  ( 2 min )
    Differentially Private Community Detection for Stochastic Block Models. (arXiv:2202.00636v1 [cs.SI])
    The goal of community detection over graphs is to recover underlying labels/attributes of users (e.g., political affiliation) given the connectivity between users (represented by adjacency matrix of a graph). There has been significant recent progress on understanding the fundamental limits of community detection when the graph is generated from a stochastic block model (SBM). Specifically, sharp information theoretic limits and efficient algorithms have been obtained for SBMs as a function of $p$ and $q$, which represent the intra-community and inter-community connection probabilities. In this paper, we study the community detection problem while preserving the privacy of the individual connections (edges) between the vertices. Focusing on the notion of $(\epsilon, \delta)$-edge differential privacy (DP), we seek to understand the fundamental tradeoffs between $(p, q)$, DP budget $(\epsilon, \delta)$, and computational efficiency for exact recovery of the community labels. To this end, we present and analyze the associated information-theoretic tradeoffs for three broad classes of differentially private community recovery mechanisms: a) stability based mechanism; b) sampling based mechanisms; and c) graph perturbation mechanisms. Our main findings are that stability and sampling based mechanisms lead to a superior tradeoff between $(p,q)$ and the privacy budget $(\epsilon, \delta)$; however this comes at the expense of higher computational complexity. On the other hand, albeit low complexity, graph perturbation mechanisms require the privacy budget $\epsilon$ to scale as $\Omega(\log(n))$ for exact recovery. To the best of our knowledge, this is the first work to study the impact of privacy constraints on the fundamental limits for community detection.  ( 2 min )
    Similarity Learning based Few Shot Learning for ECG Time Series Classification. (arXiv:2202.00612v1 [eess.SP])
    Using deep learning models to classify time series data generated from the Internet of Things (IoT) devices requires a large amount of labeled data. However, due to constrained resources available in IoT devices, it is often difficult to accommodate training using large data sets. This paper proposes and demonstrates a Similarity Learning-based Few Shot Learning for ECG arrhythmia classification using Siamese Convolutional Neural Networks. Few shot learning resolves the data scarcity issue by identifying novel classes from very few labeled examples. Few Shot Learning relies first on pretraining the model on a related relatively large database, and then the learning is used for further adaptation towards few examples available per class. Our experiments evaluate the performance accuracy with respect to K (number of instances per class) for ECG time series data classification. The accuracy with 5- shot learning is 92.25% which marginally improves with further increase in K. We also compare the performance of our method against other well-established similarity learning techniques such as Dynamic Time Warping (DTW), Euclidean Distance (ED), and a deep learning model - Long Short Term Memory Fully Convolutional Network (LSTM-FCN) with the same amount of data and conclude that our method outperforms them for a limited dataset size. For K=5, the accuracies obtained are 57%, 54%, 33%, and 92% approximately for ED, DTW, LSTM-FCN, and SCNN, respectively.  ( 2 min )
    Deep Conditional Gaussian Mixture Model for Constrained Clustering. (arXiv:2106.06385v3 [cs.LG] UPDATED)
    Constrained clustering has gained significant attention in the field of machine learning as it can leverage prior information on a growing amount of only partially labeled data. Following recent advances in deep generative models, we propose a novel framework for constrained clustering that is intuitive, interpretable, and can be trained efficiently in the framework of stochastic gradient variational inference. By explicitly integrating domain knowledge in the form of probabilistic relations, our proposed model (DC-GMM) uncovers the underlying distribution of data conditioned on prior clustering preferences, expressed as pairwise constraints. These constraints guide the clustering process towards a desirable partition of the data by indicating which samples should or should not belong to the same cluster. We provide extensive experiments to demonstrate that DC-GMM shows superior clustering performances and robustness compared to state-of-the-art deep constrained clustering methods on a wide range of data sets. We further demonstrate the usefulness of our approach on two challenging real-world applications.  ( 2 min )
    Development of a neural network to recognize standards and features from 3D CAD models. (arXiv:2202.00573v1 [cs.GR])
    Focus of this work is to recognize standards and further features directly from 3D CAD models. For this reason, a neural network was trained to recognize nine classes of machine elements. After the system identified a part as a standard, like a hexagon head screw after the DIN EN ISO 8676, it accesses the geometrical information of the CAD system via the Application Programming Interface (API). In the API, the system searches for necessary information to describe the part appropriately. Based on this information standardized parts can be recognized in detail and supplemented with further information.  ( 2 min )
    ML with HE: Privacy Preserving Machine Learning Inferences for Genome Studies. (arXiv:2110.11446v2 [cs.CR] UPDATED)
    Preserving the privacy and security of big data in the context of cloud computing, while maintaining a certain level of efficiency of its processing remains to be a subject, open for improvement. One of the most popular applications epitomizing said concerns is found to be useful in genome analysis. This work proposes a secure multi-label tumor classification method using homomorphic encryption, whereby two different machine learning algorithms, SVM and XGBoost, are used to classify the encrypted genome data of different tumor types.  ( 2 min )
    Regret and Cumulative Constraint Violation Analysis for Distributed Online Constrained Convex Optimization. (arXiv:2105.00321v2 [math.OC] UPDATED)
    This paper considers the distributed online convex optimization problem with time-varying constraints over a network of agents. This is a sequential decision making problem with two sequences of arbitrarily varying convex loss and constraint functions. At each round, each agent selects a decision from the decision set, and then only a portion of the loss function and a coordinate block of the constraint function at this round are privately revealed to this agent. The goal of the network is to minimize the network-wide loss accumulated over time. Two distributed online algorithms with full-information and bandit feedback are proposed. Both dynamic and static network regret bounds are analyzed for the proposed algorithms, and network cumulative constraint violation is used to measure constraint violation, which excludes the situation that strictly feasible constraints can compensate the effects of violated constraints. In particular, we show that the proposed algorithms achieve $\mathcal{O}(T^{\max\{\kappa,1-\kappa\}})$ static network regret and $\mathcal{O}(T^{1-\kappa/2})$ network cumulative constraint violation, where $T$ is the time horizon and $\kappa\in(0,1)$ is a user-defined trade-off parameter. Moreover, if the loss functions are strongly convex, then the static network regret bound can be reduced to $\mathcal{O}(T^{\kappa})$. Finally, numerical simulations are provided to illustrate the effectiveness of the theoretical results.  ( 2 min )
    Space Meets Time: Local Spacetime Neural Network For Traffic Flow Forecasting. (arXiv:2109.05225v2 [cs.LG] UPDATED)
    Traffic flow forecasting is a crucial task in urban computing. The challenge arises as traffic flows often exhibit intrinsic and latent spatio-temporal correlations that cannot be identified by extracting the spatial and temporal patterns of traffic data separately. We argue that such correlations are universal and play a pivotal role in traffic flow. We put forward {spacetime interval learning} as a paradigm to explicitly capture these correlations through a unified analysis of both spatial and temporal features. Unlike the state-of-the-art methods, which are restricted to a particular road network, we model the universal spatio-temporal correlations that are transferable from cities to cities. To this end, we propose a new spacetime interval learning framework that constructs a local-spacetime context of a traffic sensor comprising the data from its neighbors within close time points. Based on this idea, we introduce local spacetime neural network (STNN), which employs novel spacetime convolution and attention mechanism to learn the universal spatio-temporal correlations. The proposed STNN captures local traffic patterns, which does not depend on a specific network structure. As a result, a trained STNN model can be applied on any unseen traffic networks. We evaluate the proposed STNN on two public real-world traffic datasets and a simulated dataset on dynamic networks. The experiment results show that STNN not only improves prediction accuracy by 4% over state-of-the-art methods, but is also effective in handling the case when the traffic network undergoes dynamic changes as well as the superior generalization capability.  ( 2 min )
    Approximation of Images via Generalized Higher Order Singular Value Decomposition over Finite-dimensional Commutative Semisimple Algebra. (arXiv:2202.00450v1 [cs.LG])
    Low-rank approximation of images via singular value decomposition is well-received in the era of big data. However, singular value decomposition (SVD) is only for order-two data, i.e., matrices. It is necessary to flatten a higher order input into a matrix or break it into a series of order-two slices to tackle higher order data such as multispectral images and videos with the SVD. Higher order singular value decomposition (HOSVD) extends the SVD and can approximate higher order data using sums of a few rank-one components. We consider the problem of generalizing HOSVD over a finite dimensional commutative algebra. This algebra, referred to as a t-algebra, generalizes the field of complex numbers. The elements of the algebra, called t-scalars, are fix-sized arrays of complex numbers. One can generalize matrices and tensors over t-scalars and then extend many canonical matrix and tensor algorithms, including HOSVD, to obtain higher-performance versions. The generalization of HOSVD is called THOSVD. Its performance of approximating multi-way data can be further improved by an alternating algorithm. THOSVD also unifies a wide range of principal component analysis algorithms. To exploit the potential of generalized algorithms using t-scalars for approximating images, we use a pixel neighborhood strategy to convert each pixel to "deeper-order" t-scalar. Experiments on publicly available images show that the generalized algorithm over t-scalars, namely THOSVD, compares favorably with its canonical counterparts.  ( 2 min )
    Fine-grained differentiable physics: a yarn-level model for fabrics. (arXiv:2202.00504v1 [cs.LG])
    Differentiable physics modeling combines physics models with gradient-based learning to provide model explicability and data efficiency. It has been used to learn dynamics, solve inverse problems and facilitate design, and is at its inception of impact. Current successes have concentrated on general physics models such as rigid bodies, deformable sheets, etc., assuming relatively simple structures and forces. Their granularity is intrinsically coarse and therefore incapable of modelling complex physical phenomena. Fine-grained models are still to be developed to incorporate sophisticated material structures and force interactions with gradient-based learning. Following this motivation, we propose a new differentiable fabrics model for composite materials such as cloths, where we dive into the granularity of yarns and model individual yarn physics and yarn-to-yarn interactions. To this end, we propose several differentiable forces, whose counterparts in empirical physics are indifferentiable, to facilitate gradient-based learning. These forces, albeit applied to cloths, are ubiquitous in various physical systems. Through comprehensive evaluation and comparison, we demonstrate our model's explicability in learning meaningful physical parameters, versatility in incorporating complex physical structures and heterogeneous materials, data-efficiency in learning, and high-fidelity in capturing subtle dynamics.  ( 2 min )
    Finding Second-Order Stationary Point for Nonconvex-Strongly-Concave Minimax Problem. (arXiv:2110.04814v2 [math.OC] UPDATED)
    We study the smooth minimax optimization problem $\min_{\bf x}\max_{\bf y} f({\bf x},{\bf y})$, where $f$ is $\ell$-smooth, strongly-concave in ${\bf y}$ but possibly nonconvex in ${\bf x}$. Most of existing works focus on finding the first-order stationary point of the function $f({\bf x},{\bf y})$ or its primal function $P({\bf x})\triangleq \max_{\bf y} f({\bf x},{\bf y})$, but few of them focus on achieving the second-order stationary point. %, which is essential to nonconvex problems. In this paper, we propose a novel approach for minimax optimization, called Minimax Cubic Newton (MCN), which could find an ${\mathcal O}\left(\varepsilon,\kappa^{1.5}\sqrt{\rho\varepsilon}\right)$-second-order stationary point of $P({\bf x})$ with calling ${\mathcal O}\left(\kappa^{1.5}\sqrt{\rho}\varepsilon^{-1.5}\right)$ times of second-order oracles and $\tilde{\mathcal O}\left(\kappa^{2}\sqrt{\rho}\varepsilon^{-1.5}\right)$ times of first-order oracles, where $\kappa$ is the condition number and $\rho$ is the Lipschitz continuous constant for the Hessian of $f({\bf x},{\bf y})$. In addition, we propose an inexact variant of MCN for high-dimensional problems to avoid calling the expensive second-order oracles. Instead, our method solves the cubic sub-problem inexactly via gradient descent and matrix Chebyshev expansion. This strategy still obtains the desired approximate second-order stationary point with high probability but only requires $\tilde{\mathcal O}\left(\kappa^{1.5}\ell\varepsilon^{-2}\right)$ Hessian-vector oracle calls and $\tilde{\mathcal O}\left(\kappa^{2}\sqrt{\rho}\varepsilon^{-1.5}\right)$ first-order oracle calls. To the best of our knowledge, this is the first work that considers the non-asymptotic convergence behavior of finding second-order stationary points for minimax problems without the convex-concave assumptions.  ( 2 min )
    Decentralized Stochastic Variance Reduced Extragradient Method. (arXiv:2202.00509v1 [math.OC])
    This paper studies decentralized convex-concave minimax optimization problems of the form $\min_x\max_y f(x,y) \triangleq\frac{1}{m}\sum_{i=1}^m f_i(x,y)$, where $m$ is the number of agents and each local function can be written as $f_i(x,y)=\frac{1}{n}\sum_{j=1}^n f_{i,j}(x,y)$. We propose a novel decentralized optimization algorithm, called multi-consensus stochastic variance reduced extragradient, which achieves the best known stochastic first-order oracle (SFO) complexity for this problem. Specifically, each agent requires $\mathcal O((n+\kappa\sqrt{n})\log(1/\varepsilon))$ SFO calls for strongly-convex-strongly-concave problem and $\mathcal O((n+\sqrt{n}L/\varepsilon)\log(1/\varepsilon))$ SFO call for general convex-concave problem to achieve $\varepsilon$-accurate solution in expectation, where $\kappa$ is the condition number and $L$ is the smoothness parameter. The numerical experiments show the proposed method performs better than baselines.  ( 2 min )
    Self-Contrastive Learning: An Efficient Supervised Contrastive Framework with Single-view and Sub-network. (arXiv:2106.15499v3 [cs.LG] UPDATED)
    This paper proposes an efficient supervised contrastive learning framework, called Self-Contrastive (SelfCon) learning, that self-contrasts within multiple outputs from the different levels of a multi-exit network. SelfCon learning with a single-view does not require additional augmented samples, which resolves the concerns of multi-viewed batch (e.g., high computational cost and generalization error). Unlike the previous works based on the mutual information (MI) between the multi-views in unsupervised learning, we prove the MI bound for SelfCon loss in a supervised and single-viewed framework. We also empirically analyze that the success of SelfCon learning is related to the regularization effect from the single-view and sub-network. For ImageNet, SelfCon with a single-viewed batch improves accuracy by +0.3\% with 67\% memory and 45\% time of Supervised Contrastive (SupCon) learning, and a simple ensemble of multi-exit outputs boost performance up to +1.4\%.  ( 2 min )
    Out-of-distribution Detection Using Kernel Density Polytopes. (arXiv:2201.13001v1 [cs.LG] CROSS LISTED)
    Any reasonable machine learning (ML) model should not only interpolate efficiently in between the training samples provided (in-distribution region), but also approach the extrapolative or out-of-distribution (OOD) region without being overconfident. Our experiment on human subjects justifies the aforementioned properties for human intelligence as well. Many state-of-the-art algorithms have tried to fix the overconfidence problem of ML models in the OOD region. However, in doing so, they have often impaired the in-distribution performance of the model. Our key insight is that ML models partition the feature space into polytopes and learn constant (random forests) or affine (ReLU networks) functions over those polytopes. This leads to the OOD overconfidence problem for the polytopes which lie in the training data boundary and extend to infinity. To resolve this issue, we propose kernel density methods that fit Gaussian kernel over the polytopes, which are learned using ML models. Specifically, we introduce two variants of kernel density polytopes: Kernel Density Forest (KDF) and Kernel Density Network (KDN) based on random forests and deep networks, respectively. Studies on various simulation settings show that both KDF and KDN achieve uniform confidence over the classes in the OOD region while maintaining good in-distribution accuracy compared to that of their respective parent models.  ( 2 min )
    Memory-based Message Passing: Decoupling the Message for Propogation from Discrimination. (arXiv:2202.00423v1 [cs.LG])
    Message passing is a fundamental procedure for graph neural networks in the field of graph representation learning. Based on the homophily assumption, the current message passing always aggregates features of connected nodes, such as the graph Laplacian smoothing process. However, real-world graphs tend to be noisy and/or non-smooth. The homophily assumption does not always hold, leading to sub-optimal results. A revised message passing method needs to maintain each node's discriminative ability when aggregating the message from neighbors. To this end, we propose a Memory-based Message Passing (MMP) method to decouple the message of each node into a self-embedding part for discrimination and a memory part for propagation. Furthermore, we develop a control mechanism and a decoupling regularization to control the ratio of absorbing and excluding the message in the memory for each node. More importantly, our MMP is a general skill that can work as an additional layer to help improve traditional GNNs performance. Extensive experiments on various datasets with different homophily ratios demonstrate the effectiveness and robustness of the proposed method.  ( 2 min )
    Adversarial Robustness via Fisher-Rao Regularization. (arXiv:2106.06685v2 [cs.LG] UPDATED)
    Adversarial robustness has become a topic of growing interest in machine learning since it was observed that neural networks tend to be brittle. We propose an information-geometric formulation of adversarial defense and introduce FIRE, a new Fisher-Rao regularization for the categorical cross-entropy loss, which is based on the geodesic distance between the softmax outputs corresponding to natural and perturbed input features. Based on the information-geometric properties of the class of softmax distributions, we derive an explicit characterization of the Fisher-Rao Distance (FRD) for the binary and multiclass cases, and draw some interesting properties as well as connections with standard regularization metrics. Furthermore, for a simple linear and Gaussian model, we show that all Pareto-optimal points in the accuracy-robustness region can be reached by FIRE while other state-of-the-art methods fail. Empirically, we evaluate the performance of various classifiers trained with the proposed loss on standard datasets, showing up to a simultaneous 1\% of improvement in terms of clean and robust performances while reducing the training time by 20\% over the best-performing methods.  ( 2 min )
    Addressing the Stability-Plasticity Dilemma via Knowledge-Aware Continual Learning. (arXiv:2110.05329v2 [cs.LG] UPDATED)
    Current methods in continual learning (CL) tend to focus on alleviating catastrophic forgetting of previous tasks. This hinders balancing other CL desiderata such as forward transfer, and memory and computation efficiency. CL desiderata become much more competing when a model operates on class-Incremental Learning (class-IL) without accessing past data since it is prone to ambiguities between old and new classes. In this paper, we present Knowledge-Aware coNtinual learner (KAN) that attempts to study the stability-plasticity dilemma to balance CL desiderata in class-IL. In particular, KAN is a new task-specific components method based on dynamic sparse training that introduces reusability and selective transfer of past knowledge in this class of methods. KAN selectively reuses relevant knowledge while addressing class ambiguity, preserves old knowledge, and utilizes the model capacity efficiently. Experiments show the effectiveness of KAN in providing models with multiple CL desirable properties, outperforming state-of-the-art methods on various challenging benchmarks.  ( 2 min )
    Multi-Task Learning based Convolutional Models with Curriculum Learning for the Anisotropic Reynolds Stress Tensor in Turbulent Duct Flow. (arXiv:2111.00328v2 [physics.flu-dyn] UPDATED)
    The Reynolds-averaged Navier-Stokes (RANS) equations require accurate modeling of the anisotropic Reynolds stress tensor. Traditional closure models, while sophisticated, often only apply to restricted flow configurations. Researchers have started using machine learning approaches to tackle this problem by developing more general closure models informed by data. In this work we build upon recent convolutional neural network architectures used for turbulence modeling and propose a multi-task learning-based fully convolutional neural network that is able to accurately predict the normalized anisotropic Reynolds stress tensor for turbulent duct flows. Furthermore, we also explore the application of curriculum learning to data-driven turbulence modeling.  ( 2 min )
    Maximum Batch Frobenius Norm for Multi-Domain Text Classification. (arXiv:2202.00537v1 [cs.CL])
    Multi-domain text classification (MDTC) has obtained remarkable achievements due to the advent of deep learning. Recently, many endeavors are devoted to applying adversarial learning to extract domain-invariant features to yield state-of-the-art results. However, these methods still face one challenge: transforming original features to be domain-invariant distorts the distributions of the original features, degrading the discriminability of the learned features. To address this issue, we first investigate the structure of the batch classification output matrix and theoretically justify that the discriminability of the learned features has a positive correlation with the Frobenius norm of the batch output matrix. Based on this finding, we propose a maximum batch Frobenius norm (MBF) method to boost the feature discriminability for MDTC. Experiments on two MDTC benchmarks show that our MBF approach can effectively advance the performance of the state-of-the-art.  ( 2 min )
  • Open

    Questions for Flat-Minima Optimization of Modern Neural Networks. (arXiv:2202.00661v1 [cs.LG])
    For training neural networks, flat-minima optimizers that seek to find parameters in neighborhoods having uniformly low loss (flat minima) have been shown to improve upon stochastic and adaptive gradient-based methods. Two methods for finding flat minima stand out: 1. Averaging methods (i.e., Stochastic Weight Averaging, SWA), and 2. Minimax methods (i.e., Sharpness Aware Minimization, SAM). However, despite similar motivations, there has been limited investigation into their properties and no comprehensive comparison between them. In this work, we investigate the loss surfaces from a systematic benchmarking of these approaches across computer vision, natural language processing, and graph learning tasks. This leads us to a hypothesis: since both approaches find flat solutions in orthogonal ways, combining them should improve generalization even further. We verify this improves over either flat-minima approach in 39 out of 42 cases. When it does not, we provide potential explanations. We hope our results across image, graph, and text data will help researchers to improve deep learning optimizers, and practitioners to pinpoint the optimizer for the problem at hand.
    Hierarchical Gaussian Processes with Wasserstein-2 Kernels. (arXiv:2010.14877v2 [stat.ML] UPDATED)
    Stacking Gaussian Processes severely diminishes the model's ability to detect outliers, which when combined with non-zero mean functions, further extrapolates low non-parametric variance to low training data density regions. We propose a hybrid kernel inspired from Varifold theory, operating in both Euclidean and Wasserstein space. We posit that directly taking into account the variance in the computation of Wasserstein-2 distances is of key importance towards maintaining outlier status throughout the hierarchy. We show improved performance on medium and large scale datasets and enhanced out-of-distribution detection on both toy and real data.
    Empirical complexity of comparator-based nearest neighbor descent. (arXiv:2202.00517v1 [cs.LG])
    A Java parallel streams implementation of the $K$-nearest neighbor descent algorithm is presented using a natural statistical termination criterion. Input data consist of a set $S$ of $n$ objects of type V, and a Function>, which enables any $x \in S$ to decide which of $y, z \in S\setminus\{x\}$ is more similar to $x$. Experiments with the Kullback-Leibler divergence Comparator support the prediction that the number of rounds of $K$-nearest neighbor updates need not exceed twice the diameter of the undirected version of a random regular out-degree $K$ digraph on $n$ vertices. Overall complexity was $O(n K^2 \log_K(n))$ in the class of examples studied. When objects are sampled uniformly from a $d$-dimensional simplex, accuracy of the $K$-nearest neighbor approximation is high up to $d = 20$, but declines in higher dimensions, as theory would predict.
    Towards a Theoretical Understanding of Word and Relation Representation. (arXiv:2202.00486v1 [cs.CL])
    Representing words by vectors, or embeddings, enables computational reasoning and is foundational to automating natural language tasks. For example, if word embeddings of similar words contain similar values, word similarity can be readily assessed, whereas judging that from their spelling is often impossible (e.g. cat /feline) and to predetermine and store similarities between all words is prohibitively time-consuming, memory intensive and subjective. We focus on word embeddings learned from text corpora and knowledge graphs. Several well-known algorithms learn word embeddings from text on an unsupervised basis by learning to predict those words that occur around each word, e.g. word2vec and GloVe. Parameters of such word embeddings are known to reflect word co-occurrence statistics, but how they capture semantic meaning has been unclear. Knowledge graph representation models learn representations both of entities (words, people, places, etc.) and relations between them, typically by training a model to predict known facts in a supervised manner. Despite steady improvements in fact prediction accuracy, little is understood of the latent structure that enables this. The limited understanding of how latent semantic structure is encoded in the geometry of word embeddings and knowledge graph representations makes a principled means of improving their performance, reliability or interpretability unclear. To address this: 1. we theoretically justify the empirical observation that particular geometric relationships between word embeddings learned by algorithms such as word2vec and GloVe correspond to semantic relations between words; and 2. we extend this correspondence between semantics and geometry to the entities and relations of knowledge graphs, providing a model for the latent structure of knowledge graph representation linked to that of word embeddings.
    Dimensionality Reduction Meets Message Passing for Graph Node Embeddings. (arXiv:2202.00408v1 [cs.LG])
    Graph Neural Networks (GNNs) have become a popular approach for various applications, ranging from social network analysis to modeling chemical properties of molecules. While GNNs often show remarkable performance on public datasets, they can struggle to learn long-range dependencies in the data due to over-smoothing and over-squashing tendencies. To alleviate this challenge, we propose PCAPass, a method which combines Principal Component Analysis (PCA) and message passing for generating node embeddings in an unsupervised manner and leverages gradient boosted decision trees for classification tasks. We show empirically that this approach provides competitive performance compared to popular GNNs on node classification benchmarks, while gathering information from longer distance neighborhoods. Our research demonstrates that applying dimensionality reduction with message passing and skip connections is a promising mechanism for aggregating long-range dependencies in graph structured data.
    LightSecAgg: Rethinking Secure Aggregation in Federated Learning. (arXiv:2109.14236v2 [cs.LG] UPDATED)
    Secure model aggregation is a key component of federated learning (FL) that aims at protecting the privacy of each user's individual model while allowing for their global aggregation. It can be applied to any aggregation-based FL approach for training a global or personalized model. Model aggregation needs to also be resilient against likely user dropouts in FL systems, making its design substantially more complex. State-of-the-art secure aggregation protocols rely on secret sharing of the random-seeds used for mask generations at the users to enable the reconstruction and cancellation of those belonging to the dropped users. The complexity of such approaches, however, grows substantially with the number of dropped users. We propose a new approach, named LightSecAgg, to overcome this bottleneck by changing the design from "random-seed reconstruction of the dropped users" to "one-shot aggregate-mask reconstruction of the active users via mask encoding/decoding". We show that LightSecAgg achieves the same privacy and dropout-resiliency guarantees as the state-of-the-art protocols while significantly reducing the overhead for resiliency against dropped users. We also demonstrate that, unlike existing schemes, LightSecAgg can be applied to secure aggregation in the asynchronous FL setting. Furthermore, we provide a modular system design and optimized on-device parallelization for scalable implementation, by enabling computational overlapping between model training and on-device encoding, as well as improving the speed of concurrent receiving and sending of chunked masks. We evaluate LightSecAgg via extensive experiments for training diverse models on various datasets in a realistic FL system with large number of users and demonstrate that LightSecAgg significantly reduces the total training time.
    Safe Screening for Logistic Regression with $\ell_0$-$\ell_2$ Regularization. (arXiv:2202.00467v1 [stat.ML])
    In logistic regression, it is often desirable to utilize regularization to promote sparse solutions, particularly for problems with a large number of features compared to available labels. In this paper, we present screening rules that safely remove features from logistic regression with $\ell_0-\ell_2$ regularization before solving the problem. The proposed safe screening rules are based on lower bounds from the Fenchel dual of strong conic relaxations of the logistic regression problem. Numerical experiments with real and synthetic data suggest that a high percentage of the features can be effectively and safely removed apriori, leading to substantial speed-up in the computations.
    Ultra-fast Deep Mixtures of Gaussian Process Experts. (arXiv:2006.13309v2 [cs.LG] UPDATED)
    Mixtures of experts have become an indispensable tool for flexible modelling in a supervised learning context, and sparse Gaussian processes (GP) have shown promise as a leading candidate for the experts in such models. In this article, we propose to design the gating network for selecting the experts from such mixtures of sparse GPs using a deep neural network (DNN). Furthermore, an ultra-fast one pass algorithm called Cluster-Classify-Regress (CCR) is leveraged to approximate the maximum a posteriori (MAP) estimator extremely quickly. This powerful combination of model and algorithm together delivers a novel method which is flexible, robust, and extremely efficient. In particular, the method is able to outperform competing methods in terms of accuracy and uncertainty quantification. The cost is competitive on low-dimensional and small data sets but is significantly lower for higher dimensional and big data sets. Iteratively maximizing the distribution of experts given allocations and allocations given experts does not provide significant improvement, which indicates that the algorithm achieves a good approximation to the local MAP estimator very fast. This insight can be useful also in the context of other mixture of experts models.
    Optimal Estimation of Off-Policy Policy Gradient via Double Fitted Iteration. (arXiv:2202.00076v1 [stat.ML])
    Policy gradient (PG) estimation becomes a challenge when we are not allowed to sample with the target policy but only have access to a dataset generated by some unknown behavior policy. Conventional methods for off-policy PG estimation often suffer from either significant bias or exponentially large variance. In this paper, we propose the double Fitted PG estimation (FPG) algorithm. FPG can work with an arbitrary policy parameterization, assuming access to a Bellman-complete value function class. In the case of linear value function approximation, we provide a tight finite-sample upper bound on policy gradient estimation error, that is governed by the amount of distribution mismatch measured in feature space. We also establish the asymptotic normality of FPG estimation error with a precise covariance characterization, which is further shown to be statistically optimal with a matching Cramer-Rao lower bound. Empirically, we evaluate the performance of FPG on both policy gradient estimation and policy optimization, using either softmax tabular or ReLU policy networks. Under various metrics, our results show that FPG significantly outperforms existing off-policy PG estimation methods based on importance sampling and variance reduction techniques.
    Quantifying Ignorance in Individual-Level Causal-Effect Estimates under Hidden Confounding. (arXiv:2103.04850v6 [cs.LG] UPDATED)
    We study the problem of learning conditional average treatment effects (CATE) from high-dimensional, observational data with unobserved confounders. Unobserved confounders introduce ignorance -- a level of unidentifiability -- about an individual's response to treatment by inducing bias in CATE estimates. We present a new parametric interval estimator suited for high-dimensional data, that estimates a range of possible CATE values when given a predefined bound on the level of hidden confounding. Further, previous interval estimators do not account for ignorance about the CATE associated with samples that may be underrepresented in the original study, or samples that violate the overlap assumption. Our interval estimator also incorporates model uncertainty so that practitioners can be made aware of out-of-distribution data. We prove that our estimator converges to tight bounds on CATE when there may be unobserved confounding, and assess it using semi-synthetic, high-dimensional datasets.
    Progressive Distillation for Fast Sampling of Diffusion Models. (arXiv:2202.00512v1 [cs.LG])
    Diffusion models have recently shown great promise for generative modeling, outperforming GANs on perceptual quality and autoregressive models at density estimation. A remaining downside is their slow sampling time: generating high quality samples takes many hundreds or thousands of model evaluations. Here we make two contributions to help eliminate this downside: First, we present new parameterizations of diffusion models that provide increased stability when using few sampling steps. Second, we present a method to distill a trained deterministic diffusion sampler, using many steps, into a new diffusion model that takes half as many sampling steps. We then keep progressively applying this distillation procedure to our model, halving the number of required sampling steps each time. On standard image generation benchmarks like CIFAR-10, ImageNet, and LSUN, we start out with state-of-the-art samplers taking as many as 8192 steps, and are able to distill down to models taking as few as 4 steps without losing much perceptual quality; achieving, for example, a FID of 3.0 on CIFAR-10 in 4 steps. Finally, we show that the full progressive distillation procedure does not take more time than it takes to train the original model, thus representing an efficient solution for generative modeling using diffusion at both train and test time.
    Deconfounded Representation Similarity for Comparison of Neural Networks. (arXiv:2202.00095v1 [stat.ML])
    Similarity metrics such as representational similarity analysis (RSA) and centered kernel alignment (CKA) have been used to compare layer-wise representations between neural networks. However, these metrics are confounded by the population structure of data items in the input space, leading to spuriously high similarity for even completely random neural networks and inconsistent domain relations in transfer learning. We introduce a simple and generally applicable fix to adjust for the confounder with covariate adjustment regression, which retains the intuitive invariance properties of the original similarity measures. We show that deconfounding the similarity metrics increases the resolution of detecting semantically similar neural networks. Moreover, in real-world applications, deconfounding improves the consistency of representation similarities with domain similarities in transfer learning, and increases correlation with out-of-distribution accuracy.
    Out-of-distribution Detection Using Kernel Density Polytopes. (arXiv:2201.13001v1 [cs.LG] CROSS LISTED)
    Any reasonable machine learning (ML) model should not only interpolate efficiently in between the training samples provided (in-distribution region), but also approach the extrapolative or out-of-distribution (OOD) region without being overconfident. Our experiment on human subjects justifies the aforementioned properties for human intelligence as well. Many state-of-the-art algorithms have tried to fix the overconfidence problem of ML models in the OOD region. However, in doing so, they have often impaired the in-distribution performance of the model. Our key insight is that ML models partition the feature space into polytopes and learn constant (random forests) or affine (ReLU networks) functions over those polytopes. This leads to the OOD overconfidence problem for the polytopes which lie in the training data boundary and extend to infinity. To resolve this issue, we propose kernel density methods that fit Gaussian kernel over the polytopes, which are learned using ML models. Specifically, we introduce two variants of kernel density polytopes: Kernel Density Forest (KDF) and Kernel Density Network (KDN) based on random forests and deep networks, respectively. Studies on various simulation settings show that both KDF and KDN achieve uniform confidence over the classes in the OOD region while maintaining good in-distribution accuracy compared to that of their respective parent models.  ( 2 min )
    Robust Generalization of Quadratic Neural Networks via Function Identification. (arXiv:2109.10935v2 [cs.LG] UPDATED)
    A key challenge facing deep learning is that neural networks are often not robust to shifts in the underlying data distribution. We study this problem from the perspective of the statistical concept of parameter identification. Generalization bounds from learning theory often assume that the test distribution is close to the training distribution. In contrast, if we can identify the "true" parameters, then the model generalizes to arbitrary distribution shifts. However, neural networks typically have internal symmetries that make parameter identification impossible. We show that we can identify the function represented by a quadratic network even though we cannot identify its parameters; we extend this result to neural networks with ReLU activations. Thus, we can obtain robust generalization bounds for neural networks. We leverage this result to obtain new bounds for contextual bandits and transfer learning with quadratic neural networks. Overall, our results suggest that we can improve robustness of neural networks by designing models that can represent the true data generating process.  ( 2 min )
    On solutions of the distributional Bellman equation. (arXiv:2202.00081v1 [stat.ML])
    In distributional reinforcement learning not only expected returns but the complete return distributions of a policy is taken into account. The return distribution for a fixed policy is given as the fixed point of an associated distributional Bellman operator. In this note we consider general distributional Bellman operators and study existence and uniqueness of its fixed points as well as their tail properties. We give necessary and sufficient conditions for existence and uniqueness of return distributions and identify cases of regular variation. We link distributional Bellman equations to multivariate distributional equations of the form $\textbf{X} =_d \textbf{A}\textbf{X} + \textbf{B}$, where $\textbf{X}$ and $\textbf{B}$ are $d$-dimensional random vectors, $\textbf{A}$ a random $d\times d$ matrix and $\textbf{X}$ and $(\textbf{A},\textbf{B})$ are independent. We show that any fixed-point of a distributional Bellman operator can be obtained as the vector of marginal laws of a solution to such a multivariate distributional equation. This makes the general theory of such equations applicable to the distributional reinforcement learning setting.  ( 2 min )
    JULIA: Joint Multi-linear and Nonlinear Identification for Tensor Completion. (arXiv:2202.00071v1 [cs.LG])
    Tensor completion aims at imputing missing entries from a partially observed tensor. Existing tensor completion methods often assume either multi-linear or nonlinear relationships between latent components. However, real-world tensors have much more complex patterns where both multi-linear and nonlinear relationships may coexist. In such cases, the existing methods are insufficient to describe the data structure. This paper proposes a Joint mUlti-linear and nonLinear IdentificAtion (JULIA) framework for large-scale tensor completion. JULIA unifies the multi-linear and nonlinear tensor completion models with several advantages over the existing methods: 1) Flexible model selection, i.e., it fits a tensor by assigning its values as a combination of multi-linear and nonlinear components; 2) Compatible with existing nonlinear tensor completion methods; 3) Efficient training based on a well-designed alternating optimization approach. Experiments on six real large-scale tensors demonstrate that JULIA outperforms many existing tensor completion algorithms. Furthermore, JULIA can improve the performance of a class of nonlinear tensor completion methods. The results show that in some large-scale tensor completion scenarios, baseline methods with JULIA are able to obtain up to 55% lower root mean-squared-error and save 67% computational complexity.  ( 2 min )
    Analyzing Finite Neural Networks: Can We Trust Neural Tangent Kernel Theory?. (arXiv:2012.04477v3 [cs.LG] UPDATED)
    Neural Tangent Kernel (NTK) theory is widely used to study the dynamics of infinitely-wide deep neural networks (DNNs) under gradient descent. But do the results for infinitely-wide networks give us hints about the behavior of real finite-width ones? In this paper, we study empirically when NTK theory is valid in practice for fully-connected ReLU and sigmoid DNNs. We find out that whether a network is in the NTK regime depends on the hyperparameters of random initialization and the network's depth. In particular, NTK theory does not explain the behavior of sufficiently deep networks initialized so that their gradients explode as they propagate through the network's layers: the kernel is random at initialization and changes significantly during training in this case, contrary to NTK theory. On the other hand, in the case of vanishing gradients, DNNs are in the the NTK regime but become untrainable rapidly with depth. We also describe a framework to study generalization properties of DNNs, in particular the variance of network's output function, by means of NTK theory and discuss its limits.  ( 2 min )
    Bringing Differential Private SGD to Practice: On the Independence of Gaussian Noise and the Number of Training Rounds. (arXiv:2102.09030v4 [cs.LG] UPDATED)
    In DP-SGD each round communicates a local SGD update which leaks some new information about the underlying local data set to the outside world. In order to provide privacy, Gaussian noise with standard deviation $\sigma$ is added to local SGD updates after performing a clipping operation. We show that for attaining $(\epsilon,\delta)$-differential privacy $\sigma$ can be chosen equal to $\sqrt{2(\epsilon +\ln(1/\delta))/\epsilon}$ for $\epsilon=\Omega(T/N^2)$, where $T$ is the total number of rounds and $N$ is equal to the size of the local data set. In many existing machine learning problems, $N$ is always large and $T=O(N)$. Hence, $\sigma$ becomes "independent" of any $T=O(N)$ choice with $\epsilon=\Omega(1/N)$. This means that our $\sigma$ only depends on $N$ rather than $T$. As shown in our paper, this differential privacy characterization allows one to {\it a-priori} select parameters of DP-SGD based on a fixed privacy budget (in terms of $\epsilon$ and $\delta$) in such a way to optimize the anticipated utility (test accuracy) the most. This ability of planning ahead together with $\sigma$'s independence of $T$ (which allows local gradient computations to be split among as many rounds as needed, even for large $T$ as usually happens in practice) leads to a {\it proactive DP-SGD algorithm} that allows a client to balance its privacy budget with the accuracy of the learned global model based on local test data. We notice that the current state-of-the art differential privacy accountant method based on $f$-DP has a closed form for computing the privacy loss for DP-SGD. However, due to its interpretation complexity, it cannot be used in a simple way to plan ahead. Instead, accountant methods are only used for keeping track of how privacy budget has been spent (after the fact).  ( 3 min )
    Graph-based Neural Acceleration for Nonnegative Matrix Factorization. (arXiv:2202.00264v1 [cs.LG])
    We describe a graph-based neural acceleration technique for nonnegative matrix factorization that builds upon a connection between matrices and bipartite graphs that is well-known in certain fields, e.g., sparse linear algebra, but has not yet been exploited to design graph neural networks for matrix computations. We first consider low-rank factorization more broadly and propose a graph representation of the problem suited for graph neural networks. Then, we focus on the task of nonnegative matrix factorization and propose a graph neural network that interleaves bipartite self-attention layers with updates based on the alternating direction method of multipliers. Our empirical evaluation on synthetic and two real-world datasets shows that we attain substantial acceleration, even though we only train in an unsupervised fashion on smaller synthetic instances.  ( 2 min )
    Recycling Model Updates in Federated Learning: Are Gradient Subspaces Low-Rank?. (arXiv:2202.00280v1 [cs.LG])
    In this paper, we question the rationale behind propagating large numbers of parameters through a distributed system during federated learning. We start by examining the rank characteristics of the subspace spanned by gradients across epochs (i.e., the gradient-space) in centralized model training, and observe that this gradient-space often consists of a few leading principal components accounting for an overwhelming majority (95-99%) of the explained variance. Motivated by this, we propose the "Look-back Gradient Multiplier" (LBGM) algorithm, which exploits this low-rank property to enable gradient recycling between model update rounds of federated learning, reducing transmissions of large parameters to single scalars for aggregation. We analytically characterize the convergence behavior of LBGM, revealing the nature of the trade-off between communication savings and model performance. Our subsequent experimental results demonstrate the improvement LBGM obtains in communication overhead compared to conventional federated learning on several datasets and deep learning models. Additionally, we show that LBGM is a general plug-and-play algorithm that can be used standalone or stacked on top of existing sparsification techniques for distributed model training.  ( 2 min )
    Black-box Bayesian inference for economic agent-based models. (arXiv:2202.00625v1 [econ.EM])
    Simulation models, in particular agent-based models, are gaining popularity in economics. The considerable flexibility they offer, as well as their capacity to reproduce a variety of empirically observed behaviours of complex systems, give them broad appeal, and the increasing availability of cheap computing power has made their use feasible. Yet a widespread adoption in real-world modelling and decision-making scenarios has been hindered by the difficulty of performing parameter estimation for such models. In general, simulation models lack a tractable likelihood function, which precludes a straightforward application of standard statistical inference techniques. Several recent works have sought to address this problem through the application of likelihood-free inference techniques, in which parameter estimates are determined by performing some form of comparison between the observed data and simulation output. However, these approaches are (a) founded on restrictive assumptions, and/or (b) typically require many hundreds of thousands of simulations. These qualities make them unsuitable for large-scale simulations in economics and can cast doubt on the validity of these inference methods in such scenarios. In this paper, we investigate the efficacy of two classes of black-box approximate Bayesian inference methods that have recently drawn significant attention within the probabilistic machine learning community: neural posterior estimation and neural density ratio estimation. We present benchmarking experiments in which we demonstrate that neural network based black-box methods provide state of the art parameter inference for economic simulation models, and crucially are compatible with generic multivariate time-series data. In addition, we suggest appropriate assessment criteria for future benchmarking of approximate Bayesian inference procedures for economic simulation models.  ( 2 min )
    Density estimation on low-dimensional manifolds: an inflation-deflation approach. (arXiv:2105.12152v3 [cs.LG] UPDATED)
    Normalizing Flows (NFs) are universal density estimators based on Neural Networks. However, this universality is limited: the density's support needs to be diffeomorphic to a Euclidean space. In this paper, we propose a novel method to overcome this limitation without sacrificing universality. The proposed method inflates the data manifold by adding noise in the normal space, trains an NF on this inflated manifold, and, finally, deflates the learned density. Our main result provides sufficient conditions on the manifold and the specific choice of noise under which the corresponding estimator is exact. Our method has the same computational complexity as NFs and does not require computing an inverse flow. We also show that, if the embedding dimension is much larger than the manifold dimension, noise in the normal space can be well approximated by Gaussian noise. This allows using our method for approximating arbitrary densities on unknown manifolds provided that the manifold dimension is known.  ( 2 min )
    Finding lost DG: Explaining domain generalization via model complexity. (arXiv:2202.00563v1 [stat.ML])
    The domain generalization (DG) problem setting challenges a model trained on multiple known data distributions to generalise well on unseen data distributions. Due to its practical importance, a large number of methods have been proposed to address this challenge. However much of the work in general purpose DG is heuristically motivated, as the DG problem is hard to model formally; and recent evaluations have cast doubt on existing methods' practical efficacy -- in particular compared to a well tuned empirical risk minimisation baseline. We present a novel learning-theoretic generalisation bound for DG that bounds unseen domain generalisation performance in terms of the model's Rademacher complexity. Based on this, we conjecture that existing methods' efficacy or lack thereof is largely determined by an empirical risk vs predictor complexity trade-off, and demonstrate that their performance variability can be explained in these terms. Algorithmically, this analysis suggests that domain generalisation should be achieved by simply performing regularised ERM with a leave-one-domain-out cross-validation objective. Empirical results on the DomainBed benchmark corroborate this.  ( 2 min )
    Cocoercivity, Smoothness and Bias in Variance-Reduced Stochastic Gradient Methods. (arXiv:1903.09009v3 [math.OC] UPDATED)
    With the purpose of examining biased updates in variance-reduced stochastic gradient methods, we introduce SVAG, a SAG/SAGA-like method with adjustable bias. SVAG is analyzed in a cocoercive root-finding setting, a setting which yields the same results as in the usual smooth convex optimization setting for the ordinary proximal-gradient method. We show that the same is not true for SVAG when biased updates are used. The step-size requirements for when the operators are gradients are significantly less restrictive compared to when they are not. This highlights the need to not rely solely on cocoercivity when analyzing variance-reduced methods meant for optimization. Our analysis either match or improve on previously known convergence conditions for SAG and SAGA. However, in the biased cases they still do not correspond well with practical experiences and we therefore examine the effect of bias numerically on a set of classification problems. The choice of bias seem to primarily affect the early stages of convergence and in most cases the differences vanish in the later stages of convergence. However, the effect of the bias choice is still significant in a couple of cases.  ( 2 min )
    GNNRank: Learning Global Rankings from Pairwise Comparisons via Directed Graph Neural Networks. (arXiv:2202.00211v1 [cs.LG])
    Recovering global rankings from pairwise comparisons is an important problem with many applications, ranging from time synchronization to sports team ranking. Pairwise comparisons corresponding to matches in a competition can naturally be construed as edges in a directed graph (digraph), whose nodes represent competitors with an unknown rank or skill strength. However, existing methods addressing the rank estimation problem have thus far not utilized powerful neural network architectures to optimize ranking objectives. Hence, we propose to augment an algorithm with neural network, in particular graph neural network (GNN) for its coherence to the problem at hand. In this paper, we introduce GNNRank, a modeling framework that is compatible with any GNN capable of learning digraph embeddings, and we devise trainable objectives to encode ranking upsets/violations. This framework includes a ranking score estimation approach, and adds a useful inductive bias by unfolding the Fiedler vector computation of the graph constructed from a learnable similarity matrix. Experimental results on a wide range of data sets show that our methods attain competitive and often superior performance compared with existing approaches. It also shows promising transfer ability to new data based on the trained GNN model.  ( 2 min )
    Is the Performance of My Deep Network Too Good to Be True? A Direct Approach to Estimating the Bayes Error in Binary Classification. (arXiv:2202.00395v1 [cs.LG])
    There is a fundamental limitation in the prediction performance that a machine learning model can achieve due to the inevitable uncertainty of the prediction target. In classification problems, this can be characterized by the Bayes error, which is the best achievable error with any classifier. The Bayes error can be used as a criterion to evaluate classifiers with state-of-the-art performance and can be used to detect test set overfitting. We propose a simple and direct Bayes error estimator, where we just take the mean of the labels that show \emph{uncertainty} of the classes. Our flexible approach enables us to perform Bayes error estimation even for weakly supervised data. In contrast to others, our method is model-free and even instance-free. Moreover, it has no hyperparameters and gives a more accurate estimate of the Bayes error than classifier-based baselines. Experiments using our method suggest that a recently proposed classifier, the Vision Transformer, may have already reached the Bayes error for certain benchmark datasets.  ( 2 min )
    Space Meets Time: Local Spacetime Neural Network For Traffic Flow Forecasting. (arXiv:2109.05225v2 [cs.LG] UPDATED)
    Traffic flow forecasting is a crucial task in urban computing. The challenge arises as traffic flows often exhibit intrinsic and latent spatio-temporal correlations that cannot be identified by extracting the spatial and temporal patterns of traffic data separately. We argue that such correlations are universal and play a pivotal role in traffic flow. We put forward {spacetime interval learning} as a paradigm to explicitly capture these correlations through a unified analysis of both spatial and temporal features. Unlike the state-of-the-art methods, which are restricted to a particular road network, we model the universal spatio-temporal correlations that are transferable from cities to cities. To this end, we propose a new spacetime interval learning framework that constructs a local-spacetime context of a traffic sensor comprising the data from its neighbors within close time points. Based on this idea, we introduce local spacetime neural network (STNN), which employs novel spacetime convolution and attention mechanism to learn the universal spatio-temporal correlations. The proposed STNN captures local traffic patterns, which does not depend on a specific network structure. As a result, a trained STNN model can be applied on any unseen traffic networks. We evaluate the proposed STNN on two public real-world traffic datasets and a simulated dataset on dynamic networks. The experiment results show that STNN not only improves prediction accuracy by 4% over state-of-the-art methods, but is also effective in handling the case when the traffic network undergoes dynamic changes as well as the superior generalization capability.  ( 2 min )
    Regret Minimization with Performative Feedback. (arXiv:2202.00628v1 [cs.LG])
    In performative prediction, the deployment of a predictive model triggers a shift in the data distribution. As these shifts are typically unknown ahead of time, the learner needs to deploy a model to get feedback about the distribution it induces. We study the problem of finding near-optimal models under performativity while maintaining low regret. On the surface, this problem might seem equivalent to a bandit problem. However, it exhibits a fundamentally richer feedback structure that we refer to as performative feedback: after every deployment, the learner receives samples from the shifted distribution rather than only bandit feedback about the reward. Our main contribution is regret bounds that scale only with the complexity of the distribution shifts and not that of the reward function. The key algorithmic idea is careful exploration of the distribution shifts that informs a novel construction of confidence bounds on the risk of unexplored models. The construction only relies on smoothness of the shifts and does not assume convexity. More broadly, our work establishes a conceptual approach for leveraging tools from the bandits literature for the purpose of regret minimization with performative feedback.  ( 2 min )
    Causal-BALD: Deep Bayesian Active Learning of Outcomes to Infer Treatment-Effects from Observational Data. (arXiv:2111.02275v2 [cs.LG] UPDATED)
    Estimating personalized treatment effects from high-dimensional observational data is essential in situations where experimental designs are infeasible, unethical, or expensive. Existing approaches rely on fitting deep models on outcomes observed for treated and control populations. However, when measuring individual outcomes is costly, as is the case of a tumor biopsy, a sample-efficient strategy for acquiring each result is required. Deep Bayesian active learning provides a framework for efficient data acquisition by selecting points with high uncertainty. However, existing methods bias training data acquisition towards regions of non-overlapping support between the treated and control populations. These are not sample-efficient because the treatment effect is not identifiable in such regions. We introduce causal, Bayesian acquisition functions grounded in information theory that bias data acquisition towards regions with overlapping support to maximize sample efficiency for learning personalized treatment effects. We demonstrate the performance of the proposed acquisition strategies on synthetic and semi-synthetic datasets IHDP and CMNIST and their extensions, which aim to simulate common dataset biases and pathologies.  ( 2 min )
    The Neural Testbed: Evaluating Predictive Distributions. (arXiv:2110.04629v2 [cs.LG] UPDATED)
    Posterior predictive distributions quantify uncertainties ignored by point estimates. This paper introduces The Neural Testbed: an opensource library for controlled and principled evaluation of agents that generate such predictions. Crucially, agents are assessed not only on the quality of their marginal predictions per input, but also on their joint predictions across many inputs. We evaluate a range of agents using a neural-network-based data generating process. Our results indicate that some well-known agents that produce accurate marginal predictions do not fare well with joint predictions. We show that the quality of joint predictions drives performance in downstream decision tasks, and highlight the importance of this observation to the community.  ( 2 min )
    Meta-Learning Hypothesis Spaces for Sequential Decision-making. (arXiv:2202.00602v1 [stat.ML])
    Obtaining reliable, adaptive confidence sets for prediction functions (hypotheses) is a central challenge in sequential decision-making tasks, such as bandits and model-based reinforcement learning. These confidence sets typically rely on prior assumptions on the hypothesis space, e.g., the known kernel of a Reproducing Kernel Hilbert Space (RKHS). Hand-designing such kernels is error prone, and misspecification may lead to poor or unsafe performance. In this work, we propose to meta-learn a kernel from offline data (Meta-KeL). For the case where the unknown kernel is a combination of known base kernels, we develop an estimator based on structured sparsity. Under mild conditions, we guarantee that our estimated RKHS yields valid confidence sets that, with increasing amounts of offline data, become as tight as those given the true unknown kernel. We demonstrate our approach on the kernelized bandit problem (a.k.a.~Bayesian optimization), where we establish regret bounds competitive with those given the true kernel. We also empirically evaluate the effectiveness of our approach on a Bayesian optimization task.  ( 2 min )
    Datamodels: Predicting Predictions from Training Data. (arXiv:2202.00622v1 [stat.ML])
    We present a conceptual framework, datamodeling, for analyzing the behavior of a model class in terms of the training data. For any fixed "target" example $x$, training set $S$, and learning algorithm, a datamodel is a parameterized function $2^S \to \mathbb{R}$ that for any subset of $S' \subset S$ -- using only information about which examples of $S$ are contained in $S'$ -- predicts the outcome of training a model on $S'$ and evaluating on $x$. Despite the potential complexity of the underlying process being approximated (e.g., end-to-end training and evaluation of deep neural networks), we show that even simple linear datamodels can successfully predict model outputs. We then demonstrate that datamodels give rise to a variety of applications, such as: accurately predicting the effect of dataset counterfactuals; identifying brittle predictions; finding semantically similar examples; quantifying train-test leakage; and embedding data into a well-behaved and feature-rich representation space. Data for this paper (including pre-computed datamodels as well as raw predictions from four million trained deep neural networks) is available at https://github.com/MadryLab/datamodels-data .  ( 2 min )
    Probabilistic Robust Autoencoders for Anomaly Detection. (arXiv:2110.00494v2 [cs.LG] UPDATED)
    Anomalies (or outliers) are prevalent in real-world empirical observations and potentially mask important underlying structures. Accurate identification of anomalous samples is crucial for the success of downstream data analysis tasks. To automatically identify anomalies, we propose Probabilistic Robust AutoEncoder (PRAE). PRAE is designed to simultaneously remove outliers and identify a low-dimensional representation for the inlier samples. We first present the Robust AutoEncoder (RAE) intractable objective as a minimization problem for splitting the data to inlier samples from which a low dimensional representation is learned via an AutoEncoder (AE), and anomalous (outlier) samples that are excluded as they do not fit the low dimensional representation. RAE minimizes the autoencoder's reconstruction error while incorporating as many samples as possible. This could be formulated via regularization by subtracting from the reconstruction term an $\ell_0$ norm counting the number of selected samples. Unfortunately, this leads to an intractable combinatorial problem. Therefore, we propose two probabilistic relaxations of RAE, which are differentiable and alleviate the need for a combinatorial search. We prove that the solution to the PRAE problem is equivalent to the solution of RAE. We use synthetic data to show that PRAE can accurately remove outliers in a wide range of contamination frequencies. Finally, we demonstrate that using PRAE for anomaly detection leads to state-of-the-art results on various benchmark datasets.  ( 2 min )
    Data-driven emergence of convolutional structure in neural networks. (arXiv:2202.00565v1 [cond-mat.dis-nn])
    Exploiting data invariances is crucial for efficient learning in both artificial and biological neural circuits. Understanding how neural networks can discover appropriate representations capable of harnessing the underlying symmetries of their inputs is thus crucial in machine learning and neuroscience. Convolutional neural networks, for example, were designed to exploit translation symmetry and their capabilities triggered the first wave of deep learning successes. However, learning convolutions directly from translation-invariant data with a fully-connected network has so far proven elusive. Here, we show how initially fully-connected neural networks solving a discrimination task can learn a convolutional structure directly from their inputs, resulting in localised, space-tiling receptive fields. These receptive fields match the filters of a convolutional network trained on the same task. By carefully designing data models for the visual scene, we show that the emergence of this pattern is triggered by the non-Gaussian, higher-order local structure of the inputs, which has long been recognised as the hallmark of natural images. We provide an analytical and numerical characterisation of the pattern-formation mechanism responsible for this phenomenon in a simple model, which results in an unexpected link between receptive field formation and the tensor decomposition of higher-order input correlations. These results provide a new perspective on the development of low-level feature detectors in various sensory modalities, and pave the way for studying the impact of higher-order statistics on learning in neural networks.  ( 2 min )
    Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning. (arXiv:2106.02584v2 [cs.LG] UPDATED)
    We challenge a common assumption underlying most supervised deep learning: that a model makes a prediction depending only on its parameters and the features of a single input. To this end, we introduce a general-purpose deep learning architecture that takes as input the entire dataset instead of processing one datapoint at a time. Our approach uses self-attention to reason about relationships between datapoints explicitly, which can be seen as realizing non-parametric models using parametric attention mechanisms. However, unlike conventional non-parametric models, we let the model learn end-to-end from the data how to make use of other datapoints for prediction. Empirically, our models solve cross-datapoint lookup and complex reasoning tasks unsolvable by traditional deep learning models. We show highly competitive results on tabular data, early results on CIFAR-10, and give insight into how the model makes use of the interactions between points.  ( 2 min )
    Tight Accounting in the Shuffle Model of Differential Privacy. (arXiv:2106.00477v3 [cs.CR] UPDATED)
    Shuffle model of differential privacy is a novel distributed privacy model based on a combination of local privacy mechanisms and a secure shuffler. It has been shown that the additional randomisation provided by the shuffler improves privacy bounds compared to the purely local mechanisms. Accounting tight bounds, however, is complicated by the complexity brought by the shuffler. The recently proposed numerical techniques for evaluating $(\varepsilon,\delta)$-differential privacy guarantees have been shown to give tighter bounds than commonly used methods for compositions of various complex mechanisms. In this paper, we show how to obtain accurate bounds for adaptive compositions of general $\varepsilon$-LDP shufflers using the analysis by Feldman et al. (2021) and tight bounds for adaptive compositions of shufflers of $k$-randomised response mechanisms, using the analysis by Balle et al. (2019). We show how to speed up the evaluation of the resulting privacy loss distribution from $\mathcal{O}(n^2)$ to $\mathcal{O}(n)$, where $n$ is the number of users, without noticeable change in the resulting $\delta(\varepsilon)$-upper bounds. We also demonstrate looseness of the existing bounds and methods found in the literature, improving previous composition results significantly.  ( 2 min )
    Quantifying Relevance in Learning and Inference. (arXiv:2202.00339v1 [cs.LG])
    Learning is a distinctive feature of intelligent behaviour. High-throughput experimental data and Big Data promise to open new windows on complex systems such as cells, the brain or our societies. Yet, the puzzling success of Artificial Intelligence and Machine Learning shows that we still have a poor conceptual understanding of learning. These applications push statistical inference into uncharted territories where data is high-dimensional and scarce, and prior information on "true" models is scant if not totally absent. Here we review recent progress on understanding learning, based on the notion of "relevance". The relevance, as we define it here, quantifies the amount of information that a dataset or the internal representation of a learning machine contains on the generative model of the data. This allows us to define maximally informative samples, on one hand, and optimal learning machines on the other. These are ideal limits of samples and of machines, that contain the maximal amount of information about the unknown generative process, at a given resolution (or level of compression). Both ideal limits exhibit critical features in the statistical sense: Maximally informative samples are characterised by a power-law frequency distribution (statistical criticality) and optimal learning machines by an anomalously large susceptibility. The trade-off between resolution (i.e. compression) and relevance distinguishes the regime of noisy representations from that of lossy compression. These are separated by a special point characterised by Zipf's law statistics. This identifies samples obeying Zipf's law as the most compressed loss-less representations that are optimal in the sense of maximal relevance. Criticality in optimal learning machines manifests in an exponential degeneracy of energy levels, that leads to unusual thermodynamic properties.  ( 2 min )
    Iterative regularization for low complexity regularizers. (arXiv:2202.00420v1 [math.OC])
    Iterative regularization exploits the implicit bias of an optimization algorithm to regularize ill-posed problems. Constructing algorithms with such built-in regularization mechanisms is a classic challenge in inverse problems but also in modern machine learning, where it provides both a new perspective on algorithms analysis, and significant speed-ups compared to explicit regularization. In this work, we propose and study the first iterative regularization procedure able to handle biases described by non smooth and non strongly convex functionals, prominent in low-complexity regularization. Our approach is based on a primal-dual algorithm of which we analyze convergence and stability properties, even in the case where the original problem is unfeasible. The general results are illustrated considering the special case of sparse recovery with the $\ell_1$ penalty. Our theoretical results are complemented by experiments showing the computational benefits of our approach.  ( 2 min )
    Interactions in information spread: quantification and interpretation using stochastic block models. (arXiv:2004.04552v3 [cs.LG] UPDATED)
    In most real-world applications, it is seldom the case that a given observable evolves independently of its environment. In social networks, users' behavior results from the people they interact with, news in their feed, or trending topics. In natural language, the meaning of phrases emerges from the combination of words. In general medicine, a diagnosis is established on the basis of the interaction of symptoms. Here, we propose a new model, the Interactive Mixed Membership Stochastic Block Model (IMMSBM), which investigates the role of interactions between entities (hashtags, words, memes, etc.) and quantifies their importance within the aforementioned corpora. We find that interactions play an important role in those corpora. In inference tasks, taking them into account leads to average relative changes with respect to non-interactive models of up to 150\% in the probability of an outcome. Furthermore, their role greatly improves the predictive power of the model. Our findings suggest that neglecting interactions when modeling real-world phenomena might lead to incorrect conclusions being drawn.  ( 2 min )
    Deep Reference Priors: What is the best way to pretrain a model?. (arXiv:2202.00187v1 [stat.ML])
    What is the best way to exploit extra data -- be it unlabeled data from the same task, or labeled data from a related task -- to learn a given task? This paper formalizes the question using the theory of reference priors. Reference priors are objective, uninformative Bayesian priors that maximize the mutual information between the task and the weights of the model. Such priors enable the task to maximally affect the Bayesian posterior, e.g., reference priors depend upon the number of samples available for learning the task and for very small sample sizes, the prior puts more probability mass on low-complexity models in the hypothesis space. This paper presents the first demonstration of reference priors for medium-scale deep networks and image-based data. We develop generalizations of reference priors and demonstrate applications to two problems. First, by using unlabeled data to compute the reference prior, we develop new Bayesian semi-supervised learning methods that remain effective even with very few samples per class. Second, by using labeled data from the source task to compute the reference prior, we develop a new pretraining method for transfer learning that allows data from the target task to maximally affect the Bayesian posterior. Empirical validation of these methods is conducted on image classification datasets.  ( 2 min )
    Approximate Bayesian Computation via Classification. (arXiv:2111.11507v2 [stat.ME] UPDATED)
    Approximate Bayesian Computation (ABC) enables statistical inference in complex models whose likelihoods are difficult to calculate but easy to simulate from. ABC constructs a kernel-type approximation to the posterior distribution through an accept/reject mechanism which compares summary statistics of real and simulated data. To obviate the need for summary statistics, we directly compare empirical distributions with a Kullback-Leibler (KL) divergence estimator obtained via classification. In particular, we blend flexible machine learning classifiers within ABC to automate fake/real data comparisons. We consider the traditional accept/reject kernel as well as an exponential weighting scheme which does not require the ABC acceptance threshold. Our theoretical results show that the rate at which our ABC posterior distributions concentrate around the true parameter depends on the estimation error of the classifier. We derive limiting posterior shape results and find that, with a properly scaled exponential kernel, asymptotic normality holds. We demonstrate the usefulness of our approach on simulated examples as well as real data in the context of stock volatility estimation.  ( 2 min )
    Phase diagram of Stochastic Gradient Descent in high-dimensional two-layer neural networks. (arXiv:2202.00293v1 [stat.ML])
    Despite the non-convex optimization landscape, over-parametrized shallow networks are able to achieve global convergence under gradient descent. The picture can be radically different for narrow networks, which tend to get stuck in badly-generalizing local minima. Here we investigate the cross-over between these two regimes in the high-dimensional setting, and in particular investigate the connection between the so-called mean-field/hydrodynamic regime and the seminal approach of Saad & Solla. Focusing on the case of Gaussian data, we study the interplay between the learning rate, the time scale, and the number of hidden units in the high-dimensional dynamics of stochastic gradient descent (SGD). Our work builds on a deterministic description of SGD in high-dimensions from statistical physics, which we extend and for which we provide rigorous convergence rates.  ( 2 min )

  • Open

    "BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning", Jang et al 2021 {G}
    submitted by /u/gwern [link] [comments]
    [D]DNS: Determinantal Point Process Based Neural Network Sampler for Ensemble Reinforcement Learning
    submitted by /u/schrodingershit [link] [comments]
    Ray[rllib] for research
    I would like to ask about your opinion on using ray[rllib] for research. I'll give you a little background. I'm a Ph.D. student that started last year on the topic of RL and Safe Autonomous Driving. Until now, I was using self-developed basic algorithms to test my agents and environments, but this is long and time-consuming, especially if trying a lot of different algorithms and alternatives. Moreover, I need to use different simulators like Carla or SUMO, depending on the kind of project I'm working on. Finally, I also often try different algorithms modifications, modifying both the NN models used and the algorithm itself. So my question is: do you suggest starting using and learning ray[rllib] or it is not worth it? Why? submitted by /u/T-Style [link] [comments]  ( 2 min )
    Offline Learning using TF_agents
    I have started working on a very small problem in TF_agents where I want to utilize a dataset for offline learning. However, I can't seem to find any documentation on how to create the trajectories without using a Driver in the environment. How does one go about to turn the dataset into a format for the DQN training? submitted by /u/SemenJenner [link] [comments]  ( 1 min )
    What are your workflows (framework, monitoring, plotting tools, data storage, etc.) in the reinforcement learning research?
    submitted by /u/Rich_Beautiful [link] [comments]  ( 1 min )
  • Open

    [P] Sieve: Process 24 hours of video in 10 mins (UPDATE - try it yourself!)
    Hey everyone! I’m one of the creators of Sieve. I posted about it here a while back and thought I'd share that r/MachineLearning can now try it for free :) As a refresher... Sieve can process 24 hours of video in less than 10 minutes, and makes it easy to search video by detected objects / characteristics, motion data, and visual similarity. You can use our models out of the box, or plug-in your own model endpoints into our infrastructure. Every industry from security, to media, supply chain, construction, retail, sports, and agriculture is being transformed by video analytics—but setting up the infrastructure to process video data quickly is difficult. Having to deal with video ingestion pipelines, computer-vision model training, and search functionality is not pretty. Use Cases: https://sievedata.com/use-cases FAQ: https://sievedata.com/faq We’d love to hear what you think about all the updates, and ideas on how we can improve it. Thanks for taking the time to read this, we’re grateful to be posting here :) submitted by /u/happybirthday290 [link] [comments]  ( 1 min )
    [D] Paper digest: Third Time's the Charm? Image and Video Editing with StyleGAN3 - 5-minute paper summary (by Casual GAN Papers)
    Alias-free GAN more commonly known as StyleGAN3, the successor to the legendary StyleGAN2, came out last year, and … Well, and nothing really, despite the initial pique of interest and promising first results, StyleGAN3 did not set the world on fire, and the research community pretty quickly went back to the old but good StyleGAN2 for its well known latent space disentanglement and numerous other killer features, leaving its successor mostly in the shrinkwrap up on the bookshelf as an interesting, yet confusing toy. Now, some 6 months later the team at the Tel-Aviv University, Hebrew University of Jerusalem, and Adobe Research finally released a comprehensive study of StyleGAN3’s applications in popular inversion and editing tasks, its pitfalls, and potential solutions, as well as highlights of the power of the Alias-free generator in tasks, where traditional image generators commonly underperform. Let’s dive in, and learn, shall we? Full summary: https://t.me/casual_gan/253 Blog post: https://www.casualganpapers.com/alias-free-gan-stylegan3-inversion-video-editing/Third-Time-Is-The-Charm-explained.html StyleGAN-3 Video Editing arxiv / code Subscribe to Casual GAN Papers and follow me on Twitter for weekly AI paper summaries! submitted by /u/KirillTheMunchKing [link] [comments]  ( 1 min )
    [D] "Tech has a whiteness problem"
    https://twitter.com/dr_nickiw/status/1489008426059456517?t=AahKcHt1yb5GpAzEMYK3jg&s=19 A black feminist Google employee is quiting her job because Google supposedly has a toxic culture. What are your thoughts on this? submitted by /u/tomd_96 [link] [comments]  ( 1 min )
    [D] How did you find your current ML job?
    Hello everyone. I've been a machine learning engineer my whole career. I'm curious to see how the ML community looks for jobs in today's market. View Poll submitted by /u/prateek-infinity [link] [comments]  ( 1 min )
    [D] Applying Clustering on Legacy software applications
    Hi, I recently received a proposition to work on a project that is hypothesized as wanting to apply "ML based Clustering methods" as follows: Using clustering on legacy software - basically about separation of concerns - how do you isolate sections of monolithic applications so you can modernize or replace them a phased way. Are there any good resources for a head start, I fail to understand how one can go about applying Clustering on Legacy software and codes. Any suggestions? submitted by /u/Sufficient_Permit_77 [link] [comments]  ( 1 min )
    [D] beautify 3d echography
    I'm wondering how commercial face beautify apps are doing things like that (the input is a 3d baby echography) ​ left 3d echography - Right filtered image Gynecological private clinics are using apps like facetune2 or picsart to enhance the face of the babies. It's a manual crafted process using different features but in the end, I think it's the result of a Gan Network because I see they added/invent hair or eyebrows (closed eyes of the baby) Is it possible to automatise this process, is it possible to train a Gan like stylegan with ffhq dataset, in example? ​ Thanks submitted by /u/monete2002 [link] [comments]  ( 1 min )
    [N] AlphaCode from DeepMind: a system that can compete at average human level in competitive coding competitions like CodeForces.
    DeepMind has recently introduced AlphaCode that can solve some problems from CodeForces, a website that hosts competitive programming contests. ​ Link: https://deepmind.com/blog/article/Competitive-programming-with-AlphaCode submitted by /u/Serious-East784 [link] [comments]  ( 1 min )
    [P] Active Learning Simplified with Rikai
    Active Learning Simplified with Rikai Hey, there, We (Eto.ai) are building an open-source unstructured ML data (image, video, sensor) framework, called Rikai, which features: SQL-ML capability (Bring your own model, BYOM) Support Pytorch / Tensorflow / Sklearn model inference via SQL Large scale feature engineering and GPU inference via Spark SQL backend Native tensor / numpy SerDe in the dataset Native Pytorch / Tensorflow / Pandas readers Semantic types for CV tasks 2d/3d bounding box, segmentations and etc Jupyter integration for visualization and JDBC integration Semantic types (Image, Video, Bounding box) can be directly rendered in Jupyter notebooks Use Looker / Mode Analytics to visualize the stats of your image and sensor dataset (via JDBC) The goal of this project is to empower everyone to analyze machine learning data at scale. Here is a demo: Active Learning Simplified with Rikai . It demonstrates how simple it is to use uncertainty sampling to achieve equivalent model accuracy while significantly reducing labeling cost. Github repo: https://github.com/eto-ai/rikai Would appreciate your feedbacks! submitted by /u/eddyxu [link] [comments]  ( 1 min )
    [N] IBM Watson is dead, sold for parts.
    ​ Sold to Francisco Partners (private equity) for $1B IBM Sells Some Watson Health Assets for More Than $1 Billion - Bloomberg Watson was billed as the future of healthcare, but failed to deliver on its ambitious promises. "IBM agreed to sell part of its IBM Watson Health business to private equity firm Francisco Partners, scaling back the technology company’s once-lofty ambitions in health care. "The value of the assets being sold, which include extensive and wide-ranging data sets and products, and image software offerings, is more than $1 billion, according to people familiar with the plans. IBM confirmed an earlier Bloomberg report on the sale in a statement on Friday, without disclosing the price." This is encouraging news for those who have sights set on the healthcare industry. Also a lesson for people to focus on smaller-scale products with limited scope. submitted by /u/the_scign [link] [comments]  ( 3 min )
    [N] EleutherAI announces a 20 billion parameter model, GPT-NeoX-20B, with weights being publicly released next week
    GPT-NeoX-20B, a 20 billion parameter model trained using EleutherAI's GPT-NeoX, was announced today. They will publicly release the weights on February 9th, which is a week from now. The model outperforms OpenAI's Curie in a lot of tasks. They have provided some additional info (and benchmarks) in their blog post, at https://blog.eleuther.ai/announcing-20b/. submitted by /u/MonLiH [link] [comments]  ( 3 min )
    Has anybody used smaller cloud providers? [D]
    There are advantages to using AWS, Azure, and GCP, but I'm wondering if anybody has experience with some of the smaller providers. Sometimes I don't need a comprehensive solution, I just need more power than my workstation can provide to train a model. I've stumbled upon several companies offering lower prices for powerful hardware, like DataCrunch.io and Lamba Labs. Are there are other smaller companies worth exploring? Have any of you used one of these providers? submitted by /u/TrueBirch [link] [comments]  ( 1 min )
    [D] data2vec (and DINO): loss goes down and up again
    Hello everyone, I have issues training some self-supervised audio models and thought that maybe Reddit's swarm intelligence might help me here. I try to train the recently published data2vec model on music data (the FMA dataset). I've made some modifications to the feature extractor ConvNet, and reduced the size of the transformer encoder unit, but overall it is similar to the proposed model. Since I am new to training transformer models, I've tried to reproduce the paper's hyperparameters as closely as possible. My batch size is much smaller, so I've adjusted the learning rate accordingly etc.. The first few epochs seem promising. The loss goes down quite quickly and the embedding variance of the teacher and student models increase over time (i.e. the model is not collapsing to a constant representation). However, after a while it abruptly changes and the loss goes up again and stays high over the remaining training (the variances stay high though). I've also experienced a similar issue with another exponential moving average (EMA) model, DINO (the ResNet variant), a while ago. When reducing the learning rate, the effect just happend later. Does anyone have an idea why this might be happening? submitted by /u/Tomsen1410 [link] [comments]  ( 1 min )
    [D] What do you use to build UIs for your projects?
    Are there any easy templates or systems that you use to make your projects more presentable? I'm not much of a UI guy and don't want to spend too much time diving deep into JavaScript to figure out how to put a UI together to make my projects more presentable. Is there anything you all use that doesn't require building too much from scratch? Thanks in advance. submitted by /u/AveliaUsum [link] [comments]  ( 3 min )
    [D] Current State of the Art in Normalizing Flows for Variational Inference?
    What is the current state of the art architecture (which layers) when using Normalizing Flows for Variational inference? submitted by /u/future_gcp_poweruser [link] [comments]
    [D] Pytorch with graph database
    I've been thinking about developing a tool that would help to transform the graph DB data into a Pytorch dataset and use it with a custom model. Are there some red flags that I should have in mind regarding this tight coupling between ML and DB? submitted by /u/InGraphsWeTrust [link] [comments]  ( 1 min )
    [P] Dataset for White Flies and Aphids
    I am working on an object detection project to identify White Flies and Aphids. I have found some dataset but no annotation is available [(xmin, ymin, xmax, ymax) or a mask]. Anyone worked with these pests that can provide some insight about a dataset? submitted by /u/giakou4 [link] [comments]
    [D] Exemplar images of ImageNet data for each class
    Is there any simple way to download the exemplar images of ImageNet data for each class? It would be 1,000 images in total, maybe a little more. submitted by /u/Standard_Tank_9545 [link] [comments]
    [N] Cruise is opening driverless cars to the public in San Francisco
    Cruise, the self-driving company backed by General Motors and Honda, announced a public waitlist for its robotaxi service in San Francisco. It’s a significant step for the company that has previously been beset by delays in its quest to get paying customers into its autonomous ride-hailing vehicles. (https://www.theverge.com/2022/2/1/22912553/cruise-public-waitlist-robotaxi-autonomous-san-francisco) This has sparked some debate about the future of self-driving neural network hardware and machine learning pipeline. (https://news.ycombinator.com/item?id=30167037) submitted by /u/hardmaru [link] [comments]  ( 2 min )
    [R] StyleGAN-XL: Scaling StyleGAN to Large Diverse Datasets
    Project website, with a short summary video: https://sites.google.com/view/stylegan-xl/ Paper: https://arxiv.org/abs/2202.00273 Abstract Computer graphics has experienced a recent surge of data-centric approaches for photorealistic and controllable content creation. StyleGAN in particular sets new standards for generative modeling regarding image quality and controllability. However, StyleGAN's performance severely degrades on large unstructured datasets such as ImageNet. StyleGAN was designed for controllability; hence, prior works suspect its restrictive design to be unsuitable for diverse datasets. In contrast, we find the main limiting factor to be the current training strategy. Following the recently introduced Projected GAN paradigm, we leverage powerful neural network priors and a progressive growing strategy to successfully train the latest StyleGAN3 generator on ImageNet. Our final model, StyleGAN-XL, sets a new state-of-the-art on large-scale image synthesis and is the first to generate images at a resolution of 1024² at such a dataset scale. We demonstrate that this model can invert and edit images beyond the narrow domain of portraits or specific object classes. submitted by /u/hardmaru [link] [comments]  ( 1 min )
  • Open

    Paper digest: Third Time's the Charm? Image and Video Editing with StyleGAN3 - 5-minute paper summary (by Casual GAN Papers)
    Alias-free GAN more commonly known as StyleGAN3, the successor to the legendary StyleGAN2, came out last year, and … Well, and nothing really, despite the initial pique of interest and promising first results, StyleGAN3 did not set the world on fire, and the research community pretty quickly went back to the old but good StyleGAN2 for its well known latent space disentanglement and numerous other killer features, leaving its successor mostly in the shrinkwrap up on the bookshelf as an interesting, yet confusing toy. Now, some 6 months later the team at the Tel-Aviv University, Hebrew University of Jerusalem, and Adobe Research finally released a comprehensive study of StyleGAN3’s applications in popular inversion and editing tasks, its pitfalls, and potential solutions, as well as highlights of the power of the Alias-free generator in tasks, where traditional image generators commonly underperform. Let’s dive in, and learn, shall we? Full summary: https://t.me/casual_gan/253 Blog post: https://www.casualganpapers.com/alias-free-gan-stylegan3-inversion-video-editing/Third-Time-Is-The-Charm-explained.html StyleGAN-3 Video Editing arxiv / code Subscribe to Casual GAN Papers and follow me on Twitter for weekly AI paper summaries! submitted by /u/KirillTheMunchKing [link] [comments]  ( 1 min )
    As regional-based convolutional neural networks (R-CNNs) extract imagery from regions of interest (ROI), how are these regions identified?
    As regional-based convolutional neural networks (R-CNNs) extract imagery from regions of interest (ROI), how are these regions identified? I'm assuming that they are identified when you annotate the imagery using a bounding box-creation tool such as LabelImg. Is this how ROIs are determined by the algorithm? submitted by /u/EnvironmentalLemon36 [link] [comments]  ( 1 min )
    [WATCH] Artificial Intelligence, Real Marketing: Nate Clegg at SiGMA Europe 2021 - SiGMA
    submitted by /u/thedyezwfl [link] [comments]
    DeepMind introduces AlphaCode, an AI that rivals the average human in coding competitions.
    submitted by /u/Yasuuuya [link] [comments]  ( 1 min )
    Intentionality, Artificial Intelligence, and the Symbol Grounding Problem (w/ philosophy readings by Searle and Chisholm) — An online discussion on February 17, free and open to all
    submitted by /u/darrenjyc [link] [comments]  ( 1 min )
    Trump's Truth Social Will Use AI To Censor Platform
    submitted by /u/Beautiful-Credit-868 [link] [comments]
    Am I going to be rich?
    submitted by /u/Swaggyswaggerson [link] [comments]
    How to read more research papers? Sharing my best tips and tools that simplify my life as an AI research scientist
    submitted by /u/OnlyProggingForFun [link] [comments]
    6 Chatbot Marketing Techniques that will Drive Your Sales
    submitted by /u/mihircontra20 [link] [comments]
    I asked an AI to make multicolor paintings of the sky 🎨🌄
    submitted by /u/notrealAI [link] [comments]  ( 1 min )
    Artificial Nightmares : Pale Lady || DD PYTTI VQGAN AI Art Video [4K 60 FPS]
    submitted by /u/Thenamessd [link] [comments]
  • Open

    Announcing the launch of the model copy feature for Amazon Comprehend custom models
    Technology trends and advancements in digital media in the past decade or so have resulted in the proliferation of text-based data. The potential benefits of mining this text to derive insights, both tactical and strategic, is enormous. This is called natural language processing (NLP). You can use NLP, for example, to analyze your product reviews […]  ( 9 min )
    Balance your data for machine learning with Amazon SageMaker Data Wrangler
    Amazon SageMaker Data Wrangler is a new capability of Amazon SageMaker that makes it faster for data scientists and engineers to prepare data for machine learning (ML) applications by using a visual interface. It contains over 300 built-in data transformations so you can quickly normalize, transform, and combine features without having to write any code. […]  ( 6 min )
    Launch processing jobs with a few clicks using Amazon SageMaker Data Wrangler
    Amazon SageMaker Data Wrangler makes it faster for data scientists and engineers to prepare data for machine learning (ML) applications by using a visual interface. Previously, when you created a Data Wrangler data flow, you could choose different export options to easily integrate that data flow into your data processing pipeline. Data Wrangler offers export […]  ( 5 min )
    Prepare and analyze JSON and ORC data with Amazon SageMaker Data Wrangler
    Amazon SageMaker Data Wrangler is a new capability of Amazon SageMaker that makes it faster for data scientists and engineers to prepare data for machine learning (ML) applications via a visual interface. Data preparation is a crucial step of the ML lifecycle, and Data Wrangler provides an end-to-end solution to import, prepare, transform, featurize, and […]  ( 4 min )
  • Open

    Can Robots Follow Instructions for New Tasks?
    Posted by Chelsea Finn, Research Adviser and Eric Jang, Senior Research Scientist, Robotics at Google People can flexibly maneuver objects in their physical surroundings to accomplish various goals. One of the grand challenges in robotics is to successfully train robots to do the same, i.e., to develop a general-purpose robot capable of performing a multitude of tasks based on arbitrary user commands. Robots that are faced with the real world will also inevitably encounter new user instructions and situations that were not seen during training. Therefore, it is imperative for robots to be trained to perform multiple tasks in a variety of situations and, more importantly, to be capable of solving new tasks as requested by human users, even if the robot was not explicitly trained on those t…  ( 7 min )
  • Open

    Solving (Some) Formal Math Olympiad Problems
    We built a neural theorem prover for Lean that learned to solve a variety of challenging high-school olympiad problems, including problems from the AMC12 and AIME competitions, as well as two problems adapted from the IMO.[1] The prover uses a language model to find proofs of formal statements. Each  ( 8 min )
  • Open

    Using reinforcement learning to identify high-risk states and treatments in healthcare
    As the pandemic overburdens medical facilities and clinicians become increasingly overworked, the ability to make quick decisions on providing the best possible treatment is even more critical. In urgent health situations, such decisions can mean life or death. However, certain treatment protocols can pose a considerable risk to patients who have serious medical conditions and […] The post Using reinforcement learning to identify high-risk states and treatments in healthcare appeared first on Microsoft Research.  ( 12 min )
  • Open

    How Audio Analytic Is Teaching Machines to Listen
    From active noise cancellation to digital assistants that are always listening for your commands, audio is perhaps one of the most important but often overlooked aspects of modern technology in our daily lives. Audio Analytic has been using machine learning that enables a vast array of devices to make sense of the world of sound. Read article > The post How Audio Analytic Is Teaching Machines to Listen appeared first on The Official NVIDIA Blog.  ( 2 min )
  • Open

    imodels: leveraging the unreasonable effectiveness of rules
    imodels: A python package with cutting-edge techniques for concise, transparent, and accurate predictive modeling. All sklearn-compatible and easy to use. Recent machine-learning advances have led to increasingly complex predictive models, often at the cost of interpretability. We often need interpretability, particularly in high-stakes applications such as medicine, biology, and political science (see here and here for an overview). Moreover, interpretable models help with all kinds of things, such as identifying errors, leveraging domain knowledge, and speeding up inference. Despite new advances in formulating/fitting interpretable models, implementations are often difficult to find, use, and compare. imodels (github, paper) fills this gap by providing a simple unified interface and im…  ( 2 min )
  • Open

    learning on arbitrary graph topologies via predictive coding
    Interesting framework. Could this be used to perform any kind of large scale learning? More generally, where could an architecture like this have a real advantage over the standard ones? I'm puzzled, as it seems very nice theoretically, but maybe not able to scale up to larger benchmarks and limited applications? https://arxiv.org/pdf/2201.13180.pdf submitted by /u/Ambitious_Smile_981 [link] [comments]  ( 1 min )
  • Open

    Applications of Artificial Intelligence in Human Resource Management
    Several professionals and companies across the globe today have understood that the adoption of smart technology like Artificial Intelligence (AI) is actively changing workplaces. There are many applications of AI throughout nearly every profession and industry, and human resources careers are no exception. A recent survey conducted by Oracle and Future Workplace found that human… Read More »Applications of Artificial Intelligence in Human Resource Management The post Applications of Artificial Intelligence in Human Resource Management appeared first on Data Science Central.  ( 4 min )
    Data Animation: Much Easier than you Think!
    In this article, I explain how to easily turn data into videos. Data animations add value to your presentations. They also constitute a useful tool in exploratory data analysis. Of course, as in any data visualization, carefully choice of what to put in your video, and how to format it, is critical. The technique described… Read More »Data Animation: Much Easier than you Think! The post Data Animation: Much Easier than you Think! appeared first on Data Science Central.  ( 7 min )
    DSC Weekly Newsletter 01 Feb 2022: What is Web 3.0 and Why Does it Matter?
    In 1993, I was at a conference in Seattle where I stumbled onto someone who was showing this new application called the Mosaic WWW Browser. At the time, I’d been doing programming work in the burgeoning gaming field after having spent some time working in publishing and advertising. The demo was, to be honest, not… Read More »DSC Weekly Newsletter 01 Feb 2022: What is Web 3.0 and Why Does it Matter? The post DSC Weekly Newsletter 01 Feb 2022: What is Web 3.0 and Why Does it Matter? appeared first on Data Science Central.  ( 7 min )
  • Open

    Pre-training Offline RL on Wikipedia; Where to put your robot's eyes
    Hello! Today (February 1st) was my birthday, and I spent a great deal of time pondering how to spend this year. In the end, I had two broad goals for the year: improving my skills and helping others. RL Weekly was something I have always been proud of. It not only gave me a reason to read many research papers but also rewarded me with many emails of appreciation, which I am still incredibly thankful for. Therefore, it was a natural decision to revive the newsletter this year. This issue marks the revival of RL News. Notice the name change from RL Weekly to RL News: the newsletter will have one issue every two weeks. Coincidentally, today is also Lunar New Year. If you celebrate it: Happy Lunar New Year!  ( 3 min )
  • Open

    Improved Regret Analysis for Variance-Adaptive Linear Bandits and Horizon-Free Linear Mixture MDPs. (arXiv:2111.03289v2 [stat.ML] UPDATED)
    In online learning problems, exploiting low variance plays an important role in obtaining tight performance guarantees yet is challenging because variances are often not known a priori. Recently, considerable progress has been made by Zhang et al. (2021) where they obtain a variance-adaptive regret bound for linear bandits without knowledge of the variances and a horizon-free regret bound for linear mixture Markov decision processes (MDPs). In this paper, we present novel analyses that improve their regret bounds significantly. For linear bandits, we achieve $\tilde O(d^{1.5}\sqrt{\sum_{k}^K \sigma_k^2} + d^2)$ where $d$ is the dimension of the features, $K$ is the time horizon, and $\sigma_k^2$ is the noise variance at time step $k$, and $\tilde O$ ignores polylogarithmic dependence, which is a factor of $d^3$ improvement. For linear mixture MDPs with the assumption of maximum cumulative reward in an episode being in $[0,1]$, we achieve a horizon-free regret bound of $\tilde O(d \sqrt{K} + d^2)$ where $d$ is the number of base models and $K$ is the number of episodes. This is a factor of $d^{3.5}$ improvement in the leading term and $d^7$ in the lower order term. Our analysis critically relies on a novel elliptical potential `count' lemma. This lemma allows a novel regret analysis in conjunction with the peeling trick, which is of independent interest.
    Learning to discover: expressive Gaussian mixture models for multi-dimensional simulation and parameter inference in the physical sciences. (arXiv:2108.11481v2 [physics.data-an] UPDATED)
    We show that density models describing multiple observables with (i) hard boundaries and (ii) dependence on external parameters may be created using an auto-regressive Gaussian mixture model. The model is designed to capture how observable spectra are deformed by hypothesis variations, and is made more expressive by projecting data onto a configurable latent space. It may be used as a statistical model for scientific discovery in interpreting experimental observations, for example when constraining the parameters of a physical model or tuning simulation parameters according to calibration data. The model may also be sampled for use within a Monte Carlo simulation chain, or used to estimate likelihood ratios for event classification. The method is demonstrated on simulated high-energy particle physics data considering the anomalous electroweak production of a $Z$ boson in association with a dijet system at the Large Hadron Collider, and the accuracy of inference is tested using a realistic toy example. The developed methods are domain agnostic; they may be used within any field to perform simulation or inference where a dataset consisting of many real-valued observables has conditional dependence on external parameters.
    Hierarchical Few-Shot Generative Models. (arXiv:2110.12279v2 [cs.LG] UPDATED)
    A few-shot generative model should be able to generate data from a distribution by only observing a limited set of examples. In few-shot learning the model is trained on data from many sets from different distributions sharing some underlying properties such as sets of characters from different alphabets or sets of images of different type objects. We extend current latent variable models for sets to a fully hierarchical approach with an attention-based point to set-level aggregation and call our approach SCHA-VAE for Set-Context-Hierarchical-Aggregation Variational Autoencoder. We explore iterative data sampling, likelihood-based model comparison, and adaptation-free out of distribution generalization. Our results show that the hierarchical formulation better captures the intrinsic variability within the sets in the small data regime. With this work we generalize deep latent variable approaches to few-shot learning, taking a step towards large-scale few-shot generation with a formulation that readily can work with current state-of-the-art deep generative models.
    Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers. (arXiv:2109.10686v2 [cs.CL] UPDATED)
    There remain many open questions pertaining to the scaling behaviour of Transformer architectures. These scaling decisions and findings can be critical, as training runs often come with an associated computational cost which have both financial and/or environmental impact. The goal of this paper is to present scaling insights from pretraining and finetuning Transformers. While Kaplan et al. presents a comprehensive study of the scaling behaviour of Transformer language models, the scope is only on the upstream (pretraining) loss. Therefore, it is still unclear if these set of findings transfer to downstream task within the context of the pretrain-finetune paradigm. The key findings of this paper are as follows: (1) we show that aside from only the model size, model shape matters for downstream fine-tuning, (2) scaling protocols operate differently at different compute regions, (3) widely adopted T5-base and T5-large sizes are Pareto-inefficient. To this end, we present improved scaling protocols whereby our redesigned models achieve similar downstream fine-tuning quality while having 50\% fewer parameters and training 40\% faster compared to the widely adopted T5-base model. We publicly release over 100 pretrained checkpoints of different T5 configurations to facilitate future research and analysis.
    Vector Optimization with Stochastic Bandit Feedback. (arXiv:2110.12311v2 [cs.LG] UPDATED)
    We introduce vector optimization problems with stochastic bandit feedback, which extends the best arm identification problem to vector-valued rewards. We consider $K$ designs, with multi-dimensional mean reward vectors, which are partially ordered according to a polyhedral ordering cone $C$. This generalizes the concept of Pareto set in multi-objective optimization and allows different sets of preferences of decision-makers to be encoded by $C$. Different than prior work, we define approximations of the Pareto set based on direction-free covering and gap notions. We study the setting where an evaluation of each design yields a noisy observation of the mean reward vector. Under subgaussian noise assumption, we investigate the sample complexity of the na\"ive elimination algorithm in an ($\epsilon,\delta$)-PAC setting, where the goal is to identify an ($\epsilon,\delta$)-PAC Pareto set with the minimum number of design evaluations. In order to characterize the difficulty of learning the Pareto set, we introduce the concept of ordering complexity, i.e., geometric conditions on the deviations of empirical reward vectors from their mean under which the Pareto front can be approximated accurately. We show how to compute the ordering complexity of any polyhedral ordering cone. We run experiments to verify our theoretical results and illustrate how $C$ and sampling budget affect the Pareto set, returned ($\epsilon,\delta$)-PAC Pareto set and the success of identification.
    Deep Generative Models in Engineering Design: A Review. (arXiv:2110.10863v2 [cs.LG] UPDATED)
    Automated design synthesis has the potential to revolutionize the modern engineering design process and improve access to highly optimized and customized products across countless industries. Successfully adapting generative Machine Learning to design engineering may enable such automated design synthesis and is a research subject of great importance. We present a review and analysis of Deep Generative Machine Learning models in engineering design. Deep Generative Models (DGMs) typically leverage deep networks to learn from an input dataset and synthesize new designs. Recently, DGMs such as feedforward Neural Networks (NNs), Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and certain Deep Reinforcement Learning (DRL) frameworks have shown promising results in design applications like structural optimization, materials design, and shape synthesis. The prevalence of DGMs in engineering design has skyrocketed since 2016. Anticipating continued growth, we conduct a review of recent advances to benefit researchers interested in DGMs for design. We structure our review as an exposition of the algorithms, datasets, representation methods, and applications commonly used in the current literature. In particular, we discuss key works that have introduced new techniques and methods in DGMs, successfully applied DGMs to a design-related domain, or directly supported the development of DGMs through datasets or auxiliary methods. We further identify key challenges and limitations currently seen in DGMs across design fields, such as design creativity, handling constraints and objectives, and modeling both form and functional performance simultaneously. In our discussion, we identify possible solution pathways as key areas on which to target future work.
    Self-supervised Reinforcement Learning with Independently Controllable Subgoals. (arXiv:2109.04150v2 [cs.LG] UPDATED)
    To successfully tackle challenging manipulation tasks, autonomous agents must learn a diverse set of skills and how to combine them. Recently, self-supervised agents that set their own abstract goals by exploiting the discovered structure in the environment were shown to perform well on many different tasks. In particular, some of them were applied to learn basic manipulation skills in compositional multi-object environments. However, these methods learn skills without taking the dependencies between objects into account. Thus, the learned skills are difficult to combine in realistic environments. We propose a novel self-supervised agent that estimates relations between environment components and uses them to independently control different parts of the environment state. In addition, the estimated relations between objects can be used to decompose a complex goal into a compatible sequence of subgoals. We show that, by using this framework, an agent can efficiently and automatically learn manipulation tasks in multi-object environments with different relations between objects.
    Stochastic Normalizing Flows for Inverse Problems: a Markov Chains Viewpoint. (arXiv:2109.11375v3 [cs.LG] UPDATED)
    To overcome topological constraints and improve the expressiveness of normalizing flow architectures, Wu, K\"ohler and No\'e introduced stochastic normalizing flows which combine deterministic, learnable flow transformations with stochastic sampling methods. In this paper, we consider stochastic normalizing flows from a Markov chain point of view. In particular, we replace transition densities by general Markov kernels and establish proofs via Radon-Nikodym derivatives which allows to incorporate distributions without densities in a sound way. Further, we generalize the results for sampling from posterior distributions as required in inverse problems. The performance of the proposed conditional stochastic normalizing flow is demonstrated by numerical examples.
    Spike-inspired Rank Coding for Fast and Accurate Recurrent Neural Networks. (arXiv:2110.02865v2 [cs.NE] UPDATED)
    Biological spiking neural networks (SNNs) can temporally encode information in their outputs, e.g. in the rank order in which neurons fire, whereas artificial neural networks (ANNs) conventionally do not. As a result, models of SNNs for neuromorphic computing are regarded as potentially more rapid and efficient than ANNs when dealing with temporal input. On the other hand, ANNs are simpler to train, and usually achieve superior performance. Here we show that temporal coding such as rank coding (RC) inspired by SNNs can also be applied to conventional ANNs such as LSTMs, and leads to computational savings and speedups. In our RC for ANNs, we apply backpropagation through time using the standard real-valued activations, but only from a strategically early time step of each sequential input example, decided by a threshold-crossing event. Learning then incorporates naturally also _when_ to produce an output, without other changes to the model or the algorithm. Both the forward and the backward training pass can be significantly shortened by skipping the remaining input sequence after that first event. RC-training also significantly reduces time-to-insight during inference, with a minimal decrease in accuracy. The desired speed-accuracy trade-off is tunable by varying the threshold or a regularization parameter that rewards output entropy. We demonstrate these in two toy problems of sequence classification, and in a temporally-encoded MNIST dataset where our RC model achieves 99.19% accuracy after the first input time-step, outperforming the state of the art in temporal coding with SNNs, as well as in spoken-word classification of Google Speech Commands, outperforming non-RC-trained early inference with LSTMs.
    Almost Optimal Universal Lower Bound for Learning Causal DAGs with Atomic Interventions. (arXiv:2111.05070v3 [cs.LG] UPDATED)
    A well-studied challenge that arises in the structure learning problem of causal directed acyclic graphs (DAG) is that using observational data, one can only learn the graph up to a "Markov equivalence class" (MEC). The remaining undirected edges have to be oriented using interventions, which can be very expensive to perform in applications. Thus, the problem of minimizing the number of interventions needed to fully orient the MEC has received a lot of recent attention, and is also the focus of this work. We prove two main results. The first is a new universal lower bound on the number of atomic interventions that any algorithm (whether active or passive) would need to perform in order to orient a given MEC. Our second result shows that this bound is, in fact, within a factor of two of the size of the smallest set of atomic interventions that can orient the MEC. Our lower bound is provably better than previously known lower bounds. The proof of our lower bound is based on the new notion of clique-block shared-parents (CBSP) orderings, which are topological orderings of DAGs without v-structures and satisfy certain special properties. Further, using simulations on synthetic graphs and by giving examples of special graph families, we show that our bound is often significantly better.
    Collaborative Learning in General Graphs with Limited Memorization: Learnability, Complexity and Reliability. (arXiv:2201.12482v1 [cs.LG])
    We consider K-armed bandit problem in general graphs where agents are arbitrarily connected and each of them has limited memorization and communication bandwidth. The goal is to let each of the agents learn the best arm. Although recent studies show the power of collaboration among the agents in improving the efficacy of learning, it is assumed in these studies that the communication graphs should be complete or well-structured, whereas such an assumption is not always valid in practice. Furthermore, limited memorization and communication bandwidth also restrict the collaborations of the agents, since very few knowledge can be drawn by each agent from its experiences or the ones shared by its peers in this case. Additionally, the agents may be corrupted to share falsified experience, while the resource limit may considerably restrict the reliability of the learning process. To address the above issues, we propose a three-staged collaborative learning algorithm. In each step, the agents share their experience with each other through light-weight random walks in the general graphs, and then make decisions on which arms to pull according to the randomly memorized suggestions. The agents finally update their adoptions (i.e., preferences to the arms) based on the reward feedback of the arm pulling. Our theoretical analysis shows that, by exploiting the limited memorization and communication resources, all the agents eventually learn the best arm with high probability. We also reveal in our theoretical analysis the upper-bound on the number of corrupted agents our algorithm can tolerate. The efficacy of our proposed three-staged collaborative learning algorithm is finally verified by extensive experiments on both synthetic and real datasets.
    Any Variational Autoencoder Can Do Arbitrary Conditioning. (arXiv:2201.12414v1 [cs.LG])
    Arbitrary conditioning is an important problem in unsupervised learning, where we seek to model the conditional densities $p(\mathbf{x}_u \mid \mathbf{x}_o)$ that underly some data, for all possible non-intersecting subsets $o, u \subset \{1, \dots , d\}$. However, the vast majority of density estimation only focuses on modeling the joint distribution $p(\mathbf{x})$, in which important conditional dependencies between features are opaque. We propose a simple and general framework, coined Posterior Matching, that enables any Variational Autoencoder (VAE) to perform arbitrary conditioning, without modification to the VAE itself. Posterior Matching applies to the numerous existing VAE-based approaches to joint density estimation, thereby circumventing the specialized models required by previous approaches to arbitrary conditioning. We find that Posterior Matching achieves performance that is comparable or superior to current state-of-the-art methods for a variety of tasks.
    Zeroth-Order Actor-Critic. (arXiv:2201.12518v1 [cs.LG])
    Zeroth-order optimization methods and policy gradient based first-order methods are two promising alternatives to solve reinforcement learning (RL) problems with complementary advantages. The former work with arbitrary policies, drive state-dependent and temporally-extended exploration, possess robustness-seeking property, but suffer from high sample complexity, while the latter are more sample efficient but restricted to differentiable policies and the learned policies are less robust. We propose Zeroth-Order Actor-Critic algorithm (ZOAC) that unifies these two methods into an on-policy actor-critic architecture to preserve the advantages from both. ZOAC conducts rollouts collection with timestep-wise perturbation in parameter space, first-order policy evaluation (PEV) and zeroth-order policy improvement (PIM) alternately in each iteration. We evaluate our proposed method on a range of challenging continuous control benchmarks using different types of policies, where ZOAC outperforms zeroth-order and first-order baseline algorithms.
    Electra: Conditional Generative Model based Predicate-Aware Query Approximation. (arXiv:2201.12420v1 [cs.DB])
    The goal of Approximate Query Processing (AQP) is to provide very fast but "accurate enough" results for costly aggregate queries thereby improving user experience in interactive exploration of large datasets. Recently proposed Machine-Learning based AQP techniques can provide very low latency as query execution only involves model inference as compared to traditional query processing on database clusters. However, with increase in the number of filtering predicates(WHERE clauses), the approximation error significantly increases for these methods. Analysts often use queries with a large number of predicates for insights discovery. Thus, maintaining low approximation error is important to prevent analysts from drawing misleading conclusions. In this paper, we propose ELECTRA, a predicate-aware AQP system that can answer analytics-style queries with a large number of predicates with much smaller approximation errors. ELECTRA uses a conditional generative model that learns the conditional distribution of the data and at runtime generates a small (~1000 rows) but representative sample, on which the query is executed to compute the approximate result. Our evaluations with four different baselines on three real-world datasets show that ELECTRA provides lower AQP error for large number of predicates compared to baselines.
    Learning more skills through optimistic exploration. (arXiv:2107.14226v2 [cs.LG] UPDATED)
    Unsupervised skill learning objectives (Gregor et al., 2016, Eysenbach et al., 2018) allow agents to learn rich repertoires of behavior in the absence of extrinsic rewards. They work by simultaneously training a policy to produce distinguishable latent-conditioned trajectories, and a discriminator to evaluate distinguishability by trying to infer latents from trajectories. The hope is for the agent to explore and master the environment by encouraging each skill (latent) to reliably reach different states. However, an inherent exploration problem lingers: when a novel state is actually encountered, the discriminator will necessarily not have seen enough training data to produce accurate and confident skill classifications, leading to low intrinsic reward for the agent and effective penalization of the sort of exploration needed to actually maximize the objective. To combat this inherent pessimism towards exploration, we derive an information gain auxiliary objective that involves training an ensemble of discriminators and rewarding the policy for their disagreement. Our objective directly estimates the epistemic uncertainty that comes from the discriminator not having seen enough training examples, thus providing an intrinsic reward more tailored to the true objective compared to pseudocount-based methods (Burda et al., 2019). We call this exploration bonus discriminator disagreement intrinsic reward, or DISDAIN. We demonstrate empirically that DISDAIN improves skill learning both in a tabular grid world (Four Rooms) and the 57 games of the Atari Suite (from pixels). Thus, we encourage researchers to treat pessimism with DISDAIN.
    Feature-Attending Recurrent Modules for Generalizing Object-Centric Behavior. (arXiv:2112.08369v2 [cs.LG] UPDATED)
    To generalize in object-centric tasks, a reinforcement learning (RL) agent needs to exploit the structure that objects induce. Prior work has either hard-coded object-centric features, used complex object-centric generative models, or updated state using local spatial features. However, these approaches have had limited success in enabling general RL agents. Motivated by this, we introduce ``Feature-Attending Recurrent Modules'' (FARM), an architecture for learning state representations that relies on simple, broadly applicable inductive biases for capturing spatial and temporal regularities. FARM learns a state representation that is distributed across multiple modules that each attend to spatiotemporal features with an expressive feature attention mechanism. This enables FARM to represent diverse object-induced spatial and temporal regularities across subsets of modules. We hypothesize that this enables an RL agent to flexibly recombine its experiences for generalization. We study task suites in both 2D and 3D environments and find that FARM better generalizes compared to competing architectures that leverage attention or multiple modules.
    Pairwise Supervised Contrastive Learning of Sentence Representations. (arXiv:2109.05424v2 [cs.CL] UPDATED)
    Many recent successes in sentence representation learning have been achieved by simply fine-tuning on the Natural Language Inference (NLI) datasets with triplet loss or siamese loss. Nevertheless, they share a common weakness: sentences in a contradiction pair are not necessarily from different semantic categories. Therefore, optimizing the semantic entailment and contradiction reasoning objective alone is inadequate to capture the high-level semantic structure. The drawback is compounded by the fact that the vanilla siamese or triplet losses only learn from individual sentence pairs or triplets, which often suffer from bad local optima. In this paper, we propose PairSupCon, an instance discrimination based approach aiming to bridge semantic entailment and contradiction understanding with high-level categorical concept encoding. We evaluate PairSupCon on various downstream tasks that involve understanding sentence semantics at different granularities. We outperform the previous state-of-the-art method with $10\%$--$13\%$ averaged improvement on eight clustering tasks, and $5\%$--$6\%$ averaged improvement on seven semantic textual similarity (STS) tasks.
    Bounding Training Data Reconstruction in Private (Deep) Learning. (arXiv:2201.12383v1 [cs.LG])
    Differential privacy is widely accepted as the de facto method for preventing data leakage in ML, and conventional wisdom suggests that it offers strong protection against privacy attacks. However, existing semantic guarantees for DP focus on membership inference, which may overestimate the adversary's capabilities and is not applicable when membership status itself is non-sensitive. In this paper, we derive the first semantic guarantees for DP mechanisms against training data reconstruction attacks under a formal threat model. We show that two distinct privacy accounting methods -- Renyi differential privacy and Fisher information leakage -- both offer strong semantic protection against data reconstruction attacks.
    Benchmarking Resource Usage for Efficient Distributed Deep Learning. (arXiv:2201.12423v1 [cs.LG])
    Deep learning (DL) workflows demand an ever-increasing budget of compute and energy in order to achieve outsized gains. Neural architecture searches, hyperparameter sweeps, and rapid prototyping consume immense resources that can prevent resource-constrained researchers from experimenting with large models and carry considerable environmental impact. As such, it becomes essential to understand how different deep neural networks (DNNs) and training leverage increasing compute and energy resources -- especially specialized computationally-intensive models across different domains and applications. In this paper, we conduct over 3,400 experiments training an array of deep networks representing various domains/tasks -- natural language processing, computer vision, and chemistry -- on up to 424 graphics processing units (GPUs). During training, our experiments systematically vary compute resource characteristics and energy-saving mechanisms such as power utilization and GPU clock rate limits to capture and illustrate the different trade-offs and scaling behaviors each representative model exhibits under various resource and energy-constrained regimes. We fit power law models that describe how training time scales with available compute resources and energy constraints. We anticipate that these findings will help inform and guide high-performance computing providers in optimizing resource utilization, by selectively reducing energy consumption for different deep learning tasks/workflows with minimal impact on training.
    Graph network for learning bi-directional physics. (arXiv:2112.07054v2 [cs.LG] UPDATED)
    In this work, we propose an end-to-end graph network that learns forward and inverse models of particle-based physics using interpretable inductive biases. Physics-informed neural networks are often engineered to solve specific problems through problem-specific regularization and loss functions. Such explicit learning biases the network to learn data specific patterns and may require a change in the loss function or neural network architecture hereby limiting their generalizabiliy. Our graph network is implicitly biased by learning to solve several tasks, thereby sharing representations between tasks in order to learn the forward dynamics as well as infer the probability distribution of unknown particle specific properties. We evaluate our approach on one-step next state prediction tasks across diverse datasets. Our comparison against related data-driven physics learning approaches reveals that our model is able to predict the forward dynamics with at least an order of magnitude higher accuracy. We also show that our approach is able to recover multi-modal probability distributions of unknown physical parameters.
    Near Optimal Stochastic Algorithms for Finite-Sum Unbalanced Convex-Concave Minimax Optimization. (arXiv:2106.01761v2 [math.OC] UPDATED)
    This paper considers stochastic first-order algorithms for convex-concave minimax problems of the form $\min_{\bf x}\max_{\bf y}f(\bf x, \bf y)$, where $f$ can be presented by the average of $n$ individual components which are $L$-average smooth. For $\mu_x$-strongly-convex-$\mu_y$-strongly-concave setting, we propose a new method which could find a $\varepsilon$-saddle point of the problem in $\tilde{\mathcal O} \big(\sqrt{n(\sqrt{n}+\kappa_x)(\sqrt{n}+\kappa_y)}\log(1/\varepsilon)\big)$ stochastic first-order complexity, where $\kappa_x\triangleq L/\mu_x$ and $\kappa_y\triangleq L/\mu_y$. This upper bound is near optimal with respect to $\varepsilon$, $n$, $\kappa_x$ and $\kappa_y$ simultaneously. In addition, the algorithm is easily implemented and works well in practical. Our methods can be extended to solve more general unbalanced convex-concave minimax problems and the corresponding upper complexity bounds are also near optimal.
    Deep Learning Methods for Abstract Visual Reasoning: A Survey on Raven's Progressive Matrices. (arXiv:2201.12382v1 [cs.AI])
    Abstract visual reasoning (AVR) domain encompasses problems solving which requires the ability to reason about relations among entities present in a given scene. While humans, generally, solve AVR tasks in a ``natural'' way, even without prior experience, this type of problems has proven difficult for current machine learning systems. The paper summarises recent progress in applying deep learning methods to solving AVR problems, as a proxy for studying machine intelligence. We focus on the most common type of AVR tasks -- the Raven's Progressive Matrices (RPMs) -- and provide a comprehensive review of the learning methods and deep neural models applied to solve RPMs, as well as, the RPM benchmark sets. Performance analysis of the state-of-the-art approaches to solving RPMs leads to formulation of certain insights and remarks on the current and future trends in this area. We conclude the paper by demonstrating how real-world problems can benefit from the discoveries of RPM studies.
    ScaLA: Accelerating Adaptation of Pre-Trained Transformer-Based Language Models via Efficient Large-Batch Adversarial Noise. (arXiv:2201.12469v1 [cs.LG])
    In recent years, large pre-trained Transformer-based language models have led to dramatic improvements in many natural language understanding tasks. To train these models with increasing sizes, many neural network practitioners attempt to increase the batch sizes in order to leverage multiple GPUs to improve training speed. However, increasing the batch size often makes the optimization more difficult, leading to slow convergence or poor generalization that can require orders of magnitude more training time to achieve the same model quality. In this paper, we explore the steepness of the loss landscape of large-batch optimization for adapting pre-trained Transformer-based language models to domain-specific tasks and find that it tends to be highly complex and irregular, posing challenges to generalization on downstream tasks. To tackle this challenge, we propose ScaLA, a novel and efficient method to accelerate the adaptation speed of pre-trained transformer networks. Different from prior methods, we take a sequential game-theoretic approach by adding lightweight adversarial noise into large-batch optimization, which significantly improves adaptation speed while preserving model generalization. Experiment results show that ScaLA attains 2.7--9.8$\times$ adaptation speedups over the baseline for GLUE on BERT-base and RoBERTa-large, while achieving comparable and sometimes higher accuracy than the state-of-the-art large-batch optimization methods. Finally, we also address the theoretical aspect of large-batch optimization with adversarial noise and provide a theoretical convergence rate analysis for ScaLA using techniques for analyzing non-convex saddle-point problems.
    Theoretically Motivated Data Augmentation and Regularization for Portfolio Construction. (arXiv:2106.04114v2 [cs.LG] UPDATED)
    The task we consider is portfolio construction in a speculative market, a fundamental problem in modern finance. While various empirical works now exist to explore deep learning in finance, the theory side is almost non-existent. In this work, we focus on developing a theoretical framework for understanding the use of data augmentation for deep-learning-based approaches to quantitative finance. The proposed theory clarifies the role and necessity of data augmentation for finance; moreover, our theory implies that a simple algorithm of injecting a random noise of strength $\sqrt{|r_{t-1}|}$ to the observed return $r_{t}$ is better than not injecting any noise and a few other financially irrelevant data augmentation techniques.
    Meta-CPR: Generalize to Unseen Large Number of Agents with Communication Pattern Recognition Module. (arXiv:2112.07222v3 [cs.LG] UPDATED)
    Designing an effective communication mechanism among agents in reinforcement learning has been a challenging task, especially for real-world applications. The number of agents can grow or an environment sometimes needs to interact with a changing number of agents in real-world scenarios. To this end, a multi-agent framework needs to handle various scenarios of agents, in terms of both scales and dynamics, for being practical to real-world applications. We formulate the multi-agent environment with a different number of agents as a multi-tasking problem and propose a meta reinforcement learning (meta-RL) framework to tackle this problem. The proposed framework employs a meta-learned Communication Pattern Recognition (CPR) module to identify communication behavior and extract information that facilitates the training process. Experimental results are poised to demonstrate that the proposed framework (a) generalizes to an unseen larger number of agents and (b) allows the number of agents to change between episodes. The ablation study is also provided to reason the proposed CPR design and show such design is effective.
    Do You Need the Entropy Reward (in Practice)?. (arXiv:2201.12434v1 [cs.LG])
    Maximum entropy (MaxEnt) RL maximizes a combination of the original task reward and an entropy reward. It is believed that the regularization imposed by entropy, on both policy improvement and policy evaluation, together contributes to good exploration, training convergence, and robustness of learned policies. This paper takes a closer look at entropy as an intrinsic reward, by conducting various ablation studies on soft actor-critic (SAC), a popular representative of MaxEnt RL. Our findings reveal that in general, entropy rewards should be applied with caution to policy evaluation. On one hand, the entropy reward, like any other intrinsic reward, could obscure the main task reward if it is not properly managed. We identify some failure cases of the entropy reward especially in episodic Markov decision processes (MDPs), where it could cause the policy to be overly optimistic or pessimistic. On the other hand, our large-scale empirical study shows that using entropy regularization alone in policy improvement, leads to comparable or even better performance and robustness than using it in both policy improvement and policy evaluation. Based on these observations, we recommend either normalizing the entropy reward to a zero mean (SACZero), or simply removing it from policy evaluation (SACLite) for better practical results.
    Robust Imitation Learning from Corrupted Demonstrations. (arXiv:2201.12594v1 [cs.LG])
    We consider offline Imitation Learning from corrupted demonstrations where a constant fraction of data can be noise or even arbitrary outliers. Classical approaches such as Behavior Cloning assumes that demonstrations are collected by an presumably optimal expert, hence may fail drastically when learning from corrupted demonstrations. We propose a novel robust algorithm by minimizing a Median-of-Means (MOM) objective which guarantees the accurate estimation of policy, even in the presence of constant fraction of outliers. Our theoretical analysis shows that our robust method in the corrupted setting enjoys nearly the same error scaling and sample complexity guarantees as the classical Behavior Cloning in the expert demonstration setting. Our experiments on continuous-control benchmarks validate that our method exhibits the predicted robustness and effectiveness, and achieves competitive results compared to existing imitation learning methods.
    Planning and Learning with Adaptive Lookahead. (arXiv:2201.12403v1 [cs.LG])
    The classical Policy Iteration (PI) algorithm alternates between greedy one-step policy improvement and policy evaluation. Recent literature shows that multi-step lookahead policy improvement leads to a better convergence rate at the expense of increased complexity per iteration. However, prior to running the algorithm, one cannot tell what is the best fixed lookahead horizon. Moreover, per a given run, using a lookahead of horizon larger than one is often wasteful. In this work, we propose for the first time to dynamically adapt the multi-step lookahead horizon as a function of the state and of the value estimate. We devise two PI variants and analyze the trade-off between iteration count and computational complexity per iteration. The first variant takes the desired contraction factor as the objective and minimizes the per-iteration complexity. The second variant takes as input the computational complexity per iteration and minimizes the overall contraction factor. We then devise a corresponding DQN-based algorithm with an adaptive tree search horizon. We also include a novel enhancement for on-policy learning: per-depth value function estimator. Lastly, we demonstrate the efficacy of our adaptive lookahead method in a maze environment and in Atari.
    Enabling Long-Term Cooperation in Cross-Silo Federated Learning: A Repeated Game Perspective. (arXiv:2106.11814v2 [cs.LG] UPDATED)
    Cross-silo federated learning (FL) is a distributed learning approach where clients of the same interest train a global model cooperatively while keeping their local data private. The success of a cross-silo FL process requires active participation of many clients. Clients in cross-silo FL aim to optimize their long-term benefits by selfishly choosing their participation levels. While there has been some work on incentivizing clients to join FL, the analysis of clients' long-term selfish participation behaviors in cross-silo FL remains largely unexplored. In this paper, we analyze the selfish participation behaviors of heterogeneous clients in cross-silo FL. Specifically, we model clients' long-term selfish participation behaviors as an infinitely repeated game. For the stage game SPFL, we derive the unique Nash equilibrium (NE), and propose a distributed algorithm for each client to calculate its equilibrium participation strategy. We show that at the NE, clients fall into at most three categories: (i) free riders, (ii) a unique partial contributor (if exists), and (iii) contributors. For the long-term interactions among clients, we derive a cooperative strategy for clients which minimizes the number of free riders while increasing the amount of local data for model training. We show that enforced by a punishment strategy, such a cooperative strategy is a subgame perfect Nash equilibrium (SPNE) of the infinitely repeated game, under which some clients who are free riders at the NE of the stage game choose to be (partial) contributors. We further propose an algorithm to calculate the optimal SPNE which minimizes the number of free riders while maximizing the amount of local data for model training. Simulation results show that our derived optimal SPNE can effectively reduce the number of free riders by up to 99.3% and increase the amount of local data for model training by up to 82.3%.
    Retroformer: Pushing the Limits of Interpretable End-to-end Retrosynthesis Transformer. (arXiv:2201.12475v1 [physics.chem-ph])
    Retrosynthesis prediction is one of the fundamental challenges in organic synthesis. The task is to predict the reactants given a core product. With the advancement of machine learning, computer-aided synthesis planning has gained increasing interest. Numerous methods were proposed to solve this problem with different levels of dependency on additional chemical knowledge. In this paper, we propose Retroformer, a novel Transformer-based architecture for retrosynthesis prediction without relying on any cheminformatics tools for molecule editing. Via the proposed local attention head, the model can jointly encode the molecular sequence and graph, and efficiently exchange information between the local reactive region and the global reaction context. Retroformer reaches the new state-of-the-art accuracy for the end-to-end template-free retrosynthesis, and improves over many strong baselines on better molecule and reaction validity. In addition, its generative procedure is highly interpretable and controllable. Overall, Retroformer pushes the limits of the reaction reasoning ability of deep generative models.
    Discovering Exfiltration Paths Using Reinforcement Learning with Attack Graphs. (arXiv:2201.12416v1 [cs.CR])
    Reinforcement learning (RL), in conjunction with attack graphs and cyber terrain, are used to develop reward and state associated with determination of optimal paths for exfiltration of data in enterprise networks. This work builds on previous crown jewels (CJ) identification that focused on the target goal of computing optimal paths that adversaries may traverse toward compromising CJs or hosts within their proximity. This work inverts the previous CJ approach based on the assumption that data has been stolen and now must be quietly exfiltrated from the network. RL is utilized to support the development of a reward function based on the identification of those paths where adversaries desire reduced detection. Results demonstrate promising performance for a sizable network environment.
    Syfer: Neural Obfuscation for Private Data Release. (arXiv:2201.12406v1 [cs.LG])
    Balancing privacy and predictive utility remains a central challenge for machine learning in healthcare. In this paper, we develop Syfer, a neural obfuscation method to protect against re-identification attacks. Syfer composes trained layers with random neural networks to encode the original data (e.g. X-rays) while maintaining the ability to predict diagnoses from the encoded data. The randomness in the encoder acts as the private key for the data owner. We quantify privacy as the number of attacker guesses required to re-identify a single image (guesswork). We propose a contrastive learning algorithm to estimate guesswork. We show empirically that differentially private methods, such as DP-Image, obtain privacy at a significant loss of utility. In contrast, Syfer achieves strong privacy while preserving utility. For example, X-ray classifiers built with DP-image, Syfer, and original data achieve average AUCs of 0.53, 0.78, and 0.86, respectively.
    Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval. (arXiv:2201.12431v1 [cs.CL])
    Retrieval-based language models (R-LM) model the probability of natural language text by combining a standard language model (LM) with examples retrieved from an external datastore at test time. While effective, a major bottleneck of using these models in practice is the computationally costly datastore search, which can be performed as frequently as every time step. In this paper, we present RetoMaton -- retrieval automaton -- which approximates the datastore search, based on (1) clustering of entries into "states", and (2) state transitions from previous entries. This effectively results in a weighted finite automaton built on top of the datastore, instead of representing the datastore as a flat list. The creation of the automaton is unsupervised, and a RetoMaton can be constructed from any text collection: either the original training corpus or from another domain. Traversing this automaton at inference time, in parallel to the LM inference, reduces its perplexity, or alternatively saves up to 83% of the nearest neighbor searches over kNN-LM (Khandelwal et al., 2020), without hurting perplexity.
    Composing a surrogate observation operator for sequential data assimilation. (arXiv:2201.12514v1 [cs.LG])
    In data assimilation, state estimation is not straightforward when the observation operator is unknown. This study proposes a method for composing a surrogate operator for a true operator. The surrogate model is improved iteratively to decrease the difference between the observations and the results of the surrogate model, and a neural network is adopted in the process. A twin experiment suggests that the proposed method outperforms approaches that use a specific operator that is given tentatively throughout the data assimilation process.
    DiriNet: A network to estimate the spatial and spectral degradation functions. (arXiv:2201.12346v1 [cs.CV])
    The spatial and spectral degradation functions are critical to hyper- and multi-spectral image fusion. However, few work has been payed on the estimation of the degradation functions. To learn the spatial response function and the point spread function from the image pairs to be fused, we propose a Dirichlet network, where both functions are properly constrained. Specifically, the spatial response function is constrained with positivity, while the Dirichlet distribution along with a total variation is imposed on the point spread function. To the best of our knowledge, the neural netwrok and the Dirichlet regularization are exclusively investigated, for the first time, to estimate the degradation functions. Both image degradation and fusion experiments demonstrate the effectiveness and superiority of the proposed Dirichlet network.
    Continual Learning with Recursive Gradient Optimization. (arXiv:2201.12522v1 [cs.LG])
    Learning multiple tasks sequentially without forgetting previous knowledge, called Continual Learning(CL), remains a long-standing challenge for neural networks. Most existing methods rely on additional network capacity or data replay. In contrast, we introduce a novel approach which we refer to as Recursive Gradient Optimization(RGO). RGO is composed of an iteratively updated optimizer that modifies the gradient to minimize forgetting without data replay and a virtual Feature Encoding Layer(FEL) that represents different long-term structures with only task descriptors. Experiments demonstrate that RGO has significantly better performance on popular continual classification benchmarks when compared to the baselines and achieves new state-of-the-art performance on 20-split-CIFAR100(82.22%) and 20-split-miniImageNet(72.63%). With higher average accuracy than Single-Task Learning(STL), this method is flexible and reliable to provide continual learning capabilities for learning models that rely on gradient descent.
    SHAQ: Incorporating Shapley Value Theory into Multi-Agent Q-Learning. (arXiv:2105.15013v3 [cs.LG] UPDATED)
    Value factorisation proves to be a useful technique for multi-agent reinforcement learning (MARL) in global reward game, but the underlying mechanism is not yet fully understood. This paper explores a theoretical framework for value factorisation with interpretability through Shapley value theory. We generalise Shapley value to Markov convex game called Markov Shapley value and apply it as a value factorisation method in global reward game, thanks to the equivalence between these two games. Based on the property of Markov Shapley value, we derive Shapley-Bellman optimality equation to evaluate the optimal Markov Shapley value that is corresponding to optimal joint deterministic policy. Furthermore, we propose Shapley-Bellman operator that is proved to solve Shapley-Bellman optimality equation. With stochastic approximation and some transformation, a new MARL algorithm called Shapley Q-learning (SHAQ) is yielded, the implementation of which is guided by the theoretical results of Shapley-Bellman operator and Markov Shapley value. In experiments, we show that SHAQ possesses not only superior performances on all tasks but also the interpretability that agrees with the theoretical analysis of Markov Shapley value and Shapley-Bellman operator.
    Automatic Audio Captioning using Attention weighted Event based Embeddings. (arXiv:2201.12352v1 [cs.SD])
    Automatic Audio Captioning (AAC) refers to the task of translating audio into a natural language that describes the audio events, source of the events and their relationships. The limited samples in AAC datasets at present, has set up a trend to incorporate transfer learning with Audio Event Detection (AED) as a parent task. Towards this direction, in this paper, we propose an encoder-decoder architecture with light-weight (i.e. with lesser learnable parameters) Bi-LSTM recurrent layers for AAC and compare the performance of two state-of-the-art pre-trained AED models as embedding extractors. Our results show that an efficient AED based embedding extractor combined with temporal attention and augmentation techniques is able to surpass existing literature with computationally intensive architectures. Further, we provide evidence of the ability of the non-uniform attention weighted encoding generated as a part of our model to facilitate the decoder glance over specific sections of the audio while generating each token.
    Entropy-based Logic Explanations of Neural Networks. (arXiv:2106.06804v4 [cs.AI] UPDATED)
    Explainable artificial intelligence has rapidly emerged since lawmakers have started requiring interpretable models for safety-critical domains. Concept-based neural networks have arisen as explainable-by-design methods as they leverage human-understandable symbols (i.e. concepts) to predict class memberships. However, most of these approaches focus on the identification of the most relevant concepts but do not provide concise, formal explanations of how such concepts are leveraged by the classifier to make predictions. In this paper, we propose a novel end-to-end differentiable approach enabling the extraction of logic explanations from neural networks using the formalism of First-Order Logic. The method relies on an entropy-based criterion which automatically identifies the most relevant concepts. We consider four different case studies to demonstrate that: (i) this entropy-based criterion enables the distillation of concise logic explanations in safety-critical domains from clinical data to computer vision; (ii) the proposed approach outperforms state-of-the-art white-box models in terms of classification accuracy and matches black box performances.
    Why Should I Trust You, Bellman? The Bellman Error is a Poor Replacement for Value Error. (arXiv:2201.12417v1 [cs.LG])
    In this work, we study the use of the Bellman equation as a surrogate objective for value prediction accuracy. While the Bellman equation is uniquely solved by the true value function over all state-action pairs, we find that the Bellman error (the difference between both sides of the equation) is a poor proxy for the accuracy of the value function. In particular, we show that (1) due to cancellations from both sides of the Bellman equation, the magnitude of the Bellman error is only weakly related to the distance to the true value function, even when considering all state-action pairs, and (2) in the finite data regime, the Bellman equation can be satisfied exactly by infinitely many suboptimal solutions. This means that the Bellman error can be minimized without improving the accuracy of the value function. We demonstrate these phenomena through a series of propositions, illustrative toy examples, and empirical analysis in standard benchmark domains.
    Creating Training Sets via Weak Indirect Supervision. (arXiv:2110.03484v2 [cs.LG] UPDATED)
    Creating labeled training sets has become one of the major roadblocks in machine learning. To address this, recent Weak Supervision (WS) frameworks synthesize training labels from multiple potentially noisy supervision sources. However, existing frameworks are restricted to supervision sources that share the same output space as the target task. To extend the scope of usable sources, we formulate Weak Indirect Supervision (WIS), a new research problem for automatically synthesizing training labels based on indirect supervision sources that have different output label spaces. To overcome the challenge of mismatched output spaces, we develop a probabilistic modeling approach, PLRM, which uses user-provided label relations to model and leverage indirect supervision sources. Moreover, we provide a theoretically-principled test of the distinguishability of PLRM for unseen labels, along with a generalization bound. On both image and text classification tasks as well as an industrial advertising application, we demonstrate the advantages of PLRM by outperforming baselines by a margin of 2%-9%.
    Differentially private stochastic expectation propagation (DP-SEP). (arXiv:2111.13219v3 [cs.LG] UPDATED)
    We are interested in privatizing an approximate posterior inference algorithm called Expectation Propagation (EP). EP approximates the posterior by iteratively refining approximations to the local likelihoods, and is known to provide better posterior uncertainties than those by variational inference (VI). However, EP needs a large memory to maintain all local approximates associated with each datapoint in the training data. To overcome this challenge, stochastic expectation propagation (SEP) considers a single unique local factor that captures the average effect of each likelihood term to the posterior and refines it in a way analogous to EP. In terms of privacy, SEP is more tractable than EP because at each refining step of a factor, the remaining factors are fixed and do not depend on other datapoints as in EP, which makes the sensitivity analysis straightforward. We provide a theoretical analysis of the privacy-accuracy trade-off in the posterior estimates under our method, called differentially private stochastic expectation propagation (DP-SEP). Furthermore, we demonstrate the performance of our DP-SEP algorithm evaluated on both synthetic and real-world datasets in terms of the quality of posterior estimates at different levels of guaranteed privacy.
    A Framework for Learning to Request Rich and Contextually Useful Information from Humans. (arXiv:2110.08258v2 [cs.LG] UPDATED)
    When deployed, AI agents will encounter problems that are beyond their autonomous problem-solving capabilities. Leveraging human assistance can help agents overcome their inherent limitations and robustly cope with unfamiliar situations. We present a general interactive framework that enables an agent to determine and request contextually useful information from an assistant, and to incorporate rich forms of responses into its decision-making process. We demonstrate the practicality of our framework on a simulated human-assisted navigation problem. Aided with an assistance-requesting policy learned by our method, a navigation agent achieves up to a 7x improvement in success rate on tasks that take place in previously unseen environments, compared to fully autonomous behavior. We show that the agent can take advantage of different types of information depending on the context, and analyze the benefits and challenges of learning the assistance-requesting policy when the assistant can recursively decompose tasks into subtasks.
    Detecting Corrupted Labels Without Training a Model to Predict. (arXiv:2110.06283v2 [cs.LG] UPDATED)
    Label noise in real-world datasets encodes wrong correlation patterns and impairs the generalization of deep neural networks (DNNs). It is critical to find efficient ways to detect the corrupted patterns. Current methods primarily focus on designing robust training techniques to prevent DNNs from memorizing corrupted patterns. These approaches often require customized training processes and may overfit to corrupted patterns, leading to a performance drop in detection. In this paper, from a more data-centric perspective, we propose a training-free solution to detect corrupted labels. Intuitively, "closer" instances are more likely to share the same clean label. Based on the neighborhood information, we propose two methods: the first one uses "local voting" via checking the noisy label consensuses of nearby features. The second one is a ranking-based approach that scores each instance and filters out a guaranteed number of instances that are likely to be corrupted. We theoretically analyze how the quality of features affects the local voting and provide guidelines for tuning neighborhood size. We also prove the worst-case error bound for the ranking-based method. Experiments with both synthetic and real-world label noise demonstrate our training-free solutions are consistently and significantly improving over most of the training-based baselines. Code is available at github.com/UCSC-REAL/SimiFeat.
    Federated Unlearning via Class-Discriminative Pruning. (arXiv:2110.11794v3 [cs.CV] UPDATED)
    We explore the problem of selectively forgetting categories from trained CNN classification models in the federated learning (FL). Given that the data used for training cannot be accessed globally in FL, our insights probe deep into the internal influence of each channel. Through the visualization of feature maps activated by different channels, we observe that different channels have a varying contribution to different categories in image classification. Inspired by this, we propose a method for scrubbing the model clean of information about particular categories. The method does not require retraining from scratch, nor global access to the data used for training. Instead, we introduce the concept of Term Frequency Inverse Document Frequency (TF-IDF) to quantize the class discrimination of channels. Channels with high TF-IDF scores have more discrimination on the target categories and thus need to be pruned to unlearn. The channel pruning is followed by a fine-tuning process to recover the performance of the pruned model. Evaluated on CIFAR10 dataset, our method accelerates the speed of unlearning by 8.9x for the ResNet model, and 7.9x for the VGG model under no degradation in accuracy, compared to retraining from scratch. For CIFAR100 dataset, the speedups are 9.9x and 8.4x, respectively. We envision this work as a complementary block for FL towards compliance with legal and ethical criteria.
    XFL: eXtreme Function Labeling. (arXiv:2107.13404v2 [cs.CR] UPDATED)
    Reverse engineers would benefit from identifiers like function names, but these are usually unavailable in binaries. Training a machine learning model to predict function names automatically is promising but fundamentally hard due to the enormous number of classes. In this paper, we introduce eXtreme Function Labeling (XFL), an extreme multi-label learning approach to selecting appropriate labels for binary functions. XFL splits function names into tokens, treating each as an informative label akin to the problem of tagging texts in natural language. To capture the semantics of binary code, we introduce DEXTER, a novel function embedding that combines static analysis-based features with local context from the call graph and global context from the entire binary. We demonstrate that XFL outperforms state-of-the-art approaches to function labeling on a dataset of over 10,000 binaries from the Debian project, achieving a precision of 82.5%. We also study combinations of XFL with different published embeddings for binary functions and show that DEXTER consistently improves over the state of the art in information gain. As a result, we are able to show that binary function labeling is best phrased in terms of multi-label learning, and that binary function embeddings benefit from moving beyond just learning from syntax.
    Interpretable Neural Networks with Frank-Wolfe: Sparse Relevance Maps and Relevance Orderings. (arXiv:2110.08105v2 [cs.LG] UPDATED)
    We study the effects of constrained optimization formulations and Frank-Wolfe algorithms for obtaining interpretable neural network predictions. Reformulating the Rate-Distortion Explanations (RDE) method for relevance attribution as a constrained optimization problem provides precise control over the sparsity of relevance maps. This enables a novel multi-rate as well as a relevance-ordering variant of RDE that both empirically outperform standard RDE and other baseline methods in a well-established comparison test. We showcase several deterministic and stochastic variants of the Frank-Wolfe algorithm and their effectiveness for RDE.
    Towards Noise-adaptive, Problem-adaptive (Accelerated) Stochastic Gradient Descent. (arXiv:2110.11442v2 [math.OC] UPDATED)
    We aim to make stochastic gradient descent (SGD) adaptive to (i) the noise $\sigma^2$ in the stochastic gradients and (ii) problem-dependent constants. When minimizing smooth, strongly-convex functions with condition number $\kappa$, we prove that $T$ iterations of SGD with exponentially decreasing step-sizes and knowledge of the smoothness can achieve an $\tilde{O} \left(\exp \left( \frac{-T}{\kappa} \right) + \frac{\sigma^2}{T} \right)$ rate, without knowing $\sigma^2$. In order to be adaptive to the smoothness, we use a stochastic line-search (SLS) and show (via upper and lower-bounds) that SGD with SLS converges at the desired rate, but only to a neighbourhood of the solution. On the other hand, we prove that SGD with an offline estimate of the smoothness converges to the minimizer. However, its rate is slowed down proportional to the estimation error. Next, we prove that SGD with Nesterov acceleration and exponential step-sizes (referred to as ASGD) can achieve the near-optimal $\tilde{O} \left(\exp \left( \frac{-T}{\sqrt{\kappa}} \right) + \frac{\sigma^2}{T} \right)$ rate, without knowledge of $\sigma^2$. When used with offline estimates of the smoothness and strong-convexity, ASGD still converges to the solution, albeit at a slower rate. We empirically demonstrate the effectiveness of exponential step-sizes coupled with a novel variant of SLS.
    Hermite Polynomial Features for Private Data Generation. (arXiv:2106.05042v2 [cs.LG] UPDATED)
    Kernel mean embedding is a useful tool to represent and compare probability measures. Despite its usefulness, kernel mean embedding considers infinite-dimensional features, which are challenging to handle in the context of differentially private data generation. A recent work proposes to approximate the kernel mean embedding of data distribution using finite-dimensional random features, which yields analytically tractable sensitivity. However, the number of required random features is excessively high, often ten thousand to a hundred thousand, which worsens the privacy-accuracy trade-off. To improve the trade-off, we propose to replace random features with Hermite polynomial features. Unlike the random features, the Hermite polynomial features are ordered, where the features at the low orders contain more information on the distribution than those at the high orders. Hence, a relatively low order of Hermite polynomial features can more accurately approximate the mean embedding of the data distribution compared to a significantly higher number of random features. As demonstrated on several tabular and image datasets, the use of Hermite polynomial features is better suited for private data generation than the use of random features.
    Understanding approximate and unrolled dictionary learning for pattern recovery. (arXiv:2106.06338v2 [cs.LG] UPDATED)
    Dictionary learning consists of finding a sparse representation from noisy data and is a common way to encode data-driven prior knowledge on signals. Alternating minimization (AM) is standard for the underlying optimization, where gradient descent steps alternate with sparse coding procedures. The major drawback of this method is its prohibitive computational cost, making it unpractical on large real-world data sets. This work studies an approximate formulation of dictionary learning based on unrolling and compares it to alternating minimization to find the best trade-off between speed and precision. We analyze the asymptotic behavior and convergence rate of gradients estimates in both methods. We show that unrolling performs better on the support of the inner problem solution and during the first iterations. Finally, we apply unrolling on pattern learning in magnetoencephalography (MEG) with the help of a stochastic algorithm and compare the performance to a state-of-the-art method.
    Offline Meta-Reinforcement Learning with Online Self-Supervision. (arXiv:2107.03974v3 [cs.LG] UPDATED)
    Meta-reinforcement learning (RL) methods can meta-train policies that adapt to new tasks with orders of magnitude less data than standard RL, but meta-training itself is costly and time-consuming. If we can meta-train on offline data, then we can reuse the same static dataset, labeled once with rewards for different tasks, to meta-train policies that adapt to a variety of new tasks at meta-test time. Although this capability would make meta-RL a practical tool for real-world use, offline meta-RL presents additional challenges beyond online meta-RL or standard offline RL settings. Meta-RL learns an exploration strategy that collects data for adapting, and also meta-trains a policy that quickly adapts to data from a new task. Since this policy was meta-trained on a fixed, offline dataset, it might behave unpredictably when adapting to data collected by the learned exploration strategy, which differs systematically from the offline data and thus induces distributional shift. We propose a hybrid offline meta-RL algorithm, which uses offline data with rewards to meta-train an adaptive policy, and then collects additional unsupervised online data, without any reward labels to bridge this distribution shift. By not requiring reward labels for online collection, this data can be much cheaper to collect. We compare our method to prior work on offline meta-RL on simulated robot locomotion and manipulation tasks and find that using additional unsupervised online data collection leads to a dramatic improvement in the adaptive capabilities of the meta-trained policies, matching the performance of fully online meta-RL on a range of challenging domains that require generalization to new tasks.
    Safe Model-based Off-policy Reinforcement Learning for Eco-Driving in Connected and Automated Hybrid Electric Vehicles. (arXiv:2105.11640v2 [cs.LG] UPDATED)
    Connected and Automated Hybrid Electric Vehicles have the potential to reduce fuel consumption and travel time in real-world driving conditions. The eco-driving problem seeks to design optimal speed and power usage profiles based upon look-ahead information from connectivity and advanced mapping features. Recently, Deep Reinforcement Learning (DRL) has been applied to the eco-driving problem. While the previous studies synthesize simulators and model-free DRL to reduce online computation, this work proposes a Safe Off-policy Model-Based Reinforcement Learning algorithm for the eco-driving problem. The advantages over the existing literature are three-fold. First, the combination of off-policy learning and the use of a physics-based model improves the sample efficiency. Second, the training does not require any extrinsic rewarding mechanism for constraint satisfaction. Third, the feasibility of trajectory is guaranteed by using a safe set approximated by deep generative models. The performance of the proposed method is benchmarked against a baseline controller representing human drivers, a previously designed model-free DRL strategy, and the wait-and-see optimal solution. In simulation, the proposed algorithm leads to a policy with a higher average speed and a better fuel economy compared to the model-free agent. Compared to the baseline controller, the learned strategy reduces the fuel consumption by more than 21\% while keeping the average speed comparable.
    Increasing-Margin Adversarial (IMA) Training to Improve Adversarial Robustness of Neural Networks. (arXiv:2005.09147v6 [cs.CV] UPDATED)
    Convolutional neural network (CNN) has surpassed traditional methods for medical image classification. However, CNN is vulnerable to adversarial attacks which may lead to disastrous consequences in medical applications. Although adversarial noises are usually generated by attack algorithms, white-noise-induced adversarial samples can exist, and therefore the threats are real. In this study, we propose a novel training method, named IMA, to improve the robust-ness of CNN against adversarial noises. During training, the IMA method increases the margins of training samples in the input space, i.e., moving CNN decision boundaries far away from the training samples to improve robustness. The IMA method is evaluated on publicly available datasets under strong 100-PGD white-box adversarial attacks, and the results show that the proposed method significantly improved CNN classification and segmentation accuracy on noisy data while keeping a high accuracy on clean data. We hope our approach may facilitate the development of robust applications in medical field.
    Kronecker-factored Quasi-Newton Methods for Deep Learning. (arXiv:2102.06737v2 [cs.LG] UPDATED)
    Second-order methods have the capability of accelerating optimization by using much richer curvature information than first-order methods. However, most are impractical for deep learning, where the number of training parameters is huge. In Goldfarb et al. (2020), practical quasi-Newton methods were proposed that approximate the Hessian of a multilayer perceptron (MLP) model by a layer-wise block diagonal matrix where each layer's block is further approximated by a Kronecker product corresponding to the structure of the Hessian restricted to that layer. Here, we extend these methods to enable them to be applied to convolutional neural networks (CNNs), by analyzing the Kronecker-factored structure of the Hessian matrix of convolutional layers. Several improvements to the methods in Goldfarb et al. (2020) are also proposed that can be applied to both MLPs and CNNs. These new methods have memory requirements comparable to first-order methods and much less per-iteration time complexity than those in Goldfarb et al. (2020). Moreover, convergence results are proved for a variant under relatively mild conditions. Finally, we compared the performance of our new methods against several state-of-the-art (SOTA) methods on MLP autoencoder and CNN problems, and found that they outperformed the first-order SOTA methods and performed comparably to the second-order SOTA methods.
    Adversarial Examples for Good: Adversarial Examples Guided Imbalanced Learning. (arXiv:2201.12356v1 [cs.LG])
    Adversarial examples are inputs for machine learning models that have been designed by attackers to cause the model to make mistakes. In this paper, we demonstrate that adversarial examples can also be utilized for good to improve the performance of imbalanced learning. We provide a new perspective on how to deal with imbalanced data: adjust the biased decision boundary by training with Guiding Adversarial Examples (GAEs). Our method can effectively increase the accuracy of minority classes while sacrificing little accuracy on majority classes. We empirically show, on several benchmark datasets, our proposed method is comparable to the state-of-the-art method. To our best knowledge, we are the first to deal with imbalanced learning with adversarial examples.
    AI-SARAH: Adaptive and Implicit Stochastic Recursive Gradient Methods. (arXiv:2102.09700v2 [cs.LG] UPDATED)
    We present an adaptive stochastic variance reduced method with an implicit approach for adaptivity. As a variant of SARAH, our method employs the stochastic recursive gradient yet adjusts step-size based on local geometry. We provide convergence guarantees for finite-sum minimization problems and show a faster convergence than SARAH can be achieved if local geometry permits. Furthermore, we propose a practical, fully adaptive variant, which does not require any knowledge of local geometry and any effort of tuning the hyper-parameters. This algorithm implicitly computes step-size and efficiently estimates local Lipschitz smoothness of stochastic functions. The numerical experiments demonstrate the algorithm's strong performance compared to its classical counterparts and other state-of-the-art first-order methods.
    MobILE: Model-Based Imitation Learning From Observation Alone. (arXiv:2102.10769v3 [cs.LG] UPDATED)
    This paper studies Imitation Learning from Observations alone (ILFO) where the learner is presented with expert demonstrations that consist only of states visited by an expert (without access to actions taken by the expert). We present a provably efficient model-based framework MobILE to solve the ILFO problem. MobILE involves carefully trading off strategic exploration against imitation - this is achieved by integrating the idea of optimism in the face of uncertainty into the distribution matching imitation learning (IL) framework. We provide a unified analysis for MobILE, and demonstrate that MobILE enjoys strong performance guarantees for classes of MDP dynamics that satisfy certain well studied notions of structural complexity. We also show that the ILFO problem is strictly harder than the standard IL problem by presenting an exponential sample complexity separation between IL and ILFO. We complement these theoretical results with experimental simulations on benchmark OpenAI Gym tasks that indicate the efficacy of MobILE. Code for implementing the MobILE framework is available at https://github.com/rahulkidambi/MobILE-NeurIPS2021.
    A Joint Learning and Communications Framework for Federated Learning over Wireless Networks. (arXiv:1909.07972v4 [cs.NI] UPDATED)
    In this paper, the problem of training federated learning (FL) algorithms over a realistic wireless network is studied. In particular, in the considered model, wireless users execute an FL algorithm while training their local FL models using their own data and transmitting the trained local FL models to a base station (BS) that will generate a global FL model and send it back to the users. Since all training parameters are transmitted over wireless links, the quality of the training will be affected by wireless factors such as packet errors and the availability of wireless resources. Meanwhile, due to the limited wireless bandwidth, the BS must select an appropriate subset of users to execute the FL algorithm so as to build a global FL model accurately. This joint learning, wireless resource allocation, and user selection problem is formulated as an optimization problem whose goal is to minimize an FL loss function that captures the performance of the FL algorithm. To address this problem, a closed-form expression for the expected convergence rate of the FL algorithm is first derived to quantify the impact of wireless factors on FL. Then, based on the expected convergence rate of the FL algorithm, the optimal transmit power for each user is derived, under a given user selection and uplink resource block (RB) allocation scheme. Finally, the user selection and uplink RB allocation is optimized so as to minimize the FL loss function. Simulation results show that the proposed joint federated learning and communication framework can reduce the FL loss function value by up to 10% and 16%, respectively, compared to: 1) An optimal user selection algorithm with random resource allocation and 2) a standard FL algorithm with random user selection and resource allocation.
    Developing a Machine-Learning Algorithm to Diagnose Age-Related Macular Degeneration. (arXiv:2201.12384v1 [eess.IV])
    Today, more than 12 million people over the age of 40 suffer from ocular diseases. Most commonly, older patients are susceptible to age related macular degeneration, an eye disease that causes blurring of the central vision due to the deterioration of the retina. The former can only be detected through complex and expensive imaging software, markedly a visual field test; this leaves a significant population with untreated eye disease and holds them at risk for complete vision loss. The use of machine learning algorithms has been proposed for treating eye disease. However, the development of these models is limited by a lack of understanding regarding appropriate model and training parameters to maximize model performance. In our study, we address these points by generating 6 models, each with a learning rate of 1 * 10^n where n is 0, -1, -2, ... -6, and calculated a f1 score for each of the models. Our analysis shows that sample imbalance is a key challenge in training of machine learning models and can result in deceptive improvements in training cost which does not translate to true improvements in model predictive performance. Considering the wide ranging impact of the disease and its adverse effects, we developed a machine learning algorithm to treat the same. We trained our model on varying eye disease datasets consisting of over 5000 patients, and the pictures of their infected eyes. In the future, we hope this model is used extensively, especially in areas that are under-resourced, to better diagnose eye disease and improve well being for humanity.
    Flashlight: Enabling Innovation in Tools for Machine Learning. (arXiv:2201.12465v1 [cs.LG])
    As the computational requirements for machine learning systems and the size and complexity of machine learning frameworks increases, essential framework innovation has become challenging. While computational needs have driven recent compiler, networking, and hardware advancements, utilization of those advancements by machine learning tools is occurring at a slower pace. This is in part due to the difficulties involved in prototyping new computational paradigms with existing frameworks. Large frameworks prioritize machine learning researchers and practitioners as end users and pay comparatively little attention to systems researchers who can push frameworks forward -- we argue that both are equally important stakeholders. We introduce Flashlight, an open-source library built to spur innovation in machine learning tools and systems by prioritizing open, modular, customizable internals and state-of-the-art, research-ready models and training setups across a variety of domains. Flashlight allows systems researchers to rapidly prototype and experiment with novel ideas in machine learning computation and has low overhead, competing with and often outperforming other popular machine learning frameworks. We see Flashlight as a tool enabling research that can benefit widely used libraries downstream and bring machine learning and systems researchers closer together.
    Towards the PAC Learnability of Nash Equilibrium. (arXiv:2108.07472v4 [cs.GT] UPDATED)
    Nash equilibrium (NE) is one of the most important solution concepts in game theory and has broad applications in machine learning research. While tremendous empirical success has been reported on learning approximate NE, whether NE is Probably Approximately Correct (PAC) learnable has not been addressed yet. In this paper, we make the first effort to study the PAC learnability of NE in normal-form games. We prove that NE is agnostic PAC learnable if the hypothesis class for Nash predictors has bounded covering numbers. Theoretically, we derive such a result by constructing a self-supervised PAC learning framework based on a commonly-used objective for Nash approximation. Empirically, we provide numerical evidence showing that a Nash predictor trained under our PAC learning framework, even with the most straightforward neural architecture, can efficiently learn the NE for games sampled from the same unknown distribution. Our results justify the feasibility of approximating NE through purely data-driven approaches, which benefits both game theorists and machine learning practitioners.
    Explaining Graph-level Predictions with Communication Structure-Aware Cooperative Games. (arXiv:2201.12380v1 [cs.LG])
    Explaining predictions made by machine learning models is important and have attracted an increased interest. The Shapley value from cooperative game theory has been proposed as a prime approach to compute feature importances towards predictions, especially for images, text, tabular data, and recently graph neural networks (GNNs) on graphs. In this work, we revisit the appropriateness of the Shapley value for graph explanation, where the task is to identify the most important subgraph and constituent nodes for graph-level predictions. We purport that the Shapley value is a no-ideal choice for graph data because it is by definition not structure-aware. We propose a Graph Structure-aware eXplanation (GStarX) method to leverage the critical graph structure information to improve the explanation. Specifically, we propose a scoring function based on a new structure-aware value from the cooperative game theory called the HN value. When used to score node importance, the HN value utilizes graph structures to attribute cooperation surplus between neighbor nodes, resembling message passing in GNNs, so that node importance scores reflect not only the node feature importance, but also the structural roles. We demonstrate that GstarX produces qualitatively more intuitive explanations, and quantitatively improves over strong baselines on chemical graph property prediction and text graph sentiment classification.
    Hyperbolic Neural Networks for Molecular Generation. (arXiv:2201.12825v1 [cs.LG])
    With the recent advance of deep learning, neural networks have been extensively used for the task of molecular generation. Many deep generators extract atomic relations from molecular graphs and ignore hierarchical information at both atom and molecule levels. In order to extract such hierarchical information, we propose a novel hyperbolic generative model. Our model contains three parts: first, a fully hyperbolic junction-tree encoder-decoder that embeds the hierarchical information of the molecules in the latent hyperbolic space; second, a latent generative adversarial network for generating the latent embeddings; third, a molecular generator that inherits the decoders from the first part and the latent generator from the second part. We evaluate our model on the ZINC dataset using the MOSES benchmarking platform and achieve competitive results, especially in metrics about structural similarity.
    Universality of Winning Tickets: A Renormalization Group Perspective. (arXiv:2110.03210v2 [cs.LG] UPDATED)
    Foundational work on the Lottery Ticket Hypothesis has suggested an exciting corollary: winning tickets found in the context of one task can be transferred to similar tasks, possibly even across different architectures. This has become of broad interest, but methods to study this universality are lacking. We make use of renormalization group theory, a powerful tool from theoretical physics, to address this need. We find that iterative magnitude pruning, the principal algorithm used for discovering winning tickets, is a renormalization group scheme, and can be viewed as inducing a flow in parameter space. We demonstrate that ResNet-50 models with transferable winning tickets have flows with common properties, as would be expected from the theory. Similar observations are made for BERT models, with evidence that their flows are near fixed points. Additionally, we leverage our framework to study winning tickets transferred across ResNet architectures, observing that smaller models have flows with more uniform properties than larger models, complicating transfer between them.
    DIGRAC: Digraph Clustering Based on Flow Imbalance. (arXiv:2106.05194v5 [stat.ML] UPDATED)
    Node clustering is a powerful tool in the analysis of networks. We introduce a graph neural network framework to obtain node embeddings for directed networks in a self-supervised manner, including a novel probabilistic imbalance loss, which can be used for network clustering. Here, we propose directed flow imbalance measures, which are tightly related to directionality, to reveal clusters in the network even when there is no density difference between clusters. In contrast to standard approaches in the literature, in this paper, directionality is not treated as a nuisance, but rather contains the main signal. DIGRAC optimizes directed flow imbalance for clustering without requiring label supervision, unlike existing GNN methods, and can naturally incorporate node features, unlike existing spectral methods. Experimental results on synthetic data, in the form of directed stochastic block models, and real-world data at different scales, demonstrate that our method, based on flow imbalance, attains state-of-the-art results on directed graph clustering when compared against 10 methods from the literature, for a wide range of noise and sparsity levels and graph structures and topologies, and even outperforms supervised methods.
    Adversarial Robustness in Deep Learning: Attacks on Fragile Neurons. (arXiv:2201.12347v1 [cs.LG])
    We identify fragile and robust neurons of deep learning architectures using nodal dropouts of the first convolutional layer. Using an adversarial targeting algorithm, we correlate these neurons with the distribution of adversarial attacks on the network. Adversarial robustness of neural networks has gained significant attention in recent times and highlights intrinsic weaknesses of deep learning networks against carefully constructed distortion applied to input images. In this paper, we evaluate the robustness of state-of-the-art image classification models trained on the MNIST and CIFAR10 datasets against the fast gradient sign method attack, a simple yet effective method of deceiving neural networks. Our method identifies the specific neurons of a network that are most affected by the adversarial attack being applied. We, therefore, propose to make fragile neurons more robust against these attacks by compressing features within robust neurons and amplifying the fragile neurons proportionally.
    Gradient Assisted Learning. (arXiv:2106.01425v3 [cs.LG] UPDATED)
    In decentralized settings, collaborations between different organizations, such as financial institutions, medical centers, and retail markets, are crucial to providing improved service and performance. However, the underlying organizations may have little interest in sharing their private data, proprietary models, and objective functions. These privacy requirements have created new challenges for organizational collaboration. In this work, we propose Gradient Assisted Learning (GAL), a new method for multiple organizations to assist each other in supervised learning tasks without sharing data, models, and objective functions. In this framework, all participants collaboratively optimize the aggregate of local loss functions, and each participant autonomously builds its own model by iteratively fitting the gradients of the objective function. Experimental studies demonstrate that Gradient Assisted Learning can achieve performance close to centralized learning when all data, models, and objective functions are fully disclosed.
    How Attentive are Graph Attention Networks?. (arXiv:2105.14491v3 [cs.LG] UPDATED)
    Graph Attention Networks (GATs) are one of the most popular GNN architectures and are considered as the state-of-the-art architecture for representation learning with graphs. In GAT, every node attends to its neighbors given its own representation as the query. However, in this paper we show that GAT computes a very limited kind of attention: the ranking of the attention scores is unconditioned on the query node. We formally define this restricted kind of attention as static attention and distinguish it from a strictly more expressive dynamic attention. Because GATs use a static attention mechanism, there are simple graph problems that GAT cannot express: in a controlled problem, we show that static attention hinders GAT from even fitting the training data. To remove this limitation, we introduce a simple fix by modifying the order of operations and propose GATv2: a dynamic graph attention variant that is strictly more expressive than GAT. We perform an extensive evaluation and show that GATv2 outperforms GAT across 11 OGB and other benchmarks while we match their parametric costs. Our code is available at https://github.com/tech-srl/how_attentive_are_gats . GATv2 is available as part of the PyTorch Geometric library, the Deep Graph Library, and the TensorFlow GNN library.
    Quantum kernels for real-world predictions based on electronic health records. (arXiv:2112.06211v2 [quant-ph] UPDATED)
    In recent years, research on near-term quantum machine learning has explored how classical machine learning algorithms endowed with access to quantum kernels (similarity measures) can outperform their purely classical counterparts. Although theoretical work has shown provable advantage on synthetic data sets, no work done to date has studied empirically whether quantum advantage is attainable and with what kind of data set. In this paper, we report the first systematic investigation of empirical quantum advantage (EQA) in healthcare and life sciences and propose an end-to-end framework to study EQA. We selected electronic health records (EHRs) data subsets and created a configuration space of 5-20 features and 200-300 training samples. For each configuration coordinate, we trained classical support vector machine (SVM) models based on radial basis function (RBF) kernels and quantum models with custom kernels using an IBM quantum computer, making this one of the largest quantum machine learning experiments to date. We empirically identified regimes where quantum kernels could provide advantage on a particular data set and introduced a terrain ruggedness index, a metric to help quantitatively estimate how the accuracy of a given model will perform as a function of the number of features and sample size. The generalizable framework introduced here represents a key step towards a priori identification of data sets where quantum advantage could exist.
    Physics-informed neural networks to learn cardiac fiber orientation from multiple electroanatomical maps. (arXiv:2201.12362v1 [eess.IV])
    We propose FiberNet, a method to estimate in-vivo the cardiac fiber architecture of the human atria from multiple catheter recordings of the electrical activation. Cardiac fibers play a central rolein the electro-mechanical function of the heart, yet they aredifficult to determine in-vivo, and hence rarely truly patient-specificin existing cardiac models.FiberNet learns the fibers arrangement by solvingan inverse problem with physics-informed neural networks. The inverse problem amounts to identifyingthe conduction velocity tensor of a cardiac propagation modelfrom a set of sparse activation maps. The use of multiple mapsenables the simultaneous identification of all the componentsof the conduction velocity tensor, including the local fiber angle.We extensively test FiberNet on synthetic 2-D and 3-D examples, diffusion tensor fibers, and a patient-specific case. We show that 3 maps are sufficient to accurately capture the fibers, also in thepresence of noise. With fewer maps, the role of regularization becomesprominent. Moreover, we show that the fitted model can robustlyreproduce unseen activation maps. We envision that FiberNet will help the creation of patient-specific models for personalized medicine.The full code is available at this http URL
    A deep Q-learning method for optimizing visual search strategies in backgrounds of dynamic noise. (arXiv:2201.12385v1 [cs.CV])
    Humans process visual information with varying resolution (foveated visual system) and explore images by orienting through eye movements the high-resolution fovea to points of interest. The Bayesian ideal searcher (IS) that employs complete knowledge of task-relevant information optimizes eye movement strategy and achieves the optimal search performance. The IS can be employed as an important tool to evaluate the optimality of human eye movements, and potentially provide guidance to improve human observer visual search strategies. Najemnik and Geisler (2005) derived an IS for backgrounds of spatial 1/f noise. The corresponding template responses follow Gaussian distributions and the optimal search strategy can be analytically determined. However, the computation of the IS can be intractable when considering more realistic and complex backgrounds such as medical images. Modern reinforcement learning methods, successfully applied to obtain optimal policy for a variety of tasks, do not require complete knowledge of the background generating functions and can be potentially applied to anatomical backgrounds. An important first step is to validate the optimality of the reinforcement learning method. In this study, we investigate the ability of a reinforcement learning method that employs Q-network to approximate the IS. We demonstrate that the search strategy corresponding to the Q-network is consistent with the IS search strategy. The findings show the potential of the reinforcement learning with Q-network approach to estimate optimal eye movement planning with real anatomical backgrounds.
    Hyperparameter-free deep active learning for regression problems via query synthesis. (arXiv:2201.12632v1 [cs.LG])
    In the past decade, deep active learning (DAL) has heavily focused upon classification problems, or problems that have some 'valid' data manifolds, such as natural languages or images. As a result, existing DAL methods are not applicable to a wide variety of important problems -- such as many scientific computing problems -- that involve regression on relatively unstructured input spaces. In this work we propose the first DAL query-synthesis approach for regression problems. We frame query synthesis as an inverse problem and use the recently-proposed neural-adjoint (NA) solver to efficiently find points in the continuous input domain that optimize the query-by-committee (QBC) criterion. Crucially, the resulting NA-QBC approach removes the one sensitive hyperparameter of the classical QBC active learning approach - the "pool size"- making NA-QBC effectively hyperparameter free. This is significant because DAL methods can be detrimental, even compared to random sampling, if the wrong hyperparameters are chosen. We evaluate Random, QBC and NA-QBC sampling strategies on four regression problems, including two contemporary scientific computing problems. We find that NA-QBC achieves better average performance than random sampling on every benchmark problem, while QBC can be detrimental if the wrong hyperparameters are chosen.
    Who Leads and Who Follows in Strategic Classification?. (arXiv:2106.12529v2 [cs.LG] UPDATED)
    As predictive models are deployed into the real world, they must increasingly contend with strategic behavior. A growing body of work on strategic classification treats this problem as a Stackelberg game: the decision-maker "leads" in the game by deploying a model, and the strategic agents "follow" by playing their best response to the deployed model. Importantly, in this framing, the burden of learning is placed solely on the decision-maker, while the agents' best responses are implicitly treated as instantaneous. In this work, we argue that the order of play in strategic classification is fundamentally determined by the relative frequencies at which the decision-maker and the agents adapt to each other's actions. In particular, by generalizing the standard model to allow both players to learn over time, we show that a decision-maker that makes updates faster than the agents can reverse the order of play, meaning that the agents lead and the decision-maker follows. We observe in standard learning settings that such a role reversal can be desirable for both the decision-maker and the strategic agents. Finally, we show that a decision-maker with the freedom to choose their update frequency can induce learning dynamics that converge to Stackelberg equilibria with either order of play.
    Discovering and Explaining the Representation Bottleneck of DNNs. (arXiv:2111.06236v3 [cs.LG] UPDATED)
    This paper explores the bottleneck of feature representations of deep neural networks (DNNs), from the perspective of the complexity of interactions between input variables encoded in DNNs. To this end, we focus on the multi-order interaction between input variables, where the order represents the complexity of interactions. We discover that a DNN is more likely to encode both too simple interactions and too complex interactions, but usually fails to learn interactions of intermediate complexity. Such a phenomenon is widely shared by different DNNs for different tasks. This phenomenon indicates a cognition gap between DNNs and human beings, and we call it a representation bottleneck. We theoretically prove the underlying reason for the representation bottleneck. Furthermore, we propose a loss to encourage/penalize the learning of interactions of specific complexities, and analyze the representation capacities of interactions of different complexities.
    Detecting Electric Vehicle Battery Failure via Dynamic-VAE. (arXiv:2201.12358v1 [cs.LG])
    In this note, we describe a battery failure detection pipeline backed up by deep learning models. We first introduce a large-scale Electric vehicle (EV) battery dataset including cleaned battery-charging data from hundreds of vehicles. We then formulate battery failure detection as an outlier detection problem, and propose a new algorithm named Dynamic-VAE based on dynamic system and variational autoencoders. We validate the performance of our proposed algorithm against several baselines on our released dataset and demonstrated the effectiveness of Dynamic-VAE.
    DearFSAC: An Approach to Optimizing Unreliable Federated Learning via Deep Reinforcement Learning. (arXiv:2201.12701v1 [cs.LG])
    In federated learning (FL), model aggregation has been widely adopted for data privacy. In recent years, assigning different weights to local models has been used to alleviate the FL performance degradation caused by differences between local datasets. However, when various defects make the FL process unreliable, most existing FL approaches expose weak robustness. In this paper, we propose the DEfect-AwaRe federated soft actor-critic (DearFSAC) to dynamically assign weights to local models to improve the robustness of FL. The deep reinforcement learning algorithm soft actor-critic is adopted for near-optimal performance and stable convergence. Besides, an auto-encoder is trained to output low-dimensional embedding vectors that are further utilized to evaluate model quality. In the experiments, DearFSAC outperforms three existing approaches on four datasets for both independent and identically distributed (IID) and non-IID settings under defective scenarios.
    Hybrid Random Features. (arXiv:2110.04367v3 [cs.LG] UPDATED)
    We propose a new class of random feature methods for linearizing softmax and Gaussian kernels called hybrid random features (HRFs) that automatically adapt the quality of kernel estimation to provide most accurate approximation in the defined regions of interest. Special instantiations of HRFs lead to well-known methods such as trigonometric (Rahimi and Recht, 2007) or (recently introduced in the context of linear-attention Transformers) positive random features (Choromanski et al., 2021). By generalizing Bochner's Theorem for softmax/Gaussian kernels and leveraging random features for compositional kernels, the HRF-mechanism provides strong theoretical guarantees - unbiased approximation and strictly smaller worst-case relative errors than its counterparts. We conduct exhaustive empirical evaluation of HRF ranging from pointwise kernel estimation experiments, through tests on data admitting clustering structure to benchmarking implicit-attention Transformers (also for downstream Robotics applications), demonstrating its quality in a wide spectrum of machine learning problems.
    SemiFL: Communication Efficient Semi-Supervised Federated Learning with Unlabeled Clients. (arXiv:2106.01432v3 [cs.LG] UPDATED)
    Federated Learning (FL) allows training machine learning models by using the computation and private data resources of many distributed clients such as smartphones and IoT devices. Most existing works on FL assume the clients have ground-truth labels. However, in many practical scenarios, clients do not have task-specific labels, e.g., due to a lack of expertise. This work considers a server that hosts a labeled dataset and wishes to leverage clients with unlabeled data for supervised learning. We propose a new FL framework referred to as SemiFL to address Semi-Supervised Federated Learning (SSFL). In SemiFL, clients have completely unlabeled data, while the server has a small amount of labeled data. SemiFL is communication efficient since it allows to train client-side unsupervised data for multiple local epochs. We demonstrate several strategies in SemiFL to enhance efficiency and prediction, and develop intuitions of why they work. In particular, we provide a theoretical analysis of the use of strong data augmentation for semi-supervised learning, which can be interesting in its own right. Extensive empirical evaluations demonstrate that SemiFL can significantly improve the performance of a labeled server with unlabeled clients. Moreover, SemiFL can outperform many existing SSFL methods, and perform competitively with the state-of-the-art FL and centralized SSL results. For instance, in the standard communication efficient scenarios, SemiFL gives 93% accuracy on the CIFAR10 dataset with only 4000 labeled samples at the server. Such accuracy is 2% away from the result trained from 50000 fully labeled data, and improves 30% upon existing SSFL methods.
    Communication-Efficient Consensus Mechanism for Federated Reinforcement Learning. (arXiv:2201.12718v1 [cs.LG])
    The paper considers independent reinforcement learning (IRL) for multi-agent decision-making process in the paradigm of federated learning (FL). We show that FL can clearly improve the policy performance of IRL in terms of training efficiency and stability. However, since the policy parameters are trained locally and aggregated iteratively through a central server in FL, frequent information exchange incurs a large amount of communication overheads. To reach a good balance between improving the model's convergence performance and reducing the required communication and computation overheads, this paper proposes a system utility function and develops a consensus-based optimization scheme on top of the periodic averaging method, which introduces the consensus algorithm into FL for the exchange of a model's local gradients. This paper also provides novel convergence guarantees for the developed method, and demonstrates its superior effectiveness and efficiency in improving the system utility value through theoretical analyses and numerical simulation results.
    Reinforcement learning for pursuit and evasion of microswimmers at low Reynolds number. (arXiv:2106.08609v2 [physics.flu-dyn] UPDATED)
    We consider a model of two competing microswimming agents engaged in a pursue-evasion task within a low-Reynolds-number environment. Agents can only perform simple manoeuvres and sense hydrodynamic disturbances, which provide ambiguous (partial) information about the opponent's position and motion. We frame the problem as a zero-sum game: the pursuer has to capture the evader in the shortest time, while the evader aims at deferring capture as long as possible. We show that the agents, trained via Adversarial Reinforcement Learning, are able to overcome partial-observability by discovering increasingly complex sequences of moves and countermoves that outperform known heuristic strategies and exploit the hydrodynamic environment.
    LSB: Local Self-Balancing MCMC in Discrete Spaces. (arXiv:2109.03867v2 [cs.AI] UPDATED)
    We present the Local Self-Balancing sampler (LSB), a local Markov Chain Monte Carlo (MCMC) method for sampling in purely discrete domains, which is able to autonomously adapt to the target distribution and to reduce the number of target evaluations required to converge. LSB is based on (i) a parametrization of locally balanced proposals, (ii) an objective function based on mutual information and (iii) a self-balancing learning procedure, which minimises the proposed objective to update the proposal parameters. Experiments on energy-based models and Bayesian networks show that LSB converges using a smaller number of queries to the oracle distribution compared to recent local MCMC samplers.
    Sharing Behavior in Ride-hailing Trips: A Machine Learning Inference Approach. (arXiv:2201.12696v1 [cs.LG])
    Ride-hailing is rapidly changing urban and personal transportation. Ride sharing or pooling is important to mitigate negative externalities of ride-hailing such as increased congestion and environmental impacts. However, there lacks empirical evidence on what affect trip-level sharing behavior in ride-hailing. Using a novel dataset from all ride-hailing trips in Chicago in 2019, we show that the willingness of riders to request a shared ride has monotonically decreased from 27.0% to 12.8% throughout the year, while the trip volume and mileage have remained statistically unchanged. We find that the decline in sharing preference is due to an increased per-mile costs of shared trips and shifting shorter trips to solo. Using ensemble machine learning models, we find that the travel impedance variables (trip cost, distance, and duration) collectively contribute to 95% and 91% of the predictive power in determining whether a trip is requested to share and whether it is successfully shared, respectively. Spatial and temporal attributes, sociodemographic, built environment, and transit supply variables do not entail predictive power at the trip level in presence of these travel impedance variables. This implies that pricing signals are most effective to encourage riders to share their rides. Our findings shed light on sharing behavior in ride-hailing trips and can help devise strategies that increase shared ride-hailing, especially as the demand recovers from pandemic.
    Power-law escape rate of SGD. (arXiv:2105.09557v2 [cs.LG] UPDATED)
    Stochastic gradient descent (SGD) undergoes complicated multiplicative noise for the mean-square loss. We use this property of SGD noise to derive a stochastic differential equation (SDE) with simpler additive noise by performing a random time change. Using this formalism, we show that the log loss barrier $\Delta\log L=\log[L(\theta^s)/L(\theta^*)]$ between a local minimum $\theta^*$ and a saddle $\theta^s$ determines the escape rate of SGD from the local minimum, contrary to the previous results borrowing from physics that the linear loss barrier $\Delta L=L(\theta^s)-L(\theta^*)$ decides the escape rate. Our escape-rate formula strongly depends on the typical magnitude $h^*$ and the number $n$ of the outlier eigenvalues of the Hessian. This result explains an empirical fact that SGD prefers flat minima with low effective dimensions, giving an insight into implicit biases of SGD.
    A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification. (arXiv:2107.07511v3 [cs.LG] UPDATED)
    Black-box machine learning learning methods are now routinely used in high-risk settings, like medical diagnostics, which demand uncertainty quantification to avoid consequential model failures. Distribution-free uncertainty quantification (distribution-free UQ) is a user-friendly paradigm for creating statistically rigorous confidence intervals/sets for such predictions. Critically, the intervals/sets are valid without distributional assumptions or model assumptions, possessing explicit guarantees even with finitely many datapoints. Moreover, they adapt to the difficulty of the input; when the input example is difficult, the uncertainty intervals/sets are large, signaling that the model might be wrong. Without much work and without retraining, one can use distribution-free methods on any underlying algorithm, such as a neural network, to produce confidence sets guaranteed to contain the ground truth with a user-specified probability, such as 90%. Indeed, the methods are easy-to-understand and general, applying to many modern prediction problems arising in the fields of computer vision, natural language processing, deep reinforcement learning, and so on. This hands-on introduction is aimed at a reader interested in the practical implementation of distribution-free UQ who is not necessarily a statistician. We lead the reader through the practical theory and applications of distribution-free UQ, beginning with conformal prediction and culminating with distribution-free control of any risk, such as the false-discovery rate, false positive rate of out-of-distribution detection, and so on. We will include many explanatory illustrations, examples, and code samples in Python, with PyTorch syntax. The goal is to provide the reader a working understanding of distribution-free UQ, allowing them to put confidence intervals on their algorithms, with one self-contained document.
    Self Semi Supervised Neural Architecture Search for Semantic Segmentation. (arXiv:2201.12646v1 [cs.CV])
    In this paper, we propose a Neural Architecture Search strategy based on self supervision and semi-supervised learning for the task of semantic segmentation. Our approach builds an optimized neural network (NN) model for this task by jointly solving a jigsaw pretext task discovered with self-supervised learning over unlabeled training data, and, exploiting the structure of the unlabeled data with semi-supervised learning. The search of the architecture of the NN model is performed by dynamic routing using a gradient descent algorithm. Experiments on the Cityscapes and PASCAL VOC 2012 datasets demonstrate that the discovered neural network is more efficient than a state-of-the-art hand-crafted NN model with four times less floating operations.
    From block-Toeplitz matrices to differential equations on graphs: towards a general theory for scalable masked Transformers. (arXiv:2107.07999v2 [cs.LG] UPDATED)
    In this paper we provide, to the best of our knowl-edge, the first comprehensive approach for in-corporating various masking mechanisms intoTransformers architectures in a scalable way. Weshow that recent results on linear causal attention (Choromanski et al., 2021) and log-linear RPE-attention (Luo et al., 2021) are special cases of thisgeneral mechanism. However by casting the prob-lem as a topological (graph-based) modulation ofunmasked attention, we obtain several results un-known before, including efficientd-dimensionalRPE-masking and graph-kernel masking. Weleverage many mathematical techniques rangingfrom spectral analysis through dynamic program-ming and random walks to new algorithms forsolving Markov processes on graphs. We providea corresponding empirical evaluation.
    Mitigating Covariate Shift in Imitation Learning via Offline Data Without Great Coverage. (arXiv:2106.03207v3 [cs.LG] UPDATED)
    This paper studies offline Imitation Learning (IL) where an agent learns to imitate an expert demonstrator without additional online environment interactions. Instead, the learner is presented with a static offline dataset of state-action-next state transition triples from a potentially less proficient behavior policy. We introduce Model-based IL from Offline data (MILO): an algorithmic framework that utilizes the static dataset to solve the offline IL problem efficiently both in theory and in practice. In theory, even if the behavior policy is highly sub-optimal compared to the expert, we show that as long as the data from the behavior policy provides sufficient coverage on the expert state-action traces (and with no necessity for a global coverage over the entire state-action space), MILO can provably combat the covariate shift issue in IL. Complementing our theory results, we also demonstrate that a practical implementation of our approach mitigates covariate shift on benchmark MuJoCo continuous control tasks. We demonstrate that with behavior policies whose performances are less than half of that of the expert, MILO still successfully imitates with an extremely low number of expert state-action pairs while traditional offline IL method such as behavior cloning (BC) fails completely. Source code is provided at https://github.com/jdchang1/milo.
    Weisfeiler and Lehman Go Cellular: CW Networks. (arXiv:2106.12575v3 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) are limited in their expressive power, struggle with long-range interactions and lack a principled way to model higher-order structures. These problems can be attributed to the strong coupling between the computational graph and the input graph structure. The recently proposed Message Passing Simplicial Networks naturally decouple these elements by performing message passing on the clique complex of the graph. Nevertheless, these models can be severely constrained by the rigid combinatorial structure of Simplicial Complexes (SCs). In this work, we extend recent theoretical results on SCs to regular Cell Complexes, topological objects that flexibly subsume SCs and graphs. We show that this generalisation provides a powerful set of graph "lifting" transformations, each leading to a unique hierarchical message passing procedure. The resulting methods, which we collectively call CW Networks (CWNs), are strictly more powerful than the WL test and not less powerful than the 3-WL test. In particular, we demonstrate the effectiveness of one such scheme, based on rings, when applied to molecular graph problems. The proposed architecture benefits from provably larger expressivity than commonly used GNNs, principled modelling of higher-order signals and from compressing the distances between nodes. We demonstrate that our model achieves state-of-the-art results on a variety of molecular datasets.
    Nys-Newton: Nystr\"om-Approximated Curvature for Stochastic Optimization. (arXiv:2110.08577v2 [math.OC] UPDATED)
    Second-order optimization methods are among the most widely used optimization approaches for convex optimization problems, and have recently been used to optimize non-convex optimization problems such as deep learning models. The widely used second-order optimization methods such as quasi-Newton methods generally provide curvature information by approximating the Hessian using the secant equation. However, the secant equation becomes insipid in approximating the Newton step owing to its use of the first-order derivatives. In this study, we propose an approximate Newton sketch-based stochastic optimization algorithm for large-scale empirical risk minimization. Specifically, we compute a partial column Hessian of size ($d\times m$) with $m\ll d$ randomly selected variables, then use the \emph{Nystr\"om method} to better approximate the full Hessian matrix. To further reduce the computational complexity per iteration, we directly compute the update step ($\Delta\boldsymbol{w}$) without computing and storing the full Hessian or its inverse. We then integrate our approximated Hessian with stochastic gradient descent and stochastic variance-reduced gradient methods. The results of numerical experiments on both convex and non-convex functions show that the proposed approach was able to obtain a better approximation of Newton\textquotesingle s method, exhibiting performance competitive with that of state-of-the-art first-order and stochastic quasi-Newton methods. Furthermore, we provide a theoretical convergence analysis for convex functions.
    Selective Regression Under Fairness Criteria. (arXiv:2110.15403v2 [cs.LG] UPDATED)
    Selective regression allows abstention from prediction if the confidence to make an accurate prediction is not sufficient. In general, by allowing a reject option, one expects the performance of a regression model to increase at the cost of reducing coverage (i.e., by predicting on fewer samples). However, as we show, in some cases, the performance of a minority subgroup can decrease while we reduce the coverage, and thus selective regression can magnify disparities between different sensitive subgroups. Motivated by these disparities, we propose new fairness criteria for selective regression requiring the performance of every subgroup to improve with a decrease in coverage. We prove that if a feature representation satisfies the sufficiency criterion or is calibrated for mean and variance, than the proposed fairness criteria is met. Further, we introduce two approaches to mitigate the performance disparity across subgroups: (a) by regularizing an upper bound of conditional mutual information under a Gaussian assumption and (b) by regularizing a contrastive loss for conditional mean and conditional variance prediction. The effectiveness of these approaches is demonstrated on synthetic and real-world datasets.
    Cyclic Graph Attentive Match Encoder (CGAME): A Novel Neural Network For OD Estimation. (arXiv:2111.14625v2 [cs.LG] UPDATED)
    Origin-Destination Estimation plays an important role in traffic management and traffic simulation in the era of Intelligent Transportation System (ITS). Nevertheless, previous model-based methods face the under-determined challenge, thus desperate demand for additional assumptions and extra data exists. Deep learning provides an ideal data-based method for connecting inputs and outputs by probabilistic distribution transformation. While relevant researches of applying deep learning into OD estimation are limited due to the challenges lying in data transformation across representation space, especially from dynamic spatial-temporal space to heterogeneous graph in this issue. To address it, we propose Cyclic Graph Attentive Matching Encoder (C-GAME) based on a novel Graph Matcher with double-layer attention mechanism. It realizes effective information exchange and establishes coupling relationship across underlying feature space. The proposed model achieves state-of-the-art results in experiments, and offers a novel framework for inference task across spaces in prospective employments.
    Detecting Multi-Sensor Fusion Errors in Advanced Driver-Assistance Systems. (arXiv:2109.06404v2 [cs.RO] UPDATED)
    Advanced Driver-Assistance Systems (ADAS) have been thriving and widely deployed in recent years. In general, these systems receive sensor data, compute driving decisions, and output control signals to the vehicles. To smooth out the uncertainties brought by sensor inputs, they usually leverage multi-sensor fusion (MSF) to fuse the sensor inputs and produce a more reliable understanding of the surroundings. However, multi-sensor fusion cannot completely eliminate the uncertainties since it lacks the knowledge about which sensor provides the most accurate data and how to optimally integrate the data provided by the sensors. As a result, critical consequences might happen unexpectedly. In this work, we observed that the popular multi-sensor fusion methods in an industry-grade ADAS can mislead the car control and result in serious safety hazards. Misbehavior can happen regardless of the used fusion methods and the accurate data from at least one sensor. We call such errors as fusion errors and develop a novel evolutionary-based domain-specific search framework, FusED, for the efficient detection of fusion errors. We further apply causality analysis to show that the found fusion errors are indeed caused by the multi-sensor fusion method. We evaluate our framework on two widely used multi-sensor fusion methods in two driving environments. Experimental results show that FusED identifies more than 150 fusion errors. Finally, we provide several suggestions to improve the multi-sensor fusion methods we study.
    Deep learning via message passing algorithms based on belief propagation. (arXiv:2110.14583v2 [cs.LG] UPDATED)
    Message-passing algorithms based on the Belief Propagation (BP) equations constitute a well-known distributed computational scheme. It is exact on tree-like graphical models and has also proven to be effective in many problems defined on graphs with loops (from inference to optimization, from signal processing to clustering). The BP-based scheme is fundamentally different from stochastic gradient descent (SGD), on which the current success of deep networks is based. In this paper, we present and adapt to mini-batch training on GPUs a family of BP-based message-passing algorithms with a reinforcement field that biases distributions towards locally entropic solutions. These algorithms are capable of training multi-layer neural networks with discrete weights and activations with performance comparable to SGD-inspired heuristics (BinaryNet) and are naturally well-adapted to continual learning. Furthermore, using these algorithms to estimate the marginals of the weights allows us to make approximate Bayesian predictions that have higher accuracy than point-wise solutions.
    On the Convergence and Calibration of Deep Learning with Differential Privacy. (arXiv:2106.07830v4 [cs.LG] UPDATED)
    In deep learning with differential privacy (DP), the neural network achieves the privacy usually at the cost of slower convergence (and thus lower performance) than its non-private counterpart. This work gives the first convergence analysis of the DP deep learning, through the lens of training dynamics and the neural tangent kernel (NTK). Our convergence theory successfully characterizes the effects of two key components in the DP training: the per-sample clipping and the noise addition. Our analysis not only initiates a general principled framework to understand the DP deep learning with any network architecture and loss function, but also motivates a new clipping method -- the global clipping, that significantly improves the convergence, as well as preserves the same DP guarantee and computational efficiency as the existing method, which we term as local clipping. Theoretically speaking, we precisely characterize the effect of per-sample clipping on the NTK matrix and show that the noise level of DP optimizers does not affect the convergence in the gradient flow regime. In particular, the local clipping almost certainly breaks the positive semi-definiteness of NTK, which can be preserved by our global clipping. Consequently, DP gradient descent (GD) with global clipping converge monotonically to zero loss, which is often violated by the existing DP-GD. Notably, our analysis framework easily extends to other optimizers, e.g., DP-Adam. We demonstrate through numerous experiments that DP optimizers equipped with global clipping perform strongly on classification and regression tasks. In addition, our global clipping is surprisingly effective at learning calibrated classifiers, in contrast to the existing DP classifiers which are oftentimes over-confident and unreliable. Implementation-wise, the new clipping can be realized by inserting one line of code into the Pytorch Opacus library.
    Best of both worlds: local and global explanations with human-understandable concepts. (arXiv:2106.08641v2 [cs.LG] UPDATED)
    Interpretability techniques aim to provide the rationale behind a model's decision, typically by explaining either an individual prediction (local explanation, e.g. 'why is this patient diagnosed with this condition') or a class of predictions (global explanation, e.g. 'why is this set of patients diagnosed with this condition in general'). While there are many methods focused on either one, few frameworks can provide both local and global explanations in a consistent manner. In this work, we combine two powerful existing techniques, one local (Integrated Gradients, IG) and one global (Testing with Concept Activation Vectors), to provide local and global concept-based explanations. We first sanity check our idea using two synthetic datasets with a known ground truth, and further demonstrate with a benchmark natural image dataset. We test our method with various concepts, target classes, model architectures and IG parameters (e.g. baselines). We show that our method improves global explanations over vanilla TCAV when compared to ground truth, and provides useful local insights. Finally, a user study demonstrates the usefulness of the method compared to no or global explanations only. We hope our work provides a step towards building bridges between many existing local and global methods to get the best of both worlds.
    Provably Efficient Reinforcement Learning in Decentralized General-Sum Markov Games. (arXiv:2110.05682v3 [cs.LG] UPDATED)
    This paper addresses the problem of learning an equilibrium efficiently in general-sum Markov games through decentralized multi-agent reinforcement learning. Given the fundamental difficulty of calculating a Nash equilibrium (NE), we instead aim at finding a coarse correlated equilibrium (CCE), a solution concept that generalizes NE by allowing possible correlations among the agents' strategies. We propose an algorithm in which each agent independently runs optimistic V-learning (a variant of Q-learning) to efficiently explore the unknown environment, while using a stabilized online mirror descent (OMD) subroutine for policy updates. We show that the agents can find an $\epsilon$-approximate CCE in at most $\widetilde{O}( H^6S A /\epsilon^2)$ episodes, where $S$ is the number of states, $A$ is the size of the largest individual action space, and $H$ is the length of an episode. This appears to be the first sample complexity result for learning in generic general-sum Markov games. Our results rely on a novel investigation of an anytime high-probability regret bound for OMD with a dynamic learning rate and weighted regret, which would be of independent interest. One key feature of our algorithm is that it is fully \emph{decentralized}, in the sense that each agent has access to only its local information, and is completely oblivious to the presence of others. This way, our algorithm can readily scale up to an arbitrary number of agents, without suffering from the exponential dependence on the number of agents.
    Big Data for Traffic Estimation and Prediction: A Survey of Data and Tools. (arXiv:2103.11824v2 [cs.LG] UPDATED)
    Big data has been used widely in many areas including the transportation industry. Using various data sources, traffic states can be well estimated and further predicted for improving the overall operation efficiency. Combined with this trend, this study presents an up-to-date survey of open data and big data tools used for traffic estimation and prediction. Different data types are categorized and the off-the-shelf tools are introduced. To further promote the use of big data for traffic estimation and prediction tasks, challenges and future directions are given for future studies.
    BR-NPA: A Non-Parametric High-Resolution Attention Model to improve the Interpretability of Attention. (arXiv:2106.02566v3 [cs.CV] UPDATED)
    The prevalence of employing attention mechanisms has brought along concerns on the interpretability of attention distributions. Although it provides insights about how a model is operating, utilizing attention as the explanation of model predictions is still highly dubious. The community is still seeking more interpretable strategies for better identifying local active regions that contribute the most to the final decision. To improve the interpretability of existing attention models, we propose a novel Bilinear Representative Non-Parametric Attention (BR-NPA) strategy that captures the task-relevant human-interpretable information. The target model is first distilled to have higher-resolution intermediate feature maps. From which, representative features are then grouped based on local pairwise feature similarity, to produce finer-grained, more precise attention maps highlighting task-relevant parts of the input. The obtained attention maps are ranked according to the activity level of the compound feature, which provides information regarding the important level of the highlighted regions. The proposed model can be easily adapted in a wide variety of modern deep models, where classification is involved. Extensive quantitative and qualitative experiments showcase more comprehensive and accurate visual explanations compared to state-of-the-art attention models and visualizations methods across multiple tasks including fine-grained image classification, few-shot classification, and person re-identification, without compromising the classification accuracy. The proposed visualization model sheds imperative light on how neural networks `pay their attention' differently in different tasks.
    Identifiable Energy-based Representations: An Application to Estimating Heterogeneous Causal Effects. (arXiv:2108.03039v3 [cs.LG] UPDATED)
    Conditional average treatment effects (CATEs) allow us to understand the effect heterogeneity across a large population of individuals. However, typical CATE learners assume all confounding variables are measured in order for the CATE to be identifiable. This requirement can be satisfied by collecting many variables, at the expense of increased sample complexity for estimating CATEs. To combat this, we propose an energy-based model (EBM) that learns a low-dimensional representation of the variables by employing a noise contrastive loss function. With our EBM we introduce a preprocessing step that alleviates the dimensionality curse for any existing learner developed for estimating CATEs. We prove that our EBM keeps the representations partially identifiable up to some universal constant, as well as having universal approximation capability. These properties enable the representations to converge and keep the CATE estimates consistent. Experiments demonstrate the convergence of the representations, as well as show that estimating CATEs on our representations performs better than on the variables or the representations obtained through other dimensionality reduction methods.
    UncertaintyFuseNet: Robust Uncertainty-aware Hierarchical Feature Fusion Model with Ensemble Monte Carlo Dropout for COVID-19 Detection. (arXiv:2105.08590v3 [eess.IV] UPDATED)
    The COVID-19 (Coronavirus disease 2019) pandemic has become a major global threat to human health and well-being. Thus, the development of computer-aided detection (CAD) systems that are capable to accurately distinguish COVID-19 from other diseases using chest computed tomography (CT) and X-ray data is of immediate priority. Such automatic systems are usually based on traditional machine learning or deep learning methods. Differently from most of existing studies, which used either CT scan or X-ray images in COVID-19-case classification, we present a simple but efficient deep learning feature fusion model, called UncertaintyFuseNet, which is able to classify accurately large datasets of both of these types of images. We argue that the uncertainty of the model's predictions should be taken into account in the learning process, even though most of existing studies have overlooked it. We quantify the prediction uncertainty in our feature fusion model using effective Ensemble MC Dropout (EMCD) technique. A comprehensive simulation study has been conducted to compare the results of our new model to the existing approaches, evaluating the performance of competing models in terms of Precision, Recall, F-Measure, Accuracy and ROC curves. The obtained results prove the efficiency of our model which provided the prediction accuracy of 99.08\% and 96.35\% for the considered CT scan and X-ray datasets, respectively. Moreover, our UncertaintyFuseNet model was generally robust to noise and performed well with previously unseen data. The source code of our implementation is freely available at: https://github.com/moloud1987/UncertaintyFuseNet-for-COVID-19-Classification.
    Improved Automated Machine Learning from Transfer Learning. (arXiv:2103.00241v6 [cs.LG] UPDATED)
    In this paper, we propose a neural architecture search framework based on a similarity measure between some baseline tasks and a target task. We first define the notion of the task similarity based on the log-determinant of the Fisher Information matrix. Next, we compute the task similarity from each of the baseline tasks to the target task. By utilizing the relation between a target and a set of learned baseline tasks, the search space of architectures for the target task can be significantly reduced, making the discovery of the best candidates in the set of possible architectures tractable and efficient, in terms of GPU days. This method eliminates the requirement for training the networks from scratch for a given target task as well as introducing the bias in the initialization of the search space from the human domain.
    XDO: A Double Oracle Algorithm for Extensive-Form Games. (arXiv:2103.06426v2 [cs.GT] UPDATED)
    Policy Space Response Oracles (PSRO) is a reinforcement learning (RL) algorithm for two-player zero-sum games that has been empirically shown to find approximate Nash equilibria in large games. Although PSRO is guaranteed to converge to an approximate Nash equilibrium and can handle continuous actions, it may take an exponential number of iterations as the number of information states (infostates) grows. We propose Extensive-Form Double Oracle (XDO), an extensive-form double oracle algorithm for two-player zero-sum games that is guaranteed to converge to an approximate Nash equilibrium linearly in the number of infostates. Unlike PSRO, which mixes best responses at the root of the game, XDO mixes best responses at every infostate. We also introduce Neural XDO (NXDO), where the best response is learned through deep RL. In tabular experiments on Leduc poker, we find that XDO achieves an approximate Nash equilibrium in a number of iterations an order of magnitude smaller than PSRO. Experiments on a modified Leduc poker game and Oshi-Zumo show that tabular XDO achieves a lower exploitability than CFR with the same amount of computation. We also find that NXDO outperforms PSRO and NFSP on a sequential multidimensional continuous-action game. NXDO is the first deep RL method that can find an approximate Nash equilibrium in high-dimensional continuous-action sequential games. Experiment code is available at https://github.com/indylab/nxdo.
    Task-Based Information Compression for Multi-Agent Communication Problems with Channel Rate Constraints. (arXiv:2005.14220v4 [cs.IT] UPDATED)
    {We investigate the communications design in a multiagent system (MAS) in which agents cooperate to maximize the averaged sum of discounted one-stage rewards of a collaborative task. Due to the limited communication rate between the agents, each agent should efficiently represent its local observation and communicate an abstract version of the observations to improve the collaborative task performance. We first show that this problem is equivalent to a form of rate-distortion problem which we call task-based information compression (TBIC). We then introduce the state-aggregation for information compression algorithm (SAIC) to solve the formulated TBIC problem. It is shown that SAIC is able to achieve near-optimal performance in terms of the achieved sum of discounted rewards. The proposed algorithm is applied to a rendezvous problem and its performance is compared with several benchmarks. Numerical experiments confirm the superiority of the proposed algorithm.
    Bootstrapping Semantic Segmentation with Regional Contrast. (arXiv:2104.04465v4 [cs.CV] UPDATED)
    We present ReCo, a contrastive learning framework designed at a regional level to assist learning in semantic segmentation. ReCo performs semi-supervised or supervised pixel-level contrastive learning on a sparse set of hard negative pixels, with minimal additional memory footprint. ReCo is easy to implement, being built on top of off-the-shelf segmentation networks, and consistently improves performance in both semi-supervised and supervised semantic segmentation methods, achieving smoother segmentation boundaries and faster convergence. The strongest effect is in semi-supervised learning with very few labels. With ReCo, we achieve high-quality semantic segmentation models, requiring only 5 examples of each semantic class. Code is available at https://github.com/lorenmt/reco.
    Structured Stochastic Gradient MCMC. (arXiv:2107.09028v3 [cs.LG] UPDATED)
    Stochastic gradient Markov Chain Monte Carlo (SGMCMC) is considered the gold standard for Bayesian inference in large-scale models, such as Bayesian neural networks. Since practitioners face speed versus accuracy tradeoffs in these models, variational inference (VI) is often the preferable option. Unfortunately, VI makes strong assumptions on both the factorization and functional form of the posterior. In this work, we propose a new non-parametric variational approximation that makes no assumptions about the approximate posterior's functional form and allows practitioners to specify the exact dependencies the algorithm should respect or break. The approach relies on a new Langevin-type algorithm that operates on a modified energy function, where parts of the latent variables are averaged over samples from earlier iterations of the Markov chain. This way, statistical dependencies can be broken in a controlled way, allowing the chain to mix faster. This scheme can be further modified in a "dropout" manner, leading to even more scalability. We test our scheme for ResNet-20 on CIFAR-10, SVHN, and FMNIST. In all cases, we find improvements in convergence speed and/or final accuracy compared to SG-MCMC and VI.
    TTS-Portuguese Corpus: a corpus for speech synthesis in Brazilian Portuguese. (arXiv:2005.05144v4 [eess.AS] UPDATED)
    Speech provides a natural way for human-computer interaction. In particular, speech synthesis systems are popular in different applications, such as personal assistants, GPS applications, screen readers and accessibility tools. However, not all languages are on the same level when in terms of resources and systems for speech synthesis. This work consists of creating publicly available resources for Brazilian Portuguese in the form of a novel dataset along with deep learning models for end-to-end speech synthesis. Such dataset has 10.5 hours from a single speaker, from which a Tacotron 2 model with the RTISI-LA vocoder presented the best performance, achieving a 4.03 MOS value. The obtained results are comparable to related works covering English language and the state-of-the-art in Portuguese.
    On the Hardness of PAC-learning Stabilizer States with Noise. (arXiv:2102.05174v3 [quant-ph] UPDATED)
    We consider the problem of learning stabilizer states with noise in the Probably Approximately Correct (PAC) framework of Aaronson (2007) for learning quantum states. In the noiseless setting, an algorithm for this problem was recently given by Rocchetto (2018), but the noisy case was left open. Motivated by approaches to noise tolerance from classical learning theory, we introduce the Statistical Query (SQ) model for PAC-learning quantum states, and prove that algorithms in this model are indeed resilient to common forms of noise, including classification and depolarizing noise. We prove an exponential lower bound on learning stabilizer states in the SQ model. Even outside the SQ model, we prove that learning stabilizer states with noise is in general as hard as Learning Parity with Noise (LPN) using classical examples. Our results position the problem of learning stabilizer states as a natural quantum analogue of the classical problem of learning parities: easy in the noiseless setting, but seemingly intractable even with simple forms of noise.
    Thresholded Lasso Bandit. (arXiv:2010.11994v3 [stat.ML] UPDATED)
    In this paper, we revisit the regret minimization problem in sparse stochastic contextual linear bandits, where feature vectors may be of large dimension $d$, but where the reward function depends on a few, say $s_0\ll d$, of these features only. We present Thresholded Lasso bandit, an algorithm that (i) estimates the vector defining the reward function as well as its sparse support, i.e., significant feature elements, using the Lasso framework with thresholding, and (ii) selects an arm greedily according to this estimate projected on its support. The algorithm does not require prior knowledge of the sparsity index $s_0$ and can be parameter-free. For this simple algorithm, we establish non-asymptotic regret upper bounds scaling as $\mathcal{O}( \log d + \sqrt{T} )$ in general, and as $\mathcal{O}( \log d + \log T)$ under the so-called margin condition (a probabilistic condition on the separation of the arm rewards). The regret of previous algorithms scales as $\mathcal{O}( \log d + \sqrt{T \log (d T)})$ and $\mathcal{O}( \log T \log d)$ in the two settings, respectively. Through numerical experiments, we confirm that our algorithm outperforms existing methods.
    Unbiased Asymmetric Reinforcement Learning under Partial Observability. (arXiv:2105.11674v2 [cs.LG] UPDATED)
    In partially observable reinforcement learning, offline training gives access to latent information which is not available during online training and/or execution, such as the system state. Asymmetric actor-critic methods exploit such information by training a history-based policy via a state-based critic. However, many asymmetric methods lack theoretical foundation, and are only evaluated on limited domains. We examine the theory of asymmetric actor-critic methods which use state-based critics, and expose fundamental issues which undermine the validity of a common variant, and limit its ability to address partial observability. We propose an unbiased asymmetric actor-critic variant which is able to exploit state information while remaining theoretically sound, maintaining the validity of the policy gradient theorem, and introducing no bias and relatively low variance into the training process. An empirical evaluation performed on domains which exhibit significant partial observability confirms our analysis, demonstrating that unbiased asymmetric actor-critic converges to better policies and/or faster than symmetric and biased asymmetric baselines.
    Information Directed Reward Learning for Reinforcement Learning. (arXiv:2102.12466v3 [cs.LG] UPDATED)
    For many reinforcement learning (RL) applications, specifying a reward is difficult. This paper considers an RL setting where the agent obtains information about the reward only by querying an expert that can, for example, evaluate individual states or provide binary preferences over trajectories. From such expensive feedback, we aim to learn a model of the reward that allows standard RL algorithms to achieve high expected returns with as few expert queries as possible. To this end, we propose Information Directed Reward Learning (IDRL), which uses a Bayesian model of the reward and selects queries that maximize the information gain about the difference in return between plausibly optimal policies. In contrast to prior active reward learning methods designed for specific types of queries, IDRL naturally accommodates different query types. Moreover, it achieves similar or better performance with significantly fewer queries by shifting the focus from reducing the reward approximation error to improving the policy induced by the reward model. We support our findings with extensive evaluations in multiple environments and with different query types.
    Causal Bias Quantification for Continuous Treatments. (arXiv:2106.09762v2 [stat.ME] UPDATED)
    We extend the definition of the marginal causal effect to the continuous treatment setting and develop a novel characterization of causal bias in the framework of structural causal models. We prove that our derived bias expression is zero if, and only if, the causal effect is identifiable via covariate adjustment. We show that under some restrictions on the structural equations, the causal bias can be estimated efficiently and allows for causal regularization of predictive probabilistic models. We demonstrate the effectiveness of our method for causal bias quantification in various settings where (not) controlling for certain covariates would introduce causal bias.
    Evaluating Disentanglement of Structured Representations. (arXiv:2101.04041v3 [cs.LG] UPDATED)
    We introduce the first metric for evaluating disentanglement at individual hierarchy levels of a structured latent representation. Applied to object-centric generative models, this offers a systematic, unified approach to evaluating (i) object separation between latent slots (ii) disentanglement of object properties inside individual slots (iii) disentanglement of intrinsic and extrinsic object properties. We theoretically show that for structured representations, our framework gives stronger guarantees of selecting a good model than previous disentanglement metrics. Experimentally, we demonstrate that viewing object compositionality as a disentanglement problem addresses several issues with prior visual metrics of object separation. As a core technical component, we present the first representation probing algorithm handling slot permutation invariance.
    Fast and accurate optimization on the orthogonal manifold without retraction. (arXiv:2102.07432v2 [stat.ML] UPDATED)
    We consider the problem of minimizing a function over the manifold of orthogonal matrices. The majority of algorithms for this problem compute a direction in the tangent space, and then use a retraction to move in that direction while staying on the manifold. Unfortunately, the numerical computation of retractions on the orthogonal manifold always involves some expensive linear algebra operation, such as matrix inversion, exponential or square-root. These operations quickly become expensive as the dimension of the matrices grows. To bypass this limitation, we propose the landing algorithm which does not use retractions. The algorithm is not constrained to stay on the manifold but its evolution is driven by a potential energy which progressively attracts it towards the manifold. One iteration of the landing algorithm only involves matrix multiplications, which makes it cheap compared to its retraction counterparts. We provide an analysis of the convergence of the algorithm, and demonstrate its promises on large-scale and deep learning problems, where it is faster and less prone to numerical errors than retraction-based methods.
    Integral Probability Metric based Regularization for Optimal Transport. (arXiv:2011.05001v3 [cs.LG] UPDATED)
    Recently it has been shown that Maximum Mean Discrepancy (MMD) based regularization for Optimal Transport (OT) can improve the sample complexity of estimation. However, the metric and other important properties of such MMD-regularized OT formulations are not well-understood. In this work, we not only bridge this gap, but further study novel Integral Probability Metric (IPM) regularized p-Wasserstein style formulations. We present generic settings under which we prove that the proposed OT formulations induce novel metrics over measures. While some of these novel metrics can be interpreted as infimal convolutions of IPMs; interestingly, some others turn out to be IPM-analogues of the so-called generalized Wasserstein metrics and the Gaussian Hellinger-Kantorovich metrics. Finally, we present finite sample based simplified formulations for estimating the squared-MMD regularized metrics as well as the corresponding barycenter. We empirically validate our theoretical findings and show improvements in downstream applications.
    Encoding Weights of Irregular Sparsity for Fixed-to-Fixed Model Compression. (arXiv:2105.01869v2 [cs.LG] UPDATED)
    Even though fine-grained pruning techniques achieve a high compression ratio, conventional sparsity representations (such as CSR) associated with irregular sparsity degrade parallelism significantly. Practical pruning methods, thus, usually lower pruning rates (by structured pruning) to improve parallelism. In this paper, we study fixed-to-fixed (lossless) encoding architecture/algorithm to support fine-grained pruning methods such that sparse neural networks can be stored in a highly regular structure. We first estimate the maximum compression ratio of encoding-based compression using entropy. Then, as an effort to push the compression ratio to the theoretical maximum (by entropy), we propose a sequential fixed-to-fixed encoding scheme. We demonstrate that our proposed compression scheme achieves almost the maximum compression ratio for the Transformer and ResNet-50 pruned by various fine-grained pruning methods.
    Infinitely Deep Bayesian Neural Networks with Stochastic Differential Equations. (arXiv:2102.06559v4 [stat.ML] UPDATED)
    We perform scalable approximate inference in continuous-depth Bayesian neural networks. In this model class, uncertainty about separate weights in each layer gives hidden units that follow a stochastic differential equation. We demonstrate gradient-based stochastic variational inference in this infinite-parameter setting, producing arbitrarily-flexible approximate posteriors. We also derive a novel gradient estimator that approaches zero variance as the approximate posterior over weights approaches the true posterior. This approach brings continuous-depth Bayesian neural nets to a competitive comparison against discrete-depth alternatives, while inheriting the memory-efficient training and tunable precision of Neural ODEs.
    Variational Inference as Iterative Projection in a Bayesian Hilbert Space. (arXiv:2005.07275v2 [cs.LG] UPDATED)
    Variational Bayesian inference is an important machine-learning tool that finds application from statistics to robotics. The goal is to find an approximate probability density function (PDF) from a chosen family that is in some sense 'closest' to the full Bayesian posterior. Closeness is typically defined through the selection of an appropriate loss functional such as the Kullback-Leibler (KL) divergence. In this paper, we explore a new formulation of variational inference by exploiting the fact that (most) PDFs are members of a Bayesian Hilbert space under careful definitions of vector addition, scalar multiplication and an inner product. We show that variational inference based on KL divergence then amounts to an iterative projection, in the Euclidean sense, of the Bayesian posterior onto a subspace corresponding to the selected approximation family. We work through the details of this general framework for the specific case of the Gaussian approximation family and show the equivalence to another Gaussian variational inference approach. We furthermore discuss the implications for systems that exhibit sparsity, which is handled naturally in Bayesian space, and give an example of a high-dimensional robotic state estimation problem that can be handled as a result. Finally, we provide some preliminary examples of how the approach could be applied to non-Gaussian inference.
    Adaptive Rational Activations to Boost Deep Reinforcement Learning. (arXiv:2102.09407v3 [cs.LG] UPDATED)
    Latest insights from biology show that intelligence not only emerges from the connections between neurons but that individual neurons shoulder more computational responsibility than previously anticipated. This perspective should be critical in the context of constantly changing distinct reinforcement learning environments, yet current approaches still primarily employ static activation functions. In this work, we motivate why rationals are suitable for adaptable activation functions and why their inclusion into neural networks is crucial. Inspired by recurrence in residual networks, we derive a condition under which rational units are closed under residual connections and formulate a naturally regularised version: the recurrent-rational. We demonstrate that equipping popular algorithms with (recurrent-)rational activations leads to consistent improvements on Atari games, especially turning simple DQN into a solid approach, competitive to DDQN and Rainbow.
    Provable tradeoffs in adversarially robust classification. (arXiv:2006.05161v5 [cs.LG] UPDATED)
    It is well known that machine learning methods can be vulnerable to adversarially-chosen perturbations of their inputs. Despite significant progress in the area, foundational open problems remain. In this paper, we address several key questions. We derive exact and approximate Bayes-optimal robust classifiers for the important setting of two- and three-class Gaussian classification problems with arbitrary imbalance, for $\ell_2$ and $\ell_\infty$ adversaries. In contrast to classical Bayes-optimal classifiers, determining the optimal decisions here cannot be made pointwise and new theoretical approaches are needed. We develop and leverage new tools, including recent breakthroughs from probability theory on robust isoperimetry, which, to our knowledge, have not yet been used in the area. Our results reveal fundamental tradeoffs between standard and robust accuracy that grow when data is imbalanced. We also show further results, including an analysis of classification calibration for convex losses in certain models, and finite sample rates for the robust risk.
    A new Sparse Auto-encoder based Framework using Grey Wolf Optimizer for Data Classification Problem. (arXiv:2201.12493v1 [cs.NE])
    One of the most important properties of deep auto-encoders (DAEs) is their capability to extract high level features from row data. Hence, especially recently, the autoencoders are preferred to be used in various classification problems such as image and voice recognition, computer security, medical data analysis, etc. Despite, its popularity and high performance, the training phase of autoencoders is still a challenging task, involving to select best parameters that let the model to approach optimal results. Different training approaches are applied to train sparse autoencoders. Previous studies and preliminary experiments reveal that those approaches may present remarkable results in same problems but also disappointing results can be obtained in other complex problems. Metaheuristic algorithms have emerged over the last two decades and are becoming an essential part of contemporary optimization techniques. Gray wolf optimization (GWO) is one of the current of those algorithms and is applied to train sparse auto-encoders for this study. This model is validated by employing several popular Gene expression databases. Results are compared with previous state-of-the art methods studied with the same data sets and also are compared with other popular metaheuristic algorithms, namely, Genetic Algorithms (GA), Particle Swarm Optimization (PSO) and Artificial Bee Colony (ABC). Results reveal that the performance of the trained model using GWO outperforms on both conventional models and models trained with most popular metaheuristic algorithms.
    Generating Adversarial Disturbances for Controller Verification. (arXiv:2012.06695v2 [cs.LG] UPDATED)
    We consider the problem of generating maximally adversarial disturbances for a given controller assuming only blackbox access to it. We propose an online learning approach to this problem that \emph{adaptively} generates disturbances based on control inputs chosen by the controller. The goal of the disturbance generator is to minimize \emph{regret} versus a benchmark disturbance-generating policy class, i.e., to maximize the cost incurred by the controller as well as possible compared to the best possible disturbance generator \emph{in hindsight} (chosen from a benchmark policy class). In the setting where the dynamics are linear and the costs are quadratic, we formulate our problem as an online trust region (OTR) problem with memory and present a new online learning algorithm (\emph{MOTR}) for this problem. We prove that this method competes with the best disturbance generator in hindsight (chosen from a rich class of benchmark policies that includes linear-dynamical disturbance generating policies). We demonstrate our approach on two simulated examples: (i) synthetically generated linear systems, and (ii) generating wind disturbances for the popular PX4 controller in the AirSim simulator. On these examples, we demonstrate that our approach outperforms several baseline approaches, including $H_{\infty}$ disturbance generation and gradient-based methods.
    Building Efficient CNNs Using Depthwise Convolutional Eigen-Filters (DeCEF). (arXiv:1910.09359v3 [cs.LG] UPDATED)
    Deep Convolutional Neural Networks (CNNs) have been widely used in various domains due to their impressive capabilities. These models are typically composed of a large number of 2D convolutional (Conv2D) layers with numerous trainable parameters. To reduce the complexity of a network, compression techniques can be applied. These methods typically rely on the analysis of trained deep learning models. However, in some applications, due to reasons such as particular data or system specifications and licensing restrictions, a pre-trained network may not be available. This would require the user to train a CNN from scratch. In this paper, we aim to find an alternative parameterization to Conv2D filters without relying on a pre-trained convolutional network. During the analysis, we observe that the effective rank of the vectorized Conv2D filters decreases with respect to the increasing depth in the network, which then leads to the implementation of the Depthwise Convolutional Eigen-Filter (DeCEF) layer. Essentially, a DeCEF layer is a low rank version of the Conv2D layer with significantly fewer trainable parameters and floating point operations (FLOPs). The way we define the effective rank is different from the previous work and it is easy to implement in any deep learning frameworks. To evaluate the effectiveness of DeCEF, experiments are conducted on the benchmark datasets CIFAR-10 and ImageNet using various network architectures. The results have shown a similar or higher accuracy and robustness using about 2/3 of the original parameters and reducing the number of FLOPs to 2/3 of the base network, which is then compared to the state-of-the-art techniques.
    Buffered Asynchronous SGD for Byzantine Learning. (arXiv:2003.00937v4 [cs.LG] UPDATED)
    Distributed learning has become a hot research topic due to its wide application in clusterbased large-scale learning, federated learning, edge computing and so on. Most traditional distributed learning methods typically assume no failure or attack. However, many unexpected cases, such as communication failure and even malicious attack, may happen in real applications. Hence, Byzantine learning (BL), which refers to distributed learning with failure or attack, has recently attracted much attention. Most existing BL methods are synchronous, which are impractical in some applications due to heterogeneous or offline workers. In these cases, asynchronous BL (ABL) is usually preferred. In this paper, we propose a novel method, called buffered asynchronous stochastic gradient descent (BASGD), for ABL. To the best of our knowledge, BASGD is the first ABL method that can resist non-omniscient attacks without storing any instances on server. Furthermore, we also propose an improved variant of BASGD, called BASGD with momentum (BASGDm), by introducing momentum into BASGD. BASGDm can resist both non-omniscient and omniscient attacks. Compared with those methods which need to store instances on server, BASGD and BASGDm have a wider scope of application. Both BASGD and BASGDm are compatible with various aggregation rules. Moreover, both BASGD and BASGDm are proved to be convergent and be able to resist failure or attack. Empirical results show that our methods significantly outperform existing ABL baselines when there exists failure or attack on workers.
    TPC: Transformation-Specific Smoothing for Point Cloud Models. (arXiv:2201.12733v1 [cs.CV])
    Point cloud models with neural network architectures have achieved great success and have been widely used in safety-critical applications, such as Lidar-based recognition systems in autonomous vehicles. However, such models are shown vulnerable against adversarial attacks which aim to apply stealthy semantic transformations such as rotation and tapering to mislead model predictions. In this paper, we propose a transformation-specific smoothing framework TPC, which provides tight and scalable robustness guarantees for point cloud models against semantic transformation attacks. We first categorize common 3D transformations into three categories: additive (e.g., shearing), composable (e.g., rotation), and indirectly composable (e.g., tapering), and we present generic robustness certification strategies for all categories respectively. We then specify unique certification protocols for a range of specific semantic transformations and their compositions. Extensive experiments on several common 3D transformations show that TPC significantly outperforms the state of the art. For example, our framework boosts the certified accuracy against twisting transformation along z-axis (within 20$^\circ$) from 20.3$\%$ to 83.8$\%$.
    FSOCO: The Formula Student Objects in Context Dataset. (arXiv:2012.07139v4 [cs.CV] UPDATED)
    This paper presents the FSOCO dataset, a collaborative dataset for vision-based cone detection systems in Formula Student Driverless competitions. It contains human annotated ground truth labels for both bounding boxes and instance-wise segmentation masks. The data buy-in philosophy of FSOCO asks student teams to contribute to the database first before being granted access ensuring continuous growth. By providing clear labeling guidelines and tools for a sophisticated raw image selection, new annotations are guaranteed to meet the desired quality. The effectiveness of the approach is shown by comparing prediction results of a network trained on FSOCO and its unregulated predecessor. The FSOCO dataset can be found at fsoco-dataset.com.
    Variational Policy Propagation for Multi-agent Reinforcement Learning. (arXiv:2004.08883v4 [cs.LG] UPDATED)
    We propose a \emph{collaborative} multi-agent reinforcement learning algorithm named variational policy propagation (VPP) to learn a \emph{joint} policy through the interactions over agents. We prove that the joint policy is a Markov Random Field under some mild conditions, which in turn reduces the policy space effectively. We integrate the variational inference as special differentiable layers in policy such that the actions can be efficiently sampled from the Markov Random Field and the overall policy is differentiable. We evaluate our algorithm on several large scale challenging tasks and demonstrate that it outperforms previous state-of-the-arts.
    A Safety-Critical Decision Making and Control Framework Combining Machine Learning and Rule-based Algorithms. (arXiv:2201.12819v1 [cs.AI])
    While artificial-intelligence-based methods suffer from lack of transparency, rule-based methods dominate in safety-critical systems. Yet, the latter cannot compete with the first ones in robustness to multiple requirements, for instance, simultaneously addressing safety, comfort, and efficiency. Hence, to benefit from both methods they must be joined in a single system. This paper proposes a decision making and control framework, which profits from advantages of both the rule- and machine-learning-based techniques while compensating for their disadvantages. The proposed method embodies two controllers operating in parallel, called Safety and Learned. A rule-based switching logic selects one of the actions transmitted from both controllers. The Safety controller is prioritized every time, when the Learned one does not meet the safety constraint, and also directly participates in the safe Learned controller training. Decision making and control in autonomous driving is chosen as the system case study, where an autonomous vehicle learns a multi-task policy to safely cross an unprotected intersection. Multiple requirements (i.e., safety, efficiency, and comfort) are set for vehicle operation. A numerical simulation is performed for the proposed framework validation, where its ability to satisfy the requirements and robustness to changing environment is successfully demonstrated.
    Coordinate Descent Methods for Fractional Minimization. (arXiv:2201.12691v1 [math.OC])
    We consider a class of structured fractional minimization problems, in which the numerator part of the objective is the sum of a differentiable convex function and a convex nonsmooth function, while the denominator part is a concave or convex function. This problem is difficult to solve since it is nonconvex. By exploiting the structure of the problem, we propose two Coordinate Descent (CD) methods for solving this problem. One is applied to the original fractional function, the other is based on the associated parametric problem. The proposed methods iteratively solve a one-dimensional subproblem \textit{globally}, and they are guaranteed to converge to coordinate-wise stationary points. In the case of a convex denominator, we prove that the proposed CD methods using sequential nonconvex approximation find stronger stationary points than existing methods. Under suitable conditions, CD methods with an appropriate initialization converge linearly to the optimal point (also the coordinate-wise stationary point). In the case of a concave denominator, we show that the resulting problem is quasi-convex, and any critical point is a global minimum. We prove that the algorithms converge to the global optimal solution with a sublinear convergence rate. We demonstrate the applicability of the proposed methods to some machine learning and signal processing models. Our experiments on real-world data have shown that our method significantly and consistently outperforms existing methods in terms of accuracy.
    Graph Self-Attention for learning graph representation with Transformer. (arXiv:2201.12787v1 [cs.LG])
    We propose a novel Graph Self-Attention module to enable Transformer models to learn graph representation. We aim to incorporate graph information, on the attention map and hidden representations of Transformer. To this end, we propose context-aware attention which considers the interactions between query, key and graph information. Moreover, we propose graph-embedded value to encode the graph information on the hidden representation. Our extensive experiments and ablation studies validate that our method successfully encodes graph representation on Transformer architecture. Finally, our method achieves state-of-the-art performance on multiple benchmarks of graph representation learning, such as graph classification on images and molecules to graph regression on quantum chemistry.
    Deep Contrastive Learning is Provably (almost) Principal Component Analysis. (arXiv:2201.12680v1 [cs.LG])
    We show that Contrastive Learning (CL) under a family of loss functions (including InfoNCE) has a game-theoretical formulation, where the \emph{max player} finds representation to maximize contrastiveness, and the \emph{min player} puts weights on pairs of samples with similar representation. We show that the max player who does \emph{representation learning} reduces to Principal Component Analysis for deep linear network, and almost all local minima are global, recovering optimal PCA solutions. Experiments show that the formulation yields comparable (or better) performance on CIFAR10 and STL-10 when extending beyond InfoNCE, yielding novel contrastive losses. Furthermore, we extend our theoretical analysis to 2-layer ReLU networks, showing its difference from linear ones, and proving that feature composition is preferred over picking single dominant feature under strong augmentation.
    Achieving Efficient Distributed Machine Learning Using a Novel Non-Linear Class of Aggregation Functions. (arXiv:2201.12488v1 [cs.LG])
    Distributed machine learning (DML) over time-varying networks can be an enabler for emerging decentralized ML applications such as autonomous driving and drone fleeting. However, the commonly used weighted arithmetic mean model aggregation function in existing DML systems can result in high model loss, low model accuracy, and slow convergence speed over time-varying networks. To address this issue, in this paper, we propose a novel non-linear class of model aggregation functions to achieve efficient DML over time-varying networks. Instead of taking a linear aggregation of neighboring models as most existing studies do, our mechanism uses a nonlinear aggregation, a weighted power-p mean (WPM) where p is a positive odd integer, as the aggregation function of local models from neighbors. The subsequent optimizing steps are taken using mirror descent defined by a Bregman divergence that maintains convergence to optimality. In this paper, we analyze properties of the WPM and rigorously prove convergence properties of our aggregation mechanism. Additionally, through extensive experiments, we show that when p > 1, our design significantly improves the convergence speed of the model and the scalability of DML under time-varying networks compared with arithmetic mean aggregation functions, with little additional 26computation overhead.
    Do We Need to Penalize Variance of Losses for Learning with Label Noise?. (arXiv:2201.12739v1 [cs.LG])
    Algorithms which minimize the averaged loss have been widely designed for dealing with noisy labels. Intuitively, when there is a finite training sample, penalizing the variance of losses will improve the stability and generalization of the algorithms. Interestingly, we found that the variance should be increased for the problem of learning with noisy labels. Specifically, increasing the variance will boost the memorization effects and reduce the harmfulness of incorrect labels. By exploiting the label noise transition matrix, regularizers can be easily designed to reduce the variance of losses and be plugged in many existing algorithms. Empirically, the proposed method by increasing the variance of losses significantly improves the generalization ability of baselines on both synthetic and real-world datasets.
    AutoSNN: Towards Energy-Efficient Spiking Neural Networks. (arXiv:2201.12738v1 [cs.NE])
    Spiking neural networks (SNNs) that mimic information transmission in the brain can energy-efficiently process spatio-temporal information through discrete and sparse spikes, thereby receiving considerable attention. To improve accuracy and energy efficiency of SNNs, most previous studies have focused solely on training methods, and the effect of architecture has rarely been studied. We investigate the design choices used in the previous studies in terms of the accuracy and number of spikes and figure out that they are not best-suited for SNNs. To further improve the accuracy and reduce the spikes generated by SNNs, we propose a spike-aware neural architecture search framework called AutoSNN. We define a search space consisting of architectures without undesirable design choices. To enable the spike-aware architecture search, we introduce a fitness that considers both the accuracy and number of spikes. AutoSNN successfully searches for SNN architectures that outperform hand-crafted SNNs in accuracy and energy efficiency. We thoroughly demonstrate the effectiveness of AutoSNN on various datasets including neuromorphic datasets.
    ClassSPLOM -- A Scatterplot Matrix to Visualize Separation of Multiclass Multidimensional Data. (arXiv:2201.12822v1 [cs.HC])
    In multiclass classification of multidimensional data, the user wants to build a model of the classes to predict the label of unseen data. The model is trained on the data and tested on unseen data with known labels to evaluate its quality. The results are visualized as a confusion matrix which shows how many data labels have been predicted correctly or confused with other classes. The multidimensional nature of the data prevents the direct visualization of the classes so we design ClassSPLOM to give more perceptual insights about the classification results. It uses the Scatterplot Matrix (SPLOM) metaphor to visualize a Linear Discriminant Analysis projection of the data for each pair of classes and a set of Receiving Operating Curves to evaluate their trustworthiness. We illustrate ClassSPLOM on a use case in Arabic dialects identification.
    Task-Balanced Batch Normalization for Exemplar-based Class-Incremental Learning. (arXiv:2201.12559v1 [cs.CV])
    Batch Normalization (BN) is an essential layer for training neural network models in various computer vision tasks. It has been widely used in continual learning scenarios with little discussion, but we find that BN should be carefully applied, particularly for the exemplar memory based class incremental learning (CIL). We first analyze that the empirical mean and variance obtained for normalization in a BN layer become highly biased toward the current task. To tackle its significant problems in training and test phases, we propose Task-Balanced Batch Normalization (TBBN). Given each mini-batch imbalanced between the current and previous tasks, TBBN first reshapes and repeats the batch, calculating near task-balanced mean and variance. Second, we show that when the affine transformation parameters of BN are learned from a reshaped feature map, they become less-biased toward the current task. Based on our extensive CIL experiments with CIFAR-100 and ImageNet-100 datasets, we demonstrate that our TBBN is easily applicable to most of existing exemplar-based CIL algorithms, improving their performance by decreasing the forgetting on the previous tasks.
    Tensor Recovery Based on Tensor Equivalent Minimax-Concave Penalty. (arXiv:2201.12709v1 [cs.CV])
    Tensor recovery is an important problem in computer vision and machine learning. It usually uses the convex relaxation of tensor rank and $l_{0}$ norm, i.e., the nuclear norm and $l_{1}$ norm respectively, to solve the problem. It is well known that convex approximations produce biased estimators. In order to overcome this problem, a corresponding non-convex regularizer has been proposed to solve it. Inspired by matrix equivalent Minimax-Concave Penalty (EMCP), we propose and prove theorems of tensor equivalent Minimax-Concave Penalty (TEMCP). The tensor equivalent MCP (TEMCP) as a non-convex regularizer and the equivalent weighted tensor $\gamma$ norm (EWTGN) which can represent the low-rank part are obtained. Both of them can realize weight adaptive. At the same time, we propose two corresponding adaptive models for two classical tensor recovery problems, low-rank tensor completion (LRTC) and tensor robust principal component analysis (TRPCA), and the optimization algorithm is based on alternating direction multiplier (ADMM). This novel iterative adaptive algorithm can produce more accurate tensor recovery effect. For the tensor completion model, multispectral image (MSI), magnetic resonance imaging (MRI) and color video (CV) data sets are considered, while for the tensor robust principal component analysis model, hyperspectral image (HSI) denoising under gaussian noise plus salt and pepper noise is considered. The proposed algorithm is superior to the state-of-arts method, and the algorithm is guaranteed to meet the reduction and convergence through experiments.
    A Deep CNN Architecture with Novel Pooling Layer Applied to Two Sudanese Arabic Sentiment Datasets. (arXiv:2201.12664v1 [cs.CL])
    Arabic sentiment analysis has become an important research field in recent years. Initially, work focused on Modern Standard Arabic (MSA), which is the most widely-used form. Since then, work has been carried out on several different dialects, including Egyptian, Levantine and Moroccan. Moreover, a number of datasets have been created to support such work. However, up until now, less work has been carried out on Sudanese Arabic, a dialect which has 32 million speakers. In this paper, two new publicly available datasets are introduced, the 2-Class Sudanese Sentiment Dataset (SudSenti2) and the 3-Class Sudanese Sentiment Dataset (SudSenti3). Furthermore, a CNN architecture, SCM, is proposed, comprising five CNN layers together with a novel pooling layer, MMA, to extract the best features. This SCM+MMA model is applied to SudSenti2 and SudSenti3 with accuracies of 92.75% and 84.39%. Next, the model is compared to other deep learning classifiers and shown to be superior on these new datasets. Finally, the proposed model is applied to the existing Saudi Sentiment Dataset and to the MSA Hotel Arabic Review Dataset with accuracies 85.55% and 90.01%.
    Implicit Regularization Towards Rank Minimization in ReLU Networks. (arXiv:2201.12760v1 [cs.LG])
    We study the conjectured relationship between the implicit regularization in neural networks, trained with gradient-based methods, and rank minimization of their weight matrices. Previously, it was proved that for linear networks (of depth 2 and vector-valued outputs), gradient flow (GF) w.r.t. the square loss acts as a rank minimization heuristic. However, understanding to what extent this generalizes to nonlinear networks is an open problem. In this paper, we focus on nonlinear ReLU networks, providing several new positive and negative results. On the negative side, we prove (and demonstrate empirically) that, unlike the linear case, GF on ReLU networks may no longer tend to minimize ranks, in a rather strong sense (even approximately, for "most" datasets of size 2). On the positive side, we reveal that ReLU networks of sufficient depth are provably biased towards low-rank solutions in several reasonable settings.
    Why the Rich Get Richer? On the Balancedness of Random Partition Models. (arXiv:2201.12697v1 [stat.ML])
    Random partition models are widely used in Bayesian methods for various clustering tasks, such as mixture models, topic models, and community detection problems. While the number of clusters induced by random partition models has been studied extensively, another important model property regarding the balancedness of cluster sizes has been largely neglected. We formulate a framework to define and theoretically study the balancedness of exchangeable random partition models, by analyzing how a model assigns probabilities to partitions with different levels of balancedness. We demonstrate that the "rich-get-richer" characteristic of many existing popular random partition models is an inevitable consequence of two common assumptions: product-form exchangeability and projectivity. We propose a principled way to compare the balancedness of random partition models, which gives a better understanding of what model works better and what doesn't for different applications. We also introduce the "rich-get-poorer" random partition models and illustrate their application to entity resolution tasks.
    Few-Shot Transfer Learning for Device-Free Fingerprinting Indoor Localization. (arXiv:2201.12656v1 [eess.SP])
    Device-free wireless indoor localization is an essential technology for the Internet of Things (IoT), and fingerprint-based methods are widely used. A common challenge to fingerprint-based methods is data collection and labeling. This paper proposes a few-shot transfer learning system that uses only a small amount of labeled data from the current environment and reuses a large amount of existing labeled data previously collected in other environments, thereby significantly reducing the data collection and labeling cost for localization in each new environment. The core method lies in graph neural network (GNN) based few-shot transfer learning and its modifications. Experimental results conducted on real-world environments show that the proposed system achieves comparable performance to a convolutional neural network (CNN) model, with 40 times fewer labeled data.
    FEDformer: Frequency Enhanced Decomposed Transformer for Long-term Series Forecasting. (arXiv:2201.12740v1 [cs.LG])
    Although Transformer-based methods have significantly improved state-of-the-art results for long-term series forecasting, they are not only computationally expensive but more importantly, are unable to capture the global view of time series (e.g. overall trend). To address these problems, we propose to combine Transformer with the seasonal-trend decomposition method, in which the decomposition method captures the global profile of time series while Transformers capture more detailed structures. To further enhance the performance of Transformer for long-term prediction, we exploit the fact that most time series tend to have a sparse representation in well-known basis such as Fourier transform, and develop a frequency enhanced Transformer. Besides being more effective, the proposed method, termed as Frequency Enhanced Decomposed Transformer ({\bf FEDformer}), is more efficient than standard Transformer with a linear complexity to the sequence length. Our empirical studies with six benchmark datasets show that compared with state-of-the-art methods, FEDformer can reduce prediction error by $14.8\%$ and $22.6\%$ for multivariate and univariate time series, respectively. the code will be released soon.
    Geometry- and Accuracy-Preserving Random Forest Proximities. (arXiv:2201.12682v1 [stat.ML])
    Random forests are considered one of the best out-of-the-box classification and regression algorithms due to their high level of predictive performance with relatively little tuning. Pairwise proximities can be computed from a trained random forest which measure the similarity between data points relative to the supervised task. Random forest proximities have been used in many applications including the identification of variable importance, data imputation, outlier detection, and data visualization. However, existing definitions of random forest proximities do not accurately reflect the data geometry learned by the random forest. In this paper, we introduce a novel definition of random forest proximities called Random Forest-Geometry- and Accuracy-Preserving proximities (RF-GAP). We prove that the proximity-weighted sum (regression) or majority vote (classification) using RF-GAP exactly match the out-of-bag random forest prediction, thus capturing the data geometry learned by the random forest. We empirically show that this improved geometric representation outperforms traditional random forest proximities in tasks such as data imputation and provides outlier detection and visualization results consistent with the learned data geometry.
    No-Regret Learning in Time-Varying Zero-Sum Games. (arXiv:2201.12736v1 [cs.LG])
    Learning from repeated play in a fixed two-player zero-sum game is a classic problem in game theory and online learning. We consider a variant of this problem where the game payoff matrix changes over time, possibly in an adversarial manner. We first present three performance measures to guide the algorithmic design for this problem: 1) the well-studied individual regret, 2) an extension of duality gap, and 3) a new measure called dynamic Nash Equilibrium regret, which quantifies the cumulative difference between the player's payoff and the minimax game value. Next, we develop a single parameter-free algorithm that simultaneously enjoys favorable guarantees under all these three performance measures. These guarantees are adaptive to different non-stationarity measures of the payoff matrices and, importantly, recover the best known results when the payoff matrix is fixed. Our algorithm is based on a two-layer structure with a meta-algorithm learning over a group of black-box base-learners satisfying a certain property, along with several novel ingredients specifically designed for the time-varying game setting. Empirical results further validate the effectiveness of our algorithm.
    Bayesian Optimization For Multi-Objective Mixed-Variable Problems. (arXiv:2201.12767v1 [cs.LG])
    Optimizing multiple, non-preferential objectives for mixed-variable, expensive black-box problems is important in many areas of engineering and science. The expensive, noisy black-box nature of these problems makes them ideal candidates for Bayesian optimization (BO). Mixed-variable and multi-objective problems, however, are a challenge due to the BO's underlying smooth Gaussian process surrogate model. Current multi-objective BO algorithms cannot deal with mixed-variable problems. We present MixMOBO, the first mixed variable multi-objective Bayesian optimization framework for such problems. Using a genetic algorithm to sample the surrogate surface, optimal Pareto-fronts for multi-objective, mixed-variable design spaces can be found efficiently while ensuring diverse solutions. The method is sufficiently flexible to incorporate many different kernels and acquisition functions, including those that were developed for mixed-variable or multi-objective problems by other authors. We also present HedgeMO, a modified Hedge strategy that uses a portfolio of acquisition functions in multi-objective problems. We present a new acquisition function SMC. We show that MixMOBO performs well against other mixed-variable algorithms on synthetic problems. We apply MixMOBO to the real-world design of an architected material and show that our optimal design, which was experimentally fabricated and validated, has a normalized strain energy density $10^4$ times greater than existing structures.
    Similarity and Generalization: From Noise to Corruption. (arXiv:2201.12803v1 [cs.LG])
    Contrastive learning aims to extract distinctive features from data by finding an embedding representation where similar samples are close to each other, and different ones are far apart. We study generalization in contrastive learning, focusing on its simplest representative: Siamese Neural Networks (SNNs). We show that Double Descent also appears in SNNs and is exacerbated by noise. We point out that SNNs can be affected by two distinct sources of noise: Pair Label Noise (PLN) and Single Label Noise (SLN). The effect of SLN is asymmetric, but it preserves similarity relations, while PLN is symmetric but breaks transitivity. We show that the dataset topology crucially affects generalization. While sparse datasets show the same performances under SLN and PLN for an equal amount of noise, SLN outperforms PLN in the overparametrized region in dense datasets. Indeed, in this regime, PLN similarity violation becomes macroscopical, corrupting the dataset to the point where complete overfitting cannot be achieved. We call this phenomenon Density-Induced Break of Similarity (DIBS). We also probe the equivalence between online optimization and offline generalization for similarity tasks. We observe that an online/offline correspondence in similarity learning can be affected by both the network architecture and label noise.
    Distributed SLIDE: Enabling Training Large Neural Networks on Low Bandwidth and Simple CPU-Clusters via Model Parallelism and Sparsity. (arXiv:2201.12667v1 [cs.DC])
    More than 70% of cloud computing is paid for but sits idle. A large fraction of these idle compute are cheap CPUs with few cores that are not utilized during the less busy hours. This paper aims to enable those CPU cycles to train heavyweight AI models. Our goal is against mainstream frameworks, which focus on leveraging expensive specialized ultra-high bandwidth interconnect to address the communication bottleneck in distributed neural network training. This paper presents a distributed model-parallel training framework that enables training large neural networks on small CPU clusters with low Internet bandwidth. We build upon the adaptive sparse training framework introduced by the SLIDE algorithm. By carefully deploying sparsity over distributed nodes, we demonstrate several orders of magnitude faster model parallel training than Horovod, the main engine behind most commercial software. We show that with reduced communication, due to sparsity, we can train close to a billion parameter model on simple 4-16 core CPU nodes connected by basic low bandwidth interconnect. Moreover, the training time is at par with some of the best hardware accelerators.
    A Stochastic Bundle Method for Interpolating Networks. (arXiv:2201.12678v1 [cs.LG])
    We propose a novel method for training deep neural networks that are capable of interpolation, that is, driving the empirical loss to zero. At each iteration, our method constructs a stochastic approximation of the learning objective. The approximation, known as a bundle, is a pointwise maximum of linear functions. Our bundle contains a constant function that lower bounds the empirical loss. This enables us to compute an automatic adaptive learning rate, thereby providing an accurate solution. In addition, our bundle includes linear approximations computed at the current iterate and other linear estimates of the DNN parameters. The use of these additional approximations makes our method significantly more robust to its hyperparameters. Based on its desirable empirical properties, we term our method Bundle Optimisation for Robust and Accurate Training (BORAT). In order to operationalise BORAT, we design a novel algorithm for optimising the bundle approximation efficiently at each iteration. We establish the theoretical convergence of BORAT in both convex and non-convex settings. Using standard publicly available data sets, we provide a thorough comparison of BORAT to other single hyperparameter optimisation algorithms. Our experiments demonstrate BORAT matches the state-of-the-art generalisation performance for these methods and is the most robust.
    Co-Regularized Adversarial Learning for Multi-Domain Text Classification. (arXiv:2201.12796v1 [cs.LG])
    Multi-domain text classification (MDTC) aims to leverage all available resources from multiple domains to learn a predictive model that can generalize well on these domains. Recently, many MDTC methods adopt adversarial learning, shared-private paradigm, and entropy minimization to yield state-of-the-art results. However, these approaches face three issues: (1) Minimizing domain divergence can not fully guarantee the success of domain alignment; (2) Aligning marginal feature distributions can not fully guarantee the discriminability of the learned features; (3) Standard entropy minimization may make the predictions on unlabeled data over-confident, deteriorating the discriminability of the learned features. In order to address the above issues, we propose a co-regularized adversarial learning (CRAL) mechanism for MDTC. This approach constructs two diverse shared latent spaces, performs domain alignment in each of them, and punishes the disagreements of these two alignments with respect to the predictions on unlabeled data. Moreover, virtual adversarial training (VAT) with entropy minimization is incorporated to impose consistency regularization to the CRAL method. Experiments show that our model outperforms state-of-the-art methods on two MDTC benchmarks.
    Training Thinner and Deeper Neural Networks: Jumpstart Regularization. (arXiv:2201.12795v1 [cs.LG])
    Neural networks are more expressive when they have multiple layers. In turn, conventional training methods are only successful if the depth does not lead to numerical issues such as exploding or vanishing gradients, which occur less frequently when the layers are sufficiently wide. However, increasing width to attain greater depth entails the use of heavier computational resources and leads to overparameterized models. These subsequent issues have been partially addressed by model compression methods such as quantization and pruning, some of which relying on normalization-based regularization of the loss function to make the effect of most parameters negligible. In this work, we propose instead to use regularization for preventing neurons from dying or becoming linear, a technique which we denote as jumpstart regularization. In comparison to conventional training, we obtain neural networks that are thinner, deeper, and - most importantly - more parameter-efficient.
    Faster Convergence of Local SGD for Over-Parameterized Models. (arXiv:2201.12719v1 [cs.LG])
    Modern machine learning architectures are often highly expressive. They are usually over-parameterized and can interpolate the data by driving the empirical loss close to zero. We analyze the convergence of Local SGD (or FedAvg) for such over-parameterized models in the heterogeneous data setting and improve upon the existing literature by establishing the following convergence rates. We show an error bound of $\O(\exp(-T))$ for strongly-convex loss functions, where $T$ is the total number of iterations. For general convex loss functions, we establish an error bound of $\O(1/T)$ under a mild data similarity assumption and an error bound of $\O(K/T)$ otherwise, where $K$ is the number of local steps. We also extend our results for non-convex loss functions by proving an error bound of $\O(K/T)$. Before our work, the best-known convergence rate for strongly-convex loss functions was $\O(\exp(-T/K))$, and none existed for general convex or non-convex loss functions under the overparameterized setting. We complete our results by providing problem instances in which such convergence rates are tight to a constant factor under a reasonably small stepsize scheme. Finally, we validate our theoretical results using numerical experiments on real and synthetic data.
    Image Classification using Graph Neural Network and Multiscale Wavelet Superpixels. (arXiv:2201.12633v1 [cs.CV])
    Prior studies using graph neural networks (GNNs) for image classification have focused on graphs generated from a regular grid of pixels or similar-sized superpixels. In the latter, a single target number of superpixels is defined for an entire dataset irrespective of differences across images and their intrinsic multiscale structure. On the contrary, this study investigates image classification using graphs generated from an image-specific number of multiscale superpixels. We propose WaveMesh, a new wavelet-based superpixeling algorithm, where the number and sizes of superpixels in an image are systematically computed based on its content. WaveMesh superpixel graphs are structurally different from similar-sized superpixel graphs. We use SplineCNN, a state-of-the-art network for image graph classification, to compare WaveMesh and similar-sized superpixels. Using SplineCNN, we perform extensive experiments on three benchmark datasets under three local-pooling settings: 1) no pooling, 2) GraclusPool, and 3) WavePool, a novel spatially heterogeneous pooling scheme tailored to WaveMesh superpixels. Our experiments demonstrate that SplineCNN learns from multiscale WaveMesh superpixels on-par with similar-sized superpixels. In all WaveMesh experiments, GraclusPool performs poorer than no pooling / WavePool, indicating that poor choice of pooling can result in inferior performance while learning from multiscale superpixels.
    Stochastic Neural Networks with Infinite Width are Deterministic. (arXiv:2201.12724v1 [cs.LG])
    This work theoretically studies stochastic neural networks, a main type of neural network in use. Specifically, we prove that as the width of an optimized stochastic neural network tends to infinity, its predictive variance on the training set decreases to zero. Two common examples that our theory applies to are neural networks with dropout and variational autoencoders. Our result helps better understand how stochasticity affects the learning of neural networks and thus design better architectures for practical problems.
    Fast Differentiable Matrix Square Root and Inverse Square Root. (arXiv:2201.12543v1 [cs.CV])
    Computing the matrix square root and its inverse in a differentiable manner is important in a variety of computer vision tasks. Previous methods either adopt the Singular Value Decomposition (SVD) to explicitly factorize the matrix or use the Newton-Schulz iteration (NS iteration) to derive the approximate solution. However, both methods are not computationally efficient enough in either the forward pass or the backward pass. In this paper, we propose two more efficient variants to compute the differentiable matrix square root and the inverse square root. For the forward propagation, one method is to use Matrix Taylor Polynomial (MTP), and the other method is to use Matrix Pad\'e Approximants (MPA). The backward gradient is computed by iteratively solving the continuous-time Lyapunov equation using the matrix sign function. A series of numerical tests show that both methods yield considerable speed-up compared with the SVD or the NS iteration. Moreover, we validate the effectiveness of our methods in several real-world applications, including de-correlated batch normalization, second-order vision transformer, global covariance pooling for large-scale and fine-grained recognition, attentive covariance pooling for video recognition, and neural style transfer. The experimental results demonstrate that our methods can also achieve competitive and even slightly better performances. The Pytorch implementation is available at \href{https://github.com/KingJamesSong/FastDifferentiableMatSqrt}{https://github.com/KingJamesSong/FastDifferentiableMatSqrt}.
    The KFIoU Loss for Rotated Object Detection. (arXiv:2201.12558v1 [cs.CV])
    Differing from the well-developed horizontal object detection area whereby the computing-friendly IoU based loss is readily adopted and well fits with the detection metrics. In contrast, rotation detectors often involve a more complicated loss based on SkewIoU which is unfriendly to gradient-based training. In this paper, we argue that one effective alternative is to devise an approximate loss who can achieve trend-level alignment with SkewIoU loss instead of the strict value-level identity. Specifically, we model the objects as Gaussian distribution and adopt Kalman filter to inherently mimic the mechanism of SkewIoU by its definition, and show its alignment with the SkewIoU at trend-level. This is in contrast to recent Gaussian modeling based rotation detectors e.g. GWD, KLD that involves a human-specified distribution distance metric which requires additional hyperparameter tuning. The resulting new loss called KFIoU is easier to implement and works better compared with exact SkewIoU, thanks to its full differentiability and ability to handle the non-overlapping cases. We further extend our technique to the 3-D case which also suffers from the same issues as 2-D detection. Extensive results on various public datasets (2-D/3-D, aerial/text/face images) with different base detectors show the effectiveness of our approach.
    Decepticons: Corrupted Transformers Breach Privacy in Federated Learning for Language Models. (arXiv:2201.12675v1 [cs.LG])
    A central tenet of Federated learning (FL), which trains models without centralizing user data, is privacy. However, previous work has shown that the gradient updates used in FL can leak user information. While the most industrial uses of FL are for text applications (e.g. keystroke prediction), nearly all attacks on FL privacy have focused on simple image classifiers. We propose a novel attack that reveals private user text by deploying malicious parameter vectors, and which succeeds even with mini-batches, multiple users, and long sequences. Unlike previous attacks on FL, the attack exploits characteristics of both the Transformer architecture and the token embedding, separately extracting tokens and positional embeddings to retrieve high-fidelity text. This work suggests that FL on text, which has historically been resistant to privacy attacks, is far more vulnerable than previously thought.
    SMGRL: A Scalable Multi-resolution Graph Representation Learning Framework. (arXiv:2201.12670v1 [cs.LG])
    Graph convolutional networks (GCNs) allow us to learn topologically-aware node embeddings, which can be useful for classification or link prediction. However, by construction, they lack positional awareness and are unable to capture long-range dependencies without adding additional layers -- which in turn leads to over-smoothing and increased time and space complexity. Further, the complex dependencies between nodes make mini-batching challenging, limiting their applicability to large graphs. This paper proposes a Scalable Multi-resolution Graph Representation Learning (SMGRL) framework that enables us to learn multi-resolution node embeddings efficiently. Our framework is model-agnostic and can be applied to any existing GCN model. We dramatically reduce training costs by training only on a reduced-dimension coarsening of the original graph, then exploit self-similarity to apply the resulting algorithm at multiple resolutions. Inference of these multi-resolution embeddings can be distributed across multiple machines to reduce computational and memory requirements further. The resulting multi-resolution embeddings can be aggregated to yield high-quality node embeddings that capture both long- and short-range dependencies between nodes. Our experiments show that this leads to improved classification accuracy, without incurring high computational costs.
    ApolloRL: a Reinforcement Learning Platform for Autonomous Driving. (arXiv:2201.12609v1 [cs.RO])
    We introduce ApolloRL, an open platform for research in reinforcement learning for autonomous driving. The platform provides a complete closed-loop pipeline with training, simulation, and evaluation components. It comes with 300 hours of real-world data in driving scenarios and popular baselines such as Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC) agents. We elaborate in this paper on the architecture and the environment defined in the platform. In addition, we discuss the performance of the baseline agents in the ApolloRL environment.
    Rewiring with Positional Encodings for Graph Neural Networks. (arXiv:2201.12674v1 [cs.LG])
    Several recent works use positional encodings to extend the receptive fields of graph neural network (GNN) layers equipped with attention mechanisms. These techniques, however, extend receptive fields to the complete graph, at substantial computational cost and risking a change in the inductive biases of conventional GNNs, or require complex architecture adjustments. As a conservative alternative, we use positional encodings to expand receptive fields to any r-ring. Our method augments the input graph with additional nodes/edges and uses positional encodings as node and/or edge features. Thus, it is compatible with many existing GNN architectures. We also provide examples of positional encodings that are non-invasive, i.e., there is a one-to-one map between the original and the modified graphs. Our experiments demonstrate that extending receptive fields via positional encodings and a virtual fully-connected node significantly improves GNN performance and alleviates over-squashing using small r. We obtain improvements across models, showing state-of-the-art performance even using older architectures than recent Transformer models adapted to graphs.
    AntBO: Towards Real-World Automated Antibody Design with Combinatorial Bayesian Optimisation. (arXiv:2201.12570v1 [q-bio.BM])
    Antibodies are canonically Y-shaped multimeric proteins capable of highly specific molecular recognition. The CDRH3 region located at the tip of variable chains of an antibody dominates antigen-binding specificity. Therefore, it is a priority to design optimal antigen-specific CDRH3 regions to develop therapeutic antibodies to combat harmful pathogens. However, the combinatorial nature of CDRH3 sequence space makes it impossible to search for an optimal binding sequence exhaustively and efficiently, especially not experimentally. Here, we present AntBO: a Combinatorial Bayesian Optimisation framework enabling efficient in silico design of the CDRH3 region. Ideally, antibodies should bind to their target antigen and be free from any harmful outcomes. Therefore, we introduce the CDRH3 trust region that restricts the search to sequences with feasible developability scores. To benchmark AntBO, we use the Absolut! software suite as a black-box oracle because it can score the target specificity and affinity of designed antibodies in silico in an unconstrained fashion. The results across 188 antigens demonstrate the benefit of AntBO in designing CDRH3 regions with diverse biophysical properties. In under 200 protein designs, AntBO can suggest antibody sequences that outperform the best binding sequence drawn from 6.9 million experimentally obtained CDRH3s and a commonly used genetic algorithm baseline. Additionally, AntBO finds very-high affinity CDRH3 sequences in only 38 protein designs whilst requiring no domain knowledge. We conclude AntBO brings automated antibody design methods closer to what is practically viable for in vitro experimentation.
    Learning Fast, Learning Slow: A General Continual Learning Method based on Complementary Learning System. (arXiv:2201.12604v1 [cs.LG])
    Humans excel at continually learning from an ever-changing environment whereas it remains a challenge for deep neural networks which exhibit catastrophic forgetting. The complementary learning system (CLS) theory suggests that the interplay between rapid instance-based learning and slow structured learning in the brain is crucial for accumulating and retaining knowledge. Here, we propose CLS-ER, a novel dual memory experience replay (ER) method which maintains short-term and long-term semantic memories that interact with the episodic memory. Our method employs an effective replay mechanism whereby new knowledge is acquired while aligning the decision boundaries with the semantic memories. CLS-ER does not utilize the task boundaries or make any assumption about the distribution of the data which makes it versatile and suited for "general continual learning". Our approach achieves state-of-the-art performance on standard benchmarks as well as more realistic general continual learning settings.
    A Priori Denoising Strategies for Sparse Identification of Nonlinear Dynamical Systems: A Comparative Study. (arXiv:2201.12683v1 [stat.ML])
    In recent years, identification of nonlinear dynamical systems from data has become increasingly popular. Sparse regression approaches, such as Sparse Identification of Nonlinear Dynamics (SINDy), fostered the development of novel governing equation identification algorithms assuming the state variables are known a priori and the governing equations lend themselves to sparse, linear expansions in a (nonlinear) basis of the state variables. In the context of the identification of governing equations of nonlinear dynamical systems, one faces the problem of identifiability of model parameters when state measurements are corrupted by noise. Measurement noise affects the stability of the recovery process yielding incorrect sparsity patterns and inaccurate estimation of coefficients of the governing equations. In this work, we investigate and compare the performance of several local and global smoothing techniques to a priori denoise the state measurements and numerically estimate the state time-derivatives to improve the accuracy and robustness of two sparse regression methods to recover governing equations: Sequentially Thresholded Least Squares (STLS) and Weighted Basis Pursuit Denoising (WBPDN) algorithms. We empirically show that, in general, global methods, which use the entire measurement data set, outperform local methods, which employ a neighboring data subset around a local point. We additionally compare Generalized Cross Validation (GCV) and Pareto curve criteria as model selection techniques to automatically estimate near optimal tuning parameters, and conclude that Pareto curves yield better results. The performance of the denoising strategies and sparse regression methods is empirically evaluated through well-known benchmark problems of nonlinear dynamical systems.
    CoordX: Accelerating Implicit Neural Representation with a Split MLP Architecture. (arXiv:2201.12425v1 [cs.CV])
    Implicit neural representations with multi-layer perceptrons (MLPs) have recently gained prominence for a wide variety of tasks such as novel view synthesis and 3D object representation and rendering. However, a significant challenge with these representations is that both training and inference with an MLP over a large number of input coordinates to learn and represent an image, video, or 3D object, require large amounts of computation and incur long processing times. In this work, we aim to accelerate inference and training of coordinate-based MLPs for implicit neural representations by proposing a new split MLP architecture, CoordX. With CoordX, the initial layers are split to learn each dimension of the input coordinates separately. The intermediate features are then fused by the last layers to generate the learned signal at the corresponding coordinate point. This significantly reduces the amount of computation required and leads to large speedups in training and inference, while achieving similar accuracy as the baseline MLP. This approach thus aims at first learning functions that are a decomposition of the original signal and then fusing them to generate the learned signal. Our proposed architecture can be generally used for many implicit neural representation tasks with no additional memory overheads. We demonstrate a speedup of up to 2.92x compared to the baseline model for image, video, and 3D shape representation and rendering tasks.
    Private Boosted Decision Trees via Smooth Re-Weighting. (arXiv:2201.12648v1 [cs.LG])
    Protecting the privacy of people whose data is used by machine learning algorithms is important. Differential Privacy is the appropriate mathematical framework for formal guarantees of privacy, and boosted decision trees are a popular machine learning technique. So we propose and test a practical algorithm for boosting decision trees that guarantees differential privacy. Privacy is enforced because our booster never puts too much weight on any one example; this ensures that each individual's data never influences a single tree "too much." Experiments show that this boosting algorithm can produce better model sparsity and accuracy than other differentially private ensemble classifiers.
    Systematic Training and Testing for Machine Learning Using Combinatorial Interaction Testing. (arXiv:2201.12428v1 [cs.LG])
    This paper demonstrates the systematic use of combinatorial coverage for selecting and characterizing test and training sets for machine learning models. The presented work adapts combinatorial interaction testing, which has been successfully leveraged in identifying faults in software testing, to characterize data used in machine learning. The MNIST hand-written digits data is used to demonstrate that combinatorial coverage can be used to select test sets that stress machine learning model performance, to select training sets that lead to robust model performance, and to select data for fine-tuning models to new domains. Thus, the results posit combinatorial coverage as a holistic approach to training and testing for machine learning. In contrast to prior work which has focused on the use of coverage in regard to the internal of neural networks, this paper considers coverage over simple features derived from inputs and outputs. Thus, this paper addresses the case where the supplier of test and training sets for machine learning models does not have intellectual property rights to the models themselves. Finally, the paper addresses prior criticism of combinatorial coverage and provides a rebuttal which advocates the use of coverage metrics in machine learning applications.
    A Context-Integrated Transformer-Based Neural Network for Auction Design. (arXiv:2201.12489v1 [cs.GT])
    One of the central problems in auction design is developing an incentive-compatible mechanism that maximizes the auctioneer's expected revenue. While theoretical approaches have encountered bottlenecks in multi-item auctions, recently, there has been much progress on finding the optimal mechanism through deep learning. However, these works either focus on a fixed set of bidders and items, or restrict the auction to be symmetric. In this work, we overcome such limitations by factoring \emph{public} contextual information of bidders and items into the auction learning framework. We propose $\mathtt{CITransNet}$, a context-integrated transformer-based neural network for optimal auction design, which maintains permutation-equivariance over bids and contexts while being able to find asymmetric solutions. We show by extensive experiments that $\mathtt{CITransNet}$ can recover the known optimal solutions in single-item settings, outperform strong baselines in multi-item auctions, and generalize well to cases other than those in training.
    Towards Fast and Accurate Federated Learning with non-IID Data for Cloud-Based IoT Applications. (arXiv:2201.12515v1 [cs.LG])
    As a promising method of central model training on decentralized device data while securing user privacy, Federated Learning (FL)is becoming popular in Internet of Things (IoT) design. However, when the data collected by IoT devices are highly skewed in a non-independent and identically distributed (non-IID) manner, the accuracy of vanilla FL method cannot be guaranteed. Although there exist various solutions that try to address the bottleneck of FL with non-IID data, most of them suffer from extra intolerable communication overhead and low model accuracy. To enable fast and accurate FL, this paper proposes a novel data-based device grouping approach that can effectively reduce the disadvantages of weight divergence during the training of non-IID data. However, since our grouping method is based on the similarity of extracted feature maps from IoT devices, it may incur additional risks of privacy exposure. To solve this problem, we propose an improved version by exploiting similarity information using the Locality-Sensitive Hashing (LSH) algorithm without exposing extracted feature maps. Comprehensive experimental results on well-known benchmarks show that our approach can not only accelerate the convergence rate, but also improve the prediction accuracy for FL with non-IID data.
    Towards Safe Reinforcement Learning with a Safety Editor Policy. (arXiv:2201.12427v1 [cs.LG])
    We consider the safe reinforcement learning (RL) problem of maximizing utility while satisfying provided constraints. Since we do not assume any prior knowledge or pre-training of the safety concept, we are interested in asymptotic constraint satisfaction. A popular approach in this line of research is to combine the Lagrangian method with a model-free RL algorithm to adjust the weight of the constraint reward dynamically. It relies on a single policy to handle the conflict between utility and constraint rewards, which is often challenging. Inspired by the safety layer design (Dalal et al., 2018), we propose to separately learn a safety editor policy that transforms potentially unsafe actions output by a utility maximizer policy into safe ones. The safety editor is trained to maximize the constraint reward while minimizing a hinge loss of the utility Q values of actions before and after the edit. On 12 custom Safety Gym (Ray et al., 2019) tasks and 2 safe racing tasks with very harsh constraint thresholds, our approach demonstrates outstanding utility performance while complying with the constraints. Ablation studies reveal that our two-policy design is critical. Simply doubling the model capacity of typical single-policy approaches will not lead to comparable results. The Q hinge loss is also important in certain circumstances, and replacing it with the usual L2 distance could fail badly.
    An Indirect Rate-Distortion Characterization for Semantic Sources: General Model and the Case of Gaussian Observation. (arXiv:2201.12477v1 [cs.IT])
    A new source model, which consists of an intrinsic state part and an extrinsic observation part, is proposed and its information-theoretic characterization, namely its rate-distortion function, is defined and analyzed. Such a source model is motivated by the recent surge of interest in the semantic aspect of information: the intrinsic state corresponds to the semantic feature of the source, which in general is not observable but can only be inferred from the extrinsic observation. There are two distortion measures, one between the intrinsic state and its reproduction, and the other between the extrinsic observation and its reproduction. Under a given code rate, the tradeoff between these two distortion measures is characterized by the rate-distortion function, which is solved via the indirect rate-distortion theory and is termed as the semantic rate-distortion function of the source. As an application of the general model and its analysis, the case of Gaussian extrinsic observation is studied, assuming a linear relationship between the intrinsic state and the extrinsic observation, under a quadratic distortion structure. The semantic rate-distortion function is shown to be the solution of a convex programming programming with respect to an error covariance matrix, and a reverse water-filling type of solution is provided when the model further satisfies a diagonalizability condition.
    Investigating Why Contrastive Learning Benefits Robustness Against Label Noise. (arXiv:2201.12498v1 [cs.LG])
    Self-supervised contrastive learning has recently been shown to be very effective in preventing deep networks from overfitting noisy labels. Despite its empirical success, the theoretical understanding of the effect of contrastive learning on boosting robustness is very limited. In this work, we rigorously prove that the representation matrix learned by contrastive learning boosts robustness, by having: (i) one prominent singular value corresponding to every sub-class in the data, and remaining significantly smaller singular values; and (ii) a large alignment between the prominent singular vector and the clean labels of each sub-class. The above properties allow a linear layer trained on the representations to quickly learn the clean labels, and prevent it from overfitting the noise for a large number of training iterations. We further show that the low-rank structure of the Jacobian of deep networks pre-trained with contrastive learning allows them to achieve a superior performance initially, when fine-tuned on noisy labels. Finally, we demonstrate that the initial robustness provided by contrastive learning enables robust training methods to achieve state-of-the-art performance under extreme noise levels, e.g., an average of 27.18\% and 15.58\% increase in accuracy on CIFAR-10 and CIFAR-100 with 80\% symmetric noisy labels, and 4.11\% increase in accuracy on WebVision.
    FedGCN: Convergence and Communication Tradeoffs in Federated Training of Graph Convolutional Networks. (arXiv:2201.12433v1 [cs.LG])
    Distributed methods for training models on graph datasets have recently grown in popularity, due to the size of graph datasets as well as the private nature of graphical data like social networks. However, the graphical structure of this data means that it cannot be disjointly partitioned between different learning clients, leading to either significant communication overhead between clients or a loss of information available to the training method. We introduce Federated Graph Convolutional Network (FedGCN), which uses federated learning to train GCN models with optimized convergence rate and communication cost. Compared to prior methods that require communication among clients at each iteration, FedGCN preserves the privacy of client data and only needs communication at the initial step, which greatly reduces communication cost and speeds up the convergence rate. We theoretically analyze the tradeoff between FedGCN's convergence rate and communication cost under different data distributions, introducing a general framework can be generally used for the analysis of all edge-completion-based GCN training algorithms. Experimental results demonstrate the effectiveness of our algorithm and validate our theoretical analysis.
    Certifying Model Accuracy under Distribution Shifts. (arXiv:2201.12440v1 [cs.LG])
    Certified robustness in machine learning has primarily focused on adversarial perturbations of the input with a fixed attack budget for each point in the data distribution. In this work, we present provable robustness guarantees on the accuracy of a model under bounded Wasserstein shifts of the data distribution. We show that a simple procedure that randomizes the input of the model within a transformation space is provably robust to distributional shifts under the transformation. Our framework allows the datum-specific perturbation size to vary across different points in the input distribution and is general enough to include fixed-sized perturbations as well. Our certificates produce guaranteed lower bounds on the performance of the model for any (natural or adversarial) shift of the input distribution within a Wasserstein ball around the original distribution. We apply our technique to: (i) certify robustness against natural (non-adversarial) transformations of images such as color shifts, hue shifts and changes in brightness and saturation, (ii) certify robustness against adversarial shifts of the input distribution, and (iii) show provable lower bounds (hardness results) on the performance of models trained on so-called "unlearnable" datasets that have been poisoned to interfere with model training.
    Counterfactual Plans under Distributional Ambiguity. (arXiv:2201.12487v1 [cs.LG])
    Counterfactual explanations are attracting significant attention due to the flourishing applications of machine learning models in consequential domains. A counterfactual plan consists of multiple possibilities to modify a given instance so that the model's prediction will be altered. As the predictive model can be updated subject to the future arrival of new data, a counterfactual plan may become ineffective or infeasible with respect to the future values of the model parameters. In this work, we study the counterfactual plans under model uncertainty, in which the distribution of the model parameters is partially prescribed using only the first- and second-moment information. First, we propose an uncertainty quantification tool to compute the lower and upper bounds of the probability of validity for any given counterfactual plan. We then provide corrective methods to adjust the counterfactual plan to improve the validity measure. The numerical experiments validate our bounds and demonstrate that our correction increases the robustness of the counterfactual plans in different real-world datasets.
    Any-Play: An Intrinsic Augmentation for Zero-Shot Coordination. (arXiv:2201.12436v1 [cs.AI])
    Cooperative artificial intelligence with human or superhuman proficiency in collaborative tasks stands at the frontier of machine learning research. Prior work has tended to evaluate cooperative AI performance under the restrictive paradigms of self-play (teams composed of agents trained together) and cross-play (teams of agents trained independently but using the same algorithm). Recent work has indicated that AI optimized for these narrow settings may make for undesirable collaborators in the real-world. We formalize an alternative criteria for evaluating cooperative AI, referred to as inter-algorithm cross-play, where agents are evaluated on teaming performance with all other agents within an experiment pool with no assumption of algorithmic similarities between agents. We show that existing state-of-the-art cooperative AI algorithms, such as Other-Play and Off-Belief Learning, under-perform in this paradigm. We propose the Any-Play learning augmentation -- a multi-agent extension of diversity-based intrinsic rewards for zero-shot coordination (ZSC) -- for generalizing self-play-based algorithms to the inter-algorithm cross-play setting. We apply the Any-Play learning augmentation to the Simplified Action Decoder (SAD) and demonstrate state-of-the-art performance in the collaborative card game Hanabi.
    Mixed Integer Neural Inverse Design. (arXiv:2109.12888v2 [cs.GR] UPDATED)
    In computational design and fabrication, neural networks are becoming important surrogates for bulky forward simulations. A long-standing, intertwined question is that of inverse design: how to compute a design that satisfies a desired target performance? Here, we show that the piecewise linear property, very common in everyday neural networks, allows for an inverse design formulation based on mixed-integer linear programming. Our mixed-integer inverse design uncovers globally optimal or near optimal solutions in a principled manner. Furthermore, our method significantly facilitates emerging, but challenging, combinatorial inverse design tasks, such as material selection. For problems where finding the optimal solution is not desirable or tractable, we develop an efficient yet near-optimal hybrid optimization. Eventually, our method is able to find solutions provably robust to possible fabrication perturbations among multiple designs with similar performances.
    Physics-constrained Deep Learning for Robust Inverse ECG Modeling. (arXiv:2107.12780v2 [cs.LG] UPDATED)
    The rapid developments in advanced sensing and imaging bring about a data-rich environment, facilitating the effective modeling, monitoring, and control of complex systems. For example, the body-sensor network captures multi-channel information pertinent to the electrical activity of the heart (i.e., electrocardiograms (ECG)), which enables medical scientists to monitor and detect abnormal cardiac conditions. However, the high-dimensional sensing data are generally complexly structured and realizing the full data potential depends to a great extent on advanced analytical and predictive methods. This paper presents a physics-constrained deep learning (P-DL) framework for high-dimensional inverse ECG modeling. This method integrates the physical laws of the complex system with the advanced deep learning infrastructure for effective prediction of the system dynamics. The proposed P-DL approach is implemented to solve the inverse ECG model and predict the time-varying distribution of electric potentials in the heart from the ECG data measured by the body-surface sensor network. Experimental results show that the proposed P-DL method significantly outperforms existing methods that are commonly used in current practice.
    Lower Bounds and Accelerated Algorithms for Bilevel Optimization. (arXiv:2102.03926v4 [cs.LG] UPDATED)
    Bilevel optimization has recently attracted growing interests due to its wide applications in modern machine learning problems. Although recent studies have characterized the convergence rate for several such popular algorithms, it is still unclear how much further these convergence rates can be improved. In this paper, we address this fundamental question from two perspectives. First, we provide the first-known lower complexity bounds of $\widetilde{\Omega}(\frac{1}{\sqrt{\mu_x}\mu_y})$ and $\widetilde \Omega\big(\frac{1}{\sqrt{\epsilon}}\min\{\frac{1}{\mu_y},\frac{1}{\sqrt{\epsilon^{3}}}\}\big)$ respectively for strongly-convex-strongly-convex and convex-strongly-convex bilevel optimizations. Second, we propose an accelerated bilevel optimizer named AccBiO, for which we provide the first-known complexity bounds without the gradient boundedness assumption (which was made in existing analyses) under the two aforementioned geometries. We also provide significantly tighter upper bounds than the existing complexity when the bounded gradient assumption does hold. We show that AccBiO achieves the optimal results (i.e., the upper and lower bounds match up to logarithmic factors) when the inner-level problem takes a quadratic form with a constant-level condition number. Interestingly, our lower bounds under both geometries are larger than the corresponding optimal complexities of minimax optimization, establishing that bilevel optimization is provably more challenging than minimax optimization.
    Explainable Empirical Risk Minimization. (arXiv:2009.01492v2 [cs.LG] UPDATED)
    The successful application of machine learning (ML) methods becomes increasingly dependent on their interpretability or explainability. Designing explainable ML systems is instrumental to ensuring transparency of automated decision-making that targets humans. The explainability of ML methods is also an essential ingredient for trustworthy artificial intelligence. A key challenge in ensuring explainability is its dependence on the specific human user ("explainee"). The users of machine learning methods might have vastly different background knowledge about machine learning principles. One user might have a university degree in machine learning or related fields, while another user might have never received formal training in high-school mathematics. This paper applies information-theoretic concepts to develop a novel measure for the subjective explainability of the predictions delivered by a ML method. We construct this measure via the conditional entropy of predictions, given a user signal. This user signal might be obtained from user surveys or biophysical measurements. Our main contribution is the explainable empirical risk minimization (EERM) principle of learning a hypothesis that optimally balances between the subjective explainability and risk. The EERM principle is flexible and can be combined with arbitrary machine learning models. We present several practical implementations of EERM for linear models and decision trees. Numerical experiments demonstrate the application of EERM to detecting the use of inappropriate language on social media.
    Are Pre-trained Convolutions Better than Pre-trained Transformers?. (arXiv:2105.03322v2 [cs.CL] UPDATED)
    In the era of pre-trained language models, Transformers are the de facto choice of model architectures. While recent research has shown promise in entirely convolutional, or CNN, architectures, they have not been explored using the pre-train-fine-tune paradigm. In the context of language models, are convolutional models competitive to Transformers when pre-trained? This paper investigates this research question and presents several interesting findings. Across an extensive set of experiments on 8 datasets/tasks, we find that CNN-based pre-trained models are competitive and outperform their Transformer counterpart in certain scenarios, albeit with caveats. Overall, the findings outlined in this paper suggest that conflating pre-training and architectural advances is misguided and that both advances should be considered independently. We believe our research paves the way for a healthy amount of optimism in alternative architectures.
    DeepEMD: Differentiable Earth Mover's Distance for Few-Shot Learning. (arXiv:2003.06777v4 [cs.CV] UPDATED)
    In this work, we develop methods for few-shot image classification from a new perspective of optimal matching between image regions. We employ the Earth Mover's Distance (EMD) as a metric to compute a structural distance between dense image representations to determine image relevance. The EMD generates the optimal matching flows between structural elements that have the minimum matching cost, which is used to calculate the image distance for classification. To generate the important weights of elements in the EMD formulation, we design a cross-reference mechanism, which can effectively alleviate the adverse impact caused by the cluttered background and large intra-class appearance variations. To implement k-shot classification, we propose to learn a structured fully connected layer that can directly classify dense image representations with the EMD. Based on the implicit function theorem, the EMD can be inserted as a layer into the network for end-to-end training. Our extensive experiments validate the effectiveness of our algorithm which outperforms state-of-the-art methods by a significant margin on five widely used few-shot classification benchmarks, namely, miniImageNet, tieredImageNet, Fewshot-CIFAR100 (FC100), Caltech-UCSD Birds-200-2011 (CUB), and CIFAR-FewShot (CIFAR-FS). We also demonstrate the effectiveness of our method on the image retrieval task in our experiments.
    DeepRNG: Towards Deep Reinforcement Learning-Assisted Generative Testing of Software. (arXiv:2201.12602v1 [cs.SE])
    Although machine learning (ML) has been successful in automating various software engineering needs, software testing still remains a highly challenging topic. In this paper, we aim to improve the generative testing of software by directly augmenting the random number generator (RNG) with a deep reinforcement learning (RL) agent using an efficient, automatically extractable state representation of the software under test. Using the Cosmos SDK as the testbed, we show that the proposed DeepRNG framework provides a statistically significant improvement to the testing of the highly complex software library with over 350,000 lines of code. The source code of the DeepRNG framework is publicly available online.
    Prediction of terephthalic acid (TPA) yield in aqueous hydrolysis of polyethylene terephthalate (PET). (arXiv:2201.12657v1 [cs.LG])
    Aqueous hydrolysis is used to chemically recycle polyethylene terephthalate (PET) due to the production of high-quality terephthalic acid (TPA), the PET monomer. PET hydrolysis depends on various reaction conditions including PET size, catalyst concentration, reaction temperature, etc. So, modeling PET hydrolysis by considering the effective factors can provide useful information for material scientists to specify how to design and run these reactions. It will save time, energy, and materials by optimizing the hydrolysis conditions. Machine learning algorithms enable to design models to predict output results. For the first time, 381 experimental data were gathered to model the aqueous hydrolysis of PET. Effective reaction conditions on PET hydrolysis were connected to TPA yield. The logistic regression was applied to rank the reaction conditions. Two algorithms were proposed, artificial neural network multilayer perceptron (ANN-MLP) and adaptive network-based fuzzy inference system (ANFIS). The dataset was divided into training and testing sets to train and test the models, respectively. The models predicted TPA yield sufficiently where the ANFIS model outperformed. R-squared (R2) and Root Mean Square Error (RMSE) loss functions were employed to measure the efficiency of the models and evaluate their performance.
    Approximate Bayesian Computation Based on Maxima Weighted Isolation Kernel Mapping. (arXiv:2201.12745v1 [stat.ML])
    Motivation: The branching processes model yields unevenly stochastically distributed data that consists of sparse and dense regions. The work tries to solve the problem of a precise evaluation of a parameter for this type of model. The application of the branching processes model to cancer cell evolution has many difficulties like high dimensionality and the rare appearance of a result of interest. Moreover, we would like to solve the ambitious task of obtaining the coefficients of the model reflecting the relationship of driver genes mutations and cancer hallmarks on the basis of personal data of variant allele frequencies. Results: The Approximate Bayesian computation method based on the Isolation kernel is designed. The method includes a transformation row data to a Hilbert space (mapping) and measures the similarity between simulation points and maxima weighted Isolation kernel mapping related to the observation point. Also, we designed a heuristic algorithm to find parameter estimation without gradient calculation and dimension-independent. The advantage of the proposed machine learning method is shown for multidimensional test data as well as for an example of cancer cell evolution.
    Learning to Coordinate with Humans using Action Features. (arXiv:2201.12658v1 [cs.LG])
    An unaddressed challenge in human-AI coordination is to enable AI agents to exploit the semantic relationships between the features of actions and the features of observations. Humans take advantage of these relationships in highly intuitive ways. For instance, in the absence of a shared language, we might point to the object we desire or hold up our fingers to indicate how many objects we want. To address this challenge, we investigate the effect of network architecture on the propensity of learning algorithms to exploit these semantic relationships. Across a procedurally generated coordination task, we find that attention-based architectures that jointly process a featurized representation of observations and actions have a better inductive bias for zero-shot coordination. Through fine-grained evaluation and scenario analysis, we show that the resulting policies are human-interpretable. Moreover, such agents coordinate with people without training on any human data.
    Coordinated Attacks against Contextual Bandits: Fundamental Limits and Defense Mechanisms. (arXiv:2201.12700v1 [cs.LG])
    Motivated by online recommendation systems, we propose the problem of finding the optimal policy in multitask contextual bandits when a small fraction $\alpha < 1/2$ of tasks (users) are arbitrary and adversarial. The remaining fraction of good users share the same instance of contextual bandits with $S$ contexts and $A$ actions (items). Naturally, whether a user is good or adversarial is not known in advance. The goal is to robustly learn the policy that maximizes rewards for good users with as few user interactions as possible. Without adversarial users, established results in collaborative filtering show that $O(1/\epsilon^2)$ per-user interactions suffice to learn a good policy, precisely because information can be shared across users. This parallelization gain is fundamentally altered by the presence of adversarial users: unless there are super-polynomial number of users, we show a lower bound of $\tilde{\Omega}(\min(S,A) \cdot \alpha^2 / \epsilon^2)$ {\it per-user} interactions to learn an $\epsilon$-optimal policy for the good users. We then show we can achieve an $\tilde{O}(\min(S,A)\cdot \alpha/\epsilon^2)$ upper-bound, by employing efficient robust mean estimators for both uni-variate and high-dimensional random variables. We also show that this can be improved depending on the distributions of contexts.
    Error Rates for Kernel Classification under Source and Capacity Conditions. (arXiv:2201.12655v1 [stat.ML])
    In this manuscript, we consider the problem of kernel classification under the Gaussian data design, and under source and capacity assumptions on the dataset. While the decay rates of the prediction error have been extensively studied under much more generic assumptions for kernel ridge regression, deriving decay rates for the classification problem has been hitherto considered a much more challenging task. In this work we leverage recent analytical results for learning curves of linear classification with generic loss function to derive the rates of decay of the misclassification (prediction) error with the sample complexity for two standard classification settings, namely margin-maximizing Support Vector Machines (SVM) and ridge classification. Using numerical and analytical arguments, we derive the error rates as a function of the source and capacity coefficients, and contrast the two methods.
    Learning Stochastic Graph Neural Networks with Constrained Variance. (arXiv:2201.12611v1 [eess.SP])
    Stochastic graph neural networks (SGNNs) are information processing architectures that learn representations from data over random graphs. SGNNs are trained with respect to the expected performance, which comes with no guarantee about deviations of particular output realizations around the optimal expectation. To overcome this issue, we propose a variance-constrained optimization problem for SGNNs, balancing the expected performance and the stochastic deviation. An alternating primal-dual learning procedure is undertaken that solves the problem by updating the SGNN parameters with gradient descent and the dual variable with gradient ascent. To characterize the explicit effect of the variance-constrained learning, we conduct a theoretical analysis on the variance of the SGNN output and identify a trade-off between the stochastic robustness and the discrimination power. We further analyze the duality gap of the variance-constrained optimization problem and the converging behavior of the primal-dual learning procedure. The former indicates the optimality loss induced by the dual transformation and the latter characterizes the limiting error of the iterative algorithm, both of which guarantee the performance of the variance-constrained learning. Through numerical simulations, we corroborate our theoretical findings and observe a strong expected performance with a controllable standard deviation.
    Robustness of Deep Recommendation Systems to Untargeted Interaction Perturbations. (arXiv:2201.12686v1 [cs.IR])
    While deep learning-based sequential recommender systems are widely used in practice, their sensitivity to untargeted training data perturbations is unknown. Untargeted perturbations aim to modify ranked recommendation lists for all users at test time, by inserting imperceptible input perturbations during training time. Existing perturbation methods are mostly targeted attacks optimized to change ranks of target items, but not suitable for untargeted scenarios. In this paper, we develop a novel framework in which user-item training interactions are perturbed in unintentional and adversarial settings. First, through comprehensive experiments on four datasets, we show that four popular recommender models are unstable against even one random perturbation. Second, we establish a cascading effect in which minor manipulations of early training interactions can cause extensive changes to the model and the generated recommendations for all users. Leveraging this effect, we propose an adversarial perturbation method CASPER which identifies and perturbs an interaction that induces the maximal cascading effect. Experimentally, we demonstrate that CASPER reduces the stability of recommendation models the most, compared to several baselines and state-of-the-art methods. Finally, we show the runtime and success of CASPER scale near-linearly with the dataset size and the number of perturbations, respectively.
    Explaining Reinforcement Learning Policies through Counterfactual Trajectories. (arXiv:2201.12462v1 [cs.LG])
    In order for humans to confidently decide where to employ RL agents for real-world tasks, a human developer must validate that the agent will perform well at test-time. Some policy interpretability methods facilitate this by capturing the policy's decision making in a set of agent rollouts. However, even the most informative trajectories of training time behavior may give little insight into the agent's behavior out of distribution. In contrast, our method conveys how the agent performs under distribution shifts by showing the agent's behavior across a wider trajectory distribution. We generate these trajectories by guiding the agent to more diverse unseen states and showing the agent's behavior there. In a user study, we demonstrate that our method enables users to score better than baseline methods on one of two agent validation tasks.
    Challenges and approaches to privacy preserving post-click conversion prediction. (arXiv:2201.12666v1 [cs.LG])
    Online advertising has typically been more personalized than offline advertising, through the use of machine learning models and real-time auctions for ad targeting. One specific task, predicting the likelihood of conversion (i.e.\ the probability a user will purchase the advertised product), is crucial to the advertising ecosystem for both targeting and pricing ads. Currently, these models are often trained by observing individual user behavior, but, increasingly, regulatory and technical constraints are requiring privacy-preserving approaches. For example, major platforms are moving to restrict tracking individual user events across multiple applications, and governments around the world have shown steadily more interest in regulating the use of personal data. Instead of receiving data about individual user behavior, advertisers may receive privacy-preserving feedback, such as the number of installs of an advertised app that resulted from a group of users. In this paper we outline the recent privacy-related changes in the online advertising ecosystem from a machine learning perspective. We provide an overview of the challenges and constraints when learning conversion models in this setting. We introduce a novel approach for training these models that makes use of post-ranking signals. We show using offline experiments on real world data that it outperforms a model relying on opt-in data alone, and significantly reduces model degradation when no individual labels are available. Finally, we discuss future directions for research in this evolving area.
    Extracting Finite Automata from RNNs Using State Merging. (arXiv:2201.12451v1 [cs.LG])
    One way to interpret the behavior of a blackbox recurrent neural network (RNN) is to extract from it a more interpretable discrete computational model, like a finite state machine, that captures its behavior. In this work, we propose a new method for extracting finite automata from RNNs inspired by the state merging paradigm from grammatical inference. We demonstrate the effectiveness of our method on the Tomita languages benchmark, where we find that it is able to extract faithful automata from RNNs trained on all languages in the benchmark. We find that extraction performance is aided by the number of data provided during the extraction process, as well as, curiously, whether the RNN model is trained for additional epochs after perfectly learning its target language. We use our method to analyze this phenomenon, finding that training beyond convergence is useful because it leads to compression of the internal state space of the RNN. This finding demonstrates how our method can be used for interpretability and analysis of trained RNN models.
    A Simple Guard for Learned Optimizers. (arXiv:2201.12426v1 [cs.LG])
    If the trend of learned components eventually outperforming their hand-crafted version continues, learned optimizers will eventually outperform hand-crafted optimizers like SGD or Adam. Even if learned optimizers (L2Os) eventually outpace hand-crafted ones in practice however, they are still not provably convergent and might fail out of distribution. These are the questions addressed here. Currently, learned optimizers frequently outperform generic hand-crafted optimizers (such as gradient descent) at the beginning of learning but they generally plateau after some time while the generic algorithms continue to make progress and often overtake the learned algorithm as Aesop's tortoise which overtakes the hare and are not. L2Os also still have a difficult time generalizing out of distribution. (Heaton et al., 2020) proposed Safeguarded L2O (GL2O) which can take a learned optimizer and safeguard it with a generic learning algorithm so that by conditionally switching between the two, the resulting algorithm is provably convergent. We propose a new class of Safeguarded L2O, called Loss-Guarded L2O (LGL2O), which is both conceptually simpler and computationally less expensive. The guarding mechanism decides solely based on the expected future loss value of both optimizers. Furthermore, we show theoretical proof of LGL2O's convergence guarantee and empirical results comparing to GL2O and other baselines showing that it combines the best of both L2O and SGD and and in practice converges much better than GL2O.
    Interconnect Parasitics and Partitioning in Fully-Analog In-Memory Computing Architectures. (arXiv:2201.12480v1 [cs.AR])
    Fully-analog in-memory computing (IMC) architectures that implement both matrix-vector multiplication and non-linear vector operations within the same memory array have shown promising performance benefits over conventional IMC systems due to the removal of energy-hungry signal conversion units. However, maintaining the computation in the analog domain for the entire deep neural network (DNN) comes with potential sensitivity to interconnect parasitics. Thus, in this paper, we investigate the effect of wire parasitic resistance and capacitance on the accuracy of DNN models deployed on fully-analog IMC architectures. Moreover, we propose a partitioning mechanism to alleviate the impact of the parasitic while keeping the computation in the analog domain through dividing large arrays into multiple partitions. The SPICE circuit simulation results for a 400 X 120 X 84 X 10 DNN model deployed on a fully-analog IMC circuit show that a 94.84% accuracy could be achieved for MNIST classification application with 16, 8, and 8 horizontal partitions, as well as 8, 8, and 1 vertical partitions for first, second, and third layers of the DNN, respectively, which is comparable to the ~97% accuracy realized by digital implementation on CPU. It is shown that accuracy benefits are achieved at the cost of higher power consumption due to the extra circuitry required for handling partitioning.
    Discovering Nonlinear PDEs from Scarce Data with Physics-encoded Learning. (arXiv:2201.12354v1 [cs.LG])
    There have been growing interests in leveraging experimental measurements to discover the underlying partial differential equations (PDEs) that govern complex physical phenomena. Although past research attempts have achieved great success in data-driven PDE discovery, the robustness of the existing methods cannot be guaranteed when dealing with low-quality measurement data. To overcome this challenge, we propose a novel physics-encoded discrete learning framework for discovering spatiotemporal PDEs from scarce and noisy data. The general idea is to (1) firstly introduce a novel deep convolutional-recurrent network, which can encode prior physics knowledge (e.g., known PDE terms, assumed PDE structure, initial/boundary conditions, etc.) while remaining flexible on representation capability, to accurately reconstruct high-fidelity data, and (2) perform sparse regression with the reconstructed data to identify the explicit form of the governing PDEs. We validate our method on three nonlinear PDE systems. The effectiveness and superiority of the proposed method over baseline models are demonstrated.
    Error-Correcting Neural Networks for Two-Dimensional Curvature Computation in the Level-Set Method. (arXiv:2201.12342v1 [math.NA])
    We present an error-neural-modeling-based strategy for approximating two-dimensional curvature in the level-set method. Our main contribution is a redesigned hybrid solver (Larios-C\'{a}rdenas and Gibou (2021)[1]) that relies on numerical schemes to enable machine-learning operations on demand. In particular, our routine features double predicting to harness curvature symmetry invariance in favor of precision and stability. As in [1], the core of this solver is a multilayer perceptron trained on circular- and sinusoidal-interface samples. Its role is to quantify the error in numerical curvature approximations and emit corrected estimates for select grid vertices along the free boundary. These corrections arise in response to preprocessed context level-set, curvature, and gradient data. To promote neural capacity, we have adopted sample negative-curvature normalization, reorientation, and reflection-based augmentation. In the same manner, our system incorporates dimensionality reduction, well-balancedness, and regularization to minimize outlying effects. Our training approach is likewise scalable across mesh sizes. For this purpose, we have introduced dimensionless parametrization and probabilistic subsampling during data production. Together, all these elements have improved the accuracy and efficiency of curvature calculations around under-resolved regions. In most experiments, our strategy has outperformed the numerical baseline at twice the number of redistancing steps while requiring only a fraction of the cost.
    How to Sense the World: Leveraging Hierarchy in Multimodal Perception for Robust Reinforcement Learning Agents. (arXiv:2110.03608v2 [cs.LG] UPDATED)
    This work addresses the problem of sensing the world: how to learn a multimodal representation of a reinforcement learning agent's environment that allows the execution of tasks under incomplete perceptual conditions. To address such problem, we argue for hierarchy in the design of representation models and contribute with a novel multimodal representation model, MUSE. The proposed model learns hierarchical representations: low-level modality-specific representations, encoded from raw observation data, and a high-level multimodal representation, encoding joint-modality information to allow robust state estimation. We employ MUSE as the sensory representation model of deep reinforcement learning agents provided with multimodal observations in Atari games. We perform a comparative study over different designs of reinforcement learning agents, showing that MUSE allows agents to perform tasks under incomplete perceptual experience with minimal performance loss. Finally, we evaluate the performance of MUSE in literature-standard multimodal scenarios with higher number and more complex modalities, showing that it outperforms state-of-the-art multimodal variational autoencoders in single and cross-modality generation.
    Semi-discrete optimization through semi-discrete optimal transport: a framework for neural architecture search. (arXiv:2006.15221v2 [math.AP] UPDATED)
    In this paper we introduce a theoretical framework for semi-discrete optimization using ideas from optimal transport. Our primary motivation is in the field of deep learning, and specifically in the task of neural architecture search. With this aim in mind, we discuss the geometric and theoretical motivation for new techniques for neural architecture search (in a companion paper we show that algorithms inspired by our framework are competitive with contemporaneous methods). We introduce a Riemannian-like metric on the space of probability measures over a semi-discrete space $\mathbb{R}^d \times \mathcal{G}$ where $\mathcal{G}$ is a finite weighted graph. With such Riemmanian structure in hand, we derive formal expressions for the gradient flow of a relative entropy functional, as well as second order dynamics for the optimization of said energy. Then, with the aim of providing a rigorous motivation for the gradient flow equations derived formally, we also consider an iterative procedure known as minimizing movement scheme (i.e., Implicit Euler scheme, or JKO scheme) and apply it to the relative entropy with respect to a suitable cost function. For some specific choices of metric and cost, we rigorously show that the minimizing movement scheme of the relative entropy functional converges to the gradient flow process provided by the formal Riemannian structure. This flow coincides with a system of reaction-diffusion equations on $\mathbb{R}^d$.
    Low-rank features based double transformation matrices learning for image classification. (arXiv:2201.12351v1 [cs.LG])
    Linear regression is a supervised method that has been widely used in classification tasks. In order to apply linear regression to classification tasks, a technique for relaxing regression targets was proposed. However, methods based on this technique ignore the pressure on a single transformation matrix due to the complex information contained in the data. A single transformation matrix in this case is too strict to provide a flexible projection, thus it is necessary to adopt relaxation on transformation matrix. This paper proposes a double transformation matrices learning method based on latent low-rank feature extraction. The core idea is to use double transformation matrices for relaxation, and jointly projecting the learned principal and salient features from two directions into the label space, which can share the pressure of a single transformation matrix. Firstly, the low-rank features are learned by the latent low rank representation (LatLRR) method which processes the original data from two directions. In this process, sparse noise is also separated, which alleviates its interference on projection learning to some extent. Then, two transformation matrices are introduced to process the two features separately, and the information useful for the classification is extracted. Finally, the two transformation matrices can be easily obtained by alternate optimization methods. Through such processing, even when a large amount of redundant information is contained in samples, our method can also obtain projection results that are easy to classify. Experiments on multiple data sets demonstrate the effectiveness of our approach for classification, especially for complex scenarios.
    Convolutional Filtering in Simplicial Complexes. (arXiv:2201.12584v1 [eess.SP])
    This paper proposes convolutional filtering for data whose structure can be modeled by a simplicial complex (SC). SCs are mathematical tools that not only capture pairwise relationships as graphs but account also for higher-order network structures. These filters are built by following the shift-and-sum principle of the convolution operation and rely on the Hodge-Laplacians to shift the signal within the simplex. But since in SCs we have also inter-simplex coupling, we use the incidence matrices to transfer the signal in adjacent simplices and build a filter bank to jointly filter signals from different levels. We prove some interesting properties for the proposed filter bank, including permutation and orientation equivariance, a computational complexity that is linear in the SC dimension, and a spectral interpretation using the simplicial Fourier transform. We illustrate the proposed approach with numerical experiments.
    Bellman Meets Hawkes: Model-Based Reinforcement Learning via Temporal Point Processes. (arXiv:2201.12569v1 [cs.LG])
    We consider a sequential decision making problem where the agent faces the environment characterized by the stochastic discrete events and seeks an optimal intervention policy such that its long-term reward is maximized. This problem exists ubiquitously in social media, finance and health informatics but is rarely investigated by the conventional research in reinforcement learning. To this end, we present a novel framework of the model-based reinforcement learning where the agent's actions and observations are asynchronous stochastic discrete events occurring in continuous-time. We model the dynamics of the environment by Hawkes process with external intervention control term and develop an algorithm to embed such process in the Bellman equation which guides the direction of the value gradient. We demonstrate the superiority of our method in both synthetic simulator and real-world problem.
    GARNET: Reduced-Rank Topology Learning for Robust and Scalable Graph Neural Networks. (arXiv:2201.12741v1 [cs.LG])
    Graph neural networks (GNNs) have been increasingly deployed in various applications that involve learning on non-Euclidean data. However, recent studies show that GNNs are vulnerable to graph adversarial attacks. Although there are several defense methods to improve GNN robustness by eliminating adversarial components, they may also impair the underlying clean graph structure that contributes to GNN training. In addition, few of those defense models can scale to large graphs due to their high computational complexity and memory usage. In this paper, we propose GARNET, a scalable spectral method to boost the adversarial robustness of GNN models. GARNET first leverages weighted spectral embedding to construct a base graph, which is not only resistant to adversarial attacks but also contains critical (clean) graph structure for GNN training. Next, GARNET further refines the base graph by pruning additional uncritical edges based on probabilistic graphical model. GARNET has been evaluated on various datasets, including a large graph with millions of nodes. Our extensive experiment results show that GARNET achieves adversarial accuracy improvement and runtime speedup over state-of-the-art GNN (defense) models by up to 13.27% and 14.7x, respectively.
    Sample Prior Guided Robust Model Learning to Suppress Noisy Labels. (arXiv:2112.01197v3 [cs.CV] UPDATED)
    Imperfect labels are ubiquitous in real-world datasets and seriously harm the model performance. Several recent effective methods for handling noisy labels have two key steps: 1) dividing samples into cleanly labeled and wrongly labeled sets by training loss, 2) using semi-supervised methods to generate pseudo-labels for samples in the wrongly labeled set. However, current methods always hurt the informative hard samples due to the similar loss distribution between the hard samples and the noisy ones. In this paper, we proposed PGDF (Prior Guided Denoising Framework), a novel framework to learn a deep model to suppress noise by generating the samples' prior knowledge, which is integrated into both dividing samples step and semi-supervised step. Our framework can save more informative hard clean samples into the cleanly labeled set. Besides, our framework also promotes the quality of pseudo-labels during the semi-supervised step by suppressing the noise in the current pseudo-labels generating scheme. To further enhance the hard samples, we reweight the samples in the cleanly labeled set during training. We evaluated our method using synthetic datasets based on CIFAR-10 and CIFAR-100, as well as on the real-world datasets WebVision and Clothing1M. The results demonstrate substantial improvements over state-of-the-art methods.  ( 2 min )
    NormVAE: Normative Modeling on Neuroimaging Data using Variational Autoencoders. (arXiv:2110.04903v2 [eess.IV] UPDATED)
    Normative modelling is an emerging method for understanding the heterogeneous biology underlying neuropsychiatric and neurodegenerative disorders at the level of the individual participant. Deep learning techniques, specially autoencoders have been implemented as normative models, where patient-level deviations are modelled as the squared difference between the actual and reconstructed input without any uncertainty estimates in the deviations. In this study, we assessed NormVAE, a novel Variational Autoencoder (VAE) based normative modelling approach which calculates subject-level deviation maps quantifying uncertainty in the deviations. First, we train NormVAE on the neuroimaging data of healthy population. Next, we assess the trained model on Alzheimer's Disease (AD) patients to estimate how each disease patient deviated from the norm and identified the brain regions associated with the deviations. Our experiments demonstrated that the NormVAE-generated patient-level deviation maps exhibit increased sensitivity to the different progression stages of AD and have a better correlation with cognition compared to baseline normative models like Gaussian Process Regression (GPR) and a standard VAE, which generates deterministic subject-level deviations without any uncertainty estimates.  ( 2 min )
    CounterNet: End-to-End Training of Counterfactual Aware Predictions. (arXiv:2109.07557v2 [cs.LG] UPDATED)
    This work presents CounterNet, a novel end-to-end learning framework which integrates Machine Learning (ML) model training and the generation of corresponding counterfactual (CF) explanations into a single end-to-end pipeline. Counterfactual explanations offer a contrastive case, i.e., they attempt to find the smallest modification to the feature values of an instance that changes the prediction of the ML model on that instance to a predefined output. Prior techniques for generating CF explanations suffer from two major limitations: (i) they generate CF explanations by solving separate time-intensive optimization problems for every single input instance (which slows their running time); and (ii) they are post-hoc methods which are designed for use with proprietary black-box ML models - as a result, their procedure for generating CF explanations is uninformed by the training procedure of the black-box ML model, which leads to misalignment of objectives between model predictions and explanations. These two limitations result in significant shortcomings in the quality of the generated CF explanations. CounterNet, on the other hand, integrates both prediction and explanation in the same framework, which enables the optimization of the CF explanation generation only once together with the predictive model. We adopt a theoretically sound block-wise coordinate descent procedure which helps in effectively training CounterNet's network architecture. Finally, we evaluate CounterNet's performance by conducting extensive experiments on multiple real-world datasets. Our results show that CounterNet generates high-quality predictions, and corresponding CF explanations (with well-balanced cost-invalidity trade-offs) for any new input instance significantly faster than existing state-of-the-art baselines.  ( 2 min )
    Measuring Complexity of Learning Schemes Using Hessian-Schatten Total Variation. (arXiv:2112.06209v2 [cs.LG] UPDATED)
    In this paper, we introduce the Hessian-Schatten total variation (HTV) -- a novel seminorm that quantifies the total "rugosity" of multivariate functions. Our motivation for defining HTV is to assess the complexity of supervised-learning schemes. We start by specifying the adequate matrix-valued Banach spaces that are equipped with suitable classes of mixed norms. We then show that the HTV is invariant to rotations, scalings, and translations. Additionally, its minimum value is achieved for linear mappings, which supports the common intuition that linear regression is the least complex learning model. Next, we present closed-form expressions of the HTV for two general classes of functions. The first one is the class of Sobolev functions with a certain degree of regularity, for which we show that the HTV coincides with the Hessian-Schatten seminorm that is sometimes used as a regularizer for image reconstruction. The second one is the class of continuous and piecewise-linear (CPWL) functions. In this case, we show that the HTV reflects the total change in slopes between linear regions that have a common facet. Hence, it can be viewed as a convex relaxation (l1-type) of the number of linear regions (l0-type) of CPWL mappings. Finally, we illustrate the use of our proposed seminorm.  ( 2 min )
    Realistic galaxy image simulation via score-based generative models. (arXiv:2111.01713v2 [astro-ph.IM] UPDATED)
    We show that a Denoising Diffusion Probabalistic Model (DDPM), a class of score-based generative model, can be used to produce realistic mock images that mimic observations of galaxies. Our method is tested with Dark Energy Spectroscopic Instrument (DESI) grz imaging of galaxies from the Photometry and Rotation curve OBservations from Extragalactic Surveys (PROBES) sample and galaxies selected from the Sloan Digital Sky Survey. Subjectively, the generated galaxies are highly realistic when compared with samples from the real dataset. We quantify the similarity by borrowing from the deep generative learning literature, using the `Fr\'echet Inception Distance' to test for subjective and morphological similarity. We also introduce the `Synthetic Galaxy Distance' metric to compare the emergent physical properties (such as total magnitude, colour and half light radius) of a ground truth parent and synthesised child dataset. We argue that the DDPM approach produces sharper and more realistic images than other generative methods such as Adversarial Networks (with the downside of more costly inference), and could be used to produce large samples of synthetic observations tailored to a specific imaging survey. We demonstrate two potential uses of the DDPM: (1) accurate in-painting of occluded data, such as satellite trails, and (2) domain transfer, where new input images can be processed to mimic the properties of the DDPM training set. Here we `DESI-fy' cartoon images as a proof of concept for domain transfer. Finally, we suggest potential applications for score-based approaches that could motivate further research on this topic within the astronomical community.  ( 2 min )
    Multiplication-Avoiding Variant of Power Iteration with Applications. (arXiv:2110.12065v2 [eess.SP] UPDATED)
    Power iteration is a fundamental algorithm in data analysis. It extracts the eigenvector corresponding to the largest eigenvalue of a given matrix. Applications include ranking algorithms, recommendation systems, principal component analysis (PCA), among many others. In this paper, we introduce multiplication-avoiding power iteration (MAPI), which replaces the standard $\ell_2$-inner products that appear at the regular power iteration (RPI) with multiplication-free vector products which are Mercer-type kernel operations related with the $\ell_1$ norm. Precisely, for an $n\times n$ matrix, MAPI requires $n$ multiplications, while RPI needs $n^2$ multiplications per iteration. Therefore, MAPI provides a significant reduction of the number of multiplication operations, which are known to be costly in terms of energy consumption. We provide applications of MAPI to PCA-based image reconstruction as well as to graph-based ranking algorithms. When compared to RPI, MAPI not only typically converges much faster, but also provides superior performance.  ( 2 min )
    Reinforcement Learning with Dynamic Convex Risk Measures. (arXiv:2112.13414v2 [cs.LG] UPDATED)
    We develop an approach for solving time-consistent risk-sensitive stochastic optimization problems using model-free reinforcement learning (RL). Specifically, we assume agents assess the risk of a sequence of random variables using dynamic convex risk measures. We employ a time-consistent dynamic programming principle to determine the value of a particular policy, and develop policy gradient update rules that aid in obtaining optimal policies. We further develop an actor-critic style algorithm using neural networks to optimize over policies. Finally, we demonstrate the performance and flexibility of our approach by applying it to three optimization problems: statistical arbitrage trading strategies, obstacle avoidance robot control, and financial hedging.  ( 2 min )
    Mind-proofing Your Phone: Navigating the Digital Minefield with GreaseTerminator. (arXiv:2112.10699v2 [cs.HC] UPDATED)
    Digital harms are widespread in the mobile ecosystem. As these devices gain ever more prominence in our daily lives, so too increases the potential for malicious attacks against individuals. The last line of defense against a range of digital harms - including digital distraction, political polarisation through hate speech, and children being exposed to damaging material - is the user interface. This work introduces GreaseTerminator to enable researchers to develop, deploy, and test interventions against these harms with end-users. We demonstrate the ease of intervention development and deployment, as well as the broad range of harms potentially covered with GreaseTerminator in five in-depth case studies.  ( 2 min )
    Explainable multiple abnormality classification of chest CT volumes with deep learning. (arXiv:2111.12215v2 [eess.IV] UPDATED)
    Understanding model predictions is critical in healthcare, to facilitate rapid verification of model correctness and to guard against use of models that exploit confounding variables. We introduce the challenging new task of explainable multiple abnormality classification in volumetric medical images, in which a model must indicate the regions used to predict each abnormality. To solve this task, we propose a multiple instance learning convolutional neural network, AxialNet, that allows identification of top slices for each abnormality. Next we incorporate HiResCAM, an attention mechanism, to identify sub-slice regions. We prove that for AxialNet, HiResCAM explanations are guaranteed to reflect the locations the model used, unlike Grad-CAM which sometimes highlights irrelevant locations. Armed with a model that produces faithful explanations, we then aim to improve the model's learning through a novel mask loss that leverages HiResCAM and 3D allowed regions to encourage the model to predict abnormalities based only on the organs in which those abnormalities appear. The 3D allowed regions are obtained automatically through a new approach, PARTITION, that combines location information extracted from radiology reports with organ segmentation maps obtained through morphological image processing. Overall, we propose the first model for explainable multi-abnormality prediction in volumetric medical images, and then use the mask loss to achieve a 33% improvement in organ localization of multiple abnormalities in the RAD-ChestCT data set of 36,316 scans, representing the state of the art. This work advances the clinical applicability of multiple abnormality modeling in chest CT volumes.  ( 2 min )
    Learning to Control Direct Current Motor for Steering in Real Time via Reinforcement Learning. (arXiv:2108.00138v2 [cs.LG] UPDATED)
    Learning to quickly control a complex dynamical system (continuous and non-linear) in the presence of disturbances and uncertainties is often desired in many industrial and robotic applications. However, techniques to accomplish this task usually rely on a mathematical system model, which is often insufficient to anticipate the effects of time-varying and interrelated sources of non-linearities. Furthermore, most model-free approaches that have been successful at this task rely on massive interactions with the system (usually in a simulation) and are trained in specialized hardware to fit a highly parameterized controller. In this work, we learn to control one such dynamical system (steering position control of a DC motor) using the sample efficient technique named Neural fitted Q. Using data collected from hardware interactions in the real world, we additionally build a simulator to experiment with a wide range of parameters and learning strategies. Using the parameters found in simulation, we successfully learn an effective control policy in one minute and 53 seconds on a simulation and in 10 minutes and 35 seconds on a physical system.  ( 2 min )
    Quantifying Epistemic Uncertainty in Deep Learning. (arXiv:2110.12122v2 [cs.LG] UPDATED)
    Uncertainty quantification is at the core of the reliability and robustness of machine learning. In this paper, we provide a theoretical framework to dissect the uncertainty, especially the epistemic component, in deep learning into procedural variability (from the training procedure) and data variability (from the training data), which is the first such attempt in the literature to our best knowledge. We then propose two approaches to estimate these uncertainties, one based on influence function and one on batching. We demonstrate how our approaches overcome the computational difficulties in applying classical statistical methods. Experimental evaluations on multiple problem settings corroborate our theory and illustrate how our framework and estimation can provide direct guidance on modeling and data collection effort to improve deep learning performance.  ( 2 min )
    Differentially Private Variable Selection via the Knockoff Filter. (arXiv:2109.05402v3 [stat.ML] UPDATED)
    The knockoff filter, recently developed by Barber and Candes, is an effective procedure to perform variable selection with a controlled false discovery rate (FDR). We propose a private version of the knockoff filter by incorporating Gaussian and Laplace mechanisms, and show that variable selection with controlled FDR can be achieved. Simulations demonstrate that our setting has reasonable statistical power.  ( 2 min )
    Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Streaming Data. (arXiv:2109.07117v2 [cs.LG] UPDATED)
    Motivated by the high-frequency data streams continuously generated, real-time learning is becoming increasingly important. These data streams should be processed sequentially with the property that the data stream may change over time. In this streaming setting, we propose techniques for minimizing convex objectives through unbiased estimates of their gradients, commonly referred to as stochastic approximation problems. Our methods rely on stochastic approximation algorithms because of their applicability and computational advantages. The reasoning includes iterate averaging that guarantees optimal statistical efficiency under classical conditions. Our non-asymptotic analysis shows accelerated convergence by selecting the learning rate according to the expected data streams. We show that the average estimate converges optimally and robustly for any data stream rate. In addition, noise reduction can be achieved by processing the data in a specific pattern, which is advantageous for large-scale machine learning problems. These theoretical results are illustrated for various data streams, showing the effectiveness of the proposed algorithms.  ( 2 min )
    A Minimax Lower Bound for Low-Rank Matrix-Variate Logistic Regression. (arXiv:2105.14673v2 [cs.LG] UPDATED)
    This paper considers the problem of matrix-variate logistic regression. It derives the fundamental error threshold on estimating low-rank coefficient matrices in the logistic regression problem by obtaining a lower bound on the minimax risk. The bound depends explicitly on the dimension and distribution of the covariates, the rank and energy of the coefficient matrix, and the number of samples. The resulting bound is proportional to the intrinsic degrees of freedom in the problem, which suggests the sample complexity of the low-rank matrix logistic regression problem can be lower than that for vectorized logistic regression. The proof techniques utilized in this work also set the stage for development of minimax lower bounds for tensor-variate logistic regression problems.  ( 2 min )
    The Dual PC Algorithm for Structure Learning. (arXiv:2112.09036v2 [stat.ML] UPDATED)
    While learning the graphical structure of Bayesian networks from observational data is key to describing and helping understand data generating processes in complex applications, the task poses considerable challenges due to its computational complexity. The directed acyclic graph (DAG) representing a Bayesian network model is generally not identifiable from observational data, and a variety of methods exist to estimate its equivalence class instead. Under certain assumptions, the popular PC algorithm can consistently recover the correct equivalence class by testing for conditional independence (CI), starting from marginal independence relationships and progressively expanding the conditioning set. Here, we propose the dual PC algorithm, a novel scheme to carry out the CI tests within the PC algorithm by leveraging the inverse relationship between covariance and precision matrices. Notably, the elements of the precision matrix coincide with partial correlations for Gaussian data. Our algorithm then exploits block matrix inversions on the covariance and precision matrices to simultaneously perform tests on partial correlations of complementary (or dual) conditioning sets. The multiple CI tests of the dual PC algorithm, therefore, proceed by first considering marginal and full-order CI relationships and progressively moving to central-order ones. Simulation studies indicate that the dual PC algorithm outperforms the classical PC algorithm both in terms of run time and in recovering the underlying network structure.  ( 2 min )
    Towards the Generalization of Contrastive Self-Supervised Learning. (arXiv:2111.00743v2 [cs.LG] UPDATED)
    Recently, self-supervised learning has attracted great attention, since it only requires unlabeled data for training. Contrastive learning is a popular approach for self-supervised learning and achieves promising empirical performance. However, the theoretical understanding of its generalization ability is still limited. To this end, we define a kind of $(\sigma,\delta)$-measure to mathematically quantify the data augmentation, and then provide an upper bound of the downstream classification error based on the measure. We show that the generalization ability of contrastive self-supervised learning depends on three key factors: alignment of positive samples, divergence of class centers, and concentration of augmented data. The first two factors can be optimized by contrastive algorithms, while the third one is priorly determined by pre-defined data augmentation. With the above theoretical findings, we further study two canonical contrastive losses, InfoNCE and cross-correlation loss, and prove that both of them are able to obtain the embedding space satisfying the aforementioned factors. Finally, we conduct various experiments on the real-world dataset, and show that our theoretical inferences on the relationship between the data augmentation and the generalization of contrastive self-supervised learning agree with the empirical observations.  ( 2 min )
    RelaySum for Decentralized Deep Learning on Heterogeneous Data. (arXiv:2110.04175v2 [cs.LG] UPDATED)
    In decentralized machine learning, workers compute model updates on their local data. Because the workers only communicate with few neighbors without central coordination, these updates propagate progressively over the network. This paradigm enables distributed training on networks without all-to-all connectivity, helping to protect data privacy as well as to reduce the communication cost of distributed training in data centers. A key challenge, primarily in decentralized deep learning, remains the handling of differences between the workers' local data distributions. To tackle this challenge, we introduce the RelaySum mechanism for information propagation in decentralized learning. RelaySum uses spanning trees to distribute information exactly uniformly across all workers with finite delays depending on the distance between nodes. In contrast, the typical gossip averaging mechanism only distributes data uniformly asymptotically while using the same communication volume per step as RelaySum. We prove that RelaySGD, based on this mechanism, is independent of data heterogeneity and scales to many workers, enabling highly accurate decentralized deep learning on heterogeneous data. Our code is available at this http URL  ( 2 min )
    Revisiting Contrastive Learning through the Lens of Neighborhood Component Analysis: an Integrated Framework. (arXiv:2112.04468v2 [cs.LG] UPDATED)
    As a seminal tool in self-supervised representation learning, contrastive learning has gained unprecedented attention in recent years. In essence, contrastive learning aims to leverage pairs of positive and negative samples for representation learning, which relates to exploiting neighborhood information in a feature space. By investigating the connection between contrastive learning and neighborhood component analysis (NCA), we provide a novel stochastic nearest neighbor viewpoint of contrastive learning and subsequently propose a series of contrastive losses that outperform the existing ones. Under our proposed framework, we show a new methodology to design integrated contrastive losses that could simultaneously achieve good accuracy and robustness on downstream tasks. With the integrated framework, we achieve up to 6\% improvement on the standard accuracy and 17\% improvement on the robust accuracy.  ( 2 min )
    Continual self-training with bootstrapped remixing for speech enhancement. (arXiv:2110.10103v2 [cs.SD] UPDATED)
    We propose RemixIT, a simple and novel self-supervised training method for speech enhancement. The proposed method is based on a continuously self-training scheme that overcomes limitations from previous studies including assumptions for the in-domain noise distribution and having access to clean target signals. Specifically, a separation teacher model is pre-trained on an out-of-domain dataset and is used to infer estimated target signals for a batch of in-domain mixtures. Next, we bootstrap the mixing process by generating artificial mixtures using permuted estimated clean and noise signals. Finally, the student model is trained using the permuted estimated sources as targets while we periodically update teacher's weights using the latest student model. Our experiments show that RemixIT outperforms several previous state-of-the-art self-supervised methods under multiple speech enhancement tasks. Additionally, RemixIT provides a seamless alternative for semi-supervised and unsupervised domain adaptation for speech enhancement tasks, while being general enough to be applied to any separation task and paired with any separation model.  ( 2 min )
    Stepwise-Refining Speech Separation Network via Fine-Grained Encoding in High-order Latent Domain. (arXiv:2110.04791v2 [eess.AS] UPDATED)
    The crux of single-channel speech separation is how to encode the mixture of signals into such a latent embedding space that the signals from different speakers can be precisely separated. Existing methods for speech separation either transform the speech signals into frequency domain to perform separation or seek to learn a separable embedding space by constructing a latent domain based on convolutional filters. While the latter type of methods learning an embedding space achieves substantial improvement for speech separation, we argue that the embedding space defined by only one latent domain does not suffice to provide a thoroughly separable encoding space for speech separation. In this paper, we propose the Stepwise-Refining Speech Separation Network (SRSSN), which follows a coarse-to-fine separation framework. It first learns a 1-order latent domain to define an encoding space and thereby performs a rough separation in the coarse phase. Then the proposed SRSSN learns a new latent domain along each basis function of the existing latent domain to obtain a high-order latent domain in the refining phase, which enables our model to perform a refining separation to achieve a more precise speech separation. We demonstrate the effectiveness of our SRSSN by conducting extensive experiments, including speech separation in a clean (noise-free) setting on WSJ0-2/3mix datasets as well as in noisy/reverberant settings on WHAM!/WHAMR! datasets. Furthermore, we also perform experiments of speech recognition on separated speech signals by our model to evaluate the performance of speech separation indirectly.  ( 2 min )
    Preconditioning for Scalable Gaussian Process Hyperparameter Optimization. (arXiv:2107.00243v2 [cs.LG] UPDATED)
    Gaussian process hyperparameter optimization requires linear solves with, and log-determinants of, large kernel matrices. Iterative numerical techniques are becoming popular to scale to larger datasets, relying on the conjugate gradient method (CG) for the linear solves and stochastic trace estimation for the log-determinant. This work introduces new algorithmic and theoretical insights for preconditioning these computations. While preconditioning is well understood in the context of CG, we demonstrate that it can also accelerate convergence and reduce variance of the estimates for the log-determinant and its derivative. We prove general probabilistic error bounds for the preconditioned computation of the log-determinant, log-marginal likelihood and its derivatives. Additionally, we derive specific rates for a range of kernel-preconditioner combinations, showing that up to exponential convergence can be achieved. Our theoretical results enable provably efficient optimization of kernel hyperparameters, which we validate empirically on large-scale benchmark problems. There our approach accelerates training by up to an order of magnitude.  ( 2 min )
    Automating Control of Overestimation Bias for Reinforcement Learning. (arXiv:2110.13523v2 [cs.LG] UPDATED)
    Overestimation bias control techniques are used by the majority of high-performing off-policy reinforcement learning algorithms. However, most of these techniques rely on pre-defined bias correction policies that are either not flexible enough or require environment-specific tuning of hyperparameters. In this work, we present a general data-driven approach for the automatic selection of bias control hyperparameters. We demonstrate its effectiveness on three algorithms: Truncated Quantile Critics, Weighted Delayed DDPG, and Maxmin Q-learning. The proposed technique eliminates the need for an extensive hyperparameter search. We show that it leads to a significant reduction of the actual number of interactions while preserving the performance.  ( 2 min )
    On Improving Model-Free Algorithms for Decentralized Multi-Agent Reinforcement Learning. (arXiv:2110.05707v2 [cs.LG] UPDATED)
    Multi-agent reinforcement learning (MARL) algorithms often suffer from an exponential sample complexity dependence on the number of agents, a phenomenon known as \emph{the curse of multiagents}. In this paper, we address this challenge by investigating sample-efficient model-free algorithms in \emph{decentralized} MARL, and aim to improve existing algorithms along this line. For learning (coarse) correlated equilibria in general-sum Markov games, we propose \emph{stage-based} V-learning algorithms that significantly simplify the algorithmic design and analysis of recent works, and circumvent a rather complicated no-\emph{weighted}-regret bandit subroutine. For learning Nash equilibria in Markov potential games, we propose an independent policy gradient algorithm with a decentralized momentum-based variance reduction technique. All our algorithms are decentralized in that each agent can make decisions based on only its local information. Neither communication nor centralized coordination is required during learning, leading to a natural generalization to a large number of agents. We also provide numerical simulations to corroborate our theoretical findings.  ( 2 min )
    Quantifying the Computational Capability of a Nanomagnetic Reservoir Computing Platform with Emergent Magnetization Dynamics. (arXiv:2111.14603v2 [cond-mat.mes-hall] UPDATED)
    Devices based on arrays of interconnected magnetic nano-rings with emergent magnetization dynamics have recently been proposed for use in reservoir computing applications, but for them to be computationally useful it must be possible to optimise their dynamical responses. Here, we use a phenomenological model to demonstrate that such reservoirs can be optimised for classification tasks by tuning hyperparameters that control the scaling and input rate of data into the system using rotating magnetic fields. We use task-independent metrics to assess the rings' computational capabilities at each set of these hyperparameters and show how these metrics correlate directly to performance in spoken and written digit recognition tasks. We then show that these metrics, and performance in tasks, can be further improved by expanding the reservoir's output to include multiple, concurrent measures of the ring arrays magnetic states.  ( 2 min )
    Mitigating Adversarial Attacks by Distributing Different Copies to Different Users. (arXiv:2111.15160v2 [cs.CR] UPDATED)
    Machine learning models are vulnerable to adversarial attacks. In this paper, we consider the scenario where a model is to be distributed to many users, among which a malicious user attempts to attack another user. The malicious user probes its unique copy of the model to search for adversarial samples, presenting found samples to the victim's model in order to replicate the attack. By distributing different copies of the model to different users, we can mitigate such attacks wherein adversarial samples found on one copy would not work on another copy. We propose a flexible parameter rewriting method that directly modifies the model's parameters. This method does not require training and is able to generate a large number of copies, where each copy induces different sets of adversarial samples. Experimentation studies show that our approach can significantly mitigate the attacks while retaining high accuracy.  ( 2 min )
    Reinforcement Learning in Reward-Mixing MDPs. (arXiv:2110.03743v2 [cs.LG] UPDATED)
    Learning a near optimal policy in a partially observable system remains an elusive challenge in contemporary reinforcement learning. In this work, we consider episodic reinforcement learning in a reward-mixing Markov decision process (MDP). There, a reward function is drawn from one of multiple possible reward models at the beginning of every episode, but the identity of the chosen reward model is not revealed to the agent. Hence, the latent state space, for which the dynamics are Markovian, is not given to the agent. We study the problem of learning a near optimal policy for two reward-mixing MDPs. Unlike existing approaches that rely on strong assumptions on the dynamics, we make no assumptions and study the problem in full generality. Indeed, with no further assumptions, even for two switching reward-models, the problem requires several new ideas beyond existing algorithmic and analysis techniques for efficient exploration. We provide the first polynomial-time algorithm that finds an $\epsilon$-optimal policy after exploring $\tilde{O}(poly(H,\epsilon^{-1}) \cdot S^2 A^2)$ episodes, where $H$ is time-horizon and $S, A$ are the number of states and actions respectively. This is the first efficient algorithm that does not require any assumptions in partially observed environments where the observation space is smaller than the latent state space.  ( 2 min )
    Saddle-to-Saddle Dynamics in Deep Linear Networks: Small Initialization Training, Symmetry, and Sparsity. (arXiv:2106.15933v2 [stat.ML] UPDATED)
    The dynamics of Deep Linear Networks (DLNs) is dramatically affected by the variance $\sigma^2$ of the parameters at initialization $\theta_0$. For DLNs of width $w$, we show a phase transition w.r.t. the scaling $\gamma$ of the variance $\sigma^2=w^{-\gamma}$ as $w\to\infty$: for large variance ($\gamma1$), $\theta_0$ is close to a saddle point and far from any global minimum. While the first case corresponds to the well-studied NTK regime, the second case is less understood. This motivates the study of the case $\gamma \to +\infty$, where we conjecture a Saddle-to-Saddle dynamics: throughout training, gradient descent visits the neighborhoods of a sequence of saddles, each corresponding to linear maps of increasing rank, until reaching a sparse global minimum. We support this conjecture with a theorem for the dynamics between the first two saddles, as well as some numerical experiments.  ( 2 min )
    ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning. (arXiv:2111.10952v2 [cs.CL] UPDATED)
    Despite the recent success of multi-task learning and transfer learning for natural language processing (NLP), few works have systematically studied the effect of scaling up the number of tasks during pre-training. Towards this goal, this paper introduces ExMix (Extreme Mixture): a massive collection of 107 supervised NLP tasks across diverse domains and task-families. Using ExMix, we study the effect of multi-task pre-training at the largest scale to date, and analyze co-training transfer amongst common families of tasks. Through this analysis, we show that manually curating an ideal set of tasks for multi-task pre-training is not straightforward, and that multi-task scaling can vastly improve models on its own. Finally, we propose ExT5: a model pre-trained using a multi-task objective of self-supervised span denoising and supervised ExMix. Via extensive experiments, we show that ExT5 outperforms strong T5 baselines on SuperGLUE, GEM, Rainbow, Closed-Book QA tasks, and several tasks outside of ExMix. ExT5 also significantly improves sample efficiency while pre-training.  ( 2 min )
    Networked Restless Multi-Armed Bandits for Mobile Interventions. (arXiv:2201.12408v1 [cs.LG])
    Motivated by a broad class of mobile intervention problems, we propose and study restless multi-armed bandits (RMABs) with network effects. In our model, arms are partially recharging and connected through a graph, so that pulling one arm also improves the state of neighboring arms, significantly extending the previously studied setting of fully recharging bandits with no network effects. In mobile interventions, network effects may arise due to regular population movements (such as commuting between home and work). We show that network effects in RMABs induce strong reward coupling that is not accounted for by existing solution methods. We propose a new solution approach for networked RMABs, exploiting concavity properties which arise under natural assumptions on the structure of intervention effects. We provide sufficient conditions for optimality of our approach in idealized settings and demonstrate that it empirically outperforms state-of-the art baselines in three mobile intervention domains using real-world graphs.  ( 2 min )
    FastFlows: Flow-Based Models for Molecular Graph Generation. (arXiv:2201.12419v1 [physics.chem-ph])
    We propose a framework using normalizing-flow based models, SELF-Referencing Embedded Strings, and multi-objective optimization that efficiently generates small molecules. With an initial training set of only 100 small molecules, FastFlows generates thousands of chemically valid molecules in seconds. Because of the efficient sampling, substructure filters can be applied as desired to eliminate compounds with unreasonable moieties. Using easily computable and learned metrics for druglikeness, synthetic accessibility, and synthetic complexity, we perform a multi-objective optimization to demonstrate how FastFlows functions in a high-throughput virtual screening context. Our model is significantly simpler and easier to train than autoregressive molecular generative models, and enables fast generation and identification of druglike, synthesizable molecules.  ( 2 min )
    SHINE: SHaring the INverse Estimate from the forward pass for bi-level optimization and implicit models. (arXiv:2106.00553v3 [cs.LG] UPDATED)
    In recent years, implicit deep learning has emerged as a method to increase the effective depth of deep neural networks. While their training is memory-efficient, they are still significantly slower to train than their explicit counterparts. In Deep Equilibrium Models (DEQs), the training is performed as a bi-level problem, and its computational complexity is partially driven by the iterative inversion of a huge Jacobian matrix. In this paper, we propose a novel strategy to tackle this computational bottleneck from which many bi-level problems suffer. The main idea is to use the quasi-Newton matrices from the forward pass to efficiently approximate the inverse Jacobian matrix in the direction needed for the gradient computation. We provide a theorem that motivates using our method with the original forward algorithms. In addition, by modifying these forward algorithms, we further provide theoretical guarantees that our method asymptotically estimates the true implicit gradient. We empirically study this approach and the recent Jacobian-Free method in different settings, ranging from hyperparameter optimization to large Multiscale DEQs (MDEQs) applied to CIFAR and ImageNet. Both methods reduce significantly the computational cost of the backward pass. While SHINE has a clear advantage on hyperparameter optimization problems, both methods attain similar computational performances for larger scale problems such as MDEQs at the cost of a limited performance drop compared to the original models.  ( 2 min )
    LBCF: A Large-Scale Budget-Constrained Causal Forest Algorithm. (arXiv:2201.12585v1 [cs.LG])
    Offering incentives (e.g., coupons at Amazon, discounts at Uber and video bonuses at Tiktok) to user is a common strategy used by online platforms to increase user engagement and platform revenue. Despite its proven effectiveness, these marketing incentives incur an inevitable cost and might result in a low ROI (Return on Investment) if not used properly. On the other hand, different users respond differently to these incentives, for instance, some users never buy certain products without coupons, while others do anyway. Thus, how to select the right amount of incentives (i.e. treatment) to each user under budget constraints is an important research problem with great practical implications. In this paper, we call such problem as a budget-constrained treatment selection (BTS) problem. The challenge is how to efficiently solve BTS problem on a Large-Scale dataset and achieve improved results over the existing techniques. We propose a novel tree-based treatment selection technique under budget constraints, called Large-Scale Budget-Constrained Causal Forest (LBCF) algorithm, which is also an efficient treatment selection algorithm suitable for modern distributed computing systems. A novel offline evaluation method is also proposed to overcome an intrinsic challenge in assessing solutions' performance for BTS problem in randomized control trials (RCT) data. We deploy our approach in a real-world scenario on a large-scale video platform, where the platform gives away bonuses in order to increase users' campaign engagement duration. The simulation analysis, offline and online experiments all show that our method outperforms various tree-based state-of-the-art baselines. The proposed approach is currently serving over hundreds of millions of users on the platform and achieves one of the most tremendous improvements over these months.  ( 2 min )
    Inverse-Weighted Survival Games. (arXiv:2111.08175v2 [cs.LG] UPDATED)
    Deep models trained through maximum likelihood have achieved state-of-the-art results for survival analysis. Despite this training scheme, practitioners evaluate models under other criteria, such as binary classification losses at a chosen set of time horizons, e.g. Brier score (BS) and Bernoulli log likelihood (BLL). Models trained with maximum likelihood may have poor BS or BLL since maximum likelihood does not directly optimize these criteria. Directly optimizing criteria like BS requires inverse-weighting by the censoring distribution. However, estimating the censoring model under these metrics requires inverse-weighting by the failure distribution. The objective for each model requires the other, but neither are known. To resolve this dilemma, we introduce Inverse-Weighted Survival Games. In these games, objectives for each model are built from re-weighted estimates featuring the other model, where the latter is held fixed during training. When the loss is proper, we show that the games always have the true failure and censoring distributions as a stationary point. This means models in the game do not leave the correct distributions once reached. We construct one case where this stationary point is unique. We show that these games optimize BS on simulations and then apply these principles on real world cancer and critically-ill patient data.  ( 2 min )
  • Open

    Vector Optimization with Stochastic Bandit Feedback. (arXiv:2110.12311v2 [cs.LG] UPDATED)
    We introduce vector optimization problems with stochastic bandit feedback, which extends the best arm identification problem to vector-valued rewards. We consider $K$ designs, with multi-dimensional mean reward vectors, which are partially ordered according to a polyhedral ordering cone $C$. This generalizes the concept of Pareto set in multi-objective optimization and allows different sets of preferences of decision-makers to be encoded by $C$. Different than prior work, we define approximations of the Pareto set based on direction-free covering and gap notions. We study the setting where an evaluation of each design yields a noisy observation of the mean reward vector. Under subgaussian noise assumption, we investigate the sample complexity of the na\"ive elimination algorithm in an ($\epsilon,\delta$)-PAC setting, where the goal is to identify an ($\epsilon,\delta$)-PAC Pareto set with the minimum number of design evaluations. In order to characterize the difficulty of learning the Pareto set, we introduce the concept of ordering complexity, i.e., geometric conditions on the deviations of empirical reward vectors from their mean under which the Pareto front can be approximated accurately. We show how to compute the ordering complexity of any polyhedral ordering cone. We run experiments to verify our theoretical results and illustrate how $C$ and sampling budget affect the Pareto set, returned ($\epsilon,\delta$)-PAC Pareto set and the success of identification.
    Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Streaming Data. (arXiv:2109.07117v2 [cs.LG] UPDATED)
    Motivated by the high-frequency data streams continuously generated, real-time learning is becoming increasingly important. These data streams should be processed sequentially with the property that the data stream may change over time. In this streaming setting, we propose techniques for minimizing convex objectives through unbiased estimates of their gradients, commonly referred to as stochastic approximation problems. Our methods rely on stochastic approximation algorithms because of their applicability and computational advantages. The reasoning includes iterate averaging that guarantees optimal statistical efficiency under classical conditions. Our non-asymptotic analysis shows accelerated convergence by selecting the learning rate according to the expected data streams. We show that the average estimate converges optimally and robustly for any data stream rate. In addition, noise reduction can be achieved by processing the data in a specific pattern, which is advantageous for large-scale machine learning problems. These theoretical results are illustrated for various data streams, showing the effectiveness of the proposed algorithms.
    Towards the Generalization of Contrastive Self-Supervised Learning. (arXiv:2111.00743v2 [cs.LG] UPDATED)
    Recently, self-supervised learning has attracted great attention, since it only requires unlabeled data for training. Contrastive learning is a popular approach for self-supervised learning and achieves promising empirical performance. However, the theoretical understanding of its generalization ability is still limited. To this end, we define a kind of $(\sigma,\delta)$-measure to mathematically quantify the data augmentation, and then provide an upper bound of the downstream classification error based on the measure. We show that the generalization ability of contrastive self-supervised learning depends on three key factors: alignment of positive samples, divergence of class centers, and concentration of augmented data. The first two factors can be optimized by contrastive algorithms, while the third one is priorly determined by pre-defined data augmentation. With the above theoretical findings, we further study two canonical contrastive losses, InfoNCE and cross-correlation loss, and prove that both of them are able to obtain the embedding space satisfying the aforementioned factors. Finally, we conduct various experiments on the real-world dataset, and show that our theoretical inferences on the relationship between the data augmentation and the generalization of contrastive self-supervised learning agree with the empirical observations.
    Inverse-Weighted Survival Games. (arXiv:2111.08175v2 [cs.LG] UPDATED)
    Deep models trained through maximum likelihood have achieved state-of-the-art results for survival analysis. Despite this training scheme, practitioners evaluate models under other criteria, such as binary classification losses at a chosen set of time horizons, e.g. Brier score (BS) and Bernoulli log likelihood (BLL). Models trained with maximum likelihood may have poor BS or BLL since maximum likelihood does not directly optimize these criteria. Directly optimizing criteria like BS requires inverse-weighting by the censoring distribution. However, estimating the censoring model under these metrics requires inverse-weighting by the failure distribution. The objective for each model requires the other, but neither are known. To resolve this dilemma, we introduce Inverse-Weighted Survival Games. In these games, objectives for each model are built from re-weighted estimates featuring the other model, where the latter is held fixed during training. When the loss is proper, we show that the games always have the true failure and censoring distributions as a stationary point. This means models in the game do not leave the correct distributions once reached. We construct one case where this stationary point is unique. We show that these games optimize BS on simulations and then apply these principles on real world cancer and critically-ill patient data.
    Differentially private stochastic expectation propagation (DP-SEP). (arXiv:2111.13219v3 [cs.LG] UPDATED)
    We are interested in privatizing an approximate posterior inference algorithm called Expectation Propagation (EP). EP approximates the posterior by iteratively refining approximations to the local likelihoods, and is known to provide better posterior uncertainties than those by variational inference (VI). However, EP needs a large memory to maintain all local approximates associated with each datapoint in the training data. To overcome this challenge, stochastic expectation propagation (SEP) considers a single unique local factor that captures the average effect of each likelihood term to the posterior and refines it in a way analogous to EP. In terms of privacy, SEP is more tractable than EP because at each refining step of a factor, the remaining factors are fixed and do not depend on other datapoints as in EP, which makes the sensitivity analysis straightforward. We provide a theoretical analysis of the privacy-accuracy trade-off in the posterior estimates under our method, called differentially private stochastic expectation propagation (DP-SEP). Furthermore, we demonstrate the performance of our DP-SEP algorithm evaluated on both synthetic and real-world datasets in terms of the quality of posterior estimates at different levels of guaranteed privacy.
    Almost Optimal Universal Lower Bound for Learning Causal DAGs with Atomic Interventions. (arXiv:2111.05070v3 [cs.LG] UPDATED)
    A well-studied challenge that arises in the structure learning problem of causal directed acyclic graphs (DAG) is that using observational data, one can only learn the graph up to a "Markov equivalence class" (MEC). The remaining undirected edges have to be oriented using interventions, which can be very expensive to perform in applications. Thus, the problem of minimizing the number of interventions needed to fully orient the MEC has received a lot of recent attention, and is also the focus of this work. We prove two main results. The first is a new universal lower bound on the number of atomic interventions that any algorithm (whether active or passive) would need to perform in order to orient a given MEC. Our second result shows that this bound is, in fact, within a factor of two of the size of the smallest set of atomic interventions that can orient the MEC. Our lower bound is provably better than previously known lower bounds. The proof of our lower bound is based on the new notion of clique-block shared-parents (CBSP) orderings, which are topological orderings of DAGs without v-structures and satisfy certain special properties. Further, using simulations on synthetic graphs and by giving examples of special graph families, we show that our bound is often significantly better.
    Selective Regression Under Fairness Criteria. (arXiv:2110.15403v2 [cs.LG] UPDATED)
    Selective regression allows abstention from prediction if the confidence to make an accurate prediction is not sufficient. In general, by allowing a reject option, one expects the performance of a regression model to increase at the cost of reducing coverage (i.e., by predicting on fewer samples). However, as we show, in some cases, the performance of a minority subgroup can decrease while we reduce the coverage, and thus selective regression can magnify disparities between different sensitive subgroups. Motivated by these disparities, we propose new fairness criteria for selective regression requiring the performance of every subgroup to improve with a decrease in coverage. We prove that if a feature representation satisfies the sufficiency criterion or is calibrated for mean and variance, than the proposed fairness criteria is met. Further, we introduce two approaches to mitigate the performance disparity across subgroups: (a) by regularizing an upper bound of conditional mutual information under a Gaussian assumption and (b) by regularizing a contrastive loss for conditional mean and conditional variance prediction. The effectiveness of these approaches is demonstrated on synthetic and real-world datasets.
    Deep learning via message passing algorithms based on belief propagation. (arXiv:2110.14583v2 [cs.LG] UPDATED)
    Message-passing algorithms based on the Belief Propagation (BP) equations constitute a well-known distributed computational scheme. It is exact on tree-like graphical models and has also proven to be effective in many problems defined on graphs with loops (from inference to optimization, from signal processing to clustering). The BP-based scheme is fundamentally different from stochastic gradient descent (SGD), on which the current success of deep networks is based. In this paper, we present and adapt to mini-batch training on GPUs a family of BP-based message-passing algorithms with a reinforcement field that biases distributions towards locally entropic solutions. These algorithms are capable of training multi-layer neural networks with discrete weights and activations with performance comparable to SGD-inspired heuristics (BinaryNet) and are naturally well-adapted to continual learning. Furthermore, using these algorithms to estimate the marginals of the weights allows us to make approximate Bayesian predictions that have higher accuracy than point-wise solutions.
    Improved Regret Analysis for Variance-Adaptive Linear Bandits and Horizon-Free Linear Mixture MDPs. (arXiv:2111.03289v2 [stat.ML] UPDATED)
    In online learning problems, exploiting low variance plays an important role in obtaining tight performance guarantees yet is challenging because variances are often not known a priori. Recently, considerable progress has been made by Zhang et al. (2021) where they obtain a variance-adaptive regret bound for linear bandits without knowledge of the variances and a horizon-free regret bound for linear mixture Markov decision processes (MDPs). In this paper, we present novel analyses that improve their regret bounds significantly. For linear bandits, we achieve $\tilde O(d^{1.5}\sqrt{\sum_{k}^K \sigma_k^2} + d^2)$ where $d$ is the dimension of the features, $K$ is the time horizon, and $\sigma_k^2$ is the noise variance at time step $k$, and $\tilde O$ ignores polylogarithmic dependence, which is a factor of $d^3$ improvement. For linear mixture MDPs with the assumption of maximum cumulative reward in an episode being in $[0,1]$, we achieve a horizon-free regret bound of $\tilde O(d \sqrt{K} + d^2)$ where $d$ is the number of base models and $K$ is the number of episodes. This is a factor of $d^{3.5}$ improvement in the leading term and $d^7$ in the lower order term. Our analysis critically relies on a novel elliptical potential `count' lemma. This lemma allows a novel regret analysis in conjunction with the peeling trick, which is of independent interest.
    Measuring Complexity of Learning Schemes Using Hessian-Schatten Total Variation. (arXiv:2112.06209v2 [cs.LG] UPDATED)
    In this paper, we introduce the Hessian-Schatten total variation (HTV) -- a novel seminorm that quantifies the total "rugosity" of multivariate functions. Our motivation for defining HTV is to assess the complexity of supervised-learning schemes. We start by specifying the adequate matrix-valued Banach spaces that are equipped with suitable classes of mixed norms. We then show that the HTV is invariant to rotations, scalings, and translations. Additionally, its minimum value is achieved for linear mappings, which supports the common intuition that linear regression is the least complex learning model. Next, we present closed-form expressions of the HTV for two general classes of functions. The first one is the class of Sobolev functions with a certain degree of regularity, for which we show that the HTV coincides with the Hessian-Schatten seminorm that is sometimes used as a regularizer for image reconstruction. The second one is the class of continuous and piecewise-linear (CPWL) functions. In this case, we show that the HTV reflects the total change in slopes between linear regions that have a common facet. Hence, it can be viewed as a convex relaxation (l1-type) of the number of linear regions (l0-type) of CPWL mappings. Finally, we illustrate the use of our proposed seminorm.
    Building Efficient CNNs Using Depthwise Convolutional Eigen-Filters (DeCEF). (arXiv:1910.09359v3 [cs.LG] UPDATED)
    Deep Convolutional Neural Networks (CNNs) have been widely used in various domains due to their impressive capabilities. These models are typically composed of a large number of 2D convolutional (Conv2D) layers with numerous trainable parameters. To reduce the complexity of a network, compression techniques can be applied. These methods typically rely on the analysis of trained deep learning models. However, in some applications, due to reasons such as particular data or system specifications and licensing restrictions, a pre-trained network may not be available. This would require the user to train a CNN from scratch. In this paper, we aim to find an alternative parameterization to Conv2D filters without relying on a pre-trained convolutional network. During the analysis, we observe that the effective rank of the vectorized Conv2D filters decreases with respect to the increasing depth in the network, which then leads to the implementation of the Depthwise Convolutional Eigen-Filter (DeCEF) layer. Essentially, a DeCEF layer is a low rank version of the Conv2D layer with significantly fewer trainable parameters and floating point operations (FLOPs). The way we define the effective rank is different from the previous work and it is easy to implement in any deep learning frameworks. To evaluate the effectiveness of DeCEF, experiments are conducted on the benchmark datasets CIFAR-10 and ImageNet using various network architectures. The results have shown a similar or higher accuracy and robustness using about 2/3 of the original parameters and reducing the number of FLOPs to 2/3 of the base network, which is then compared to the state-of-the-art techniques.
    On the Convergence and Calibration of Deep Learning with Differential Privacy. (arXiv:2106.07830v4 [cs.LG] UPDATED)
    In deep learning with differential privacy (DP), the neural network achieves the privacy usually at the cost of slower convergence (and thus lower performance) than its non-private counterpart. This work gives the first convergence analysis of the DP deep learning, through the lens of training dynamics and the neural tangent kernel (NTK). Our convergence theory successfully characterizes the effects of two key components in the DP training: the per-sample clipping and the noise addition. Our analysis not only initiates a general principled framework to understand the DP deep learning with any network architecture and loss function, but also motivates a new clipping method -- the global clipping, that significantly improves the convergence, as well as preserves the same DP guarantee and computational efficiency as the existing method, which we term as local clipping. Theoretically speaking, we precisely characterize the effect of per-sample clipping on the NTK matrix and show that the noise level of DP optimizers does not affect the convergence in the gradient flow regime. In particular, the local clipping almost certainly breaks the positive semi-definiteness of NTK, which can be preserved by our global clipping. Consequently, DP gradient descent (GD) with global clipping converge monotonically to zero loss, which is often violated by the existing DP-GD. Notably, our analysis framework easily extends to other optimizers, e.g., DP-Adam. We demonstrate through numerous experiments that DP optimizers equipped with global clipping perform strongly on classification and regression tasks. In addition, our global clipping is surprisingly effective at learning calibrated classifiers, in contrast to the existing DP classifiers which are oftentimes over-confident and unreliable. Implementation-wise, the new clipping can be realized by inserting one line of code into the Pytorch Opacus library.
    Provable tradeoffs in adversarially robust classification. (arXiv:2006.05161v5 [cs.LG] UPDATED)
    It is well known that machine learning methods can be vulnerable to adversarially-chosen perturbations of their inputs. Despite significant progress in the area, foundational open problems remain. In this paper, we address several key questions. We derive exact and approximate Bayes-optimal robust classifiers for the important setting of two- and three-class Gaussian classification problems with arbitrary imbalance, for $\ell_2$ and $\ell_\infty$ adversaries. In contrast to classical Bayes-optimal classifiers, determining the optimal decisions here cannot be made pointwise and new theoretical approaches are needed. We develop and leverage new tools, including recent breakthroughs from probability theory on robust isoperimetry, which, to our knowledge, have not yet been used in the area. Our results reveal fundamental tradeoffs between standard and robust accuracy that grow when data is imbalanced. We also show further results, including an analysis of classification calibration for convex losses in certain models, and finite sample rates for the robust risk.
    LSB: Local Self-Balancing MCMC in Discrete Spaces. (arXiv:2109.03867v2 [cs.AI] UPDATED)
    We present the Local Self-Balancing sampler (LSB), a local Markov Chain Monte Carlo (MCMC) method for sampling in purely discrete domains, which is able to autonomously adapt to the target distribution and to reduce the number of target evaluations required to converge. LSB is based on (i) a parametrization of locally balanced proposals, (ii) an objective function based on mutual information and (iii) a self-balancing learning procedure, which minimises the proposed objective to update the proposal parameters. Experiments on energy-based models and Bayesian networks show that LSB converges using a smaller number of queries to the oracle distribution compared to recent local MCMC samplers.
    Differentially Private Variable Selection via the Knockoff Filter. (arXiv:2109.05402v3 [stat.ML] UPDATED)
    The knockoff filter, recently developed by Barber and Candes, is an effective procedure to perform variable selection with a controlled false discovery rate (FDR). We propose a private version of the knockoff filter by incorporating Gaussian and Laplace mechanisms, and show that variable selection with controlled FDR can be achieved. Simulations demonstrate that our setting has reasonable statistical power.
    Towards Noise-adaptive, Problem-adaptive (Accelerated) Stochastic Gradient Descent. (arXiv:2110.11442v2 [math.OC] UPDATED)
    We aim to make stochastic gradient descent (SGD) adaptive to (i) the noise $\sigma^2$ in the stochastic gradients and (ii) problem-dependent constants. When minimizing smooth, strongly-convex functions with condition number $\kappa$, we prove that $T$ iterations of SGD with exponentially decreasing step-sizes and knowledge of the smoothness can achieve an $\tilde{O} \left(\exp \left( \frac{-T}{\kappa} \right) + \frac{\sigma^2}{T} \right)$ rate, without knowing $\sigma^2$. In order to be adaptive to the smoothness, we use a stochastic line-search (SLS) and show (via upper and lower-bounds) that SGD with SLS converges at the desired rate, but only to a neighbourhood of the solution. On the other hand, we prove that SGD with an offline estimate of the smoothness converges to the minimizer. However, its rate is slowed down proportional to the estimation error. Next, we prove that SGD with Nesterov acceleration and exponential step-sizes (referred to as ASGD) can achieve the near-optimal $\tilde{O} \left(\exp \left( \frac{-T}{\sqrt{\kappa}} \right) + \frac{\sigma^2}{T} \right)$ rate, without knowledge of $\sigma^2$. When used with offline estimates of the smoothness and strong-convexity, ASGD still converges to the solution, albeit at a slower rate. We empirically demonstrate the effectiveness of exponential step-sizes coupled with a novel variant of SLS.
    Hierarchical Few-Shot Generative Models. (arXiv:2110.12279v2 [cs.LG] UPDATED)
    A few-shot generative model should be able to generate data from a distribution by only observing a limited set of examples. In few-shot learning the model is trained on data from many sets from different distributions sharing some underlying properties such as sets of characters from different alphabets or sets of images of different type objects. We extend current latent variable models for sets to a fully hierarchical approach with an attention-based point to set-level aggregation and call our approach SCHA-VAE for Set-Context-Hierarchical-Aggregation Variational Autoencoder. We explore iterative data sampling, likelihood-based model comparison, and adaptation-free out of distribution generalization. Our results show that the hierarchical formulation better captures the intrinsic variability within the sets in the small data regime. With this work we generalize deep latent variable approaches to few-shot learning, taking a step towards large-scale few-shot generation with a formulation that readily can work with current state-of-the-art deep generative models.
    Quantifying Epistemic Uncertainty in Deep Learning. (arXiv:2110.12122v2 [cs.LG] UPDATED)
    Uncertainty quantification is at the core of the reliability and robustness of machine learning. In this paper, we provide a theoretical framework to dissect the uncertainty, especially the epistemic component, in deep learning into procedural variability (from the training procedure) and data variability (from the training data), which is the first such attempt in the literature to our best knowledge. We then propose two approaches to estimate these uncertainties, one based on influence function and one on batching. We demonstrate how our approaches overcome the computational difficulties in applying classical statistical methods. Experimental evaluations on multiple problem settings corroborate our theory and illustrate how our framework and estimation can provide direct guidance on modeling and data collection effort to improve deep learning performance.
    Nys-Newton: Nystr\"om-Approximated Curvature for Stochastic Optimization. (arXiv:2110.08577v2 [math.OC] UPDATED)
    Second-order optimization methods are among the most widely used optimization approaches for convex optimization problems, and have recently been used to optimize non-convex optimization problems such as deep learning models. The widely used second-order optimization methods such as quasi-Newton methods generally provide curvature information by approximating the Hessian using the secant equation. However, the secant equation becomes insipid in approximating the Newton step owing to its use of the first-order derivatives. In this study, we propose an approximate Newton sketch-based stochastic optimization algorithm for large-scale empirical risk minimization. Specifically, we compute a partial column Hessian of size ($d\times m$) with $m\ll d$ randomly selected variables, then use the \emph{Nystr\"om method} to better approximate the full Hessian matrix. To further reduce the computational complexity per iteration, we directly compute the update step ($\Delta\boldsymbol{w}$) without computing and storing the full Hessian or its inverse. We then integrate our approximated Hessian with stochastic gradient descent and stochastic variance-reduced gradient methods. The results of numerical experiments on both convex and non-convex functions show that the proposed approach was able to obtain a better approximation of Newton\textquotesingle s method, exhibiting performance competitive with that of state-of-the-art first-order and stochastic quasi-Newton methods. Furthermore, we provide a theoretical convergence analysis for convex functions.
    Deep Generative Models in Engineering Design: A Review. (arXiv:2110.10863v2 [cs.LG] UPDATED)
    Automated design synthesis has the potential to revolutionize the modern engineering design process and improve access to highly optimized and customized products across countless industries. Successfully adapting generative Machine Learning to design engineering may enable such automated design synthesis and is a research subject of great importance. We present a review and analysis of Deep Generative Machine Learning models in engineering design. Deep Generative Models (DGMs) typically leverage deep networks to learn from an input dataset and synthesize new designs. Recently, DGMs such as feedforward Neural Networks (NNs), Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and certain Deep Reinforcement Learning (DRL) frameworks have shown promising results in design applications like structural optimization, materials design, and shape synthesis. The prevalence of DGMs in engineering design has skyrocketed since 2016. Anticipating continued growth, we conduct a review of recent advances to benefit researchers interested in DGMs for design. We structure our review as an exposition of the algorithms, datasets, representation methods, and applications commonly used in the current literature. In particular, we discuss key works that have introduced new techniques and methods in DGMs, successfully applied DGMs to a design-related domain, or directly supported the development of DGMs through datasets or auxiliary methods. We further identify key challenges and limitations currently seen in DGMs across design fields, such as design creativity, handling constraints and objectives, and modeling both form and functional performance simultaneously. In our discussion, we identify possible solution pathways as key areas on which to target future work.
    A Leap among Quantum Computing and Quantum Neural Networks: A Survey. (arXiv:2107.03313v2 [quant-ph] UPDATED)
    In recent years, Quantum Computing witnessed massive improvements in terms of available resources and algorithms development. The ability to harness quantum phenomena to solve computational problems is a long-standing dream that has drawn the scientific community's interest since the late 80s. In such a context, we propose our contribution. First, we introduce basic concepts related to quantum computations, and then we explain the core functionalities of technologies that implement the Gate Model and Adiabatic Quantum Computing paradigms. Finally, we gather, compare and analyze the current state-of-the-art concerning Quantum Perceptrons and Quantum Neural Networks implementations.
    Learning more skills through optimistic exploration. (arXiv:2107.14226v2 [cs.LG] UPDATED)
    Unsupervised skill learning objectives (Gregor et al., 2016, Eysenbach et al., 2018) allow agents to learn rich repertoires of behavior in the absence of extrinsic rewards. They work by simultaneously training a policy to produce distinguishable latent-conditioned trajectories, and a discriminator to evaluate distinguishability by trying to infer latents from trajectories. The hope is for the agent to explore and master the environment by encouraging each skill (latent) to reliably reach different states. However, an inherent exploration problem lingers: when a novel state is actually encountered, the discriminator will necessarily not have seen enough training data to produce accurate and confident skill classifications, leading to low intrinsic reward for the agent and effective penalization of the sort of exploration needed to actually maximize the objective. To combat this inherent pessimism towards exploration, we derive an information gain auxiliary objective that involves training an ensemble of discriminators and rewarding the policy for their disagreement. Our objective directly estimates the epistemic uncertainty that comes from the discriminator not having seen enough training examples, thus providing an intrinsic reward more tailored to the true objective compared to pseudocount-based methods (Burda et al., 2019). We call this exploration bonus discriminator disagreement intrinsic reward, or DISDAIN. We demonstrate empirically that DISDAIN improves skill learning both in a tabular grid world (Four Rooms) and the 57 games of the Atari Suite (from pixels). Thus, we encourage researchers to treat pessimism with DISDAIN.
    Hybrid Random Features. (arXiv:2110.04367v3 [cs.LG] UPDATED)
    We propose a new class of random feature methods for linearizing softmax and Gaussian kernels called hybrid random features (HRFs) that automatically adapt the quality of kernel estimation to provide most accurate approximation in the defined regions of interest. Special instantiations of HRFs lead to well-known methods such as trigonometric (Rahimi and Recht, 2007) or (recently introduced in the context of linear-attention Transformers) positive random features (Choromanski et al., 2021). By generalizing Bochner's Theorem for softmax/Gaussian kernels and leveraging random features for compositional kernels, the HRF-mechanism provides strong theoretical guarantees - unbiased approximation and strictly smaller worst-case relative errors than its counterparts. We conduct exhaustive empirical evaluation of HRF ranging from pointwise kernel estimation experiments, through tests on data admitting clustering structure to benchmarking implicit-attention Transformers (also for downstream Robotics applications), demonstrating its quality in a wide spectrum of machine learning problems.
    Automating Control of Overestimation Bias for Reinforcement Learning. (arXiv:2110.13523v2 [cs.LG] UPDATED)
    Overestimation bias control techniques are used by the majority of high-performing off-policy reinforcement learning algorithms. However, most of these techniques rely on pre-defined bias correction policies that are either not flexible enough or require environment-specific tuning of hyperparameters. In this work, we present a general data-driven approach for the automatic selection of bias control hyperparameters. We demonstrate its effectiveness on three algorithms: Truncated Quantile Critics, Weighted Delayed DDPG, and Maxmin Q-learning. The proposed technique eliminates the need for an extensive hyperparameter search. We show that it leads to a significant reduction of the actual number of interactions while preserving the performance.
    Understanding approximate and unrolled dictionary learning for pattern recovery. (arXiv:2106.06338v2 [cs.LG] UPDATED)
    Dictionary learning consists of finding a sparse representation from noisy data and is a common way to encode data-driven prior knowledge on signals. Alternating minimization (AM) is standard for the underlying optimization, where gradient descent steps alternate with sparse coding procedures. The major drawback of this method is its prohibitive computational cost, making it unpractical on large real-world data sets. This work studies an approximate formulation of dictionary learning based on unrolling and compares it to alternating minimization to find the best trade-off between speed and precision. We analyze the asymptotic behavior and convergence rate of gradients estimates in both methods. We show that unrolling performs better on the support of the inner problem solution and during the first iterations. Finally, we apply unrolling on pattern learning in magnetoencephalography (MEG) with the help of a stochastic algorithm and compare the performance to a state-of-the-art method.
    Weisfeiler and Lehman Go Cellular: CW Networks. (arXiv:2106.12575v3 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) are limited in their expressive power, struggle with long-range interactions and lack a principled way to model higher-order structures. These problems can be attributed to the strong coupling between the computational graph and the input graph structure. The recently proposed Message Passing Simplicial Networks naturally decouple these elements by performing message passing on the clique complex of the graph. Nevertheless, these models can be severely constrained by the rigid combinatorial structure of Simplicial Complexes (SCs). In this work, we extend recent theoretical results on SCs to regular Cell Complexes, topological objects that flexibly subsume SCs and graphs. We show that this generalisation provides a powerful set of graph "lifting" transformations, each leading to a unique hierarchical message passing procedure. The resulting methods, which we collectively call CW Networks (CWNs), are strictly more powerful than the WL test and not less powerful than the 3-WL test. In particular, we demonstrate the effectiveness of one such scheme, based on rings, when applied to molecular graph problems. The proposed architecture benefits from provably larger expressivity than commonly used GNNs, principled modelling of higher-order signals and from compressing the distances between nodes. We demonstrate that our model achieves state-of-the-art results on a variety of molecular datasets.
    Saddle-to-Saddle Dynamics in Deep Linear Networks: Small Initialization Training, Symmetry, and Sparsity. (arXiv:2106.15933v2 [stat.ML] UPDATED)
    The dynamics of Deep Linear Networks (DLNs) is dramatically affected by the variance $\sigma^2$ of the parameters at initialization $\theta_0$. For DLNs of width $w$, we show a phase transition w.r.t. the scaling $\gamma$ of the variance $\sigma^2=w^{-\gamma}$ as $w\to\infty$: for large variance ($\gamma1$), $\theta_0$ is close to a saddle point and far from any global minimum. While the first case corresponds to the well-studied NTK regime, the second case is less understood. This motivates the study of the case $\gamma \to +\infty$, where we conjecture a Saddle-to-Saddle dynamics: throughout training, gradient descent visits the neighborhoods of a sequence of saddles, each corresponding to linear maps of increasing rank, until reaching a sparse global minimum. We support this conjecture with a theorem for the dynamics between the first two saddles, as well as some numerical experiments.
    Mitigating Covariate Shift in Imitation Learning via Offline Data Without Great Coverage. (arXiv:2106.03207v3 [cs.LG] UPDATED)
    This paper studies offline Imitation Learning (IL) where an agent learns to imitate an expert demonstrator without additional online environment interactions. Instead, the learner is presented with a static offline dataset of state-action-next state transition triples from a potentially less proficient behavior policy. We introduce Model-based IL from Offline data (MILO): an algorithmic framework that utilizes the static dataset to solve the offline IL problem efficiently both in theory and in practice. In theory, even if the behavior policy is highly sub-optimal compared to the expert, we show that as long as the data from the behavior policy provides sufficient coverage on the expert state-action traces (and with no necessity for a global coverage over the entire state-action space), MILO can provably combat the covariate shift issue in IL. Complementing our theory results, we also demonstrate that a practical implementation of our approach mitigates covariate shift on benchmark MuJoCo continuous control tasks. We demonstrate that with behavior policies whose performances are less than half of that of the expert, MILO still successfully imitates with an extremely low number of expert state-action pairs while traditional offline IL method such as behavior cloning (BC) fails completely. Source code is provided at https://github.com/jdchang1/milo.
    Creating Training Sets via Weak Indirect Supervision. (arXiv:2110.03484v2 [cs.LG] UPDATED)
    Creating labeled training sets has become one of the major roadblocks in machine learning. To address this, recent Weak Supervision (WS) frameworks synthesize training labels from multiple potentially noisy supervision sources. However, existing frameworks are restricted to supervision sources that share the same output space as the target task. To extend the scope of usable sources, we formulate Weak Indirect Supervision (WIS), a new research problem for automatically synthesizing training labels based on indirect supervision sources that have different output label spaces. To overcome the challenge of mismatched output spaces, we develop a probabilistic modeling approach, PLRM, which uses user-provided label relations to model and leverage indirect supervision sources. Moreover, we provide a theoretically-principled test of the distinguishability of PLRM for unseen labels, along with a generalization bound. On both image and text classification tasks as well as an industrial advertising application, we demonstrate the advantages of PLRM by outperforming baselines by a margin of 2%-9%.
    Kronecker-factored Quasi-Newton Methods for Deep Learning. (arXiv:2102.06737v2 [cs.LG] UPDATED)
    Second-order methods have the capability of accelerating optimization by using much richer curvature information than first-order methods. However, most are impractical for deep learning, where the number of training parameters is huge. In Goldfarb et al. (2020), practical quasi-Newton methods were proposed that approximate the Hessian of a multilayer perceptron (MLP) model by a layer-wise block diagonal matrix where each layer's block is further approximated by a Kronecker product corresponding to the structure of the Hessian restricted to that layer. Here, we extend these methods to enable them to be applied to convolutional neural networks (CNNs), by analyzing the Kronecker-factored structure of the Hessian matrix of convolutional layers. Several improvements to the methods in Goldfarb et al. (2020) are also proposed that can be applied to both MLPs and CNNs. These new methods have memory requirements comparable to first-order methods and much less per-iteration time complexity than those in Goldfarb et al. (2020). Moreover, convergence results are proved for a variant under relatively mild conditions. Finally, we compared the performance of our new methods against several state-of-the-art (SOTA) methods on MLP autoencoder and CNN problems, and found that they outperformed the first-order SOTA methods and performed comparably to the second-order SOTA methods.
    RelaySum for Decentralized Deep Learning on Heterogeneous Data. (arXiv:2110.04175v2 [cs.LG] UPDATED)
    In decentralized machine learning, workers compute model updates on their local data. Because the workers only communicate with few neighbors without central coordination, these updates propagate progressively over the network. This paradigm enables distributed training on networks without all-to-all connectivity, helping to protect data privacy as well as to reduce the communication cost of distributed training in data centers. A key challenge, primarily in decentralized deep learning, remains the handling of differences between the workers' local data distributions. To tackle this challenge, we introduce the RelaySum mechanism for information propagation in decentralized learning. RelaySum uses spanning trees to distribute information exactly uniformly across all workers with finite delays depending on the distance between nodes. In contrast, the typical gossip averaging mechanism only distributes data uniformly asymptotically while using the same communication volume per step as RelaySum. We prove that RelaySGD, based on this mechanism, is independent of data heterogeneity and scales to many workers, enabling highly accurate decentralized deep learning on heterogeneous data. Our code is available at this http URL
    Explainable Empirical Risk Minimization. (arXiv:2009.01492v2 [cs.LG] UPDATED)
    The successful application of machine learning (ML) methods becomes increasingly dependent on their interpretability or explainability. Designing explainable ML systems is instrumental to ensuring transparency of automated decision-making that targets humans. The explainability of ML methods is also an essential ingredient for trustworthy artificial intelligence. A key challenge in ensuring explainability is its dependence on the specific human user ("explainee"). The users of machine learning methods might have vastly different background knowledge about machine learning principles. One user might have a university degree in machine learning or related fields, while another user might have never received formal training in high-school mathematics. This paper applies information-theoretic concepts to develop a novel measure for the subjective explainability of the predictions delivered by a ML method. We construct this measure via the conditional entropy of predictions, given a user signal. This user signal might be obtained from user surveys or biophysical measurements. Our main contribution is the explainable empirical risk minimization (EERM) principle of learning a hypothesis that optimally balances between the subjective explainability and risk. The EERM principle is flexible and can be combined with arbitrary machine learning models. We present several practical implementations of EERM for linear models and decision trees. Numerical experiments demonstrate the application of EERM to detecting the use of inappropriate language on social media.
    Semi-discrete optimization through semi-discrete optimal transport: a framework for neural architecture search. (arXiv:2006.15221v2 [math.AP] UPDATED)
    In this paper we introduce a theoretical framework for semi-discrete optimization using ideas from optimal transport. Our primary motivation is in the field of deep learning, and specifically in the task of neural architecture search. With this aim in mind, we discuss the geometric and theoretical motivation for new techniques for neural architecture search (in a companion paper we show that algorithms inspired by our framework are competitive with contemporaneous methods). We introduce a Riemannian-like metric on the space of probability measures over a semi-discrete space $\mathbb{R}^d \times \mathcal{G}$ where $\mathcal{G}$ is a finite weighted graph. With such Riemmanian structure in hand, we derive formal expressions for the gradient flow of a relative entropy functional, as well as second order dynamics for the optimization of said energy. Then, with the aim of providing a rigorous motivation for the gradient flow equations derived formally, we also consider an iterative procedure known as minimizing movement scheme (i.e., Implicit Euler scheme, or JKO scheme) and apply it to the relative entropy with respect to a suitable cost function. For some specific choices of metric and cost, we rigorously show that the minimizing movement scheme of the relative entropy functional converges to the gradient flow process provided by the formal Riemannian structure. This flow coincides with a system of reaction-diffusion equations on $\mathbb{R}^d$.
    Hermite Polynomial Features for Private Data Generation. (arXiv:2106.05042v2 [cs.LG] UPDATED)
    Kernel mean embedding is a useful tool to represent and compare probability measures. Despite its usefulness, kernel mean embedding considers infinite-dimensional features, which are challenging to handle in the context of differentially private data generation. A recent work proposes to approximate the kernel mean embedding of data distribution using finite-dimensional random features, which yields analytically tractable sensitivity. However, the number of required random features is excessively high, often ten thousand to a hundred thousand, which worsens the privacy-accuracy trade-off. To improve the trade-off, we propose to replace random features with Hermite polynomial features. Unlike the random features, the Hermite polynomial features are ordered, where the features at the low orders contain more information on the distribution than those at the high orders. Hence, a relatively low order of Hermite polynomial features can more accurately approximate the mean embedding of the data distribution compared to a significantly higher number of random features. As demonstrated on several tabular and image datasets, the use of Hermite polynomial features is better suited for private data generation than the use of random features.
    SHINE: SHaring the INverse Estimate from the forward pass for bi-level optimization and implicit models. (arXiv:2106.00553v3 [cs.LG] UPDATED)
    In recent years, implicit deep learning has emerged as a method to increase the effective depth of deep neural networks. While their training is memory-efficient, they are still significantly slower to train than their explicit counterparts. In Deep Equilibrium Models (DEQs), the training is performed as a bi-level problem, and its computational complexity is partially driven by the iterative inversion of a huge Jacobian matrix. In this paper, we propose a novel strategy to tackle this computational bottleneck from which many bi-level problems suffer. The main idea is to use the quasi-Newton matrices from the forward pass to efficiently approximate the inverse Jacobian matrix in the direction needed for the gradient computation. We provide a theorem that motivates using our method with the original forward algorithms. In addition, by modifying these forward algorithms, we further provide theoretical guarantees that our method asymptotically estimates the true implicit gradient. We empirically study this approach and the recent Jacobian-Free method in different settings, ranging from hyperparameter optimization to large Multiscale DEQs (MDEQs) applied to CIFAR and ImageNet. Both methods reduce significantly the computational cost of the backward pass. While SHINE has a clear advantage on hyperparameter optimization problems, both methods attain similar computational performances for larger scale problems such as MDEQs at the cost of a limited performance drop compared to the original models.
    Thresholded Lasso Bandit. (arXiv:2010.11994v3 [stat.ML] UPDATED)
    In this paper, we revisit the regret minimization problem in sparse stochastic contextual linear bandits, where feature vectors may be of large dimension $d$, but where the reward function depends on a few, say $s_0\ll d$, of these features only. We present Thresholded Lasso bandit, an algorithm that (i) estimates the vector defining the reward function as well as its sparse support, i.e., significant feature elements, using the Lasso framework with thresholding, and (ii) selects an arm greedily according to this estimate projected on its support. The algorithm does not require prior knowledge of the sparsity index $s_0$ and can be parameter-free. For this simple algorithm, we establish non-asymptotic regret upper bounds scaling as $\mathcal{O}( \log d + \sqrt{T} )$ in general, and as $\mathcal{O}( \log d + \log T)$ under the so-called margin condition (a probabilistic condition on the separation of the arm rewards). The regret of previous algorithms scales as $\mathcal{O}( \log d + \sqrt{T \log (d T)})$ and $\mathcal{O}( \log T \log d)$ in the two settings, respectively. Through numerical experiments, we confirm that our algorithm outperforms existing methods.
    DIGRAC: Digraph Clustering Based on Flow Imbalance. (arXiv:2106.05194v5 [stat.ML] UPDATED)
    Node clustering is a powerful tool in the analysis of networks. We introduce a graph neural network framework to obtain node embeddings for directed networks in a self-supervised manner, including a novel probabilistic imbalance loss, which can be used for network clustering. Here, we propose directed flow imbalance measures, which are tightly related to directionality, to reveal clusters in the network even when there is no density difference between clusters. In contrast to standard approaches in the literature, in this paper, directionality is not treated as a nuisance, but rather contains the main signal. DIGRAC optimizes directed flow imbalance for clustering without requiring label supervision, unlike existing GNN methods, and can naturally incorporate node features, unlike existing spectral methods. Experimental results on synthetic data, in the form of directed stochastic block models, and real-world data at different scales, demonstrate that our method, based on flow imbalance, attains state-of-the-art results on directed graph clustering when compared against 10 methods from the literature, for a wide range of noise and sparsity levels and graph structures and topologies, and even outperforms supervised methods.
    Causal Bias Quantification for Continuous Treatments. (arXiv:2106.09762v2 [stat.ME] UPDATED)
    We extend the definition of the marginal causal effect to the continuous treatment setting and develop a novel characterization of causal bias in the framework of structural causal models. We prove that our derived bias expression is zero if, and only if, the causal effect is identifiable via covariate adjustment. We show that under some restrictions on the structural equations, the causal bias can be estimated efficiently and allows for causal regularization of predictive probabilistic models. We demonstrate the effectiveness of our method for causal bias quantification in various settings where (not) controlling for certain covariates would introduce causal bias.
    Lower Bounds and Accelerated Algorithms for Bilevel Optimization. (arXiv:2102.03926v4 [cs.LG] UPDATED)
    Bilevel optimization has recently attracted growing interests due to its wide applications in modern machine learning problems. Although recent studies have characterized the convergence rate for several such popular algorithms, it is still unclear how much further these convergence rates can be improved. In this paper, we address this fundamental question from two perspectives. First, we provide the first-known lower complexity bounds of $\widetilde{\Omega}(\frac{1}{\sqrt{\mu_x}\mu_y})$ and $\widetilde \Omega\big(\frac{1}{\sqrt{\epsilon}}\min\{\frac{1}{\mu_y},\frac{1}{\sqrt{\epsilon^{3}}}\}\big)$ respectively for strongly-convex-strongly-convex and convex-strongly-convex bilevel optimizations. Second, we propose an accelerated bilevel optimizer named AccBiO, for which we provide the first-known complexity bounds without the gradient boundedness assumption (which was made in existing analyses) under the two aforementioned geometries. We also provide significantly tighter upper bounds than the existing complexity when the bounded gradient assumption does hold. We show that AccBiO achieves the optimal results (i.e., the upper and lower bounds match up to logarithmic factors) when the inner-level problem takes a quadratic form with a constant-level condition number. Interestingly, our lower bounds under both geometries are larger than the corresponding optimal complexities of minimax optimization, establishing that bilevel optimization is provably more challenging than minimax optimization.
    Infinitely Deep Bayesian Neural Networks with Stochastic Differential Equations. (arXiv:2102.06559v4 [stat.ML] UPDATED)
    We perform scalable approximate inference in continuous-depth Bayesian neural networks. In this model class, uncertainty about separate weights in each layer gives hidden units that follow a stochastic differential equation. We demonstrate gradient-based stochastic variational inference in this infinite-parameter setting, producing arbitrarily-flexible approximate posteriors. We also derive a novel gradient estimator that approaches zero variance as the approximate posterior over weights approaches the true posterior. This approach brings continuous-depth Bayesian neural nets to a competitive comparison against discrete-depth alternatives, while inheriting the memory-efficient training and tunable precision of Neural ODEs.
    The Dual PC Algorithm for Structure Learning. (arXiv:2112.09036v2 [stat.ML] UPDATED)
    While learning the graphical structure of Bayesian networks from observational data is key to describing and helping understand data generating processes in complex applications, the task poses considerable challenges due to its computational complexity. The directed acyclic graph (DAG) representing a Bayesian network model is generally not identifiable from observational data, and a variety of methods exist to estimate its equivalence class instead. Under certain assumptions, the popular PC algorithm can consistently recover the correct equivalence class by testing for conditional independence (CI), starting from marginal independence relationships and progressively expanding the conditioning set. Here, we propose the dual PC algorithm, a novel scheme to carry out the CI tests within the PC algorithm by leveraging the inverse relationship between covariance and precision matrices. Notably, the elements of the precision matrix coincide with partial correlations for Gaussian data. Our algorithm then exploits block matrix inversions on the covariance and precision matrices to simultaneously perform tests on partial correlations of complementary (or dual) conditioning sets. The multiple CI tests of the dual PC algorithm, therefore, proceed by first considering marginal and full-order CI relationships and progressively moving to central-order ones. Simulation studies indicate that the dual PC algorithm outperforms the classical PC algorithm both in terms of run time and in recovering the underlying network structure.
    A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification. (arXiv:2107.07511v3 [cs.LG] UPDATED)
    Black-box machine learning learning methods are now routinely used in high-risk settings, like medical diagnostics, which demand uncertainty quantification to avoid consequential model failures. Distribution-free uncertainty quantification (distribution-free UQ) is a user-friendly paradigm for creating statistically rigorous confidence intervals/sets for such predictions. Critically, the intervals/sets are valid without distributional assumptions or model assumptions, possessing explicit guarantees even with finitely many datapoints. Moreover, they adapt to the difficulty of the input; when the input example is difficult, the uncertainty intervals/sets are large, signaling that the model might be wrong. Without much work and without retraining, one can use distribution-free methods on any underlying algorithm, such as a neural network, to produce confidence sets guaranteed to contain the ground truth with a user-specified probability, such as 90%. Indeed, the methods are easy-to-understand and general, applying to many modern prediction problems arising in the fields of computer vision, natural language processing, deep reinforcement learning, and so on. This hands-on introduction is aimed at a reader interested in the practical implementation of distribution-free UQ who is not necessarily a statistician. We lead the reader through the practical theory and applications of distribution-free UQ, beginning with conformal prediction and culminating with distribution-free control of any risk, such as the false-discovery rate, false positive rate of out-of-distribution detection, and so on. We will include many explanatory illustrations, examples, and code samples in Python, with PyTorch syntax. The goal is to provide the reader a working understanding of distribution-free UQ, allowing them to put confidence intervals on their algorithms, with one self-contained document.
    Structured Stochastic Gradient MCMC. (arXiv:2107.09028v3 [cs.LG] UPDATED)
    Stochastic gradient Markov Chain Monte Carlo (SGMCMC) is considered the gold standard for Bayesian inference in large-scale models, such as Bayesian neural networks. Since practitioners face speed versus accuracy tradeoffs in these models, variational inference (VI) is often the preferable option. Unfortunately, VI makes strong assumptions on both the factorization and functional form of the posterior. In this work, we propose a new non-parametric variational approximation that makes no assumptions about the approximate posterior's functional form and allows practitioners to specify the exact dependencies the algorithm should respect or break. The approach relies on a new Langevin-type algorithm that operates on a modified energy function, where parts of the latent variables are averaged over samples from earlier iterations of the Markov chain. This way, statistical dependencies can be broken in a controlled way, allowing the chain to mix faster. This scheme can be further modified in a "dropout" manner, leading to even more scalability. We test our scheme for ResNet-20 on CIFAR-10, SVHN, and FMNIST. In all cases, we find improvements in convergence speed and/or final accuracy compared to SG-MCMC and VI.
    Identifiable Energy-based Representations: An Application to Estimating Heterogeneous Causal Effects. (arXiv:2108.03039v3 [cs.LG] UPDATED)
    Conditional average treatment effects (CATEs) allow us to understand the effect heterogeneity across a large population of individuals. However, typical CATE learners assume all confounding variables are measured in order for the CATE to be identifiable. This requirement can be satisfied by collecting many variables, at the expense of increased sample complexity for estimating CATEs. To combat this, we propose an energy-based model (EBM) that learns a low-dimensional representation of the variables by employing a noise contrastive loss function. With our EBM we introduce a preprocessing step that alleviates the dimensionality curse for any existing learner developed for estimating CATEs. We prove that our EBM keeps the representations partially identifiable up to some universal constant, as well as having universal approximation capability. These properties enable the representations to converge and keep the CATE estimates consistent. Experiments demonstrate the convergence of the representations, as well as show that estimating CATEs on our representations performs better than on the variables or the representations obtained through other dimensionality reduction methods.
    Correcting diacritics and typos with ByT5 transformer model. (arXiv:2201.13242v1 [cs.CL])
    Due to the fast pace of life and online communications, the prevalence of English and the QWERTY keyboard, people tend to forgo using diacritics, make typographical errors (typos) when typing. Restoring diacritics and correcting spelling is important for proper language use and disambiguation of texts for both humans and downstream algorithms. However, both of these problems are typically addressed separately, i.e., state-of-the-art diacritics restoration methods do not tolerate other typos. In this work, we tackle both problems at once by employing newly-developed ByT5 byte-level transformer models. Our simultaneous diacritics restoration and typos correction approach demonstrates near state-of-the-art performance in 13 languages, reaching >96% of the alpha-word accuracy. We also perform diacritics restoration alone on 12 benchmark datasets with the additional one for the Lithuanian language. The experimental investigation proves that our approach is able to achieve comparable results (>98%) to previously reported despite being trained on fewer data. Our approach is also able to restore diacritics in words not seen during training with >76% accuracy. We also show the accuracies to further improve with longer training. All this shows a great real-world application potential of our suggested methods to more data, languages, and error classes.
    BEER: Fast $O(1/T)$ Rate for Decentralized Nonconvex Optimization with Communication Compression. (arXiv:2201.13320v1 [cs.LG])
    Communication efficiency has been widely recognized as the bottleneck for large-scale decentralized machine learning applications in multi-agent or federated environments. To tackle the communication bottleneck, there have been many efforts to design communication-compressed algorithms for decentralized nonconvex optimization, where the clients are only allowed to communicate a small amount of quantized information (aka bits) with their neighbors over a predefined graph topology. Despite significant efforts, the state-of-the-art algorithm in the nonconvex setting still suffers from a slower rate of convergence $O((G/T)^{2/3})$ compared with their uncompressed counterpart, where $G$ measures the data heterogeneity across different clients, and $T$ is the number of communication rounds. This paper proposes BEER, which adopts communication compression with gradient tracking, and shows it converges at a faster rate of $O(1/T)$. This significantly improves over the state-of-the-art rate, by matching the rate without compression even under arbitrary data heterogeneity. Numerical experiments are also provided to corroborate our theory and confirm the practical superiority of BEER in the data heterogeneous regime.
    A framework for bilevel optimization that enables stochastic and global variance reduction algorithms. (arXiv:2201.13409v1 [stat.ML])
    Bilevel optimization, the problem of minimizing a value function which involves the arg-minimum of another function, appears in many areas of machine learning. In a large scale setting where the number of samples is huge, it is crucial to develop stochastic methods, which only use a few samples at a time to progress. However, computing the gradient of the value function involves solving a linear system, which makes it difficult to derive unbiased stochastic estimates. To overcome this problem we introduce a novel framework, in which the solution of the inner problem, the solution of the linear system, and the main variable evolve at the same time. These directions are written as a sum, making it straightforward to derive unbiased estimates. The simplicity of our approach allows us to develop global variance reduction algorithms, where the dynamics of all variables is subject to variance reduction. We demonstrate that SABA, an adaptation of the celebrated SAGA algorithm in our framework, has $O(\frac1T)$ convergence rate, and that it achieves linear convergence under Polyak-Lojasciewicz assumption. This is the first stochastic algorithm for bilevel optimization that verifies either of these properties. Numerical experiments validate the usefulness of our method.
    Bayesian Optimization for Distributionally Robust Chance-constrained Problem. (arXiv:2201.13112v1 [stat.ML])
    In black-box function optimization, we need to consider not only controllable design variables but also uncontrollable stochastic environment variables. In such cases, it is necessary to solve the optimization problem by taking into account the uncertainty of the environmental variables. Chance-constrained (CC) problem, the problem of maximizing the expected value under a certain level of constraint satisfaction probability, is one of the practically important problems in the presence of environmental variables. In this study, we consider distributionally robust CC (DRCC) problem and propose a novel DRCC Bayesian optimization method for the case where the distribution of the environmental variables cannot be precisely specified. We show that the proposed method can find an arbitrary accurate solution with high probability in a finite number of trials, and confirm the usefulness of the proposed method through numerical experiments.
    Fast and accurate optimization on the orthogonal manifold without retraction. (arXiv:2102.07432v2 [stat.ML] UPDATED)
    We consider the problem of minimizing a function over the manifold of orthogonal matrices. The majority of algorithms for this problem compute a direction in the tangent space, and then use a retraction to move in that direction while staying on the manifold. Unfortunately, the numerical computation of retractions on the orthogonal manifold always involves some expensive linear algebra operation, such as matrix inversion, exponential or square-root. These operations quickly become expensive as the dimension of the matrices grows. To bypass this limitation, we propose the landing algorithm which does not use retractions. The algorithm is not constrained to stay on the manifold but its evolution is driven by a potential energy which progressively attracts it towards the manifold. One iteration of the landing algorithm only involves matrix multiplications, which makes it cheap compared to its retraction counterparts. We provide an analysis of the convergence of the algorithm, and demonstrate its promises on large-scale and deep learning problems, where it is faster and less prone to numerical errors than retraction-based methods.
    Score Matched Neural Exponential Families for Likelihood-Free Inference. (arXiv:2012.10903v4 [stat.ME] UPDATED)
    Bayesian Likelihood-Free Inference (LFI) approaches allow to obtain posterior distributions for stochastic models with intractable likelihood, by relying on model simulations. In Approximate Bayesian Computation (ABC), a popular LFI method, summary statistics are used to reduce data dimensionality. ABC algorithms adaptively tailor simulations to the observation in order to sample from an approximate posterior, whose form depends on the chosen statistics. In this work, we introduce a new way to learn ABC statistics: we first generate parameter-simulation pairs from the model independently on the observation; then, we use Score Matching to train a neural conditional exponential family to approximate the likelihood. The exponential family is the largest class of distributions with fixed-size sufficient statistics; thus, we use them in ABC, which is intuitively appealing and has state-of-the-art performance. In parallel, we insert our likelihood approximation in an MCMC for doubly intractable distributions to draw posterior samples. We can repeat that for any number of observations with no additional model simulations, with performance comparable to related approaches. We validate our methods on toy models with known likelihood and a large-dimensional time-series model.
    A Minimax Lower Bound for Low-Rank Matrix-Variate Logistic Regression. (arXiv:2105.14673v2 [cs.LG] UPDATED)
    This paper considers the problem of matrix-variate logistic regression. It derives the fundamental error threshold on estimating low-rank coefficient matrices in the logistic regression problem by obtaining a lower bound on the minimax risk. The bound depends explicitly on the dimension and distribution of the covariates, the rank and energy of the coefficient matrix, and the number of samples. The resulting bound is proportional to the intrinsic degrees of freedom in the problem, which suggests the sample complexity of the low-rank matrix logistic regression problem can be lower than that for vectorized logistic regression. The proof techniques utilized in this work also set the stage for development of minimax lower bounds for tensor-variate logistic regression problems.
    MobILE: Model-Based Imitation Learning From Observation Alone. (arXiv:2102.10769v3 [cs.LG] UPDATED)
    This paper studies Imitation Learning from Observations alone (ILFO) where the learner is presented with expert demonstrations that consist only of states visited by an expert (without access to actions taken by the expert). We present a provably efficient model-based framework MobILE to solve the ILFO problem. MobILE involves carefully trading off strategic exploration against imitation - this is achieved by integrating the idea of optimism in the face of uncertainty into the distribution matching imitation learning (IL) framework. We provide a unified analysis for MobILE, and demonstrate that MobILE enjoys strong performance guarantees for classes of MDP dynamics that satisfy certain well studied notions of structural complexity. We also show that the ILFO problem is strictly harder than the standard IL problem by presenting an exponential sample complexity separation between IL and ILFO. We complement these theoretical results with experimental simulations on benchmark OpenAI Gym tasks that indicate the efficacy of MobILE. Code for implementing the MobILE framework is available at https://github.com/rahulkidambi/MobILE-NeurIPS2021.
    Power-law escape rate of SGD. (arXiv:2105.09557v2 [cs.LG] UPDATED)
    Stochastic gradient descent (SGD) undergoes complicated multiplicative noise for the mean-square loss. We use this property of SGD noise to derive a stochastic differential equation (SDE) with simpler additive noise by performing a random time change. Using this formalism, we show that the log loss barrier $\Delta\log L=\log[L(\theta^s)/L(\theta^*)]$ between a local minimum $\theta^*$ and a saddle $\theta^s$ determines the escape rate of SGD from the local minimum, contrary to the previous results borrowing from physics that the linear loss barrier $\Delta L=L(\theta^s)-L(\theta^*)$ decides the escape rate. Our escape-rate formula strongly depends on the typical magnitude $h^*$ and the number $n$ of the outlier eigenvalues of the Hessian. This result explains an empirical fact that SGD prefers flat minima with low effective dimensions, giving an insight into implicit biases of SGD.
    Generating Adversarial Disturbances for Controller Verification. (arXiv:2012.06695v2 [cs.LG] UPDATED)
    We consider the problem of generating maximally adversarial disturbances for a given controller assuming only blackbox access to it. We propose an online learning approach to this problem that \emph{adaptively} generates disturbances based on control inputs chosen by the controller. The goal of the disturbance generator is to minimize \emph{regret} versus a benchmark disturbance-generating policy class, i.e., to maximize the cost incurred by the controller as well as possible compared to the best possible disturbance generator \emph{in hindsight} (chosen from a benchmark policy class). In the setting where the dynamics are linear and the costs are quadratic, we formulate our problem as an online trust region (OTR) problem with memory and present a new online learning algorithm (\emph{MOTR}) for this problem. We prove that this method competes with the best disturbance generator in hindsight (chosen from a rich class of benchmark policies that includes linear-dynamical disturbance generating policies). We demonstrate our approach on two simulated examples: (i) synthetically generated linear systems, and (ii) generating wind disturbances for the popular PX4 controller in the AirSim simulator. On these examples, we demonstrate that our approach outperforms several baseline approaches, including $H_{\infty}$ disturbance generation and gradient-based methods.
    Connecting the Dots: Numerical Randomized Hamiltonian Monte Carlo with State-Dependent Event Rates. (arXiv:2005.01285v3 [stat.CO] UPDATED)
    Numerical Generalized Randomized Hamiltonian Monte Carlo is introduced, as a robust, easy to use and computationally fast alternative to conventional Markov chain Monte Carlo methods for continuous target distributions. A wide class of piecewise deterministic Markov processes generalizing Randomized HMC (Bou-Rabee and Sanz-Serna, 2017) by allowing for state-dependent event rates is defined. Under very mild restrictions, such processes will have the desired target distribution as an invariant distribution. Secondly, the numerical implementation of such processes, based on adaptive numerical integration of second order ordinary differential equations (ODEs) is considered. The numerical implementation yields an approximate, yet highly robust algorithm that, unlike conventional Hamiltonian Monte Carlo, enables the exploitation of the complete Hamiltonian trajectories (hence the title). The proposed algorithm may yield large speedups and improvements in stability relative to relevant benchmarks, while incurring numerical biases that are negligible relative to the overall Monte Carlo errors. Granted access to a high-quality ODE code, the proposed methodology is both easy to implement and use, even for highly challenging and high-dimensional target distributions.  ( 2 min )
    Compositional Multi-Object Reinforcement Learning with Linear Relation Networks. (arXiv:2201.13388v1 [cs.RO])
    Although reinforcement learning has seen remarkable progress over the last years, solving robust dexterous object-manipulation tasks in multi-object settings remains a challenge. In this paper, we focus on models that can learn manipulation tasks in fixed multi-object settings and extrapolate this skill zero-shot without any drop in performance when the number of objects changes. We consider the generic task of bringing a specific cube out of a set to a goal position. We find that previous approaches, which primarily leverage attention and graph neural network-based architectures, do not generalize their skills when the number of input objects changes while scaling as $K^2$. We propose an alternative plug-and-play module based on relational inductive biases to overcome these limitations. Besides exceeding performances in their training environment, we show that our approach, which scales linearly in $K$, allows agents to extrapolate and generalize zero-shot to any new object number.  ( 2 min )
    Positive-Unlabeled Learning with Uncertainty-aware Pseudo-label Selection. (arXiv:2201.13192v1 [stat.ML])
    Pseudo-labeling solutions for positive-unlabeled (PU) learning have the potential to result in higher performance compared to cost-sensitive learning but are vulnerable to incorrectly estimated pseudo-labels. In this paper, we provide a theoretical analysis of a risk estimator that combines risk on PU and pseudo-labeled data. Furthermore, we show analytically as well as experimentally that such an estimator results in lower excess risk compared to using PU data alone, provided that enough samples are pseudo-labeled with acceptable error rates. We then propose PUUPL, a novel training procedure for PU learning that leverages the epistemic uncertainty of an ensemble of deep neural networks to minimize errors in pseudo-label selection. We conclude with extensive experiments showing the effectiveness of our proposed algorithm over different datasets, modalities, and learning tasks. These show that PUUPL enables a reduction of up to 20% in test error rates even when prior and negative samples are not provided for validation, setting a new state-of-the-art for PU learning.  ( 2 min )
    A Joint Learning and Communications Framework for Federated Learning over Wireless Networks. (arXiv:1909.07972v4 [cs.NI] UPDATED)
    In this paper, the problem of training federated learning (FL) algorithms over a realistic wireless network is studied. In particular, in the considered model, wireless users execute an FL algorithm while training their local FL models using their own data and transmitting the trained local FL models to a base station (BS) that will generate a global FL model and send it back to the users. Since all training parameters are transmitted over wireless links, the quality of the training will be affected by wireless factors such as packet errors and the availability of wireless resources. Meanwhile, due to the limited wireless bandwidth, the BS must select an appropriate subset of users to execute the FL algorithm so as to build a global FL model accurately. This joint learning, wireless resource allocation, and user selection problem is formulated as an optimization problem whose goal is to minimize an FL loss function that captures the performance of the FL algorithm. To address this problem, a closed-form expression for the expected convergence rate of the FL algorithm is first derived to quantify the impact of wireless factors on FL. Then, based on the expected convergence rate of the FL algorithm, the optimal transmit power for each user is derived, under a given user selection and uplink resource block (RB) allocation scheme. Finally, the user selection and uplink RB allocation is optimized so as to minimize the FL loss function. Simulation results show that the proposed joint federated learning and communication framework can reduce the FL loss function value by up to 10% and 16%, respectively, compared to: 1) An optimal user selection algorithm with random resource allocation and 2) a standard FL algorithm with random user selection and resource allocation.  ( 3 min )
    Agnostic Learnability of Halfspaces via Logistic Loss. (arXiv:2201.13419v1 [cs.LG])
    We investigate approximation guarantees provided by logistic regression for the fundamental problem of agnostic learning of homogeneous halfspaces. Previously, for a certain broad class of "well-behaved" distributions on the examples, Diakonikolas et al. (2020) proved an $\tilde{\Omega}(\textrm{OPT})$ lower bound, while Frei et al. (2021) proved an $\tilde{O}(\sqrt{\textrm{OPT}})$ upper bound, where $\textrm{OPT}$ denotes the best zero-one/misclassification risk of a homogeneous halfspace. In this paper, we close this gap by constructing a well-behaved distribution such that the global minimizer of the logistic risk over this distribution only achieves $\Omega(\sqrt{\textrm{OPT}})$ misclassification risk, matching the upper bound in (Frei et al., 2021). On the other hand, we also show that if we impose a radial-Lipschitzness condition in addition to well-behaved-ness on the distribution, logistic regression on a ball of bounded radius reaches $\tilde{O}(\textrm{OPT})$ misclassification risk. Our techniques also show for any well-behaved distribution, regardless of radial Lipschitzness, we can overcome the $\Omega(\sqrt{\textrm{OPT}})$ lower bound for logistic loss simply at the cost of one additional convex optimization step involving the hinge loss and attain $\tilde{O}(\textrm{OPT})$ misclassification risk. This two-step convex optimization algorithm is simpler than previous methods obtaining this guarantee, all of which require solving $O(\log(1/\textrm{OPT}))$ minimization problems.  ( 2 min )
    Buffered Asynchronous SGD for Byzantine Learning. (arXiv:2003.00937v4 [cs.LG] UPDATED)
    Distributed learning has become a hot research topic due to its wide application in clusterbased large-scale learning, federated learning, edge computing and so on. Most traditional distributed learning methods typically assume no failure or attack. However, many unexpected cases, such as communication failure and even malicious attack, may happen in real applications. Hence, Byzantine learning (BL), which refers to distributed learning with failure or attack, has recently attracted much attention. Most existing BL methods are synchronous, which are impractical in some applications due to heterogeneous or offline workers. In these cases, asynchronous BL (ABL) is usually preferred. In this paper, we propose a novel method, called buffered asynchronous stochastic gradient descent (BASGD), for ABL. To the best of our knowledge, BASGD is the first ABL method that can resist non-omniscient attacks without storing any instances on server. Furthermore, we also propose an improved variant of BASGD, called BASGD with momentum (BASGDm), by introducing momentum into BASGD. BASGDm can resist both non-omniscient and omniscient attacks. Compared with those methods which need to store instances on server, BASGD and BASGDm have a wider scope of application. Both BASGD and BASGDm are compatible with various aggregation rules. Moreover, both BASGD and BASGDm are proved to be convergent and be able to resist failure or attack. Empirical results show that our methods significantly outperform existing ABL baselines when there exists failure or attack on workers.  ( 2 min )
    Nystr\"om Kernel Mean Embeddings. (arXiv:2201.13055v1 [stat.ML])
    Kernel mean embeddings are a powerful tool to represent probability distributions over arbitrary spaces as single points in a Hilbert space. Yet, the cost of computing and storing such embeddings prohibits their direct use in large-scale settings. We propose an efficient approximation procedure based on the Nystr\"om method, which exploits a small random subset of the dataset. Our main result is an upper bound on the approximation error of this procedure. It yields sufficient conditions on the subsample size to obtain the standard $n^{-1/2}$ rate while reducing computational costs. We discuss applications of this result for the approximation of the maximum mean discrepancy and quadrature rules, and illustrate our theoretical findings with numerical experiments.  ( 2 min )
    Trajectory Balance: Improved Credit Assignment in GFlowNets. (arXiv:2201.13259v1 [cs.LG])
    Generative Flow Networks (GFlowNets) are a method for learning a stochastic policy for generating compositional objects, such as graphs or strings, from a given unnormalized density by sequences of actions, where many possible action sequences may lead to the same object. Prior temporal difference-like learning objectives for training GFlowNets, such as flow matching and detailed balance, are prone to inefficient credit propagation across action sequences, particularly in the case of long sequences. We propose a new learning objective for GFlowNets, trajectory balance, as a more efficient alternative to previously used objectives. We prove that any global minimizer of the trajectory balance objective can define a policy that samples exactly from the target distribution. In experiments on four distinct domains, we empirically demonstrate the benefits of the trajectory balance objective for GFlowNet convergence, diversity of generated samples, and robustness to long action sequences and large action spaces.  ( 2 min )
    Variational Policy Propagation for Multi-agent Reinforcement Learning. (arXiv:2004.08883v4 [cs.LG] UPDATED)
    We propose a \emph{collaborative} multi-agent reinforcement learning algorithm named variational policy propagation (VPP) to learn a \emph{joint} policy through the interactions over agents. We prove that the joint policy is a Markov Random Field under some mild conditions, which in turn reduces the policy space effectively. We integrate the variational inference as special differentiable layers in policy such that the actions can be efficiently sampled from the Markov Random Field and the overall policy is differentiable. We evaluate our algorithm on several large scale challenging tasks and demonstrate that it outperforms previous state-of-the-arts.  ( 2 min )
    Memory-Efficient Backpropagation through Large Linear Layers. (arXiv:2201.13195v1 [cs.LG])
    In modern neural networks like Transformers, linear layers require significant memory to store activations during backward pass. This study proposes a memory reduction approach to perform backpropagation through linear layers. Since the gradients of linear layers are computed by matrix multiplications, we consider methods for randomized matrix multiplications and demonstrate that they require less memory with a moderate decrease of the test accuracy. Also, we investigate the variance of the gradient estimate induced by the randomized matrix multiplication. We compare this variance with the variance coming from gradient estimation based on the batch of samples. We demonstrate the benefits of the proposed method on the fine-tuning of the pre-trained RoBERTa model on GLUE tasks.  ( 2 min )
    Fluctuations, Bias, Variance & Ensemble of Learners: Exact Asymptotics for Convex Losses in High-Dimension. (arXiv:2201.13383v1 [stat.ML])
    From the sampling of data to the initialisation of parameters, randomness is ubiquitous in modern Machine Learning practice. Understanding the statistical fluctuations engendered by the different sources of randomness in prediction is therefore key to understanding robust generalisation. In this manuscript we develop a quantitative and rigorous theory for the study of fluctuations in an ensemble of generalised linear models trained on different, but correlated, features in high-dimensions. In particular, we provide a complete description of the asymptotic joint distribution of the empirical risk minimiser for generic convex loss and regularisation in the high-dimensional limit. Our result encompasses a rich set of classification and regression tasks, such as the lazy regime of overparametrised neural networks, or equivalently the random features approximation of kernels. While allowing to study directly the mitigating effect of ensembling (or bagging) on the bias-variance decomposition of the test error, our analysis also helps disentangle the contribution of statistical fluctuations, and the singular role played by the interpolation threshold that are at the roots of the "double-descent" phenomenon.  ( 2 min )
    Unified Perspective on Probability Divergence via Maximum Likelihood Density Ratio Estimation: Bridging KL-Divergence and Integral Probability Metrics. (arXiv:2201.13127v1 [cs.LG])
    This paper provides a unified perspective for the Kullback-Leibler (KL)-divergence and the integral probability metrics (IPMs) from the perspective of maximum likelihood density-ratio estimation (DRE). Both the KL-divergence and the IPMs are widely used in various fields in applications such as generative modeling. However, a unified understanding of these concepts has still been unexplored. In this paper, we show that the KL-divergence and the IPMs can be represented as maximal likelihoods differing only by sampling schemes, and use this result to derive a unified form of the IPMs and a relaxed estimation method. To develop the estimation problem, we construct an unconstrained maximum likelihood estimator to perform DRE with a stratified sampling scheme. We further propose a novel class of probability divergences, called the Density Ratio Metrics (DRMs), that interpolates the KL-divergence and the IPMs. In addition to these findings, we also introduce some applications of the DRMs, such as DRE and generative adversarial networks. In experiments, we validate the effectiveness of our proposed methods.  ( 2 min )
    Rotting infinitely many-armed bandits. (arXiv:2201.12975v1 [cs.LG])
    We consider the infinitely many-armed bandit problem with rotting rewards, where the mean reward of an arm decreases at each pull of the arm according to an arbitrary trend with maximum rotting rate $\varrho=o(1)$. We show that this learning problem has an $\Omega(\max\{\varrho^{1/3}T,\sqrt{T}\})$ worst-case regret lower bound where $T$ is the horizon time. We show that a matching upper bound $\tilde{O}(\max\{\varrho^{1/3}T,\sqrt{T}\})$, up to a poly-logarithmic factor, can be achieved by an algorithm that uses a UCB index for each arm and a threshold value to decide whether to continue pulling an arm or remove the arm from further consideration, when the algorithm knows the value of the maximum rotting rate $\varrho$. We also show that an $\tilde{O}(\max\{\varrho^{1/3}T,T^{3/4}\})$ regret upper bound can be achieved by an algorithm that does not know the value of $\varrho$, by using an adaptive UCB index along with an adaptive threshold value.  ( 2 min )
    NAS-Bench-Suite: NAS Evaluation is (Now) Surprisingly Easy. (arXiv:2201.13396v1 [cs.LG])
    The release of tabular benchmarks, such as NAS-Bench-101 and NAS-Bench-201, has significantly lowered the computational overhead for conducting scientific research in neural architecture search (NAS). Although they have been widely adopted and used to tune real-world NAS algorithms, these benchmarks are limited to small search spaces and focus solely on image classification. Recently, several new NAS benchmarks have been introduced that cover significantly larger search spaces over a wide range of tasks, including object detection, speech recognition, and natural language processing. However, substantial differences among these NAS benchmarks have so far prevented their widespread adoption, limiting researchers to using just a few benchmarks. In this work, we present an in-depth analysis of popular NAS algorithms and performance prediction methods across 25 different combinations of search spaces and datasets, finding that many conclusions drawn from a few NAS benchmarks do not generalize to other benchmarks. To help remedy this problem, we introduce NAS-Bench-Suite, a comprehensive and extensible collection of NAS benchmarks, accessible through a unified interface, created with the aim to facilitate reproducible, generalizable, and rapid NAS research. Our code is available at https://github.com/automl/naslib.  ( 2 min )
    Continual Repeated Annealed Flow Transport Monte Carlo. (arXiv:2201.13117v1 [stat.ML])
    We propose Continual Repeated Annealed Flow Transport Monte Carlo (CRAFT), a method that combines a sequential Monte Carlo (SMC) sampler (itself a generalization of Annealed Importance Sampling) with variational inference using normalizing flows. The normalizing flows are directly trained to transport between annealing temperatures using a KL divergence for each transition. This optimization objective is itself estimated using the normalizing flow/SMC approximation. We show conceptually and using multiple empirical examples that CRAFT improves on Annealed Flow Transport Monte Carlo (Arbel et al., 2021), on which it builds and also on Markov chain Monte Carlo (MCMC) based Stochastic Normalizing Flows (Wu et al., 2020). By incorporating CRAFT within particle MCMC, we show that such learnt samplers can achieve impressively accurate results on a challenging lattice field theory example.  ( 2 min )
    GenMod: A generative modeling approach for spectral representation of PDEs with random inputs. (arXiv:2201.12973v1 [stat.ML])
    We propose a method for quantifying uncertainty in high-dimensional PDE systems with random parameters, where the number of solution evaluations is small. Parametric PDE solutions are often approximated using a spectral decomposition based on polynomial chaos expansions. For the class of systems we consider (i.e., high dimensional with limited solution evaluations) the coefficients are given by an underdetermined linear system in a regression formulation. This implies additional assumptions, such as sparsity of the coefficient vector, are needed to approximate the solution. Here, we present an approach where we assume the coefficients are close to the range of a generative model that maps from a low to a high dimensional space of coefficients. Our approach is inspired be recent work examining how generative models can be used for compressed sensing in systems with random Gaussian measurement matrices. Using results from PDE theory on coefficient decay rates, we construct an explicit generative model that predicts the polynomial chaos coefficient magnitudes. The algorithm we developed to find the coefficients, which we call GenMod, is composed of two main steps. First, we predict the coefficient signs using Orthogonal Matching Pursuit. Then, we assume the coefficients are within a sparse deviation from the range of a sign-adjusted generative model. This allows us to find the coefficients by solving a nonconvex optimization problem, over the input space of the generative model and the space of sparse vectors. We obtain theoretical recovery results for a Lipschitz continuous generative model and for a more specific generative model, based on coefficient decay rate bounds. We examine three high-dimensional problems and show that, for all three examples, the generative model approach outperforms sparsity promoting methods at small sample sizes.  ( 2 min )
    An end-to-end deep learning approach for extracting stochastic dynamical systems with $\alpha$-stable L\'evy noise. (arXiv:2201.13114v1 [stat.ML])
    Recently, extracting data-driven governing laws of dynamical systems through deep learning frameworks has gained a lot of attention in various fields. Moreover, a growing amount of research work tends to transfer deterministic dynamical systems to stochastic dynamical systems, especially those driven by non-Gaussian multiplicative noise. However, lots of log-likelihood based algorithms that work well for Gaussian cases cannot be directly extended to non-Gaussian scenarios which could have high error and low convergence issues. In this work, we overcome some of these challenges and identify stochastic dynamical systems driven by $\alpha$-stable L\'evy noise from only random pairwise data. Our innovations include: (1) designing a deep learning approach to learn both drift and diffusion terms for L\'evy induced noise with $\alpha$ across all values, (2) learning complex multiplicative noise without restrictions on small noise intensity, (3) proposing an end-to-end complete framework for stochastic systems identification under a general input data assumption, that is, $\alpha$-stable random variable. Finally, numerical experiments and comparisons with the non-local Kramers-Moyal formulas with moment generating function confirm the effectiveness of our method.  ( 2 min )
    Guided Semi-Supervised Non-negative Matrix Factorization on Legal Documents. (arXiv:2201.13324v1 [cs.LG])
    Classification and topic modeling are popular techniques in machine learning that extract information from large-scale datasets. By incorporating a priori information such as labels or important features, methods have been developed to perform classification and topic modeling tasks; however, most methods that can perform both do not allow for guidance of the topics or features. In this paper, we propose a method, namely Guided Semi-Supervised Non-negative Matrix Factorization (GSSNMF), that performs both classification and topic modeling by incorporating supervision from both pre-assigned document class labels and user-designed seed words. We test the performance of this method through its application to legal documents provided by the California Innocence Project, a nonprofit that works to free innocent convicted persons and reform the justice system. The results show that our proposed method improves both classification accuracy and topic coherence in comparison to past methods like Semi-Supervised Non-negative Matrix Factorization (SSNMF) and Guided Non-negative Matrix Factorization (Guided NMF).  ( 2 min )
    Variational Inference as Iterative Projection in a Bayesian Hilbert Space. (arXiv:2005.07275v2 [cs.LG] UPDATED)
    Variational Bayesian inference is an important machine-learning tool that finds application from statistics to robotics. The goal is to find an approximate probability density function (PDF) from a chosen family that is in some sense 'closest' to the full Bayesian posterior. Closeness is typically defined through the selection of an appropriate loss functional such as the Kullback-Leibler (KL) divergence. In this paper, we explore a new formulation of variational inference by exploiting the fact that (most) PDFs are members of a Bayesian Hilbert space under careful definitions of vector addition, scalar multiplication and an inner product. We show that variational inference based on KL divergence then amounts to an iterative projection, in the Euclidean sense, of the Bayesian posterior onto a subspace corresponding to the selected approximation family. We work through the details of this general framework for the specific case of the Gaussian approximation family and show the equivalence to another Gaussian variational inference approach. We furthermore discuss the implications for systems that exhibit sparsity, which is handled naturally in Bayesian space, and give an example of a high-dimensional robotic state estimation problem that can be handled as a result. Finally, we provide some preliminary examples of how the approach could be applied to non-Gaussian inference.  ( 2 min )
    Assessment of DeepONet for reliability analysis of stochastic nonlinear dynamical systems. (arXiv:2201.13145v1 [stat.ML])
    Time dependent reliability analysis and uncertainty quantification of structural system subjected to stochastic forcing function is a challenging endeavour as it necessitates considerable computational time. We investigate the efficacy of recently proposed DeepONet in solving time dependent reliability analysis and uncertainty quantification of systems subjected to stochastic loading. Unlike conventional machine learning and deep learning algorithms, DeepONet learns is a operator network and learns a function to function mapping and hence, is ideally suited to propagate the uncertainty from the stochastic forcing function to the output responses. We use DeepONet to build a surrogate model for the dynamical system under consideration. Multiple case studies, involving both toy and benchmark problems, have been conducted to examine the efficacy of DeepONet in time dependent reliability analysis and uncertainty quantification of linear and nonlinear dynamical systems. Results obtained indicate that the DeepONet architecture is accurate as well as efficient. Moreover, DeepONet posses zero shot learning capabilities and hence, a trained model easily generalizes to unseen and new environment with no further training.  ( 2 min )
    Deletion Robust Submodular Maximization over Matroids. (arXiv:2201.13128v1 [cs.DS])
    Maximizing a monotone submodular function is a fundamental task in machine learning. In this paper, we study the deletion robust version of the problem under the classic matroids constraint. Here the goal is to extract a small size summary of the dataset that contains a high value independent set even after an adversary deleted some elements. We present constant-factor approximation algorithms, whose space complexity depends on the rank $k$ of the matroid and the number $d$ of deleted elements. In the centralized setting we present a $(3.582+O(\varepsilon))$-approximation algorithm with summary size $O(k + \frac{d \log k}{\varepsilon^2})$. In the streaming setting we provide a $(5.582+O(\varepsilon))$-approximation algorithm with summary size and memory $O(k + \frac{d \log k}{\varepsilon^2})$. We complement our theoretical results with an in-depth experimental analysis showing the effectiveness of our algorithms on real-world datasets.  ( 2 min )
    Robust supervised learning with coordinate gradient descent. (arXiv:2201.13372v1 [stat.ML])
    This paper considers the problem of supervised learning with linear methods when both features and labels can be corrupted, either in the form of heavy tailed data and/or corrupted rows. We introduce a combination of coordinate gradient descent as a learning algorithm together with robust estimators of the partial derivatives. This leads to robust statistical learning methods that have a numerical complexity nearly identical to non-robust ones based on empirical risk minimization. The main idea is simple: while robust learning with gradient descent requires the computational cost of robustly estimating the whole gradient to update all parameters, a parameter can be updated immediately using a robust estimator of a single partial derivative in coordinate gradient descent. We prove upper bounds on the generalization error of the algorithms derived from this idea, that control both the optimization and statistical errors with and without a strong convexity assumption of the risk. Finally, we propose an efficient implementation of this approach in a new python library called linlearn, and demonstrate through extensive numerical experiments that our approach introduces a new interesting compromise between robustness, statistical performance and numerical efficiency for this problem.  ( 2 min )
    A Regret Minimization Approach to Multi-Agent Contro. (arXiv:2201.13288v1 [math.OC])
    We study the problem of multi-agent control of a dynamical system with known dynamics and adversarial disturbances. Our study focuses on optimal control without centralized precomputed policies, but rather with adaptive control policies for the different agents that are only equipped with a stabilizing controller. We give a reduction from any (standard) regret minimizing control method to a distributed algorithm. The reduction guarantees that the resulting distributed algorithm has low regret relative to the optimal precomputed joint policy. Our methodology involves generalizing online convex optimization to a multi-agent setting and applying recent tools from nonstochastic control derived for a single agent. We empirically evaluate our method on a model of an overactuated aircraft. We show that the distributed method is robust to failure and to adversarial perturbations in the dynamics.  ( 2 min )
    FEDformer: Frequency Enhanced Decomposed Transformer for Long-term Series Forecasting. (arXiv:2201.12740v1 [cs.LG])
    Although Transformer-based methods have significantly improved state-of-the-art results for long-term series forecasting, they are not only computationally expensive but more importantly, are unable to capture the global view of time series (e.g. overall trend). To address these problems, we propose to combine Transformer with the seasonal-trend decomposition method, in which the decomposition method captures the global profile of time series while Transformers capture more detailed structures. To further enhance the performance of Transformer for long-term prediction, we exploit the fact that most time series tend to have a sparse representation in well-known basis such as Fourier transform, and develop a frequency enhanced Transformer. Besides being more effective, the proposed method, termed as Frequency Enhanced Decomposed Transformer ({\bf FEDformer}), is more efficient than standard Transformer with a linear complexity to the sequence length. Our empirical studies with six benchmark datasets show that compared with state-of-the-art methods, FEDformer can reduce prediction error by $14.8\%$ and $22.6\%$ for multivariate and univariate time series, respectively. the code will be released soon.  ( 2 min )
    Stochastic Neural Networks with Infinite Width are Deterministic. (arXiv:2201.12724v1 [cs.LG])
    This work theoretically studies stochastic neural networks, a main type of neural network in use. Specifically, we prove that as the width of an optimized stochastic neural network tends to infinity, its predictive variance on the training set decreases to zero. Two common examples that our theory applies to are neural networks with dropout and variational autoencoders. Our result helps better understand how stochasticity affects the learning of neural networks and thus design better architectures for practical problems.  ( 2 min )
    Fair Wrapping for Black-box Predictions. (arXiv:2201.12947v1 [stat.ML])
    We introduce a new family of techniques to post-process ("wrap") a black-box classifier in order to reduce its bias. Our technique builds on the recent analysis of improper loss functions whose optimisation can correct any twist in prediction, unfairness being treated as a twist. In the post-processing, we learn a wrapper function which we define as an {\alpha}-tree, which modifies the prediction. We provide two generic boosting algorithms to learn {\alpha}-trees. We show that our modification has appealing properties in terms of composition of{\alpha}-trees, generalization, interpretability, and KL divergence between modified and original predictions. We exemplify the use of our technique in three fairness notions: conditional value at risk, equality of opportunity, and statistical parity; and provide experiments on several readily available datasets.  ( 2 min )
    Cryptocurrency Valuation: An Explainable AI Approach. (arXiv:2201.12893v1 [econ.GN])
    Currently, there are no convincing proxies for the fundamentals of cryptocurrency assets. We propose a new market-to-fundamental ratio, the price-to-utility (PU) ratio, utilizing unique blockchain accounting methods. We then proxy various fundamental-to-market ratios by Bitcoin historical data and find they have little predictive power for short-term bitcoin returns. However, PU ratio effectively predicts long-term bitcoin returns. We verify PU ratio valuation by unsupervised and supervised machine learning. The valuation method informs investment returns and predicts bull markets effectively. Finally, we present an automated trading strategy advised by the PU ratio that outperforms the conventional buy-and-hold and market-timing strategies. We distribute the trading algorithms as open-source software via Python Package Index for future research.  ( 2 min )
    Coordinated Attacks against Contextual Bandits: Fundamental Limits and Defense Mechanisms. (arXiv:2201.12700v1 [cs.LG])
    Motivated by online recommendation systems, we propose the problem of finding the optimal policy in multitask contextual bandits when a small fraction $\alpha < 1/2$ of tasks (users) are arbitrary and adversarial. The remaining fraction of good users share the same instance of contextual bandits with $S$ contexts and $A$ actions (items). Naturally, whether a user is good or adversarial is not known in advance. The goal is to robustly learn the policy that maximizes rewards for good users with as few user interactions as possible. Without adversarial users, established results in collaborative filtering show that $O(1/\epsilon^2)$ per-user interactions suffice to learn a good policy, precisely because information can be shared across users. This parallelization gain is fundamentally altered by the presence of adversarial users: unless there are super-polynomial number of users, we show a lower bound of $\tilde{\Omega}(\min(S,A) \cdot \alpha^2 / \epsilon^2)$ {\it per-user} interactions to learn an $\epsilon$-optimal policy for the good users. We then show we can achieve an $\tilde{O}(\min(S,A)\cdot \alpha/\epsilon^2)$ upper-bound, by employing efficient robust mean estimators for both uni-variate and high-dimensional random variables. We also show that this can be improved depending on the distributions of contexts.  ( 2 min )
    Deep Non-Crossing Quantiles through the Partial Derivative. (arXiv:2201.12848v1 [cs.LG])
    Quantile Regression (QR) provides a way to approximate a single conditional quantile. To have a more informative description of the conditional distribution, QR can be merged with deep learning techniques to simultaneously estimate multiple quantiles. However, the minimisation of the QR-loss function does not guarantee non-crossing quantiles, which affects the validity of such predictions and introduces a critical issue in certain scenarios. In this article, we propose a generic deep learning algorithm for predicting an arbitrary number of quantiles that ensures the quantile monotonicity constraint up to the machine precision and maintains its modelling performance with respect to alternative models. The presented method is evaluated over several real-world datasets obtaining state-of-the-art results as well as showing that it scales to large-size data sets.  ( 2 min )
    Approximate Bayesian Computation Based on Maxima Weighted Isolation Kernel Mapping. (arXiv:2201.12745v1 [stat.ML])
    Motivation: The branching processes model yields unevenly stochastically distributed data that consists of sparse and dense regions. The work tries to solve the problem of a precise evaluation of a parameter for this type of model. The application of the branching processes model to cancer cell evolution has many difficulties like high dimensionality and the rare appearance of a result of interest. Moreover, we would like to solve the ambitious task of obtaining the coefficients of the model reflecting the relationship of driver genes mutations and cancer hallmarks on the basis of personal data of variant allele frequencies. Results: The Approximate Bayesian computation method based on the Isolation kernel is designed. The method includes a transformation row data to a Hilbert space (mapping) and measures the similarity between simulation points and maxima weighted Isolation kernel mapping related to the observation point. Also, we designed a heuristic algorithm to find parameter estimation without gradient calculation and dimension-independent. The advantage of the proposed machine learning method is shown for multidimensional test data as well as for an example of cancer cell evolution.  ( 2 min )
    Why the Rich Get Richer? On the Balancedness of Random Partition Models. (arXiv:2201.12697v1 [stat.ML])
    Random partition models are widely used in Bayesian methods for various clustering tasks, such as mixture models, topic models, and community detection problems. While the number of clusters induced by random partition models has been studied extensively, another important model property regarding the balancedness of cluster sizes has been largely neglected. We formulate a framework to define and theoretically study the balancedness of exchangeable random partition models, by analyzing how a model assigns probabilities to partitions with different levels of balancedness. We demonstrate that the "rich-get-richer" characteristic of many existing popular random partition models is an inevitable consequence of two common assumptions: product-form exchangeability and projectivity. We propose a principled way to compare the balancedness of random partition models, which gives a better understanding of what model works better and what doesn't for different applications. We also introduce the "rich-get-poorer" random partition models and illustrate their application to entity resolution tasks.  ( 2 min )
    Provable Domain Generalization via Invariant-Feature Subspace Recovery. (arXiv:2201.12919v1 [cs.LG])
    Domain generalization asks for models trained on a set of training environments to perform well on unseen test environments. Recently, a series of algorithms such as Invariant Risk Minimization (IRM) has been proposed for domain generalization. However, Rosenfeld et al. (2021) shows that in a simple linear data model, even if non-convexity issues are ignored, IRM and its extensions cannot generalize to unseen environments with less than $d_s+1$ training environments, where $d_s$ is the dimension of the spurious-feature subspace. In this paper, we propose to achieve domain generalization with Invariant-feature Subspace Recovery (ISR). Our first algorithm, ISR-Mean, can identify the subspace spanned by invariant features from the first-order moments of the class-conditional distributions, and achieve provable domain generalization with $d_s+1$ training environments under the data model of Rosenfeld et al. (2021). Our second algorithm, ISR-Cov, further reduces the required number of training environments to $O(1)$ using the information of second-order moments. Notably, unlike IRM, our algorithms bypass non-convexity issues and enjoy global convergence guarantees. Empirically, our ISRs can obtain superior performance compared with IRM on synthetic benchmarks. In addition, on three real-world image and text datasets, we show that ISR-Mean can be used as a simple yet effective post-processing method to increase the worst-case accuracy of trained models against spurious correlations and group shifts.  ( 2 min )
    Error Rates for Kernel Classification under Source and Capacity Conditions. (arXiv:2201.12655v1 [stat.ML])
    In this manuscript, we consider the problem of kernel classification under the Gaussian data design, and under source and capacity assumptions on the dataset. While the decay rates of the prediction error have been extensively studied under much more generic assumptions for kernel ridge regression, deriving decay rates for the classification problem has been hitherto considered a much more challenging task. In this work we leverage recent analytical results for learning curves of linear classification with generic loss function to derive the rates of decay of the misclassification (prediction) error with the sample complexity for two standard classification settings, namely margin-maximizing Support Vector Machines (SVM) and ridge classification. Using numerical and analytical arguments, we derive the error rates as a function of the source and capacity coefficients, and contrast the two methods.  ( 2 min )
    COIN++: Data Agnostic Neural Compression. (arXiv:2201.12904v1 [cs.LG])
    Neural compression algorithms are typically based on autoencoders that require specialized encoder and decoder architectures for different data modalities. In this paper, we propose COIN++, a neural compression framework that seamlessly handles a wide range of data modalities. Our approach is based on converting data to implicit neural representations, i.e. neural functions that map coordinates (such as pixel locations) to features (such as RGB values). Then, instead of storing the weights of the implicit neural representation directly, we store modulations applied to a meta-learned base network as a compressed code for the data. We further quantize and entropy code these modulations, leading to large compression gains while reducing encoding time by two orders of magnitude compared to baselines. We empirically demonstrate the effectiveness of our method by compressing various data modalities, from images to medical and climate data.  ( 2 min )
    Scaling Gaussian Process Optimization by Evaluating a Few Unique Candidates Multiple Times. (arXiv:2201.12909v1 [stat.ML])
    Computing a Gaussian process (GP) posterior has a computational cost cubical in the number of historical points. A reformulation of the same GP posterior highlights that this complexity mainly depends on how many \emph{unique} historical points are considered. This can have important implication in active learning settings, where the set of historical points is constructed sequentially by the learner. We show that sequential black-box optimization based on GPs (GP-Opt) can be made efficient by sticking to a candidate solution for multiple evaluation steps and switch only when necessary. Limiting the number of switches also limits the number of unique points in the history of the GP. Thus, the efficient GP reformulation can be used to exactly and cheaply compute the posteriors required to run the GP-Opt algorithms. This approach is especially useful in real-world applications of GP-Opt with high switch costs (e.g. switching chemicals in wet labs, data/model loading in hyperparameter optimization). As examples of this meta-approach, we modify two well-established GP-Opt algorithms, GP-UCB and GP-EI, to switch candidates as infrequently as possible adapting rules from batched GP-Opt. These versions preserve all the theoretical no-regret guarantees while improving practical aspects of the algorithms such as runtime, memory complexity, and the ability of batching candidates and evaluating them in parallel.  ( 2 min )
    Riemannian block SPD coupling manifold and its application to optimal transport. (arXiv:2201.12933v1 [math.FA])
    Optimal transport (OT) has seen its popularity in various fields of applications. We start by observing that the OT problem can be viewed as an instance of a general symmetric positive definite (SPD) matrix-valued OT problem, where the cost, the marginals, and the coupling are represented as block matrices and each component block is a SPD matrix. The summation of row blocks and column blocks in the coupling matrix are constrained by the given block-SPD marginals. We endow the set of such block-coupling matrices with a novel Riemannian manifold structure. This allows to exploit the versatile Riemannian optimization framework to solve generic SPD matrix-valued OT problems. We illustrate the usefulness of the proposed approach in several applications.  ( 2 min )
    Geometry- and Accuracy-Preserving Random Forest Proximities. (arXiv:2201.12682v1 [stat.ML])
    Random forests are considered one of the best out-of-the-box classification and regression algorithms due to their high level of predictive performance with relatively little tuning. Pairwise proximities can be computed from a trained random forest which measure the similarity between data points relative to the supervised task. Random forest proximities have been used in many applications including the identification of variable importance, data imputation, outlier detection, and data visualization. However, existing definitions of random forest proximities do not accurately reflect the data geometry learned by the random forest. In this paper, we introduce a novel definition of random forest proximities called Random Forest-Geometry- and Accuracy-Preserving proximities (RF-GAP). We prove that the proximity-weighted sum (regression) or majority vote (classification) using RF-GAP exactly match the out-of-bag random forest prediction, thus capturing the data geometry learned by the random forest. We empirically show that this improved geometric representation outperforms traditional random forest proximities in tasks such as data imputation and provides outlier detection and visualization results consistent with the learned data geometry.  ( 2 min )
    A Priori Denoising Strategies for Sparse Identification of Nonlinear Dynamical Systems: A Comparative Study. (arXiv:2201.12683v1 [stat.ML])
    In recent years, identification of nonlinear dynamical systems from data has become increasingly popular. Sparse regression approaches, such as Sparse Identification of Nonlinear Dynamics (SINDy), fostered the development of novel governing equation identification algorithms assuming the state variables are known a priori and the governing equations lend themselves to sparse, linear expansions in a (nonlinear) basis of the state variables. In the context of the identification of governing equations of nonlinear dynamical systems, one faces the problem of identifiability of model parameters when state measurements are corrupted by noise. Measurement noise affects the stability of the recovery process yielding incorrect sparsity patterns and inaccurate estimation of coefficients of the governing equations. In this work, we investigate and compare the performance of several local and global smoothing techniques to a priori denoise the state measurements and numerically estimate the state time-derivatives to improve the accuracy and robustness of two sparse regression methods to recover governing equations: Sequentially Thresholded Least Squares (STLS) and Weighted Basis Pursuit Denoising (WBPDN) algorithms. We empirically show that, in general, global methods, which use the entire measurement data set, outperform local methods, which employ a neighboring data subset around a local point. We additionally compare Generalized Cross Validation (GCV) and Pareto curve criteria as model selection techniques to automatically estimate near optimal tuning parameters, and conclude that Pareto curves yield better results. The performance of the denoising strategies and sparse regression methods is empirically evaluated through well-known benchmark problems of nonlinear dynamical systems.  ( 2 min )
    Hyperbolic Neural Networks for Molecular Generation. (arXiv:2201.12825v1 [cs.LG])
    With the recent advance of deep learning, neural networks have been extensively used for the task of molecular generation. Many deep generators extract atomic relations from molecular graphs and ignore hierarchical information at both atom and molecule levels. In order to extract such hierarchical information, we propose a novel hyperbolic generative model. Our model contains three parts: first, a fully hyperbolic junction-tree encoder-decoder that embeds the hierarchical information of the molecules in the latent hyperbolic space; second, a latent generative adversarial network for generating the latent embeddings; third, a molecular generator that inherits the decoders from the first part and the latent generator from the second part. We evaluate our model on the ZINC dataset using the MOSES benchmarking platform and achieve competitive results, especially in metrics about structural similarity.  ( 2 min )
    No-Regret Learning in Time-Varying Zero-Sum Games. (arXiv:2201.12736v1 [cs.LG])
    Learning from repeated play in a fixed two-player zero-sum game is a classic problem in game theory and online learning. We consider a variant of this problem where the game payoff matrix changes over time, possibly in an adversarial manner. We first present three performance measures to guide the algorithmic design for this problem: 1) the well-studied individual regret, 2) an extension of duality gap, and 3) a new measure called dynamic Nash Equilibrium regret, which quantifies the cumulative difference between the player's payoff and the minimax game value. Next, we develop a single parameter-free algorithm that simultaneously enjoys favorable guarantees under all these three performance measures. These guarantees are adaptive to different non-stationarity measures of the payoff matrices and, importantly, recover the best known results when the payoff matrix is fixed. Our algorithm is based on a two-layer structure with a meta-algorithm learning over a group of black-box base-learners satisfying a certain property, along with several novel ingredients specifically designed for the time-varying game setting. Empirical results further validate the effectiveness of our algorithm.  ( 2 min )
    Similarity and Generalization: From Noise to Corruption. (arXiv:2201.12803v1 [cs.LG])
    Contrastive learning aims to extract distinctive features from data by finding an embedding representation where similar samples are close to each other, and different ones are far apart. We study generalization in contrastive learning, focusing on its simplest representative: Siamese Neural Networks (SNNs). We show that Double Descent also appears in SNNs and is exacerbated by noise. We point out that SNNs can be affected by two distinct sources of noise: Pair Label Noise (PLN) and Single Label Noise (SLN). The effect of SLN is asymmetric, but it preserves similarity relations, while PLN is symmetric but breaks transitivity. We show that the dataset topology crucially affects generalization. While sparse datasets show the same performances under SLN and PLN for an equal amount of noise, SLN outperforms PLN in the overparametrized region in dense datasets. Indeed, in this regime, PLN similarity violation becomes macroscopical, corrupting the dataset to the point where complete overfitting cannot be achieved. We call this phenomenon Density-Induced Break of Similarity (DIBS). We also probe the equivalence between online optimization and offline generalization for similarity tasks. We observe that an online/offline correspondence in similarity learning can be affected by both the network architecture and label noise.  ( 2 min )
    Meta-Learners for Estimation of Causal Effects: Finite Sample Cross-Fit Performance. (arXiv:2201.12692v1 [econ.EM])
    Estimation of causal effects using machine learning methods has become an active research field in econometrics. In this paper, we study the finite sample performance of meta-learners for estimation of heterogeneous treatment effects under the usage of sample-splitting and cross-fitting to reduce the overfitting bias. In both synthetic and semi-synthetic simulations we find that the performance of the meta-learners in finite samples greatly depends on the estimation procedure. The results imply that sample-splitting and cross-fitting are beneficial in large samples for bias reduction and efficiency of the meta-learners, respectively, whereas full-sample estimation is preferable in small samples. Furthermore, we derive practical recommendations for application of specific meta-learners in empirical studies depending on particular data characteristics such as treatment shares and sample size.  ( 2 min )
    Implicit Regularization Towards Rank Minimization in ReLU Networks. (arXiv:2201.12760v1 [cs.LG])
    We study the conjectured relationship between the implicit regularization in neural networks, trained with gradient-based methods, and rank minimization of their weight matrices. Previously, it was proved that for linear networks (of depth 2 and vector-valued outputs), gradient flow (GF) w.r.t. the square loss acts as a rank minimization heuristic. However, understanding to what extent this generalizes to nonlinear networks is an open problem. In this paper, we focus on nonlinear ReLU networks, providing several new positive and negative results. On the negative side, we prove (and demonstrate empirically) that, unlike the linear case, GF on ReLU networks may no longer tend to minimize ranks, in a rather strong sense (even approximately, for "most" datasets of size 2). On the positive side, we reveal that ReLU networks of sufficient depth are provably biased towards low-rank solutions in several reasonable settings.  ( 2 min )
    ApolloRL: a Reinforcement Learning Platform for Autonomous Driving. (arXiv:2201.12609v1 [cs.RO])
    We introduce ApolloRL, an open platform for research in reinforcement learning for autonomous driving. The platform provides a complete closed-loop pipeline with training, simulation, and evaluation components. It comes with 300 hours of real-world data in driving scenarios and popular baselines such as Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC) agents. We elaborate in this paper on the architecture and the environment defined in the platform. In addition, we discuss the performance of the baseline agents in the ApolloRL environment.  ( 2 min )
    AntBO: Towards Real-World Automated Antibody Design with Combinatorial Bayesian Optimisation. (arXiv:2201.12570v1 [q-bio.BM])
    Antibodies are canonically Y-shaped multimeric proteins capable of highly specific molecular recognition. The CDRH3 region located at the tip of variable chains of an antibody dominates antigen-binding specificity. Therefore, it is a priority to design optimal antigen-specific CDRH3 regions to develop therapeutic antibodies to combat harmful pathogens. However, the combinatorial nature of CDRH3 sequence space makes it impossible to search for an optimal binding sequence exhaustively and efficiently, especially not experimentally. Here, we present AntBO: a Combinatorial Bayesian Optimisation framework enabling efficient in silico design of the CDRH3 region. Ideally, antibodies should bind to their target antigen and be free from any harmful outcomes. Therefore, we introduce the CDRH3 trust region that restricts the search to sequences with feasible developability scores. To benchmark AntBO, we use the Absolut! software suite as a black-box oracle because it can score the target specificity and affinity of designed antibodies in silico in an unconstrained fashion. The results across 188 antigens demonstrate the benefit of AntBO in designing CDRH3 regions with diverse biophysical properties. In under 200 protein designs, AntBO can suggest antibody sequences that outperform the best binding sequence drawn from 6.9 million experimentally obtained CDRH3s and a commonly used genetic algorithm baseline. Additionally, AntBO finds very-high affinity CDRH3 sequences in only 38 protein designs whilst requiring no domain knowledge. We conclude AntBO brings automated antibody design methods closer to what is practically viable for in vitro experimentation.  ( 2 min )
    Robust Imitation Learning from Corrupted Demonstrations. (arXiv:2201.12594v1 [cs.LG])
    We consider offline Imitation Learning from corrupted demonstrations where a constant fraction of data can be noise or even arbitrary outliers. Classical approaches such as Behavior Cloning assumes that demonstrations are collected by an presumably optimal expert, hence may fail drastically when learning from corrupted demonstrations. We propose a novel robust algorithm by minimizing a Median-of-Means (MOM) objective which guarantees the accurate estimation of policy, even in the presence of constant fraction of outliers. Our theoretical analysis shows that our robust method in the corrupted setting enjoys nearly the same error scaling and sample complexity guarantees as the classical Behavior Cloning in the expert demonstration setting. Our experiments on continuous-control benchmarks validate that our method exhibits the predicted robustness and effectiveness, and achieves competitive results compared to existing imitation learning methods.  ( 2 min )
    Why Should I Trust You, Bellman? The Bellman Error is a Poor Replacement for Value Error. (arXiv:2201.12417v1 [cs.LG])
    In this work, we study the use of the Bellman equation as a surrogate objective for value prediction accuracy. While the Bellman equation is uniquely solved by the true value function over all state-action pairs, we find that the Bellman error (the difference between both sides of the equation) is a poor proxy for the accuracy of the value function. In particular, we show that (1) due to cancellations from both sides of the Bellman equation, the magnitude of the Bellman error is only weakly related to the distance to the true value function, even when considering all state-action pairs, and (2) in the finite data regime, the Bellman equation can be satisfied exactly by infinitely many suboptimal solutions. This means that the Bellman error can be minimized without improving the accuracy of the value function. We demonstrate these phenomena through a series of propositions, illustrative toy examples, and empirical analysis in standard benchmark domains.  ( 2 min )
    Any Variational Autoencoder Can Do Arbitrary Conditioning. (arXiv:2201.12414v1 [cs.LG])
    Arbitrary conditioning is an important problem in unsupervised learning, where we seek to model the conditional densities $p(\mathbf{x}_u \mid \mathbf{x}_o)$ that underly some data, for all possible non-intersecting subsets $o, u \subset \{1, \dots , d\}$. However, the vast majority of density estimation only focuses on modeling the joint distribution $p(\mathbf{x})$, in which important conditional dependencies between features are opaque. We propose a simple and general framework, coined Posterior Matching, that enables any Variational Autoencoder (VAE) to perform arbitrary conditioning, without modification to the VAE itself. Posterior Matching applies to the numerous existing VAE-based approaches to joint density estimation, thereby circumventing the specialized models required by previous approaches to arbitrary conditioning. We find that Posterior Matching achieves performance that is comparable or superior to current state-of-the-art methods for a variety of tasks.  ( 2 min )
    Systematic Training and Testing for Machine Learning Using Combinatorial Interaction Testing. (arXiv:2201.12428v1 [cs.LG])
    This paper demonstrates the systematic use of combinatorial coverage for selecting and characterizing test and training sets for machine learning models. The presented work adapts combinatorial interaction testing, which has been successfully leveraged in identifying faults in software testing, to characterize data used in machine learning. The MNIST hand-written digits data is used to demonstrate that combinatorial coverage can be used to select test sets that stress machine learning model performance, to select training sets that lead to robust model performance, and to select data for fine-tuning models to new domains. Thus, the results posit combinatorial coverage as a holistic approach to training and testing for machine learning. In contrast to prior work which has focused on the use of coverage in regard to the internal of neural networks, this paper considers coverage over simple features derived from inputs and outputs. Thus, this paper addresses the case where the supplier of test and training sets for machine learning models does not have intellectual property rights to the models themselves. Finally, the paper addresses prior criticism of combinatorial coverage and provides a rebuttal which advocates the use of coverage metrics in machine learning applications.  ( 2 min )

  • Open

    Agent IRIS*
    In this paper we prototype and explore how multiple agent-based models of robotic factories connected to other robotic factories (represented by their respective models) can be orchestrated using an all-purpose data platform – thereby simulating descriptive and predictive properties of a group of factories (a factory cluster). For the underlying prototype, NetLogo suite was used to do factory agent-based simulation (re-using “Robotic Factory” model [1]) while InterSystems IRIS data platform was used for NetLogo orchestration and factory/cluster end-to-end simulation. The post Agent IRIS* appeared first on Data Science Central.  ( 15 min )
    Data Literacy Education Framework – Part 2
    What do we need to do to increase the data literacy of our organization? In a world where your personal data, and the preferences and biases buried in that data, are being used to influence your behaviors, beliefs, and decisions, data literacy becomes a fundamental skill.  And it’s not just corporations that need this training. … Read More »Data Literacy Education Framework – Part 2 The post Data Literacy Education Framework – Part 2 appeared first on Data Science Central.  ( 8 min )
    Divergent thinking and true AI innovation
    Research and Markets estimated that annual global sales of information technology reached nearly $8.4 trillion in 2021. At that level, IT sales made up just less than 9% total estimated global annual gross domestic product (GDP). Global IT sales tend to grow about 6.6 percent annually. For the sake of argument, let’s assume that the… Read More »Divergent thinking and true AI innovation The post Divergent thinking and true AI innovation appeared first on Data Science Central.  ( 5 min )
  • Open

    RL, Demonstration, and Imitation Learning
    Hello, I have a RL agent who succeeded in learning the optimal policy for a certain problem. I slightly increased the complexity of the problem, and I want to train a new agent (the previously trained agent as is cannot perfectly tackle the new problem). I am thinking about introducing "some" trajectories in the new learning process using the previous agent, as it could help speeding up the learning process. I have seen some works combining imitation learning with RL, but in such works two different loss functions are used (one for IL and one for RL). I want to do the above while maintaining the same RL loss function I am currently using (PPO), would this still be considered some form of RL-IM? submitted by /u/AhmedNizam_ [link] [comments]  ( 1 min )
    "Intelligence and Unambitiousness Using Algorithmic Information Theory", Cohen et al 2021
    submitted by /u/gwern [link] [comments]
    "Bootstrapped Meta-Learning", Flennerhag et al 2021 {D}
    submitted by /u/gwern [link] [comments]
    "Here Come the Underdogs of the Robot Olympics: Can a scrappy team of robotics undergrads play to win against the best-funded labs in the world? At the Darpa Challenge, it’s anyone’s game"
    submitted by /u/gwern [link] [comments]
    "Don't Change the Algorithm, Change the Data: Exploratory Data for Offline Reinforcement Learning (ExoRL)", Yarats et al 2022
    submitted by /u/gwern [link] [comments]  ( 1 min )
    Advantage Actor-Critic for minimizing the reward
    In this case I just have to use gradient descent instead of instead, which means that I need to subtract instead of add, for the weight update. Is that correct or am I overlooking something? submitted by /u/luzariuSsuckSs [link] [comments]  ( 1 min )
    Gym/MuJoCo swimming worm + policy gradient
    A short write-up with videos for solving the Gym/MuJoCo swimming worm with a policy gradient method. submitted by /u/ExploitExplore [link] [comments]
    Help in modeling states/observations
    Hello, I have a problem in Multi-Agent RL, where agents need to navigate the environment searching for an object. I use PPO with actor/critic networks being Convolutional Nets. An agent observes it's own location, the search history, and other agents' locations. These observations are in the form of grid maps (the search area is represented as a grid map, as shown below). The actor network for an agent takes these 3 maps and produces an action. Own Location, Other Agents' Locations, Search history This works fine, but the learning takes long. Am I introducing an overhead for representing agents' locations using Maps instead of just feeding x,y coordinates to a Feed forward network (FFN)? The reason I use maps is that I believe it would be more scalable with increasing number of agents, i.e. with more agents you get fil more pixels in the map, rather than increasing the number of inputs (as would be the case with using FFN). Any insights on how to better represent such observations? submitted by /u/AhmedNizam_ [link] [comments]  ( 1 min )
    "Why Should I Trust You, Bellman? The Bellman Error is a Poor Replacement for Value Error", Fujimoto et al 2022
    submitted by /u/gwern [link] [comments]  ( 1 min )
    "Can Wikipedia Help Offline Reinforcement Learning?", Reid et al 2022 (text-pretrained Decision Transformers, but not CLIP/iGPT, more sample-efficient)
    submitted by /u/gwern [link] [comments]  ( 1 min )
  • Open

    [D] Validation metrics improving despite validation loss increasing
    Hey ML Reddit! So just recently I've been fine-tuning CLIP on medical text-image pairs using contrastive learning. In general this is working, but I came upon a weird phenomenon. After only a couple of epochs the validation loss stagnates and then starts increasing. This would expected if the model is actually overfitting but all my other metrics are actually improving (see the zero-shot accuracy and the retrieval@1). Does anyone have an idea how this can be happening? https://preview.redd.it/p8x714wzraf81.png?width=1127&format=png&auto=webp&s=d729168702ee1e61ceefabf575013883af9fcd07 submitted by /u/Antonenanenas [link] [comments]  ( 1 min )
    [D] Are there any papers that discuss techniques for modifying image-based deep learning models for fast inference on edge devices?
    I'm looking for papers that discuss general techniques for "scaling down" existing image-based deep learning models (across any task), so that they can be used for fast inference on mobile devices. Are there any such papers you've come across? submitted by /u/rickracklove [link] [comments]  ( 1 min )
    [D] HuggingFace ecosystem vs. Pytorch Lightning for big research NLP project with many collaborators.
    Greetings Reddit ML, TL;DR: HuggningFace and Pytorch Lghtning have overlapping functionality in some regard and I am asking the community here about the following: What is the best framework to go with when structuring code for collaboration and ease of use in a large research project? What are your experiences with either? Success/Failures, lessons learned? The nature of the research project is quite cutting edge and already uses HuggingFace to utilise pretrained models + the "accelerate" package by HuggingFace for distributed training. Ie. pretrained models of HuggingFace WILL still be used even if we decide to move to Pytorch Lightning structure ofmodules, distributed training, trainer, etc. Beside HuggingFace models, the code is written in Pytorch. For the past few weeks I hav…  ( 2 min )
    [R] Benchmarks for Physics-Informed Machine Learning
    There's been an invasion of PIML papers recently. Unlike for CV, NLP or RL*, there seem to be no agreed benchmarks for PIML, and people literally republish stuff that had been already presented before (sometimes with worse results, or altering the validation procedure slightly in order to cheat). Especially the way the computational cost is estimated, is ridiculous: some approaches, which require thousands of hours of computation on expensive HPC clusters in order to generate a training set, are peddled as faster and better than traditional solvers, when they actually took much longer to train than a traditional solver, and produce much worse results on geometries and boundary conditions not in the training set. Are there some benchmarks apart from the overused Burgers equation (sometimes even with a viscosity term added, to make things simpler for the crappy PIML method which cannot handle shocks)? Cc u/chrisrackauckas PS a serious benchmark should not only specify a (system of) PDE, but also the initial and boundary conditions, a set of agreed-upon values of the PDE parameters (e.g., the viscosity in the case of the viscous Burgers equation), etc. Also, benchmarks for inverse problems would be needed too. *RL would probably merit a separate discussion, because even if there are some reference tasks/environments used as benchmarks, the question of how to fairly compare different RL methods and how to take into account result stochasticity is delicate. But you got my point submitted by /u/Best-Neat-9439 [link] [comments]  ( 2 min )
    Which machine learning model for predictive maintenance? (combination of time series and regression) [P]
    Goal and type of data I want to predict the thickness of the contact wire of railways. The contact wire wears down every time a train passes due to the friction between the pantograph and the wire. Besides the traffic on the track, also other factors play a role in the wear rate. Every year, a measurement train measures the thickness of the contact wire and several other parameters for the catenary. This measurement is done for every 25 centimetres of the rail track. This means that for every 25 centimetres multiple parameters are measured over several years. Difficulties with regression and time series alone I could train this model in such a way that it learns what the thickness will be for a certain combination of feature values (regression). So, for the same combination of values,…  ( 6 min )
    [D] Current State of JAX vs Pytorch?
    Any thoughts about this? How do their inner workings differ and what should I consider before learning one? I read that the documentation on JAX Is lacking in other reddit comparison posts, but they are already a year old, is this still the case? And is JAX still growing or does it look as if Google might abandon it? submitted by /u/future_gcp_poweruser [link] [comments]  ( 5 min )
    [R] Variational Neural Cellular Automata
    submitted by /u/hardmaru [link] [comments]
    [P] How to figure out if there is something wrong with a dataset?
    Hello, I have been working on binary classification task for a while and I feel like I tried every model, but neither of them is performing well. I have a very small (100 rows) tabular medical dataset with approximately 15 features. I don't have a neuroscience background so I don't know what most of the columns mean. Is there a way to see if I am wasting time on this dataset? If the correlation of all independent variables to the target variable is less than 0.5, would that mean this data is not sufficient for building such a model for prediction? Could you recommend me some tests to check if the dataset is too noisy? Any insight into this issue would be greatly appreciated. Thank you submitted by /u/chazy07 [link] [comments]  ( 1 min )
    [R] Can Wikipedia Help Offline Reinforcement Learning?
    submitted by /u/hardmaru [link] [comments]
    [R] Diffusion models are autoencoders
    Hello /r/ML, long time no see! Diffusion models are actually autoencoders in disguise, and this is a connection that I feel is generally underappreciated. I rarely seem to get around to writing blog posts (tweets are easier...), but I thought this was worth expanding on. So I did. Let me know what you think :) https://benanne.github.io/2022/01/31/diffusion.html submitted by /u/benanne [link] [comments]  ( 1 min )
    [D] *Orthogonal* Random Features (Or Naive Linear Models)
    Linear models are some of the most useful in all of statistics. Unfortunately, as the number of features grows large, confidence in your parameters drops quickly. When the number of parameters is greater than your number of datapoints, the model loses all statistical validity. It is instructive to take a closer look at the closed-form solution for linear regression: w = np.linalg.inv(X.T @ X) @ X' @ y (Note that X is an n-by-d matrix and y is an n-by-1 matrix; w is the vector we're solving for and is d-by-1) When X' X is diagonal our features are linearly independent. It also means computing w becomes extremely easy, taking only O(nd + d^2) time, rather than O(nd^2 + d^3) time (and avoiding an annoying matrix inverse). For one particular feature, the coefficient is just: x = X[:,i] …  ( 2 min )
  • Open

    The downside of machine learning in health care
    Assistant Professor Marzyeh Ghassemi explores how hidden biases in medical data could compromise artificial intelligence approaches.  ( 6 min )
    2021-22 Takeda Fellows: Leaning on AI to advance medicine for humans
    New fellows are working on electronic health record algorithms, remote sensing data related to environmental health, and neural networks for the development of antibiotics.  ( 6 min )
    Artificial intelligence system rapidly predicts how two proteins will attach
    The machine-learning model could help scientists speed the development of new medicines.  ( 6 min )
  • Open

    What do you think will be the best use of AI in 2022 in short?
    Would be interesting to hear some great responses :) - kbf_ submitted by /u/kbf_ [link] [comments]
    Data Is More Valuable When It Can Be Shared
    submitted by /u/Repeat-or [link] [comments]  ( 1 min )
    The most important AI Conferences 2022 (link with registration fees in comments)
    submitted by /u/amaigmbh [link] [comments]  ( 1 min )
    AI Will Augment Our Understanding of the Universe and Space
    submitted by /u/Beautiful-Credit-868 [link] [comments]
    Cornell And NTT Researchers Introduces Deep Physical Neural Networks To Train Physical Systems To Perform Machine Learning Computations Using Backpropagation
    Deep-learning models have become commonplace in all fields of research and engineering. However, their energy requirements are limiting their ability to scale. Synergistic hardware has contributed to the widespread use of deep neural networks (DNNs). DNN ‘accelerators,’ generally based on direct mathematical isomorphism between the hardware physics and the mathematical processes in DNNs, have been inspired by the developing DNN energy problem. One of the methods to achieve effective machine learning is to apply trained mathematical transformations by designing hardware for tight, operation-by-operation mathematical isomorphism. A more efficient way would be to directly train the physical changes of the hardware to perform specific computations. Continue Reading Paper: https://www.nature.com/articles/s41586-021-04223-6.pdf https://preview.redd.it/g1elygj999f81.png?width=1394&format=png&auto=webp&s=f30155419189c550e16adee2640d07ef064c93e2 submitted by /u/ai-lover [link] [comments]  ( 1 min )
    The EU and US are starting to align on AI regulation
    submitted by /u/mr_j_b [link] [comments]
    THE FUTURE OF AI: Machine Learning and Knowledge Graphs
    submitted by /u/mr_j_b [link] [comments]
    Impact of using AI to generate primers for detection of SARS-CoV-2 Omicron variant
    submitted by /u/mr_j_b [link] [comments]
    Giving an AI control of nuclear weapons: What could possibly go wrong? - Bulletin of the Atomic Scientists
    submitted by /u/mr_j_b [link] [comments]
    Alpine: Renault’s AI input can help push up F1 grid
    submitted by /u/mr_j_b [link] [comments]
    AI Applications Add Speed and Control to Industrial Production
    submitted by /u/mr_j_b [link] [comments]
    Confused about recursion in python-depth first search-:
    # Python dictionary to act as an adjacency list graph = { '7' : ['19','21', '14'], '19': ['1', '12', '31'], '21': [], '14': ['23', '6'], '1' : [], '12': [], '31': [], '23': [], '6' : [] } visited = [] # List of visited nodes of graph. def dfs(visited, graph, node): if node not in visited: visited.append(node) for neighbor in graph[node]: dfs(visited, graph, neighbor) print(node) # Driver Code print("Following is the Depth-First Search") dfs(visited, graph, '7') print("visited=",visited) what I don't understand is how do once this program reaches to print(1) what happens next, it doesn't make any sense to me. idk if I am stupid or what to not realize sth very trivial or idk. I will try to explain what is my problem, step by step. steps-: 1) dfs(visited,graph,7) 2) 7 not in visited. visited=7 dfs(19) 3) 19 not in visited. visited=7,19 dfs(1) 4) 1 not in visited visited=7,19,1 1 has no neighbours. print(1) imo the code should stop now. Because there is no function call no nth. But in fact the code goes to for neighbour in graph(node): dfs(visited,graph,neighbour) and starts with dfs(12). I don't understand this....How does it happen? how can it go to for loop just like that?(source-:https://cscircles.cemc.uwaterloo.ca/visualize#mode=display) even if it doesn't go to for loop, I can't make sense where it really goes. Can you please guide me about this issue? submitted by /u/PrisonSaintGermain [link] [comments]  ( 1 min )
    Chrome Browser Extension to get Research Paper Details
    Link: https://chrome.google.com/webstore/detail/paperslabmlai/agkeoekbbkpokfbpmmminnombikadbhb Identifies research papers mentioned in websites you visit, and the extension shows you the following details about research papers, Two-line summary. Availability of source code, videos, Reddit and Hacker News discussions. Popularity on Twitter. Conferences. Also, we released the source code of the extension, Github: https://github.com/labmlai/chrome-extension We love to hear your feedback and suggestions. Thank you all, and I appreciate the support. submitted by /u/hnipun [link] [comments]  ( 1 min )
    How do I get into AI?
    This question has been raised multiple times on Reddit and is pretty easily googleable. However, due to the rate at which the market changes and the fact that my circumstances are bit unique, I think my question is valid. I'm a 27 y.o. without any Bc. degree but with a 9 years of experience as a software developer. I'm currently working for one of the "big N" companies as a SSE and I'm happy with my career (so far). I specialise in backend technologies. I've been thinking about the direction in which modern technologies are going and my personal conclusion (newsflash!) is that ML and AI-related jobs are going to be more and more in demand. I want to be prepared for that and be competitive, when the time comes. The specific question which I have is, do you guys think I need to go back and finish my Uni? (since a lot of AI-related stuff is very academical and is being researched). EDIT: the question isn't about whether or not I can learn doing AI. What I'm asking is, is the industry set up in a way, which makes it different for non-academics to get into? submitted by /u/nikagam [link] [comments]  ( 1 min )
    What are all websites with news about upcoming or recent A.I./ML apps?
    submitted by /u/Niu_Davinci [link] [comments]
    What happens once dfs reaches leaf node in recursive depth first search-uninformed search-artificial intelligence
    submitted by /u/PrisonSaintGermain [link] [comments]
    Which aspect of organizational cyber security is mostly ignored by even big global organizations?
    View Poll submitted by /u/futureanalytica [link] [comments]
    A New AI Research Propose ‘UniFormer’ (Unified transFormer) to Unify Convolution and Self-Attention for Visual Recognition
    For visual recognition, representation learning is a crucial research area. Essentially, researchers are confronted with two separate issues in visual data, such as photographs and movies. On the one hand, there is a great deal of local redundancy; for example, visual material in a particular region (space, time, or space-time) is often comparable. Inefficient computation is frequently introduced by such localization. Global reliance, on the other hand, is complex; for example, objectives in various regions have dynamic relationships. Such long-distance communication frequently results in inefficient learning. To address these issues, academics have proposed many effective visual recognition models. Convolution Neural Networks (CNNs) and Vision Transformers (ViTs) are two popular backbones, with convolution and self-attention as fundamental processes in these two architectures. Regrettably, each of these operations focuses on one of the aforementioned issues while disregarding the other. Continue Reading Paper: https://arxiv.org/pdf/2201.09450v1.pdf Github: https://github.com/sense-x/uniformer submitted by /u/ai-lover [link] [comments]  ( 1 min )
    Artificial Nightmares : Sombra's Day of The Dead || DD PYTTI VQGAN AI Art Video [4K 60 FPS]
    submitted by /u/Thenamessd [link] [comments]
  • Open

    Advancing AI trustworthiness: Updates on responsible AI research
    Inflated expectations around the capabilities of AI technologies may lead people to believe that computers can’t be wrong. The truth is AI failures are not a matter of if but when. AI is a human endeavor that combines information about people and the physical world into mathematical constructs. Such technologies typically rely on statistical methods, with the possibility for errors throughout an AI system’s lifespan. As AI systems become more widely used across domains, especially in high-stakes scenarios where people’s safety and wellbeing can be affected, a critical question must be addressed: how trustworthy are AI systems, and how much and when should people trust AI?  The post Advancing AI trustworthiness: Updates on responsible AI research appeared first on Microsoft Research.  ( 14 min )
  • Open

    Applying Differential Privacy to Large Scale Image Classification
    Posted by Alexey Kurakin, Software Engineer and Roxana Geambasu, Visiting Faculty Researcher, Google Research Machine learning (ML) models are becoming increasingly valuable for improved performance across a variety of consumer products, from recommendations to automatic image classification. However, despite aggregating large amounts of data, in theory it is possible for models to encode characteristics of individual entries from the training set. For example, experiments in controlled settings have shown that language models trained using email datasets may sometimes encode sensitive information included in the training data and may have the potential to reveal the presence of a particular user’s data in the training set. As such, it is important to prevent the encoding of such charac…  ( 7 min )
  • Open

    Associahedron
    The previous post looked at ways of associating a product of four things. This post will look at associating more things, and visualizing the associations. The number of ways to associate a product of n+1 things is the nth Catalan number, defined by The previous post looked at products of 4 things, and there are […] Associahedron first appeared on John D. Cook.  ( 2 min )
    Non-associative multiplication
    There are five ways to parenthesize a product of four things: ((ab)c)d (ab)(cd) (a(b(cd)) (a(bc))d (a((bc)d) In a context where multiplication is not associative, the five products above are not necessarily the same. Maybe all five are different. This post will give two examples where the products above are all different: octonions and matrix multiplication. […] Non-associative multiplication first appeared on John D. Cook.  ( 4 min )
  • Open

    Naive Bayes Python Implementation and Understanding
    Implementation of Naive Bayes Classifier in Python  ( 5 min )
    Barriers to AI adoption: challenges faced and ways to overcome
    The most frequent bottlenecks of AI adoption and best practices to manage adoption risks based on recent researches and our experience.  ( 3 min )
  • Open

    Support for New NVIDIA RTX 3080 Ti, 3070 Ti Studio Laptops Now Available in February Studio Driver
    Support for the new GeForce RTX 3080 Ti and 3070 Ti Laptop GPUs is available today in the February Studio driver. Updated monthly, NVIDIA Studio drivers support NVIDIA tools and optimize the most popular creative apps, delivering added performance, reliability and speed to creative workflows. Creatives will also benefit from the February Studio driver with Read article > The post Support for New NVIDIA RTX 3080 Ti, 3070 Ti Studio Laptops Now Available in February Studio Driver appeared first on The Official NVIDIA Blog.  ( 4 min )
  • Open

    Error-correction learning
    Hi all, i´m rather new to the application of ANNs but i´m facing an issue in my daily work where i have the feeling that ANNs could help to solve it. Assuming i have a set of data, generated from simulation, as a function f = f(x,y) with the independent parameters x & y. From real testing i know that for certain values of x & y the output f(x,y) doesn´t match with the simulated data. Could some kind of backpropagation approach used to "correct" the data set f(x,y)? The main issue that i see here is that the amount of data from testing only covers a small portion of the 2D-Map f(x,y). submitted by /u/LordCatG [link] [comments]  ( 1 min )
    Padding a Neural Network
    I’m currently working with some data between -.5 and .5 predicting a single variable using a simple neural network running relu and elu. Unfortunately I have a high variability in the number of input variables in each sample. My current thought process is to pad the data, as losing data would be awful, but what value should I use to pad it. Should I use 0 or a value outside of my data range like -1? Thanks in advance for the help! submitted by /u/Tvdybgggh [link] [comments]  ( 1 min )
  • Open

    The Effects of Spectral Dimensionality Reduction on Hyperspectral Pixel Classification: A Case Study. (arXiv:2104.00788v2 [cs.CV] UPDATED)
    This paper presents a systematic study of the effects of hyperspectral pixel dimensionality reduction on the pixel classification task. We use five dimensionality reduction methods -- PCA, KPCA, ICA, AE, and DAE -- to compress 301-dimensional hyperspectral pixels. Compressed pixels are subsequently used to perform pixel classifications. Pixel classification accuracies together with compression method, compression rates, and reconstruction errors provide a new lens to study the suitability of a compression method for the task of pixel classification. We use three high-resolution hyperspectral image datasets, representing three common landscape types (i.e. urban, transitional suburban, and forests) collected by the Remote Sensing and Spatial Ecosystem Modeling laboratory of the University of Toronto. We found that PCA, KPCA, and ICA post greater signal reconstruction capability; however, when compression rates are more than 90\% these methods show lower classification scores. AE and DAE methods post better classification accuracy at 95\% compression rate, however their performance drops as compression rate approaches 97\%. Our results suggest that both the compression method and the compression rate are important considerations when designing a hyperspectral pixel classification pipeline.  ( 2 min )
    Model-Advantage and Value-Aware Models for Model-Based Reinforcement Learning: Bridging the Gap in Theory and Practice. (arXiv:2106.14080v2 [cs.LG] UPDATED)
    This work shows that value-aware model learning, known for its numerous theoretical benefits, is also practically viable for solving challenging continuous control tasks in prevalent model-based reinforcement learning algorithms. First, we derive a novel value-aware model learning objective by bounding the model-advantage i.e. model performance difference, between two MDPs or models given a fixed policy, achieving superior performance to prior value-aware objectives in most continuous control environments. Second, we identify the issue of stale value estimates in naively substituting value-aware objectives in place of maximum-likelihood in dyna-style model-based RL algorithms. Our proposed remedy to this issue bridges the long-standing gap in theory and practice of value-aware model learning by enabling successful deployment of all value-aware objectives in solving several continuous control robotic manipulation and locomotion tasks. Our results are obtained with minimal modifications to two popular and open-source model-based RL algorithms -- SLBO and MBPO, without tuning any existing hyper-parameters, while also demonstrating better performance of value-aware objectives than these baseline in some environments.  ( 2 min )
    Heterogeneous Treatment Effect Estimation using machine learning for Healthcare application: tutorial and benchmark. (arXiv:2109.12769v2 [cs.LG] UPDATED)
    Developing new drugs for target diseases is a time-consuming and expensive task, drug repurposing has become a popular topic in the drug development field. As much health claim data become available, many studies have been conducted on the data. The real-world data is noisy, sparse, and has many confounding factors. In addition, many studies have shown that drugs effects are heterogeneous among the population. Lots of advanced machine learning models about estimating heterogeneous treatment effects (HTE) have emerged in recent years, and have been applied to in econometrics and machine learning communities. These studies acknowledge medicine and drug development as the main application area, but there has been limited translational research from the HTE methodology to drug development. We aim to introduce the HTE methodology to the healthcare area and provide feasibility consideration when translating the methodology with benchmark experiments on healthcare administrative claim data. Also, we want to use benchmark experiments to show how to interpret and evaluate the model when it is applied to healthcare research. By introducing the recent HTE techniques to a broad readership in biomedical informatics communities, we expect to promote the wide adoption of causal inference using machine learning. We also expect to provide the feasibility of HTE for personalized drug effectiveness.  ( 2 min )
    Distributed TD(0) with Almost No Communication. (arXiv:2104.07855v2 [cs.LG] UPDATED)
    We provide a new non-asymptotic analysis of distributed TD(0) with linear function approximation. Our approach relies on "one-shot averaging," where $N$ agents run local copies of TD(0) and average the outcomes only once at the very end. We consider two models: one in which the agents interact with an environment they can observe and whose transitions depends on all of their actions (which we call the global state model), and one in which each agent can run a local copy of an identical Markov Decision Process, which we call the local state model. In the global state model, we show that the convergence rate of our distributed one-shot averaging method matches the known convergence rate of TD(0). By contrast, the best convergence rate in the previous literature showed a rate which, according to the worst-case bounds given, could underperform the non-distributed version by $O(N^3)$ in terms of the number of agents $N$. In the local state model, we demonstrate a version of the linear time speedup phenomenon, where the convergence time of the distributed process is a factor of $N$ faster than the convergence time of TD(0). As far as we are aware, this is the first result rigorously showing benefits from parallelism for temporal difference methods.  ( 2 min )
    Deep Neural Networks for Rank-Consistent Ordinal Regression Based On Conditional Probabilities. (arXiv:2111.08851v2 [cs.LG] UPDATED)
    In recent times, deep neural networks achieved outstanding predictive performance on various classification and pattern recognition tasks. However, many real-world prediction problems have ordinal response variables, and this ordering information is ignored by conventional classification losses such as the multi-category cross-entropy. Ordinal regression methods for deep neural networks address this. One such method is the CORAL method, which is based on an earlier binary label extension framework and achieves rank consistency among its output layer tasks by imposing a weight-sharing constraint. However, while earlier experiments showed that CORAL's rank consistency is beneficial for performance, the weight-sharing constraint could severely restrict the expressiveness of a deep neural network. In this paper, we propose an alternative method for rank-consistent ordinal regression that does not require a weight-sharing constraint in a neural network's fully connected output layer. We achieve this rank consistency by a novel training scheme using conditional training sets to obtain the unconditional rank probabilities through applying the chain rule for conditional probability distributions. Experiments on various datasets demonstrate the efficacy of the proposed method to utilize the ordinal target information, and the absence of the weight-sharing restriction improves the performance substantially compared to the CORAL reference approach.  ( 2 min )
    Neural forecasting at scale. (arXiv:2109.09705v4 [cs.LG] UPDATED)
    We study the problem of efficiently scaling ensemble-based deep neural networks for multi-step time series (TS) forecasting on a large set of time series. Current state-of-the-art deep ensemble models have high memory and computational requirements, hampering their use to forecast millions of TS in practical scenarios. We propose N-BEATS(P), a global parallel variant of the N-BEATS model designed to allow simultaneous training of multiple univariate TS forecasting models. Our model addresses the practical limitations of related models, reducing the training time by half and memory requirement by a factor of 5, while keeping the same level of accuracy in all TS forecasting settings. We have performed multiple experiments detailing the various ways to train our model and have obtained results that demonstrate its capacity to generalize in various forecasting conditions and setups.  ( 2 min )
    Improving Novelty Detection using the Reconstructions of Nearest Neighbours. (arXiv:2111.06150v2 [cs.LG] UPDATED)
    We show that using nearest neighbours in the latent space of autoencoders (AE) significantly improves performance of semi-supervised novelty detection in both single and multi-class contexts. Autoencoding methods detect novelty by learning to differentiate between the non-novel training class(es) and all other unseen classes. Our method harnesses a combination of the reconstructions of the nearest neighbours and the latent-neighbour distances of a given input's latent representation. We demonstrate that our nearest-latent-neighbours (NLN) algorithm is memory and time efficient, does not require significant data augmentation, nor is reliant on pre-trained networks. Furthermore, we show that the NLN-algorithm is easily applicable to multiple datasets without modification. Additionally, the proposed algorithm is agnostic to autoencoder architecture and reconstruction error method. We validate our method across several standard datasets for a variety of different autoencoding architectures such as vanilla, adversarial and variational autoencoders using either reconstruction, residual or feature consistent losses. The results show that the NLN algorithm grants up to a 17% increase in Area Under the Receiver Operating Characteristics (AUROC) curve performance for the multi-class case and 8% for single-class novelty detection.  ( 2 min )
    Enhancing Cross-Sectional Currency Strategies by Context-Aware Learning to Rank with Self-Attention. (arXiv:2105.10019v2 [q-fin.PM] UPDATED)
    The performance of a cross-sectional currency strategy depends crucially on accurately ranking instruments prior to portfolio construction. While this ranking step is traditionally performed using heuristics, or by sorting the outputs produced by pointwise regression or classification techniques, strategies using Learning to Rank algorithms have recently presented themselves as competitive and viable alternatives. Although the rankers at the core of these strategies are learned globally and improve ranking accuracy on average, they ignore the differences between the distributions of asset features over the times when the portfolio is rebalanced. This flaw renders them susceptible to producing sub-optimal rankings, possibly at important periods when accuracy is actually needed the most. For example, this might happen during critical risk-off episodes, which consequently exposes the portfolio to substantial, unwanted drawdowns. We tackle this shortcoming with an analogous idea from information retrieval: that a query's top retrieved documents or the local ranking context provide vital information about the query's own characteristics, which can then be used to refine the initial ranked list. In this work, we use a context-aware Learning-to-rank model that is based on the Transformer architecture to encode top/bottom ranked assets, learn the context and exploit this information to re-rank the initial results. Backtesting on a slate of 31 currencies, our proposed methodology increases the Sharpe ratio by around 30% and significantly enhances various performance metrics. Additionally, this approach also improves the Sharpe ratio when separately conditioning on normal and risk-off market states.  ( 2 min )
    SOLIS -- The MLOps journey from data acquisition to actionable insights. (arXiv:2112.11925v2 [cs.LG] UPDATED)
    Machine Learning operations is unarguably a very important and also one of the hottest topics in Artificial Intelligence lately. Being able to define very clear hypotheses for actual real-life problems that can be addressed by machine learning models, collecting and curating large amounts of data for model training and validation followed by model architecture search and actual optimization and finally presenting the results fits very well the scenario of Data Science experiments. This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems. Automating live configuration mechanisms, on the fly adapting to live or offline data capture and consumption, serving multiple models in parallel either on edge or cloud architectures, addressing specific limitations of GPU memory or compute power, post-processing inference or prediction results and serving those either as APIs or with IoT based communication stacks in the same end-to-end pipeline are the real challenges that we try to address in this particular paper. In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all above requirements while using basic cross-platform tensor framework and script language engines.  ( 2 min )
    Variational Wasserstein gradient flow. (arXiv:2112.02424v2 [cs.LG] UPDATED)
    Wasserstein gradient flow has emerged as a promising approach to solve optimization problems over the space of probability distributions. A recent trend is to use the well-known JKO scheme in combination with input convex neural networks to numerically implement the proximal step. The most challenging step, in this setup, is to evaluate functions involving density explicitly, such as entropy, in terms of samples. This paper builds on the recent works with a slight but crucial difference: we propose to utilize a variational formulation of the objective function formulated as maximization over a parametric class of functions. Theoretically, the proposed variational formulation allows the construction of gradient flows directly for empirical distributions with a well-defined and meaningful objective function. Computationally, this approach replaces the computationally expensive step in existing methods, to handle objective functions involving density, with inner loop updates that only require a small batch of samples and scale well with the dimension. The performance and scalability of the proposed method are illustrated with the aid of several numerical experiments involving high-dimensional synthetic and real datasets.  ( 2 min )
    EDEN: Communication-Efficient and Robust Distributed Mean Estimation for Federated Learning. (arXiv:2108.08842v2 [cs.LG] UPDATED)
    Distributed Mean Estimation (DME) is a central building block in federated learning, where clients send local gradients to a parameter server for averaging and updating the model. Due to communication constraints, clients often use lossy compression techniques to compress the gradients, resulting in estimation inaccuracies. DME is more challenging when clients have diverse network conditions, such as constrained communication budgets and packet losses. In such settings, DME techniques often incur a significant increase in the estimation error leading to degraded learning performance. In this work, we propose a robust DME technique named EDEN that naturally handles heterogeneous communication budgets and packet losses. We derive appealing theoretical guarantees for EDEN and evaluate it empirically. Our results demonstrate that EDEN consistently improves over state-of-the-art DME techniques.  ( 2 min )
    One Thing to Fool them All: Generating Interpretable, Universal, and Physically-Realizable Adversarial Features. (arXiv:2110.03605v3 [cs.LG] UPDATED)
    It is well understood that modern deep networks are vulnerable to adversarial attacks. However, conventional attack methods fail to produce adversarial perturbations that are intelligible to humans, and they pose limited threats in the physical world. To study feature-class associations in networks and better understand their vulnerability to attacks in the real world, we develop feature-level adversarial perturbations using deep image generators and a novel optimization objective. We term these feature-fool attacks. We show that they are versatile and use them to generate targeted feature-level attacks at the ImageNet scale that are simultaneously interpretable, universal to any source image, and physically-realizable. These attacks reveal spurious, semantically-describable feature/class associations that can be exploited by novel combinations of objects. We use them to guide the design of "copy/paste" adversaries in which one natural image is pasted into another to cause a targeted misclassification. Code is available at https://github.com/thestephencasper/feature_fool.  ( 2 min )
    DAAS: Differentiable Architecture and Augmentation Policy Search. (arXiv:2109.15273v2 [cs.LG] UPDATED)
    Neural architecture search (NAS) has been an active direction of automatic machine learning (Auto-ML), aiming to explore efficient network structures. The searched architecture is evaluated by training on datasets with fixed data augmentation policies. However, recent works on auto-augmentation show that the suited augmentation policies can vary over different structures. Therefore, this work considers the possible coupling between neural architectures and data augmentation and proposes an effective algorithm jointly searching for them. Specifically, 1) for the NAS task, we adopt a single-path based differentiable method with Gumbel-softmax reparameterization strategy due to its memory efficiency; 2) for the auto-augmentation task, we introduce a novel search method based on policy gradient algorithm, which can significantly reduce the computation complexity. Our approach achieves 97.91% accuracy on CIFAR-10 and 76.6% Top-1 accuracy on ImageNet dataset, showing the outstanding performance of our search algorithm.  ( 2 min )
    Hamiltonian Operator Disentanglement of Content and Motion in Image Sequences. (arXiv:2112.01641v2 [cs.CV] UPDATED)
    We introduce a deep generative model for image sequences that reliably factorise the latent space into content and motion variables. To model the diverse dynamics, we split the motion space into subspaces and introduce a unique Hamiltonian operator for each subspace. The Hamiltonian formulation provides reversible dynamics that constrain the evolution of the motion path along the low-dimensional manifold and conserves learnt invariant properties. The explicit split of the motion space decomposes the Hamiltonian into symmetry groups and gives long-term separability of the dynamics. This split also means we can learn content representations that are easy to interpret and control. We demonstrate the utility of our model by swapping the motion of two videos, generating long term sequences of various actions from a given image, unconditional sequence generation and image rotations.  ( 2 min )
    Neural Stochastic Partial Differential Equations: Resolution-Invariant Learning of Continuous Spatiotemporal Dynamics. (arXiv:2110.10249v4 [cs.LG] UPDATED)
    Stochastic partial differential equations (SPDEs) are the mathematical tool of choice for modelling spatiotemporal PDE-dynamics under the influence of randomness. Based on the notion of mild solution of an SPDE, we introduce a novel neural architecture to learn solution operators of PDEs with (possibly stochastic) forcing from partially observed data. The proposed Neural SPDE model provides an extension to two popular classes of physics-inspired architectures. On the one hand, it extends Neural CDEs and variants -- continuous-time analogues of RNNs -- in that it is capable of processing incoming sequential information arriving irregularly in time and observed at arbitrary spatial resolutions. On the other hand, it extends Neural Operators -- generalizations of neural networks to model mappings between spaces of functions -- in that it can parameterize solution operators of SPDEs depending simultaneously on the initial condition and a realization of the driving noise. By performing operations in the spectral domain, we show how a Neural SPDE can be evaluated in two ways, either by calling an ODE solver (emulating a spectral Galerkin scheme), or by solving a fixed point problem. Experiments on various semilinear SPDEs, including the stochastic Navier-Stokes equations, demonstrate how the Neural SPDE model is capable of learning complex spatiotemporal dynamics in a resolution-invariant way, with better accuracy and lighter training data requirements compared to alternative models, and up to 3 orders of magnitude faster than traditional solvers.  ( 2 min )
    Biclustering with Alternating K-Means. (arXiv:2009.04550v3 [cs.LG] UPDATED)
    Biclustering is the task of simultaneously clustering the rows and columns of the data matrix into different subgroups such that the rows and columns within a subgroup exhibit similar patterns. In this paper, we consider the case of producing block-diagonal biclusters. We provide a new formulation of the biclustering problem based on the idea of minimizing the empirical clustering risk. We develop and prove a consistency result with respect to the empirical clustering risk. Since the optimization problem is combinatorial in nature, finding the global minimum is computationally intractable. In light of this fact, we propose a simple and novel algorithm that finds a local minimum by alternating the use of an adapted version of the k-means clustering algorithm between columns and rows. We evaluate and compare the performance of our algorithm to other related biclustering methods on both simulated data and real-world gene expression data sets. The results demonstrate that our algorithm is able to detect meaningful structures in the data and outperform other competing biclustering methods in various settings and situations.  ( 2 min )
    The Hessian Screening Rule. (arXiv:2104.13026v2 [stat.ML] UPDATED)
    Predictor screening rules, which discard predictors from the design matrix before fitting a model, have had considerable impact on the speed with which l1-regularized regression problems, such as the lasso, can be solved. Current state-of-the-art screening rules, however, have difficulties in dealing with highly-correlated predictors, often becoming too conservative. In this paper, we present a new screening rule to deal with this issue: the Hessian Screening Rule. The rule uses second-order information from the model to provide more accurate screening as well as higher-quality warm starts. The proposed rule outperforms all studied alternatives on data sets with high correlation for both l1-regularized least-squares (the lasso) and logistic regression. It also performs best overall on the real data sets that we examine.  ( 2 min )
    The Curse of Zero Task Diversity: On the Failure of Transfer Learning to Outperform MAML and their Empirical Equivalence. (arXiv:2112.13121v2 [cs.LG] UPDATED)
    Recently, it has been observed that a transfer learning solution might be all we need to solve many few-shot learning benchmarks -- thus raising important questions about when and how meta-learning algorithms should be deployed. In this paper, we seek to clarify these questions by proposing a novel metric -- the diversity coefficient -- to measure the diversity of tasks in a few-shot learning benchmark. We hypothesize that the diversity coefficient of the few-shot learning benchmark is predictive of whether meta-learning solutions will succeed or not. Using the diversity coefficient, we show that the MiniImagenet benchmark has zero diversity. This novel insight contextualizes claims that transfer learning solutions are better than meta-learned solutions. Specifically, we empirically find that a diversity coefficient of zero correlates with a high similarity between transfer learning and Model-Agnostic Meta-Learning (MAML) learned solutions in terms of meta-accuracy (at meta-test time). Therefore, we conjecture meta-learned solutions have the same meta-test performance as transfer learning when the diversity coefficient is zero. Our work provides the first test of whether diversity correlates with meta-learning success.  ( 2 min )
    RiskNet: Neural Risk Assessment in Networks of Unreliable Resources. (arXiv:2201.12263v1 [cs.NI])
    We propose a graph neural network (GNN)-based method to predict the distribution of penalties induced by outages in communication networks, where connections are protected by resources shared between working and backup paths. The GNN-based algorithm is trained only with random graphs generated with the Barab\'asi-Albert model. Even though, the obtained test results show that we can precisely model the penalties in a wide range of various existing topologies. GNNs eliminate the need to simulate complex outage scenarios for the network topologies under study. In practice, the whole design operation is limited by 4ms on modern hardware. This way, we can gain as much as over 12,000 times in the speed improvement.  ( 2 min )
    VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning. (arXiv:2105.04906v3 [cs.CV] UPDATED)
    Recent self-supervised methods for image representation learning are based on maximizing the agreement between embedding vectors from different views of the same image. A trivial solution is obtained when the encoder outputs constant vectors. This collapse problem is often avoided through implicit biases in the learning architecture, that often lack a clear justification or interpretation. In this paper, we introduce VICReg (Variance-Invariance-Covariance Regularization), a method that explicitly avoids the collapse problem with a simple regularization term on the variance of the embeddings along each dimension individually. VICReg combines the variance term with a decorrelation mechanism based on redundancy reduction and covariance regularization, and achieves results on par with the state of the art on several downstream tasks. In addition, we show that incorporating our new variance term into other methods helps stabilize the training and leads to performance improvements.  ( 2 min )
    Lossy Compression for Lossless Prediction. (arXiv:2106.10800v5 [cs.LG] UPDATED)
    Most data is automatically collected and only ever "seen" by algorithms. Yet, data compressors preserve perceptual fidelity rather than just the information needed by algorithms performing downstream tasks. In this paper, we characterize the bit-rate required to ensure high performance on all predictive tasks that are invariant under a set of transformations, such as data augmentations. Based on our theory, we design unsupervised objectives for training neural compressors. Using these objectives, we train a generic image compressor that achieves substantial rate savings (more than $1000\times$ on ImageNet) compared to JPEG on 8 datasets, without decreasing downstream classification performance.  ( 2 min )
    Differential Privacy of Dirichlet Posterior Sampling. (arXiv:2110.01984v2 [cs.CR] UPDATED)
    Besides the Laplace distribution and the Gaussian distribution, there are many more probability distributions that are not well-understood in terms of privacy-preserving property -- one of which is the Dirichlet distribution. In this work, we study the inherent privacy of releasing a single draw from a Dirichlet posterior distribution (the Dirichlet posterior sampling). As our main result, we provide a simple privacy guarantee of the Dirichlet posterior sampling with the framework of R\'enyi Differential Privacy (RDP). Consequently, the RDP guarantee allows us to derive a simpler form of the $(\varepsilon,\delta)$-differential privacy guarantee compared to those from the previous work. As an application, we use the RDP guarantee to derive a utility guarantee of the Dirichlet posterior sampling for privately releasing a normalized histogram, which is confirmed by our experimental results. Moreover, we demonstrate that the RDP guarantee can be used to track the privacy loss in Bayesian reinforcement learning.  ( 2 min )
    Predicting The Stock Trend Using News Sentiment Analysis and Technical Indicators in Spark. (arXiv:2201.12283v1 [q-fin.ST])
    Predicting the stock market trend has always been challenging since its movement is affected by many factors. Here, we approach the future trend prediction problem as a machine learning classification problem by creating tomorrow_trend feature as our label to be predicted. Different features are given to help the machine learning model predict the label of a given day; whether it is an uptrend or downtrend, those features are technical indicators generated from the stock's price history. In addition, as financial news plays a vital role in changing the investor's behavior, the overall sentiment score on a given day is created from all news released on that day and added to the model as another feature. Three different machine learning models are tested in Spark (big-data computing platform), Logistic Regression, Random Forest, and Gradient Boosting Machine. Random Forest was the best performing model with a 63.58% test accuracy.  ( 2 min )
    Universal Joint Approximation of Manifolds and Densities by Simple Injective Flows. (arXiv:2110.04227v3 [cs.LG] UPDATED)
    We study approximation of probability measures supported on $n$-dimensional manifolds embedded in R^m by injective flows -- neural networks composed of invertible flows and injective layers. We show that in general, injective flows between R^n and R^m universally approximate measures supported on images of extendable embeddings, which are a subset of standard embeddings: when the embedding dimension m is small, topological obstructions may preclude certain manifolds as admissible targets. When the embedding dimension is sufficiently large, m \geq 3n+1, we use an argument from algebraic topology known as the clean trick to prove that the topological obstructions vanish and injective flows universally approximate any differentiable embedding. Along the way we show that the studied injective flows admit efficient projections on the range, and that their optimality can be established "in reverse," resolving a conjecture made in Brehmer and Cranmer 2020  ( 2 min )
    Steerable 3D Spherical Neurons. (arXiv:2106.13863v3 [cs.CV] UPDATED)
    Emerging from low-level vision theory, steerable filters found their counterpart in prior work on steerable convolutional neural networks equivariant to rigid transformations. In our work, we propose a steerable feed-forward learning-based approach that consists of neurons with spherical decision surfaces and operates on point clouds. Such spherical neurons are obtained by conformal embedding of Euclidean space and have recently been revisited in the context of learning representations of point sets. Focusing on 3D geometry, we exploit the isometry property of spherical neurons and derive a 3D steerability constraint. After training spherical neurons to classify point clouds in a canonical orientation, we use a tetrahedron basis to quadruplicate the neurons and construct rotation-equivariant spherical filter banks. We then apply the derived constraint to interpolate the filter bank outputs and, thus, obtain a rotation-invariant network. Finally, we use a synthetic point set and real-world 3D skeleton data to verify our theoretical findings.  ( 2 min )
    Arbitrage-Free Implied Volatility Surface Generation with Variational Autoencoders. (arXiv:2108.04941v3 [q-fin.MF] UPDATED)
    We propose a hybrid method for generating arbitrage-free implied volatility (IV) surfaces consistent with historical data by combining model-free Variational Autoencoders (VAEs) with continuous time stochastic differential equation (SDE) driven models. We focus on two classes of SDE models: regime switching models and L\'evy additive processes. By projecting historical surfaces onto the space of SDE model parameters, we obtain a distribution on the parameter subspace faithful to the data on which we then train a VAE. Arbitrage-free IV surfaces are then generated by sampling from the posterior distribution on the latent space, decoding to obtain SDE model parameters, and finally mapping those parameters to IV surfaces. We further refine the VAE model by including conditional features and demonstrate its superior generative out-of-sample performance.  ( 2 min )
    Planning to Fairly Allocate: Probabilistic Fairness in the Restless Bandit Setting. (arXiv:2106.07677v3 [cs.LG] UPDATED)
    Restless and collapsing bandits are often used to model budget-constrained resource allocation in settings where arms have action-dependent transition probabilities, such as the allocation of health interventions among patients. However, state-of-the-art Whittle-index-based approaches to this planning problem either do not consider fairness among arms, or incentivize fairness without guaranteeing it. We thus introduce ProbFair, a probabilistically fair policy that maximizes total expected reward and satisfies the budget constraint while ensuring a strictly positive lower bound on the probability of being pulled at each timestep. We evaluate our algorithm on a real-world application, where interventions support continuous positive airway pressure (CPAP) therapy adherence among patients, as well as on a broader class of synthetic transition matrices. We find that ProbFair preserves utility while providing fairness guarantees.  ( 2 min )
    Cycle Representation Learning for Inductive Relation Prediction. (arXiv:2110.02510v2 [cs.LG] UPDATED)
    In recent years, algebraic topology and its modern development, the theory of persistent homology, has shown great potential in graph representation learning. In this paper, based on the mathematics of algebraic topology, we propose a novel solution for inductive relation prediction, an important learning task for knowledge graph completion. To predict the relation between two entities, one can use the existence of rules, namely a sequence of relations. Previous works view rules as paths and primarily focus on the searching of paths between entities. The space of rules is huge, and one has to sacrifice either efficiency or accuracy. In this paper, we consider rules as cycles and show that the space of cycles has a unique structure based on the mathematics of algebraic topology. By exploring the linear structure of the cycle space, we can improve the searching efficiency of rules. We propose to collect cycle bases that span the space of cycles. We build a novel GNN framework on the collected cycles to learn the representations of cycles, and to predict the existence/non-existence of a relation. Our method achieves state-of-the-art performance on benchmarks.  ( 2 min )
    Neural Tangent Kernel Empowered Federated Learning. (arXiv:2110.03681v2 [cs.LG] UPDATED)
    Federated learning (FL) is a privacy-preserving paradigm where multiple participants jointly solve a machine learning problem without sharing raw data. Unlike traditional distributed learning, a unique characteristic of FL is statistical heterogeneity, namely, data distributions across participants are different from each other. Meanwhile, recent advances in the interpretation of neural networks have seen a wide use of neural tangent kernels (NTKs) for convergence analyses. In this paper, we propose a novel FL paradigm empowered by the NTK framework. The paradigm addresses the challenge of statistical heterogeneity by transmitting update data that are more expressive than those of the conventional FL paradigms. Specifically, sample-wise Jacobian matrices, rather than model weights/gradients, are uploaded by participants. The server then constructs an empirical kernel matrix to update a global model without explicitly performing gradient descent. We further develop a variant with improved communication efficiency and enhanced privacy. Numerical results show that the proposed paradigm can achieve the same accuracy while reducing the number of communication rounds by an order of magnitude compared to federated averaging.  ( 2 min )
    Differentially Private Coordinate Descent for Composite Empirical Risk Minimization. (arXiv:2110.11688v2 [cs.LG] UPDATED)
    Machine learning models can leak information about the data used to train them. To mitigate this issue, Differentially Private (DP) variants of optimization algorithms like Stochastic Gradient Descent (DP-SGD) have been designed to trade-off utility for privacy in Empirical Risk Minimization (ERM) problems. In this paper, we propose Differentially Private proximal Coordinate Descent (DP-CD), a new method to solve composite DP-ERM problems. We derive utility guarantees through a novel theoretical analysis of inexact coordinate descent. Our results show that, thanks to larger step sizes, DP-CD can exploit imbalance in gradient coordinates to outperform DP-SGD. We also prove new lower bounds for composite DP-ERM under coordinate-wise regularity assumptions, that are nearly matched by DP-CD. For practical implementations, we propose to clip gradients using coordinate-wise thresholds that emerge from our theory, avoiding costly hyperparameter tuning. Experiments on real and synthetic data support our results, and show that DP-CD compares favorably with DP-SGD.  ( 2 min )
    Hierarchical Reinforcement Learning with Adversarially Guided Subgoals. (arXiv:2201.09635v2 [cs.LG] UPDATED)
    Hierarchical reinforcement learning (HRL) proposes to solve difficult tasks by performing decision-making and control at successively higher levels of temporal abstraction. However, off-policy HRL often suffers from the problem of non-stationary high-level policy since the low-level policy is constantly changing. In this paper, we propose a novel HRL approach for mitigating the non-stationarity by adversarially enforcing the high-level policy to generate subgoals compatible with the current instantiation of the low-level policy. In practice, the adversarial learning is implemented by training a simple discriminator network concurrently with the high-level policy which determines the compatibility level of subgoals. Experiments with state-of-the-art algorithms show that our approach improves both HRL learning efficiency and overall performance in various challenging continuous control tasks.  ( 2 min )
    Automated Feature Extraction on AsMap for Emotion Classification using EEG. (arXiv:2201.12055v1 [cs.LG])
    Emotion recognition using EEG has been widely studied to address the challenges associated with affective computing. Using manual feature extraction method on EEG signals result in sub-optimal performance by the learning models. With the advancements in deep learning as a tool for automated feature engineering, in this work a hybrid of manual and automatic feature extraction method has been proposed. The asymmetry in the different brain regions are captured in a 2-D vector, termed as AsMap from the differential entropy (DE) features of EEG signals. These AsMaps are then used to extract features automatically using Convolutional Neural Network (CNN) model. The proposed feature extraction method has been compared with DE and other DE-based feature extraction methods such as RASM, DASM and DCAU. Experiments are conducted using DEAP and SEED dataset on different classification problems based on number of classes. Results obtained indicate that the proposed method of feature extraction results in higher classification accuracy outperforming the DE based feature extraction methods. Highest classification accuracy of 97.10% is achieved on 3-class classification problem using SEED dataset. Further, the impact of window size on classification accuracy has also been assessed in this work.  ( 2 min )
    On the Convergence of Heterogeneous Federated Learning with Arbitrary Adaptive Online Model Pruning. (arXiv:2201.11803v1 [cs.LG])
    One of the biggest challenges in Federated Learning (FL) is that client devices often have drastically different computation and communication resources for local updates. To this end, recent research efforts have focused on training heterogeneous local models obtained by pruning a shared global model. Despite empirical success, theoretical guarantees on convergence remain an open question. In this paper, we present a unifying framework for heterogeneous FL algorithms with {\em arbitrary} adaptive online model pruning and provide a general convergence analysis. In particular, we prove that under certain sufficient conditions and on both IID and non-IID data, these algorithms converges to a stationary point of standard FL for general smooth cost functions, with a convergence rate of $O(\frac{1}{\sqrt{Q}})$. Moreover, we illuminate two key factors impacting convergence: pruning-induced noise and minimum coverage index, advocating a joint design of local pruning masks for efficient training.  ( 2 min )
    Conditional COT-GAN for Video Prediction with Kernel Smoothing. (arXiv:2106.05658v2 [stat.ML] UPDATED)
    Causal Optimal Transport (COT) results from imposing a temporal causality constraint on classic optimal transport problems, which naturally generates a new concept of distances between distributions on path spaces. The first application of the COT theory for sequential learning was given in Xu et al. (2020), where COT-GAN was introduced as an adversarial algorithm to train implicit generative models optimized for producing sequential data. Relying on (Xu et al., 2020), the contribution of the present paper is twofold. First, we develop a conditional version of COT-GAN suitable for sequence prediction. This means that the dataset is now used in order to learn how a sequence will evolve given the observation of its past evolution. Second, we improve on the convergence results by working with modifications of the empirical measures via kernel smoothing due to (Pflug and Pichler (2016)). The resulting kernel conditional COT-GAN algorithm is illustrated with an application for video prediction.  ( 2 min )
    Generative Cooperative Networks for Natural Language Generation. (arXiv:2201.12320v1 [cs.LG])
    Generative Adversarial Networks (GANs) have known a tremendous success for many continuous generation tasks, especially in the field of image generation. However, for discrete outputs such as language, optimizing GANs remains an open problem with many instabilities, as no gradient can be properly back-propagated from the discriminator output to the generator parameters. An alternative is to learn the generator network via reinforcement learning, using the discriminator signal as a reward, but such a technique suffers from moving rewards and vanishing gradient problems. Finally, it often falls short compared to direct maximum-likelihood approaches. In this paper, we introduce Generative Cooperative Networks, in which the discriminator architecture is cooperatively used along with the generation policy to output samples of realistic texts for the task at hand. We give theoretical guarantees of convergence for our approach, and study various efficient decoding schemes to empirically achieve state-of-the-art results in two main NLG tasks.  ( 2 min )
    Learning Parametrised Graph Shift Operators. (arXiv:2101.10050v3 [cs.LG] UPDATED)
    In many domains data is currently represented as graphs and therefore, the graph representation of this data becomes increasingly important in machine learning. Network data is, implicitly or explicitly, always represented using a graph shift operator (GSO) with the most common choices being the adjacency, Laplacian matrices and their normalisations. In this paper, a novel parametrised GSO (PGSO) is proposed, where specific parameter values result in the most commonly used GSOs and message-passing operators in graph neural network (GNN) frameworks. The PGSO is suggested as a replacement of the standard GSOs that are used in state-of-the-art GNN architectures and the optimisation of the PGSO parameters is seamlessly included in the model training. It is proved that the PGSO has real eigenvalues and a set of real eigenvectors independent of the parameter values and spectral bounds on the PGSO are derived. PGSO parameters are shown to adapt to the sparsity of the graph structure in a study on stochastic blockmodel networks, where they are found to automatically replicate the GSO regularisation found in the literature. On several real-world datasets the accuracy of state-of-the-art GNN architectures is improved by the inclusion of the PGSO in both node- and graph-classification tasks.  ( 2 min )
    Optimizing model-agnostic Random Subspace ensembles. (arXiv:2109.03099v2 [cs.LG] UPDATED)
    This paper presents a model-agnostic ensemble approach for supervised learning. The proposed approach is based on a parametric version of Random Subspace, in which each base model is learned from a feature subset sampled according to a Bernoulli distribution. Parameter optimization is performed using gradient descent and is rendered tractable by using an importance sampling approach that circumvents frequent re-training of the base models after each gradient descent step. While the degree of randomization is controlled by a hyper-parameter in standard Random Subspace, it has the advantage to be automatically tuned in our parametric version. Furthermore, model-agnostic feature importance scores can be easily derived from the trained ensemble model.
    Sum-of-norms clustering does not separate nearby balls. (arXiv:2104.13753v2 [cs.LG] UPDATED)
    Sum-of-norms clustering is a popular convexification of $K$-means clustering. We show that, if the dataset is made of a large number of independent random variables distributed according to the uniform measure on the union of two disjoint balls of unit radius, and if the balls are sufficiently close to one another, then sum-of-norms clustering will typically fail to recover the decomposition of the dataset into two clusters. As the dimension tends to infinity, this happens even when the distance between the centers of the two balls is taken to be as large as $2\sqrt{2}$. In order to show this, we introduce and analyze a continuous version of sum-of-norms clustering, where the dataset is replaced by a general measure. In particular, we state and prove a local-global characterization of the clustering that seems to be new even in the case of discrete datapoints.
    Generative Coarse-Graining of Molecular Conformations. (arXiv:2201.12176v1 [cs.LG])
    Coarse-graining (CG) of molecular simulations simplifies the particle representation by grouping selected atoms into pseudo-beads and therefore drastically accelerates simulation. However, such CG procedure induces information losses, which makes accurate backmapping, i.e., restoring fine-grained (FG) coordinates from CG coordinates, a long-standing challenge. Inspired by the recent progress in generative models and equivariant networks, we propose a novel model that rigorously embeds the vital probabilistic nature and geometric consistency requirements of the backmapping transformation. Our model encodes the FG uncertainties into an invariant latent space and decodes them back to FG geometries via equivariant convolutions. To standardize the evaluation of this domain, we further provide three comprehensive benchmarks based on molecular dynamics trajectories. Extensive experiments show that our approach always recovers more realistic structures and outperforms existing data-driven methods with a significant margin.
    HEAT: Hyperedge Attention Networks. (arXiv:2201.12113v1 [cs.LG])
    Learning from structured data is a core machine learning task. Commonly, such data is represented as graphs, which normally only consider (typed) binary relationships between pairs of nodes. This is a substantial limitation for many domains with highly-structured data. One important such domain is source code, where hypergraph-based representations can better capture the semantically rich and structured nature of code. In this work, we present HEAT, a neural model capable of representing typed and qualified hypergraphs, where each hyperedge explicitly qualifies how participating nodes contribute. It can be viewed as a generalization of both message passing neural networks and Transformers. We evaluate HEAT on knowledge base completion and on bug detection and repair using a novel hypergraph representation of programs. In both settings, it outperforms strong baselines, indicating its power and generality.
    Dual Learning Music Composition and Dance Choreography. (arXiv:2201.11999v1 [cs.SD])
    Music and dance have always co-existed as pillars of human activities, contributing immensely to the cultural, social, and entertainment functions in virtually all societies. Notwithstanding the gradual systematization of music and dance into two independent disciplines, their intimate connection is undeniable and one art-form often appears incomplete without the other. Recent research works have studied generative models for dance sequences conditioned on music. The dual task of composing music for given dances, however, has been largely overlooked. In this paper, we propose a novel extension, where we jointly model both tasks in a dual learning approach. To leverage the duality of the two modalities, we introduce an optimal transport objective to align feature embeddings, as well as a cycle consistency loss to foster overall consistency. Experimental results demonstrate that our dual learning framework improves individual task performance, delivering generated music compositions and dance choreographs that are realistic and faithful to the conditioned inputs.
    Evaluating Local Explanations using White-box Models. (arXiv:2106.02488v2 [cs.LG] UPDATED)
    Evaluating explanation techniques using human subjects is costly, time-consuming and can lead to subjectivity in the assessments. To evaluate the accuracy of local explanations, we require access to the true feature importance scores for a given instance. However, the prediction function of a model usually does not decompose into linear additive terms that indicate how much a feature contributes to the output. In this work, we suggest to instead focus on the log odds ratio (LOR) of the prediction function, which naturally decomposes into additive terms for logistic regression and naive Bayes. We demonstrate how we can benchmark different explanation techniques in terms of their similarity to the LOR scores based on our proposed approach. In the experiments, we compare prominent local explanation techniques and find that the performance of the techniques can depend on the underlying model, the dataset, which data point is explained, the normalization of the data and the similarity metric.
    Deep Deterministic Uncertainty: A Simple Baseline. (arXiv:2102.11582v3 [cs.LG] UPDATED)
    Reliable uncertainty from deterministic single-forward pass models is sought after because conventional methods of uncertainty quantification are computationally expensive. We take two complex single-forward-pass uncertainty approaches, DUQ and SNGP, and examine whether they mainly rely on a well-regularized feature space. Crucially, without using their more complex methods for estimating uncertainty, a single softmax neural net with such a feature-space, achieved via residual connections and spectral normalization, *outperforms* DUQ and SNGP's epistemic uncertainty predictions using simple Gaussian Discriminant Analysis *post-training* as a separate feature-space density estimator -- without fine-tuning on OoD data, feature ensembling, or input pre-procressing. This conceptually simple *Deep Deterministic Uncertainty (DDU)* baseline can also be used to disentangle aleatoric and epistemic uncertainty and performs as well as Deep Ensembles, the state-of-the art for uncertainty prediction, on several OoD benchmarks (CIFAR-10/100 vs SVHN/Tiny-ImageNet, ImageNet vs ImageNet-O) as well as in active learning settings across different model architectures, yet is *computationally cheaper*.
    The Price of Majority Support. (arXiv:2201.12303v1 [cs.LG])
    We consider the problem of finding a compromise between the opinions of a group of individuals on a number of mutually independent, binary topics. In this paper, we quantify the loss in representativeness that results from requiring the outcome to have majority support, in other words, the "price of majority support". Each individual is assumed to support an outcome if they agree with the outcome on at least as many topics as they disagree on. Our results can also be seen as quantifying Anscombes paradox which states that topic-wise majority outcome may not be supported by a majority. To measure the representativeness of an outcome, we consider two metrics. First, we look for an outcome that agrees with a majority on as many topics as possible. We prove that the maximum number such that there is guaranteed to exist an outcome that agrees with a majority on this number of topics and has majority support, equals $\ceil{(t+1)/2}$ where $t$ is the total number of topics. Second, we count the number of times a voter opinion on a topic matches the outcome on that topic. The goal is to find the outcome with majority support with the largest number of matches. We consider the ratio between this number and the number of matches of the overall best outcome which may not have majority support. We try to find the maximum ratio such that an outcome with majority support and this ratio of matches compared to the overall best is guaranteed to exist. For 3 topics, we show this ratio to be $5/6\approx 0.83$. In general, we prove an upper bound that comes arbitrarily close to $2\sqrt{6}-4\approx 0.90$ as $t$ tends to infinity. Furthermore, we numerically compute a better upper and a non-matching lower bound in the relevant range for $t$.
    A Transfer Learning and Optimized CNN Based Intrusion Detection System for Internet of Vehicles. (arXiv:2201.11812v1 [cs.CR])
    Modern vehicles, including autonomous vehicles and connected vehicles, are increasingly connected to the external world, which enables various functionalities and services. However, the improving connectivity also increases the attack surfaces of the Internet of Vehicles (IoV), causing its vulnerabilities to cyber-threats. Due to the lack of authentication and encryption procedures in vehicular networks, Intrusion Detection Systems (IDSs) are essential approaches to protect modern vehicle systems from network attacks. In this paper, a transfer learning and ensemble learning-based IDS is proposed for IoV systems using convolutional neural networks (CNNs) and hyper-parameter optimization techniques. In the experiments, the proposed IDS has demonstrated over 99.25% detection rates and F1-scores on two well-known public benchmark IoV security datasets: the Car-Hacking dataset and the CICIDS2017 dataset. This shows the effectiveness of the proposed IDS for cyber-attack detection in both intra-vehicle and external vehicular networks.
    Fast Interpretable Greedy-Tree Sums (FIGS). (arXiv:2201.11931v1 [cs.LG])
    Modern machine learning has achieved impressive prediction performance, but often sacrifices interpretability, a critical consideration in many problems. Here, we propose Fast Interpretable Greedy-Tree Sums (FIGS), an algorithm for fitting concise rule-based models. Specifically, FIGS generalizes the CART algorithm to simultaneously grow a flexible number of trees in a summation. The total number of splits across all the trees can be restricted by a pre-specified threshold, thereby keeping both the size and number of its trees under control. When both are small, the fitted tree-sum can be easily visualized and written out by hand, making it highly interpretable. A partially oracle theoretical result hints at the potential for FIGS to overcome a key weakness of single-tree models by disentangling additive components of generative additive models, thereby reducing redundancy from repeated splits on the same feature. Furthermore, given oracle access to optimal tree structures, we obtain L2 generalization bounds for such generative models in the case of C1 component functions, matching known minimax rates in some cases. Extensive experiments across a wide array of real-world datasets show that FIGS achieves state-of-the-art prediction performance (among all popular rule-based methods) when restricted to just a few splits (e.g. less than 20). We find empirically that FIGS is able to avoid repeated splits, and often provides more concise decision rules than fitted decision trees, without sacrificing predictive performance. All code and models are released in a full-fledged package on Github \url{https://github.com/csinva/imodels}.
    A tomographic workflow to enable deep learning for X-ray based foreign object detection. (arXiv:2201.12184v1 [cs.CV])
    Detection of unwanted (`foreign') objects within products is a common procedure in many branches of industry for maintaining production quality. X-ray imaging is a fast, non-invasive and widely applicable method for foreign object detection. Deep learning has recently emerged as a powerful approach for recognizing patterns in radiographs (i.e., X-ray images), enabling automated X-ray based foreign object detection. However, these methods require a large number of training examples and manual annotation of these examples is a subjective and laborious task. In this work, we propose a Computed Tomography (CT) based method for producing training data for supervised learning of foreign object detection, with minimal labour requirements. In our approach, a few representative objects are CT scanned and reconstructed in 3D. The radiographs that have been acquired as part of the CT-scan data serve as input for the machine learning method. High-quality ground truth locations of the foreign objects are obtained through accurate 3D reconstructions and segmentations. Using these segmented volumes, corresponding 2D segmentations are obtained by creating virtual projections. We outline the benefits of objectively and reproducibly generating training data in this way compared to conventional radiograph annotation. In addition, we show how the accuracy depends on the number of objects used for the CT reconstructions. The results show that in this workflow generally only a relatively small number of representative objects (i.e., fewer than 10) are needed to achieve adequate detection performance in an industrial setting. Moreover, for real experimental data we show that the workflow leads to higher foreign object detection accuracies than with standard radiograph annotation.
    Unsupervised Single-shot Depth Estimation using Perceptual Reconstruction. (arXiv:2201.12170v1 [cs.CV])
    Real-time estimation of actual object depth is a module that is essential to performing various autonomous system tasks such as 3D reconstruction, scene understanding and condition assessment of machinery parts. During the last decade of machine learning, extensive deployment of deep learning methods to computer vision tasks has yielded approaches that succeed in achieving realistic depth synthesis out of a simple RGB modality. While most of these models are based on paired depth data or availability of video sequences and stereo images, methods for single-view depth synthesis in a fully unsupervised setting have hardly been explored. This study presents the most recent advances in the field of generative neural networks, leveraging them to perform fully unsupervised single-shot depth synthesis. Two generators for RGB-to-depth and depth-to-RGB transfer are implemented and simultaneously optimized using the Wasserstein-1 distance and a novel perceptual reconstruction term. To ensure that the proposed method is plausible, we comprehensively evaluate the models using industrial surface depth data as well as the Texas 3D Face Recognition Database and the SURREAL dataset that records body depth. The success observed in this study suggests the great potential for unsupervised single-shot depth estimation in real-world applications.
    Modern Evolution Strategies for Creativity: Fitting Concrete Images and Abstract Concepts. (arXiv:2109.08857v2 [cs.NE] UPDATED)
    Evolutionary algorithms have been used in the digital art scene since the 1970s. A popular application of genetic algorithms is to optimize the procedural placement of vector graphic primitives to resemble a given painting. In recent years, deep learning-based approaches have also been proposed to generate procedural drawings, which can be optimized using gradient descent. In this work, we revisit the use of evolutionary algorithms for computational creativity. We find that modern evolution strategies (ES) algorithms, when tasked with the placement of shapes, offer large improvements in both quality and efficiency compared to traditional genetic algorithms, and even comparable to gradient-based methods. We demonstrate that ES is also well suited at optimizing the placement of shapes to fit the CLIP model, and can produce diverse, distinct geometric abstractions that are aligned with human interpretation of language. Videos and demo: https://es-clip.github.io/
    Equivariant Manifold Flows. (arXiv:2107.08596v2 [stat.ML] UPDATED)
    Tractably modelling distributions over manifolds has long been an important goal in the natural sciences. Recent work has focused on developing general machine learning models to learn such distributions. However, for many applications these distributions must respect manifold symmetries -- a trait which most previous models disregard. In this paper, we lay the theoretical foundations for learning symmetry-invariant distributions on arbitrary manifolds via equivariant manifold flows. We demonstrate the utility of our approach by using it to learn gauge invariant densities over $SU(n)$ in the context of quantum field theory.
    Leveraging deep learning for fully automated NMR protein structure determination. (arXiv:2201.12041v1 [q-bio.BM])
    Nuclear Magnetic Resonance (NMR) spectroscopy is one of the major techniques in structural biology with over 11800 protein structures deposited in the Protein Data Bank. NMR can elucidate structures and dynamics of small and medium size proteins in solution, living cells, and solids, but has been limited by the tedious data analysis process. It typically requires weeks or months of manual work of trained expert to turn NMR measurements into a protein structure. Automation of this process is an open problem, formulated in the field over 30 years ago. Here, we present the first approach that addresses this challenge. Our method, ARTINA, uses as input only NMR spectra and the protein sequence, delivering a structure strictly without any human intervention. Tested on a 100-protein benchmark (1329 2D/3D/4D NMR spectra), ARTINA demonstrated its ability to solve structures with 1.44 {\AA} median RMSD to the PDB reference and 91.36% correct NMR resonance assignments. ARTINA can be used by non-experts, reducing the effort for a protein structure determination by NMR essentially to the preparation of the sample and the spectra measurements.
    Linear Adversarial Concept Erasure. (arXiv:2201.12091v1 [cs.LG])
    Modern neural models trained on textual data rely on pre-trained representations that emerge without direct supervision. As these representations are increasingly being used in real-world applications, the inability to \emph{control} their content becomes an increasingly important problem. We formulate the problem of identifying and erasing a linear subspace that corresponds to a given concept, in order to prevent linear predictors from recovering the concept. We model this problem as a constrained, linear minimax game, and show that existing solutions are generally not optimal for this task. We derive a closed-form solution for certain objectives, and propose a convex relaxation, R-LACE, that works well for others. When evaluated in the context of binary gender removal, the method recovers a low-dimensional subspace whose removal mitigates bias by intrinsic and extrinsic evaluation. We show that the method -- despite being linear -- is highly expressive, effectively mitigating bias in deep nonlinear classifiers while maintaining tractability and interpretability.
    Convergence Guarantees for Deep Epsilon Greedy Policy Learning. (arXiv:2112.03376v2 [cs.LG] UPDATED)
    Policy learning is a quickly growing area. As robotics and computers control day-to-day life, their error rate needs to be minimized and controlled. There are many policy learning methods and bandit methods with provable error rates that accompany them. We show an error or regret bound and convergence of the Deep Epsilon Greedy method which chooses actions with a neural network's prediction. We also show that Epsilon Greedy method regret upper bound is minimized with cubic root exploration. In experiments with the real-world dataset MNIST, we construct a nonlinear reinforcement learning problem. We witness how with either high or low noise, some methods do and some do not converge which agrees with our proof of convergence.
    Discriminative Supervised Subspace Learning for Cross-modal Retrieval. (arXiv:2201.11843v1 [cs.CV])
    Nowadays the measure between heterogeneous data is still an open problem for cross-modal retrieval. The core of cross-modal retrieval is how to measure the similarity between different types of data. Many approaches have been developed to solve the problem. As one of the mainstream, approaches based on subspace learning pay attention to learning a common subspace where the similarity among multi-modal data can be measured directly. However, many of the existing approaches only focus on learning a latent subspace. They ignore the full use of discriminative information so that the semantically structural information is not well preserved. Therefore satisfactory results can not be achieved as expected. We in this paper propose a discriminative supervised subspace learning for cross-modal retrieval(DS2L), to make full use of discriminative information and better preserve the semantically structural information. Specifically, we first construct a shared semantic graph to preserve the semantic structure within each modality. Subsequently, the Hilbert-Schmidt Independence Criterion(HSIC) is introduced to preserve the consistence between feature-similarity and semantic-similarity of samples. Thirdly, we introduce a similarity preservation term, thus our model can compensate for the shortcomings of insufficient use of discriminative data and better preserve the semantically structural information within each modality. The experimental results obtained on three well-known benchmark datasets demonstrate the effectiveness and competitiveness of the proposed method against the compared classic subspace learning approaches.
    The FreshPRINCE: A Simple Transformation Based Pipeline Time Series Classifier. (arXiv:2201.12048v1 [cs.LG])
    There have recently been significant advances in the accuracy of algorithms proposed for time series classification (TSC). However, a commonly asked question by real world practitioners and data scientists less familiar with the research topic, is whether the complexity of the algorithms considered state of the art is really necessary. Many times the first approach suggested is a simple pipeline of summary statistics or other time series feature extraction approaches such as TSFresh, which in itself is a sensible question; in publications on TSC algorithms generalised for multiple problem types, we rarely see these approaches considered or compared against. We experiment with basic feature extractors using vector based classifiers shown to be effective with continuous attributes in current state-of-the-art time series classifiers. We test these approaches on the UCR time series dataset archive, looking to see if TSC literature has overlooked the effectiveness of these approaches. We find that a pipeline of TSFresh followed by a rotation forest classifier, which we name FreshPRINCE, performs best. It is not state of the art, but it is significantly more accurate than nearest neighbour with dynamic time warping, and represents a reasonable benchmark for future comparison.
    Neural Optimal Transport. (arXiv:2201.12220v1 [cs.LG])
    We present a novel neural-networks-based algorithm to compute optimal transport maps and plans for strong and weak transport costs. To justify the usage of neural networks, we prove that they are universal approximators of transport plans between probability distributions. We evaluate the performance of our optimal transport algorithm on toy examples and on the unpaired image-to-image style translation task.
    InSeGAN: A Generative Approach to Segmenting Identical Instances in Depth Images. (arXiv:2108.13865v2 [cs.CV] UPDATED)
    In this paper, we present InSeGAN, an unsupervised 3D generative adversarial network (GAN) for segmenting (nearly) identical instances of rigid objects in depth images. Using an analysis-by-synthesis approach, we design a novel GAN architecture to synthesize a multiple-instance depth image with independent control over each instance. InSeGAN takes in a set of code vectors (e.g., random noise vectors), each encoding the 3D pose of an object that is represented by a learned implicit object template. The generator has two distinct modules. The first module, the instance feature generator, uses each encoded pose to transform the implicit template into a feature map representation of each object instance. The second module, the depth image renderer, aggregates all of the single-instance feature maps output by the first module and generates a multiple-instance depth image. A discriminator distinguishes the generated multiple-instance depth images from the distribution of true depth images. To use our model for instance segmentation, we propose an instance pose encoder that learns to take in a generated depth image and reproduce the pose code vectors for all of the object instances. To evaluate our approach, we introduce a new synthetic dataset, "Insta-10", consisting of 100,000 depth images, each with 5 instances of an object from one of 10 classes. Our experiments on Insta-10, as well as on real-world noisy depth images, show that InSeGAN achieves state-of-the-art performance, often outperforming prior methods by large margins.
    Exploring Graph Representation of Chorales. (arXiv:2201.11745v1 [cs.LG])
    This work explores areas overlapping music, graph theory, and machine learning. An embedding representation of a node, in a weighted undirected graph $\mathcal{G}$, is a representation that captures the meaning of nodes in an embedding space. In this work, 383 Bach chorales were compiled and represented as a graph. Two application cases were investigated in this paper (i) learning node embedding representation using \emph{Continuous Bag of Words (CBOW), skip-gram}, and \emph{node2vec} algorithms, and (ii) learning node labels from neighboring nodes based on a collective classification approach. The results of this exploratory study ascertains many salient features of the graph-based representation approach applicable to music applications.
    Deep Generative Model for Periodic Graphs. (arXiv:2201.11932v1 [cs.LG])
    Periodic graphs are graphs consisting of repetitive local structures, such as crystal nets and polygon mesh. Their generative modeling has great potential in real-world applications such as material design and graphics synthesis. Classical models either rely on domain-specific predefined generation principles (e.g., in crystal net design), or follow geometry-based prescribed rules. Recently, deep generative models has shown great promise in automatically generating general graphs. However, their advancement into periodic graphs have not been well explored due to several key challenges in 1) maintaining graph periodicity; 2) disentangling local and global patterns; and 3) efficiency in learning repetitive patterns. To address them, this paper proposes Periodical-Graph Disentangled Variational Auto-encoder (PGD-VAE), a new deep generative models for periodic graphs that can automatically learn, disentangle, and generate local and global graph patterns. Specifically, we develop a new periodic graph encoder consisting of global-pattern encoder and local-pattern encoder that ensures to disentangle the representation into global and local semantics. We then propose a new periodic graph decoder consisting of local structure decoder, neighborhood decoder, and global structure decoder, as well as the assembler of their outputs that guarantees periodicity. Moreover, we design a new model learning objective that helps ensure the invariance of local-semantic representations for the graphs with the same local structure. Comprehensive experimental evaluations have been conducted to demonstrate the effectiveness of the proposed method. The code of proposed PGD-VAE is availabe at https://github.com/shi-yu-wang/PGD-VAE.
    Simulating Using Deep Learning The World Trade Forecasting of Export-Import Exchange Rate Convergence Factor During COVID-19. (arXiv:2201.12291v1 [q-fin.ST])
    By trade we usually mean the exchange of goods between states and countries. International trade acts as a barometer of the economic prosperity index and every country is overly dependent on resources, so international trade is essential. Trade is significant to the global health crisis, saving lives and livelihoods. By collecting the dataset called "Effects of COVID19 on trade" from the state website NZ Tatauranga Aotearoa, we have developed a sustainable prediction process on the effects of COVID-19 in world trade using a deep learning model. In the research, we have given a 180-day trade forecast where the ups and downs of daily imports and exports have been accurately predicted in the Covid-19 period. In order to fulfill this prediction, we have taken data from 1st January 2015 to 30th May 2021 for all countries, all commodities, and all transport systems and have recovered what the world trade situation will be in the next 180 days during the Covid-19 period. The deep learning method has received equal attention from both investors and researchers in the field of in-depth observation. This study predicts global trade using the Long-Short Term Memory. Time series analysis can be useful to see how a given asset, security, or economy changes over time. Time series analysis plays an important role in past analysis to get different predictions of the future and it can be observed that some factors affect a particular variable from period to period. Through the time series it is possible to observe how various economic changes or trade effects change over time. By reviewing these changes, one can be aware of the steps to be taken in the future and a country can be more careful in terms of imports and exports accordingly. From our time series analysis, it can be said that the LSTM model has given a very gracious thought of the future world import and export situation in terms of trade.
    On the Hidden Biases of Policy Mirror Ascent in Continuous Action Spaces. (arXiv:2201.12332v1 [cs.LG])
    We focus on parameterized policy search for reinforcement learning over continuous action spaces. Typically, one assumes the score function associated with a policy is bounded, which {fails to hold even for Gaussian policies. } To properly address this issue, one must introduce an exploration tolerance parameter to quantify the region in which it is bounded. Doing so incurs a persistent bias that appears in the attenuation rate of the expected policy gradient norm, which is inversely proportional to the radius of the action space. To mitigate this hidden bias, heavy-tailed policy parameterizations may be used, which exhibit a bounded score function, but doing so can cause instability in algorithmic updates. To address these issues, in this work, we study the convergence of policy gradient algorithms under heavy-tailed parameterizations, which we propose to stabilize with a combination of mirror ascent-type updates and gradient tracking. Our main theoretical contribution is the establishment that this scheme converges with constant step and batch sizes, whereas prior works require these parameters to respectively shrink to null or grow to infinity. Experimentally, this scheme under a heavy-tailed policy parameterization yields improved reward accumulation across a variety of settings as compared with standard benchmarks.
    Sliced-Wasserstein Gradient Flows. (arXiv:2110.10972v2 [cs.LG] UPDATED)
    Minimizing functionals in the space of probability distributions can be done with Wasserstein gradient flows. To solve them numerically, a possible approach is to rely on the Jordan-Kinderlehrer-Otto (JKO) scheme which is analogous to the proximal scheme in Euclidean spaces. However, it requires solving a nested optimization problem at each iteration, and is known for its computational challenges, especially in high dimension. To alleviate it, very recent works propose to approximate the JKO scheme leveraging Brenier's theorem, and using gradients of Input Convex Neural Networks to parameterize the density (JKO-ICNN). However, this method comes with a high computational cost and stability issues. Instead, this work proposes to use gradient flows in the space of probability measures endowed with the sliced-Wasserstein (SW) distance. We argue that this method is more flexible than JKO-ICNN, since SW enjoys a closed-form differentiable approximation. Thus, the density at each step can be parameterized by any generative model which alleviates the computational burden and makes it tractable in higher dimensions.
    Leveraging class abstraction for commonsense reinforcement learning via residual policy gradient methods. (arXiv:2201.12126v1 [cs.AI])
    Enabling reinforcement learning (RL) agents to leverage a knowledge base while learning from experience promises to advance RL in knowledge intensive domains. However, it has proven difficult to leverage knowledge that is not manually tailored to the environment. We propose to use the subclass relationships present in open-source knowledge graphs to abstract away from specific objects. We develop a residual policy gradient method that is able to integrate knowledge across different abstraction levels in the class hierarchy. Our method results in improved sample efficiency and generalisation to unseen objects in commonsense games, but we also investigate failure modes, such as excessive noise in the extracted class knowledge or environments with little class structure.
    Multiple-Source Domain Adaptation via Coordinated Domain Encoders and Paired Classifiers. (arXiv:2201.11870v1 [cs.CL])
    We present a novel multiple-source unsupervised model for text classification under domain shift. Our model exploits the update rates in document representations to dynamically integrate domain encoders. It also employs a probabilistic heuristic to infer the error rate in the target domain in order to pair source classifiers. Our heuristic exploits data transformation cost and the classifier accuracy in the target feature space. We have used real world scenarios of Domain Adaptation to evaluate the efficacy of our algorithm. We also used pretrained multi-layer transformers as the document encoder in the experiments to demonstrate whether the improvement achieved by domain adaptation models can be delivered by out-of-the-box language model pretraining. The experiments testify that our model is the top performing approach in this setting.
    Pseudo-Differential Integral Operator for Learning Solution Operators of Partial Differential Equations. (arXiv:2201.11967v1 [cs.LG])
    Learning mapping between two function spaces has attracted considerable research attention. However, learning the solution operator of partial differential equations (PDEs) remains a challenge in scientific computing. Therefore, in this study, we propose a novel pseudo-differential integral operator (PDIO) inspired by a pseudo-differential operator, which is a generalization of a differential operator and characterized by a certain symbol. We parameterize the symbol by using a neural network and show that the neural-network-based symbol is contained in a smooth symbol class. Subsequently, we prove that the PDIO is a bounded linear operator, and thus is continuous in the Sobolev space. We combine the PDIO with the neural operator to develop a pseudo-differential neural operator (PDNO) to learn the nonlinear solution operator of PDEs. We experimentally validate the effectiveness of the proposed model by using Burgers' equation, Darcy flow, and the Navier-Stokes equation. The results reveal that the proposed PDNO outperforms the existing neural operator approaches in most experiments.
    Cause-Effect Preservation and Classification using Neurochaos Learning. (arXiv:2201.12181v1 [cs.LG])
    Discovering cause-effect from observational data is an important but challenging problem in science and engineering. In this work, a recently proposed brain inspired learning algorithm namely-\emph{Neurochaos Learning} (NL) is used for the classification of cause-effect from simulated data. The data instances used are generated from coupled AR processes, coupled 1D chaotic skew tent maps, coupled 1D chaotic logistic maps and a real-world prey-predator system. The proposed method consistently outperforms a five layer Deep Neural Network architecture for coupling coefficient values ranging from $0.1$ to $0.7$. Further, we investigate the preservation of causality in the feature extracted space of NL using Granger Causality (GC) for coupled AR processes and and Compression-Complexity Causality (CCC) for coupled chaotic systems and real-world prey-predator dataset. This ability of NL to preserve causality under a chaotic transformation and successfully classify cause and effect time series (including a transfer learning scenario) is highly desirable in causal machine learning applications.
    Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning. (arXiv:2201.12023v1 [cs.LG])
    Alpa automates model-parallel training of large deep learning (DL) models by generating execution plans that unify data, operator, and pipeline parallelism. Existing model-parallel training systems either require users to manually create a parallelization plan or automatically generate one from a limited space of model parallelism configurations, which does not suffice to scale out complex DL models on distributed compute devices. Alpa distributes the training of large DL models by viewing parallelisms as two hierarchical levels: inter-operator and intra-operator parallelisms. Based on it, Alpa constructs a new hierarchical space for massive model-parallel execution plans. Alpa designs a number of compilation passes to automatically derive the optimal parallel execution plan in each independent parallelism level and implements an efficient runtime to orchestrate the two-level parallel execution on distributed compute devices. Our evaluation shows Alpa generates parallelization plans that match or outperform hand-tuned model-parallel training systems even on models they are designed for. Unlike specialized systems, Alpa also generalizes to models with heterogeneous architectures and models without manually-designed plans.
    A Review on Deep-Learning Algorithms for Fetal Ultrasound-Image Analysis. (arXiv:2201.12260v1 [eess.IV])
    Deep-learning (DL) algorithms are becoming the standard for processing ultrasound (US) fetal images. Despite a large number of survey papers already present in this field, most of them are focusing on a broader area of medical-image analysis or not covering all fetal US DL applications. This paper surveys the most recent work in the field, with a total of 145 research papers published after 2017. Each paper is analyzed and commented on from both the methodology and application perspective. We categorized the papers in (i) fetal standard-plane detection, (ii) anatomical-structure analysis, and (iii) biometry parameter estimation. For each category, main limitations and open issues are presented. Summary tables are included to facilitate the comparison among the different approaches. Publicly-available datasets and performance metrics commonly used to assess algorithm performance are summarized, too. This paper ends with a critical summary of the current state of the art on DL algorithms for fetal US image analysis and a discussion on current challenges that have to be tackled by researchers working in the field to translate the research methodology into the actual clinical practice.
    A Topological Filter for Learning with Label Noise. (arXiv:2012.04835v2 [cs.CV] UPDATED)
    Noisy labels can impair the performance of deep neural networks. To tackle this problem, in this paper, we propose a new method for filtering label noise. Unlike most existing methods relying on the posterior probability of a noisy classifier, we focus on the much richer spatial behavior of data in the latent representational space. By leveraging the high-order topological information of data, we are able to collect most of the clean data and train a high-quality model. Theoretically we prove that this topological approach is guaranteed to collect the clean data with high probability. Empirical results show that our method outperforms the state-of-the-arts and is robust to a broad spectrum of noise types and levels.
    Improving End-to-End Models for Set Prediction in Spoken Language Understanding. (arXiv:2201.12105v1 [cs.CL])
    The goal of spoken language understanding (SLU) systems is to determine the meaning of the input speech signal, unlike speech recognition which aims to produce verbatim transcripts. Advances in end-to-end (E2E) speech modeling have made it possible to train solely on semantic entities, which are far cheaper to collect than verbatim transcripts. We focus on this set prediction problem, where entity order is unspecified. Using two classes of E2E models, RNN transducers and attention based encoder-decoders, we show that these models work best when the training entity sequence is arranged in spoken order. To improve E2E SLU models when entity spoken order is unknown, we propose a novel data augmentation technique along with an implicit attention based alignment method to infer the spoken order. F1 scores significantly increased by more than 11% for RNN-T and about 2% for attention based encoder-decoder SLU models, outperforming previously reported results.
    Learning Stationary Nash Equilibrium Policies in $n$-Player Stochastic Games with Independent Chains via Dual Mirror Descent. (arXiv:2201.12224v1 [cs.LG])
    We consider a subclass of $n$-player stochastic games, in which players have their own internal state/action spaces while they are coupled through their payoff functions. It is assumed that players' internal chains are driven by independent transition probabilities. Moreover, players can only receive realizations of their payoffs but not the actual functions, nor can they observe each others' states/actions. Under some assumptions on the structure of the payoff functions, we develop efficient learning algorithms based on Dual Averaging and Dual Mirror Descent, which provably converge almost surely or in expectation to the set of $\epsilon$-Nash equilibrium policies. In particular, we derive upper bounds on the number of iterates that scale polynomially in terms of the game parameters to achieve an $\epsilon$-Nash equilibrium policy. Besides Markov potential games and linear-quadratic stochastic games, this work provides another interesting subclass of $n$-player stochastic games that under some assumption provably admit polynomial-time learning algorithm for finding their $\epsilon$-Nash equilibrium policies.
    Exploration With a Finite Brain. (arXiv:2201.11817v1 [cs.LG])
    Equipping artificial agents with useful exploration mechanisms remains a challenge to this day. Humans, on the other hand, seem to manage the trade-off between exploration and exploitation effortlessly. In the present article, we put forward the hypothesis that they accomplish this by making optimal use of limited computational resources. We study this hypothesis by meta-learning reinforcement learning algorithms that sacrifice performance for a shorter description length. The emerging class of models captures human exploration behavior better than previously considered approaches, such as Boltzmann exploration, upper confidence bound algorithms, and Thompson sampling. We additionally demonstrate that changes in description length produce the intended effects: reducing description length captures the behavior of brain-lesioned patients while increasing it echoes cognitive development during adolescence.
    M\"{o}bius Convolutions for Spherical CNNs. (arXiv:2201.12212v1 [cs.CV])
    M\"{o}bius transformations play an important role in both geometry and spherical image processing -- they are the group of conformal automorphisms of 2D surfaces and the spherical equivalent of homographies. Here we present a novel, M\"{o}bius-equivariant spherical convolution operator which we call M\"{o}bius convolution, and with it, develop the foundations for M\"{o}bius-equivariant spherical CNNs. Our approach is based on a simple observation: to achieve equivariance, we only need to consider the lower-dimensional subgroup which transforms the positions of points as seen in the frames of their neighbors. To efficiently compute M\"{o}bius convolutions at scale we derive an approximation of the action of the transformations on spherical filters, allowing us to compute our convolutions in the spectral domain with the fast Spherical Harmonic Transform. The resulting framework is both flexible and descriptive, and we demonstrate its utility by achieving promising results in both shape classification and image segmentation tasks.
    Learning Summary Statistics for Bayesian Inference with Autoencoders. (arXiv:2201.12059v1 [cs.LG])
    For stochastic models with intractable likelihood functions, approximate Bayesian computation offers a way of approximating the true posterior through repeated comparisons of observations with simulated model outputs in terms of a small set of summary statistics. These statistics need to retain the information that is relevant for constraining the parameters but cancel out the noise. They can thus be seen as thermodynamic state variables, for general stochastic models. For many scientific applications, we need strictly more summary statistics than model parameters to reach a satisfactory approximation of the posterior. Therefore, we propose to use the inner dimension of deep neural network based Autoencoders as summary statistics. To create an incentive for the encoder to encode all the parameter-related information but not the noise, we give the decoder access to explicit or implicit information on the noise that has been used to generate the training data. We validate the approach empirically on two types of stochastic models.
    Least Squares Estimation Using Sketched Data with Heteroskedastic Errors. (arXiv:2007.07781v2 [stat.ML] UPDATED)
    Researchers may use a sketch of data of size $m$ instead of the full sample of size $n$ sometimes to relieve computation burden, and other times to maintain data privacy. This paper considers the case when full sample estimation would have required the Eicker-Huber-White robust standard errors to account for heteroskedasticity. We show that random projections have a smoothing effect on the sketched data, with the consequence that the least squares estimates using such sketched data behave 'as if' the errors were homoskedastic. This result is obtained by expressing the difference between the moments computed from the full sample and the sketched data as a degenerate $U$-statistic which is asymptotically normal with a homoskedastic variance when the conditions in Hall (1984) are satisfied. This result also holds for two-stage least squares for which algorithmic and statistical properties are analyzed. Sketches produced by random sampling will not, however, have the effect of homogenizing the error variances.
    LAP: An Attention-Based Module for Faithful Interpretation and Knowledge Injection in Convolutional Neural Networks. (arXiv:2201.11808v1 [cs.CV])
    Despite the state-of-the-art performance of deep convolutional neural networks, they are susceptible to bias and malfunction in unseen situations. The complex computation behind their reasoning is not sufficiently human-understandable to develop trust. External explainer methods have tried to interpret the network decisions in a human-understandable way, but they are accused of fallacies due to their assumptions and simplifications. On the other side, the inherent self-interpretability of models, while being more robust to the mentioned fallacies, cannot be applied to the already trained models. In this work, we propose a new attention-based pooling layer, called Local Attention Pooling (LAP), that accomplishes self-interpretability and the possibility for knowledge injection while improving the model's performance. Moreover, several weakly-supervised knowledge injection methodologies are provided to enhance the process of training. We verified our claims by evaluating several LAP-extended models on three different datasets, including Imagenet. The proposed framework offers more valid human-understandable and more faithful-to-the-model interpretations than the commonly used white-box explainer methods.
    A Survey on Visual Transfer Learning using Knowledge Graphs. (arXiv:2201.11794v1 [cs.CV])
    Recent approaches of computer vision utilize deep learning methods as they perform quite well if training and testing domains follow the same underlying data distribution. However, it has been shown that minor variations in the images that occur when using these methods in the real world can lead to unpredictable errors. Transfer learning is the area of machine learning that tries to prevent these errors. Especially, approaches that augment image data using auxiliary knowledge encoded in language embeddings or knowledge graphs (KGs) have achieved promising results in recent years. This survey focuses on visual transfer learning approaches using KGs. KGs can represent auxiliary knowledge either in an underlying graph-structured schema or in a vector-based knowledge graph embedding. Intending to enable the reader to solve visual transfer learning problems with the help of specific KG-DL configurations we start with a description of relevant modeling structures of a KG of various expressions, such as directed labeled graphs, hypergraphs, and hyper-relational graphs. We explain the notion of feature extractor, while specifically referring to visual and semantic features. We provide a broad overview of knowledge graph embedding methods and describe several joint training objectives suitable to combine them with high dimensional visual embeddings. The main section introduces four different categories on how a KG can be combined with a DL pipeline: 1) Knowledge Graph as a Reviewer; 2) Knowledge Graph as a Trainee; 3) Knowledge Graph as a Trainer; and 4) Knowledge Graph as a Peer. To help researchers find evaluation benchmarks, we provide an overview of generic KGs and a set of image processing datasets and benchmarks including various types of auxiliary knowledge. Last, we summarize related surveys and give an outlook about challenges and open issues for future research.
    Look It Up: Bilingual Dictionaries Improve Neural Machine Translation. (arXiv:2010.05997v2 [cs.CL] UPDATED)
    Despite advances in neural machine translation (NMT) quality, rare words continue to be problematic. For humans, the solution to the rare-word problem has long been dictionaries, but dictionaries cannot be straightforwardly incorporated into NMT. In this paper, we describe a new method for "attaching" dictionary definitions to rare words so that the network can learn the best way to use them. We demonstrate improvements of up to 1.8 BLEU using bilingual dictionaries.
    Neural Approximation of Extended Persistent Homology on Graphs. (arXiv:2201.12032v1 [cs.LG])
    Persistent homology is a widely used theory in topological data analysis. In the context of graph learning, topological features based on persistent homology have been used to capture potentially high-order structural information so as to augment existing graph neural network methods. However, computing extended persistent homology summaries remains slow for large and dense graphs, especially since in learning applications one has to carry out this computation potentially many times. Inspired by recent success in neural algorithmic reasoning, we propose a novel learning method to compute extended persistence diagrams on graphs. The proposed neural network aims to simulate a specific algorithm and learns to compute extended persistence diagrams for new graphs efficiently. Experiments on approximating extended persistence diagrams and several downstream graph representation learning tasks demonstrate the effectiveness of our method. Our method is also efficient; on large and dense graphs, we accelerate the computation by nearly 100 times.
    Parameter Sharing For Heterogeneous Agents in Multi-Agent Reinforcement Learning. (arXiv:2005.13625v7 [cs.LG] UPDATED)
    Parameter sharing, where each agent independently learns a policy with fully shared parameters between all policies, is a popular baseline method for multi-agent deep reinforcement learning. Unfortunately, since all agents share the same policy network, they cannot learn different policies or tasks. This issue has been circumvented experimentally by adding an agent-specific indicator signal to observations, which we term "agent indication." Agent indication is limited, however, in that without modification it does not allow parameter sharing to be applied to environments where the action spaces and/or observation spaces are heterogeneous. This work formalizes the notion of agent indication and proves that it enables convergence to optimal policies for the first time. Next, we formally introduce methods to extend parameter sharing to learning in heterogeneous observation and action spaces, and prove that these methods allow for convergence to optimal policies. Finally, we experimentally confirm that the methods we introduce function empirically, and conduct a wide array of experiments studying the empirical efficacy of many different agent indication schemes for graphical observation spaces.
    Group Chat Ecology in Enterprise Instant Messaging: How Employees Collaborate Through Multi-User Chat Channels on Slack. (arXiv:1906.01756v5 [cs.HC] UPDATED)
    Despite the long history of studying instant messaging usage, we know very little about how today's people participate in group chat channels and interact with others inside a real-world organization. In this short paper, we aim to update the existing knowledge on how group chat is used in the context of today's organizations. The knowledge is particularly important for the new norm of remote works under the COVID-19 pandemic. We have the privilege of collecting two valuable datasets: a total of 4,300 group chat channels in Slack from an R&D department in a multinational IT company; and a total of 117 groups' performance data. Through qualitative coding of 100 randomly sampled group channels from the 4,300 channels dataset, we identified and reported 9 categories such as Project channels, IT-Support channels, and Event channels. We further defined a feature metric with 21 meta features (and their derived features) without looking at the message content to depict the group communication style for these group chat channels, with which we successfully trained a machine learning model that can automatically classify a given group channel into one of the 9 categories. In addition to the descriptive data analysis, we illustrated how these communication metrics can be used to analyze team performance. We cross-referenced 117 project teams and their team-based Slack channels and identified 57 teams that appeared in both datasets, then we built a regression model to reveal the relationship between these group communication styles and the project team performance. This work contributes an updated empirical understanding of human-human communication practices within the enterprise setting, and suggests design opportunities for the future of human-AI communication experience.
    Rethinking Attention-Model Explainability through Faithfulness Violation Test. (arXiv:2201.12114v1 [cs.LG])
    Attention mechanisms are dominating the explainability of deep models. They produce probability distributions over the input, which are widely deemed as feature-importance indicators. However, in this paper, we find one critical limitation in attention explanations: weakness in identifying the polarity of feature impact. This would be somehow misleading -- features with higher attention weights may not faithfully contribute to model predictions; instead, they can impose suppression effects. With this finding, we reflect on the explainability of current attention-based techniques, such as Attentio$\odot$Gradient and LRP-based attention explanations. We first propose an actionable diagnostic methodology (henceforth faithfulness violation test) to measure the consistency between explanation weights and the impact polarity. Through the extensive experiments, we then show that most tested explanation methods are unexpectedly hindered by the faithfulness violation issue, especially the raw attention. Empirical analyses on the factors affecting violation issues further provide useful observations for adopting explanation methods in attention models.
    Interplay between depth of neural networks and locality of target functions. (arXiv:2201.12082v1 [cs.LG])
    It has been recognized that heavily overparameterized deep neural networks (DNNs) exhibit surprisingly good generalization performance in various machine-learning tasks. Although benefits of depth have been investigated from different perspectives such as the approximation theory and the statistical learning theory, existing theories do not adequately explain the empirical success of overparameterized DNNs. In this work, we report a remarkable interplay between depth and locality of a target function. We introduce $k$-local and $k$-global functions, and find that depth is beneficial for learning local functions but detrimental to learning global functions. This interplay is not properly captured by the neural tangent kernel, which describes an infinitely wide neural network within the lazy learning regime.
    Backdoors Stuck At The Frontdoor: Multi-Agent Backdoor Attacks That Backfire. (arXiv:2201.12211v1 [cs.LG])
    Malicious agents in collaborative learning and outsourced data collection threaten the training of clean models. Backdoor attacks, where an attacker poisons a model during training to successfully achieve targeted misclassification, are a major concern to train-time robustness. In this paper, we investigate a multi-agent backdoor attack scenario, where multiple attackers attempt to backdoor a victim model simultaneously. A consistent backfiring phenomenon is observed across a wide range of games, where agents suffer from a low collective attack success rate. We examine different modes of backdoor attack configurations, non-cooperation / cooperation, joint distribution shifts, and game setups to return an equilibrium attack success rate at the lower bound. The results motivate the re-evaluation of backdoor defense research for practical environments.
    The fine line between dead neurons and sparsity in binarized spiking neural networks. (arXiv:2201.11915v1 [cs.NE])
    Spiking neural networks can compensate for quantization error by encoding information either in the temporal domain, or by processing discretized quantities in hidden states of higher precision. In theory, a wide dynamic range state-space enables multiple binarized inputs to be accumulated together, thus improving the representational capacity of individual neurons. This may be achieved by increasing the firing threshold, but make it too high and sparse spike activity turns into no spike emission. In this paper, we propose the use of `threshold annealing' as a warm-up method for firing thresholds. We show it enables the propagation of spikes across multiple layers where neurons would otherwise cease to fire, and in doing so, achieve highly competitive results on four diverse datasets, despite using binarized weights. Source code is available at https://github.com/jeshraghian/snn-tha/
    O-ViT: Orthogonal Vision Transformer. (arXiv:2201.12133v1 [cs.CV])
    Inspired by the tremendous success of the self-attention mechanism in natural language processing, the Vision Transformer (ViT) creatively applies it to image patch sequences and achieves incredible performance. However, the scaled dot-product self-attention of ViT brings about scale ambiguity to the structure of the original feature space. To address this problem, we propose a novel method named Orthogonal Vision Transformer (O-ViT), to optimize ViT from the geometric perspective. O-ViT limits parameters of self-attention blocks to be on the norm-keeping orthogonal manifold, which can keep the geometry of the feature space. Moreover, O-ViT achieves both orthogonal constraints and cheap optimization overhead by adopting a surjective mapping between the orthogonal group and its Lie algebra.We have conducted comparative experiments on image recognition tasks to demonstrate O-ViT's validity and experiments show that O-ViT can boost the performance of ViT by up to 3.6%.
    Adaptive Optimizer for Automated Hyperparameter Optimization Problem. (arXiv:2201.12124v1 [cs.LG])
    The choices of hyperparameters have critical effects on the performance of machine learning models. In this paper, we present a general framework that is able to construct an adaptive optimizer, which automatically adjust the appropriate algorithm and parameters in the process of optimization. Examining the method of adaptive optimizer, we product an example of using genetic algorithm to construct an adaptive optimizer based on Bayesian Optimizer and compared effectiveness with original optimizer. Especially, It has great advantages in parallel optimization.
    Constrained Variational Policy Optimization for Safe Reinforcement Learning. (arXiv:2201.11927v1 [cs.LG])
    Safe reinforcement learning (RL) aims to learn policies that satisfy certain constraints before deploying to safety-critical applications. Primal-dual as a prevalent constrained optimization framework suffers from instability issues and lacks optimality guarantees. This paper overcomes the issues from a novel probabilistic inference perspective and proposes an Expectation-Maximization style approach to learn safe policy. We show that the safe RL problem can be decomposed to 1) a convex optimization phase with a non-parametric variational distribution and 2) a supervised learning phase. We show the unique advantages of constrained variational policy optimization by proving its optimality and policy improvement stability. A wide range of experiments on continuous robotic tasks show that the proposed method achieves significantly better performance in terms of constraint satisfaction and sample efficiency than primal-dual baselines.
    Dynamic Temporal Reconciliation by Reinforcement learning. (arXiv:2201.11964v1 [cs.LG])
    Planning based on long and short term time series forecasts is a common practice across many industries. In this context, temporal aggregation and reconciliation techniques have been useful in improving forecasts, reducing model uncertainty, and providing a coherent forecast across different time horizons. However, an underlying assumption spanning all these techniques is the complete availability of data across all levels of the temporal hierarchy, while this offers mathematical convenience but most of the time low frequency data is partially completed and it is not available while forecasting. On the other hand, high frequency data can significantly change in a scenario like the COVID pandemic and this change can be used to improve forecasts that will otherwise significantly diverge from long term actuals. We propose a dynamic reconciliation method whereby we formulate the problem of informing low frequency forecasts based on high frequency actuals as a Markov Decision Process (MDP) allowing for the fact that we do not have complete information about the dynamics of the process. This allows us to have the best long term estimates based on the most recent data available even if the low frequency cycles have only been partially completed. The MDP has been solved using a Time Differenced Reinforcement learning (TDRL) approach with customizable actions and improves the long terms forecasts dramatically as compared to relying solely on historical low frequency data. The result also underscores the fact that while low frequency forecasts can improve the high frequency forecasts as mentioned in the temporal reconciliation literature (based on the assumption that low frequency forecasts have lower noise to signal ratio) the high frequency forecasts can also be used to inform the low frequency forecasts.
    Prediction of GPU Failures Under Deep Learning Workloads. (arXiv:2201.11853v1 [cs.LG])
    Graphics processing units (GPUs) are the de facto standard for processing deep learning (DL) tasks. Meanwhile, GPU failures, which are inevitable, cause severe consequences in DL tasks: they disrupt distributed trainings, crash inference services, and result in service level agreement violations. To mitigate the problem caused by GPU failures, we propose to predict failures by using ML models. This paper is the first to study prediction models of GPU failures under large-scale production deep learning workloads. As a starting point, we evaluate classic prediction models and observe that predictions of these models are both inaccurate and unstable. To improve the precision and stability of predictions, we propose several techniques, including parallel and cascade model-ensemble mechanisms and a sliding training method. We evaluate the performances of our various techniques on a four-month production dataset including 350 million entries. The results show that our proposed techniques improve the prediction precision from 46.3\% to 84.0\%.
    Universal Online Convex Optimization with Minimax Optimal Second-Order Dynamic Regret. (arXiv:1907.00497v2 [math.OC] UPDATED)
    We introduce an online convex optimization algorithm using projected subgradient descent with optimal adaptive learning rates, with sequential and efficient first-order updates. Our method provides a subgradient adaptive minimax optimal dynamic regret guarantee for a sequence of general convex functions with no known additional properties such as strong-convexity, smoothness, exp-concavity or even Lipschitz-continuity. The guarantee is against any comparator decision sequence with bounded "complexity", defined by the cumulative distance traveled via changes between successive decisions. We show optimality by generating a lower bound of the worst-case second-order dynamic regret, which incorporates actual subgradient norms and matches with our guarantees within a constant factor. We also derive the extension for independent learning in each decision coordinate separately. Additionally, we demonstrate how to best preserve our guarantees when the bound on total successive changes in the dynamic comparator sequence grows in time or the feedback regarding such bound arrives partially with time, both in a truly online manner. Then, as a major contribution, we examine the scenario when we receive no information regarding the successive changes, but instead, by a unique re-purposing of the expert mixture framework with novel additions, we eliminate the need of such information in, again, a truly online manner. Moreover, we show the ability to compete against all dynamic comparator sequences simultaneously (universally) with minimax optimality, where the guarantees depend on the "complexity" of each comparator separately. We also discuss potential modifications to our approach which addresses further complexity reductions for time, computation, memory, and we also further the universal competitiveness via guarantees taking into account concentrations of a comparator sequence in the decision set.
    Efficient Embedding of Semantic Similarity in Control Policies via Entangled Bisimulation. (arXiv:2201.12300v1 [cs.LG])
    Learning generalizeable policies from visual input in the presence of visual distractions is a challenging problem in reinforcement learning. Recently, there has been renewed interest in bisimulation metrics as a tool to address this issue; these metrics can be used to learn representations that are, in principle, invariant to irrelevant distractions by measuring behavioural similarity between states. An accurate, unbiased, and scalable estimation of these metrics has proved elusive in continuous state and action scenarios. We propose entangled bisimulation, a bisimulation metric that allows the specification of the distance function between states, and can be estimated without bias in continuous state and action spaces. We show how entangled bisimulation can meaningfully improve over previous methods on the Distracting Control Suite (DCS), even when added on top of data augmentation techniques.
    Using Shape Metrics to Describe 2D Data Points. (arXiv:2201.11857v1 [cs.LG])
    Traditional machine learning (ML) algorithms, such as multiple regression, require human analysts to make decisions on how to treat the data. These decisions can make the model building process subjective and difficult to replicate for those who did not build the model. Deep learning approaches benefit by allowing the model to learn what features are important once the human analyst builds the architecture. Thus, a method for automating certain human decisions for traditional ML modeling would help to improve the reproducibility and remove subjective aspects of the model building process. To that end, we propose to use shape metrics to describe 2D data to help make analyses more explainable and interpretable. The proposed approach provides a foundation to help automate various aspects of model building in an interpretable and explainable fashion. This is particularly important in applications in the medical community where the `right to explainability' is crucial. We provide various simulated data sets ranging from probability distributions, functions, and model quality control checks (such as QQ-Plots and residual analyses from ordinary least squares) to showcase the breadth of this approach.
    Classification of White Blood Cell Leukemia with Low Number of Interpretable and Explainable Features. (arXiv:2201.11864v1 [eess.IV])
    White Blood Cell (WBC) Leukaemia is detected through image-based classification. Convolutional Neural Networks are used to learn the features needed to classify images of cells a malignant or healthy. However, this type of model requires learning a large number of parameters and is difficult to interpret and explain. Explainable AI (XAI) attempts to alleviate this issue by providing insights to how models make decisions. Therefore, we present an XAI model which uses only 24 explainable and interpretable features and is highly competitive to other approaches by outperforming them by about 4.38\%. Further, our approach provides insight into which variables are the most important for the classification of the cells. This insight provides evidence that when labs treat the WBCs differently, the importance of various metrics changes substantially. Understanding the important features for classification is vital in medical imaging diagnosis and, by extension, understanding the AI models built in scientific pursuits.
    Learning Curves for Decision Making in Supervised Machine Learning -- A Survey. (arXiv:2201.12150v1 [cs.LG])
    Learning curves are a concept from social sciences that has been adopted in the context of machine learning to assess the performance of a learning algorithm with respect to a certain resource, e.g. the number of training examples or the number of training iterations. Learning curves have important applications in several contexts of machine learning, most importantly for the context of data acquisition, early stopping of model training and model selection. For example, by modelling the learning curves, one can assess at an early stage whether the algorithm and hyperparameter configuration have the potential to be a suitable choice, often speeding up the algorithm selection process. A variety of approaches has been proposed to use learning curves for decision making. Some models answer the binary decision question of whether a certain algorithm at a certain budget will outperform a certain reference performance, whereas more complex models predict the entire learning curve of an algorithm. We contribute a framework that categorizes learning curve approaches using three criteria: the decision situation that they address, the intrinsic learning curve question that they answer and the type of resources that they use. We survey papers from literature and classify them into this framework.
    From data to functa: Your data point is a function and you should treat it like one. (arXiv:2201.12204v1 [cs.LG])
    It is common practice in deep learning to represent a measurement of the world on a discrete grid, e.g. a 2D grid of pixels. However, the underlying signal represented by these measurements is often continuous, e.g. the scene depicted in an image. A powerful continuous alternative is then to represent these measurements using an implicit neural representation, a neural function trained to output the appropriate measurement value for any input spatial location. In this paper, we take this idea to its next level: what would it take to perform deep learning on these functions instead, treating them as data? In this context we refer to the data as functa, and propose a framework for deep learning on functa. This view presents a number of challenges around efficient conversion from data to functa, compact representation of functa, and effectively solving downstream tasks on functa. We outline a recipe to overcome these challenges and apply it to a wide range of data modalities including images, 3D shapes, neural radiance fields (NeRF) and data on manifolds. We demonstrate that this approach has various compelling properties across data modalities, in particular on the canonical tasks of generative modeling, data imputation, novel view synthesis and classification.
    Optimal Transport Tools (OTT): A JAX Toolbox for all things Wasserstein. (arXiv:2201.12324v1 [cs.LG])
    Optimal transport tools (OTT-JAX) is a Python toolbox that can solve optimal transport problems between point clouds and histograms. The toolbox builds on various JAX features, such as automatic and custom reverse mode differentiation, vectorization, just-in-time compilation and accelerators support. The toolbox covers elementary computations, such as the resolution of the regularized OT problem, and more advanced extensions, such as barycenters, Gromov-Wasserstein, low-rank solvers, estimation of convex maps, differentiable generalizations of quantiles and ranks, and approximate OT between Gaussian mixtures. The toolbox code is available at \texttt{https://github.com/ott-jax/ott}
    Rethinking Learning Dynamics in RL using Adversarial Networks. (arXiv:2201.11783v1 [cs.LG])
    We present a learning mechanism for reinforcement learning of closely related skills parameterized via a skill embedding space. Our approach is grounded on the intuition that nothing makes you learn better than a coevolving adversary. The main contribution of our work is to formulate an adversarial training regime for reinforcement learning with the help of entropy-regularized policy gradient formulation. We also adapt existing measures of causal attribution to draw insights from the skills learned. Our experiments demonstrate that the adversarial process leads to a better exploration of multiple solutions and understanding the minimum number of different skills necessary to solve a given set of tasks.
    Eigenvalues of Autoencoders in Training and at Initialization. (arXiv:2201.11813v1 [cs.LG])
    In this paper, we investigate the evolution of autoencoders near their initialization. In particular, we study the distribution of the eigenvalues of the Jacobian matrices of autoencoders early in the training process, training on the MNIST data set. We find that autoencoders that have not been trained have eigenvalue distributions that are qualitatively different from those which have been trained for a long time ($>$100 epochs). Additionally, we find that even at early epochs, these eigenvalue distributions rapidly become qualitatively similar to those of the fully trained autoencoders. We also compare the eigenvalues at initialization to pertinent theoretical work on the eigenvalues of random matrices and the products of such matrices.
    The Challenges of Exploration for Offline Reinforcement Learning. (arXiv:2201.11861v1 [cs.LG])
    Offline Reinforcement Learning (ORL) enablesus to separately study the two interlinked processes of reinforcement learning: collecting informative experience and inferring optimal behaviour. The second step has been widely studied in the offline setting, but just as critical to data-efficient RL is the collection of informative data. The task-agnostic setting for data collection, where the task is not known a priori, is of particular interest due to the possibility of collecting a single dataset and using it to solve several downstream tasks as they arise. We investigate this setting via curiosity-based intrinsic motivation, a family of exploration methods which encourage the agent to explore those states or transitions it has not yet learned to model. With Explore2Offline, we propose to evaluate the quality of collected data by transferring the collected data and inferring policies with reward relabelling and standard offline RL algorithms. We evaluate a wide variety of data collection strategies, including a new exploration agent, Intrinsic Model Predictive Control (IMPC), using this scheme and demonstrate their performance on various tasks. We use this decoupled framework to strengthen intuitions about exploration and the data prerequisites for effective offline RL.
    You Only Cut Once: Boosting Data Augmentation with a Single Cut. (arXiv:2201.12078v1 [cs.CV])
    We present You Only Cut Once (YOCO) for performing data augmentations. YOCO cuts one image into two pieces and performs data augmentations individually within each piece. Applying YOCO improves the diversity of the augmentation per sample and encourages neural networks to recognize objects from partial information. YOCO enjoys the properties of parameter-free, easy usage, and boosting almost all augmentations for free. Thorough experiments are conducted to evaluate its effectiveness. We first demonstrate that YOCO can be seamlessly applied to varying data augmentations, neural network architectures, and brings performance gains on CIFAR and ImageNet classification tasks, sometimes surpassing conventional image-level augmentation by large margins. Moreover, we show YOCO benefits contrastive pre-training toward a more powerful representation that can be better transferred to multiple downstream tasks. Finally, we study a number of variants of YOCO and empirically analyze the performance for respective settings. Code is available at GitHub.
    Optimal Complexity in Decentralized Training. (arXiv:2006.08085v4 [cs.LG] UPDATED)
    Decentralization is a promising method of scaling up parallel machine learning systems. In this paper, we provide a tight lower bound on the iteration complexity for such methods in a stochastic non-convex setting. Our lower bound reveals a theoretical gap in known convergence rates of many existing decentralized training algorithms, such as D-PSGD. We prove by construction this lower bound is tight and achievable. Motivated by our insights, we further propose DeTAG, a practical gossip-style decentralized algorithm that achieves the lower bound with only a logarithm gap. Empirically, we compare DeTAG with other decentralized algorithms on image classification tasks, and we show DeTAG enjoys faster convergence compared to baselines, especially on unshuffled data and in sparse networks.
    DELAUNAY: a dataset of abstract art for psychophysical and machine learning research. (arXiv:2201.12123v1 [cs.LG])
    Image datasets are commonly used in psychophysical experiments and in machine learning research. Most publicly available datasets are comprised of images of realistic and natural objects. However, while typical machine learning models lack any domain specific knowledge about natural objects, humans can leverage prior experience for such data, making comparisons between artificial and natural learning challenging. Here, we introduce DELAUNAY, a dataset of abstract paintings and non-figurative art objects labelled by the artists' names. This dataset provides a middle ground between natural images and artificial patterns and can thus be used in a variety of contexts, for example to investigate the sample efficiency of humans and artificial neural networks. Finally, we train an off-the-shelf convolutional neural network on DELAUNAY, highlighting several of its intriguing features.
    Biases in In Silico Evaluation of Molecular Optimization Methods and Bias-Reduced Evaluation Methodology. (arXiv:2201.12163v1 [cs.LG])
    We are interested in in silico evaluation methodology for molecular optimization methods. Given a sample of molecules and their properties of our interest, we wish not only to train an agent that can find molecules optimized with respect to the target property but also to evaluate its performance. A common practice is to train a predictor of the target property on the sample and use it for both training and evaluating the agent. We show that this evaluator potentially suffers from two biases; one is due to misspecification of the predictor and the other to reusing the same sample for training and evaluation. We discuss bias reduction methods for each of the biases comprehensively, and empirically investigate their effectiveness.
    Assessing Classifier Fairness with Collider Bias. (arXiv:2010.03933v2 [cs.LG] UPDATED)
    The increasing application of machine learning techniques in everyday decision-making processes has brought concerns about the fairness of algorithmic decision-making. This paper concerns the problem of collider bias which produces spurious associations in fairness assessment and develops theorems to guide fairness assessment avoiding the collider bias. We consider a real-world application of auditing a trained classifier by an audit agency. We propose an unbiased assessment algorithm by utilising the developed theorems to reduce collider biases in the assessment. Experiments and simulations show the proposed algorithm reduces collider biases significantly in the assessment and is promising in auditing trained classifiers.
    Benchmarking Robustness of 3D Point Cloud Recognition Against Common Corruptions. (arXiv:2201.12296v1 [cs.LG])
    Deep neural networks on 3D point cloud data have been widely used in the real world, especially in safety-critical applications. However, their robustness against corruptions is less studied. In this paper, we present ModelNet40-C, the first comprehensive benchmark on 3D point cloud corruption robustness, consisting of 15 common and realistic corruptions. Our evaluation shows a significant gap between the performances on ModelNet40 and ModelNet40-C for state-of-the-art (SOTA) models. To reduce the gap, we propose a simple but effective method by combining PointCutMix-R and TENT after evaluating a wide range of augmentation and test-time adaptation strategies. We identify a number of critical insights for future studies on corruption robustness in point cloud recognition. For instance, we unveil that Transformer-based architectures with proper training recipes achieve the strongest robustness. We hope our in-depth analysis will motivate the development of robust training strategies or architecture designs in the 3D point cloud domain. Our codebase and dataset are included in https://github.com/jiachens/ModelNet40-C
    Sampling Theorems for Learning from Incomplete Measurements. (arXiv:2201.12151v1 [stat.ML])
    In many real-world settings, only incomplete measurement data are available which can pose a problem for learning. Unsupervised learning of the signal model using a fixed incomplete measurement process is impossible in general, as there is no information in the nullspace of the measurement operator. This limitation can be overcome by using measurements from multiple operators. While this idea has been successfully applied in various applications, a precise characterization of the conditions for learning is still lacking. In this paper, we fill this gap by presenting necessary and sufficient conditions for learning the signal model which indicate the interplay between the number of distinct measurement operators $G$, the number of measurements per operator $m$, the dimension of the model $k$ and the dimension of the signals $n$. In particular, we show that generically unsupervised learning is possible if each operator obtains at least $m>k+n/G$ measurements. Our results are agnostic of the learning algorithm and have implications in a wide range of practical algorithms, from low-rank matrix recovery to deep neural networks.
    Consistent Collaborative Filtering via Tensor Decomposition. (arXiv:2201.11936v1 [cs.IR])
    Collaborative filtering is the de facto standard for analyzing users' activities and building recommendation systems for items. In this work we develop Sliced Anti-symmetric Decomposition (SAD), a new model for collaborative filtering based on implicit feedback. In contrast to traditional techniques where a latent representation of users (user vectors) and items (item vectors) are estimated, SAD introduces one additional latent vector to each item, using a novel three-way tensor view of user-item interactions. This new vector extends user-item preferences calculated by standard dot products to general inner products, producing interactions between items when evaluating their relative preferences. SAD reduces to state-of-the-art (SOTA) collaborative filtering models when the vector collapses to one, while in this paper we allow its value to be estimated from data. The proposed SAD model is simple, resulting in an efficient group stochastic gradient descent (SGD) algorithm. We demonstrate the efficiency of SAD in both simulated and real world datasets containing over 1M user-item interactions. By comparing SAD with seven alternative SOTA collaborative filtering models, we show that SAD is able to more consistently estimate personalized preferences.
    REET: Robustness Evaluation and Enhancement Toolbox for Computational Pathology. (arXiv:2201.12311v1 [cs.LG])
    Motivation: Digitization of pathology laboratories through digital slide scanners and advances in deep learning approaches for objective histological assessment have resulted in rapid progress in the field of computational pathology (CPath) with wide-ranging applications in medical and pharmaceutical research as well as clinical workflows. However, the estimation of robustness of CPath models to variations in input images is an open problem with a significant impact on the down-stream practical applicability, deployment and acceptability of these approaches. Furthermore, development of domain-specific strategies for enhancement of robustness of such models is of prime importance as well. Implementation and Availability: In this work, we propose the first domain-specific Robustness Evaluation and Enhancement Toolbox (REET) for computational pathology applications. It provides a suite of algorithmic strategies for enabling robustness assessment of predictive models with respect to specialized image transformations such as staining, compression, focusing, blurring, changes in spatial resolution, brightness variations, geometric changes as well as pixel-level adversarial perturbations. Furthermore, REET also enables efficient and robust training of deep learning pipelines in computational pathology. REET is implemented in Python and is available at the following URL: https://github.com/alexjfoote/reetoolbox. Contact: Fayyaz.minhas@warwick.ac.uk
    Locally Invariant Explanations: Towards Stable and Unidirectional Explanations through Local Invariant Learning. (arXiv:2201.12143v1 [cs.LG])
    Locally interpretable model agnostic explanations (LIME) method is one of the most popular methods used to explain black-box models at a per example level. Although many variants have been proposed, few provide a simple way to produce high fidelity explanations that are also stable and intuitive. In this work, we provide a novel perspective by proposing a model agnostic local explanation method inspired by the invariant risk minimization (IRM) principle -- originally proposed for (global) out-of-distribution generalization -- to provide such high fidelity explanations that are also stable and unidirectional across nearby examples. Our method is based on a game theoretic formulation where we theoretically show that our approach has a strong tendency to eliminate features where the gradient of the black-box function abruptly changes sign in the locality of the example we want to explain, while in other cases it is more careful and will choose a more conservative (feature) attribution, a behavior which can be highly desirable for recourse. Empirically, we show on tabular, image and text data that the quality of our explanations with neighborhoods formed using random perturbations are much better than LIME and in some cases even comparable to other methods that use realistic neighbors sampled from the data manifold. This is desirable given that learning a manifold to either create realistic neighbors or to project explanations is typically expensive or may even be impossible. Moreover, our algorithm is simple and efficient to train, and can ascertain stable input features for local decisions of a black-box without access to side information such as a (partial) causal graph as has been seen in some recent works.  ( 2 min )
    Limitation of characterizing implicit regularization by data-independent functions. (arXiv:2201.12198v1 [cs.LG])
    In recent years, understanding the implicit regularization of neural networks (NNs) has become a central task of deep learning theory. However, implicit regularization is in itself not completely defined and well understood. In this work, we make an attempt to mathematically define and study the implicit regularization. Importantly, we explore the limitation of a common approach of characterizing the implicit regularization by data-independent functions. We propose two dynamical mechanisms, i.e., Two-point and One-point Overlapping mechanisms, based on which we provide two recipes for producing classes of one-hidden-neuron NNs that provably cannot be fully characterized by a type of or all data-independent functions. Our results signify the profound data-dependency of implicit regularization in general, inspiring us to study in detail the data-dependency of NN implicit regularization in the future.  ( 2 min )
    Local Latent Space Bayesian Optimization over Structured Inputs. (arXiv:2201.11872v1 [cs.LG])
    Bayesian optimization over the latent spaces of deep autoencoder models (DAEs) has recently emerged as a promising new approach for optimizing challenging black-box functions over structured, discrete, hard-to-enumerate search spaces (e.g., molecules). Here the DAE dramatically simplifies the search space by mapping inputs into a continuous latent space where familiar Bayesian optimization tools can be more readily applied. Despite this simplification, the latent space typically remains high-dimensional. Thus, even with a well-suited latent space, these approaches do not necessarily provide a complete solution, but may rather shift the structured optimization problem to a high-dimensional one. In this paper, we propose LOL-BO, which adapts the notion of trust regions explored in recent work on high-dimensional Bayesian optimization to the structured setting. By reformulating the encoder to function as both an encoder for the DAE globally and as a deep kernel for the surrogate model within a trust region, we better align the notion of local optimization in the latent space with local optimization in the input space. LOL-BO achieves as much as 20 times improvement over state-of-the-art latent space Bayesian optimization methods across six real-world benchmarks, demonstrating that improvement in optimization strategies is as important as developing better DAE models.  ( 2 min )
    Simulating surface height and terminus position for marine outlet glaciers using a level set method with data assimilation. (arXiv:2201.12235v1 [physics.ao-ph])
    We implement a data assimilation framework for integrating ice surface and terminus position observations into a numerical ice-flow model. The model uses the well-known shallow shelf approximation (SSA) coupled to a level set method to capture ice motion and changes in the glacier geometry. The level set method explicitly tracks the evolving ice-atmosphere and ice-ocean boundaries for a marine outlet glacier. We use an Ensemble Transform Kalman Filter to assimilate observations of ice surface elevation and lateral ice extent by updating the level set function that describes the ice interface. Numerical experiments on an idealized marine-terminating glacier demonstrate the effectiveness of our data assimilation approach for tracking seasonal and multi-year glacier advance and retreat cycles. The model is also applied to simulate Helheim Glacier, a major tidewater-terminating glacier of the Greenland Ice Sheet that has experienced a recent history of rapid retreat. By assimilating observations from remotely-sensed surface elevation profiles we are able to more accurately track the migrating glacier terminus and glacier surface changes. These results support the use of data assimilation methodologies for obtaining more accurate predictions of short-term ice sheet dynamics.
    Information fusion between knowledge and data in Bayesian network structure learning. (arXiv:2102.00473v3 [cs.AI] UPDATED)
    Bayesian Networks (BNs) have become a powerful technology for reasoning under uncertainty, particularly in areas that require causal assumptions that enable us to simulate the effect of intervention. The graphical structure of these models can be determined by causal knowledge, learnt from data, or a combination of both. While it seems plausible that the best approach in constructing a causal graph involves combining knowledge with machine learning, this approach remains underused in practice. We implement and evaluate 10 knowledge approaches with application to different case studies and BN structure learning algorithms available in the open-source Bayesys structure learning system. The approaches enable us to specify pre-existing knowledge that can be obtained from heterogeneous sources, to constrain or guide structure learning. Each approach is assessed in terms of structure learning effectiveness and efficiency, including graphical accuracy, model fitting, complexity, and runtime; making this the first paper that provides a comparative evaluation of a wide range of knowledge approaches for BN structure learning. Because the value of knowledge depends on what data are available, we illustrate the results both with limited and big data. While the overall results show that knowledge becomes less important with big data due to higher learning accuracy rendering knowledge less important, some of the knowledge approaches are actually found to be more important with big data. Amongst the main conclusions is the observation that reduced search space obtained from knowledge does not always imply reduced computational complexity, perhaps because the relationships implied by the data and knowledge are in tension.
    Speeding up PCA with priming. (arXiv:2109.03709v2 [cs.LG] UPDATED)
    We introduce primed-PCA (pPCA), a two-step algorithm for speeding up the approximation of principal components. This algorithm first runs any approximate-PCA method to get an initial estimate of the principal components (priming), and then applies an exact PCA in the subspace they span. Since this subspace is of small dimension in any practical use, the second step is extremely cheap computationally. Nonetheless, it improves accuracy significantly for a given computational budget across datasets. In this setup, the purpose of the priming is to narrow down the search space, and prepare the data for the second step, an exact calculation. We show formally that pPCA improves upon the priming algorithm under very mild conditions, and we provide experimental validation on both synthetic and real large-scale datasets showing that it systematically translates to improved performance. In our experiments we study pPCA by priming it with the power method, Oja's rule and the recently proposed EigenGame algorithm for extremely large scale datasets.
    Gradient Masked Averaging for Federated Learning. (arXiv:2201.11986v1 [cs.LG])
    Federated learning is an emerging paradigm that permits a large number of clients with heterogeneous data to coordinate learning of a unified global model without the need to share data amongst each other. Standard federated learning algorithms involve averaging of model parameters or gradient updates to approximate the global model at the server. However, in heterogeneous settings averaging can result in information loss and lead to poor generalization due to the bias induced by dominant clients. We hypothesize that to generalize better across non-i.i.d datasets as in FL settings, the algorithms should focus on learning the invariant mechanism that is constant while ignoring spurious mechanisms that differ across clients. Inspired from recent work in the Out-of-Distribution literature, we propose a gradient masked averaging approach for federated learning as an alternative to the standard averaging of client updates. This client update aggregation technique can be adapted as a drop-in replacement in most existing federated algorithms. We perform extensive experiments with gradient masked approach on multiple FL algorithms with in-distribution, real-world, and out-of-distribution (as the worst case scenario) test dataset and show that it provides consistent improvements, particularly in the case of heterogeneous clients.  ( 2 min )
    Domain Adaptation for Time Series Forecasting via Attention Sharing. (arXiv:2102.06828v4 [cs.LG] UPDATED)
    Recently, deep neural networks have gained increasing popularity in the field of time series forecasting. A primary reason for their success is their ability to effectively capture complex temporal dynamics across multiple related time series. The advantages of these deep forecasters only start to emerge in the presence of a sufficient amount of data. This poses a challenge for typical forecasting problems in practice, where there is a limited number of time series or observations per time series, or both. To cope with this data scarcity issue, we propose a novel domain adaptation framework, Domain Adaptation Forecaster (DAF). DAF leverages statistical strengths from a relevant domain with abundant data samples (source) to improve the performance on the domain of interest with limited data (target). In particular, we use an attention-based shared module with a domain discriminator across domains and private modules for individual domains. We induce domain-invariant latent features (queries and keys) and retrain domain-specific features (values) simultaneously to enable joint training of forecasters on source and target domains. A main insight is that our design of aligning keys allows the target domain to leverage source time series even with different characteristics. Extensive experiments on various domains demonstrate that our proposed method outperforms state-of-the-art baselines on synthetic and real-world datasets, and ablation studies verify the effectiveness of our design choices.
    Adaptive Accelerated (Extra-)Gradient Methods with Variance Reduction. (arXiv:2201.12302v1 [math.OC])
    In this paper, we study the finite-sum convex optimization problem focusing on the general convex case. Recently, the study of variance reduced (VR) methods and their accelerated variants has made exciting progress. However, the step size used in the existing VR algorithms typically depends on the smoothness parameter, which is often unknown and requires tuning in practice. To address this problem, we propose two novel adaptive VR algorithms: Adaptive Variance Reduced Accelerated Extra-Gradient (AdaVRAE) and Adaptive Variance Reduced Accelerated Gradient (AdaVRAG). Our algorithms do not require knowledge of the smoothness parameter. AdaVRAE uses $\mathcal{O}\left(n\log\log n+\sqrt{\frac{n\beta}{\epsilon}}\right)$ gradient evaluations and AdaVRAG uses $\mathcal{O}\left(n\log\log n+\sqrt{\frac{n\beta\log\beta}{\epsilon}}\right)$ gradient evaluations to attain an $\mathcal{O}(\epsilon)$-suboptimal solution, where $n$ is the number of functions in the finite sum and $\beta$ is the smoothness parameter. This result matches the best-known convergence rate of non-adaptive VR methods and it improves upon the convergence of the state of the art adaptive VR method, AdaSVRG. We demonstrate the superior performance of our algorithms compared with previous methods in experiments on real-world datasets.
    BCDAG: An R package for Bayesian structure and Causal learning of Gaussian DAGs. (arXiv:2201.12003v1 [stat.ML])
    Directed Acyclic Graphs (DAGs) provide a powerful framework to model causal relationships among variables in multivariate settings; in addition, through the do-calculus theory, they allow for the identification and estimation of causal effects between variables also from pure observational data. In this setting, the process of inferring the DAG structure from the data is referred to as causal structure learning or causal discovery. We introduce BCDAG, an R package for Bayesian causal discovery and causal effect estimation from Gaussian observational data, implementing the Markov chain Monte Carlo (MCMC) scheme proposed by Castelletti & Mascaro (2021). Our implementation scales efficiently with the number of observations and, whenever the DAGs are sufficiently sparse, with the number of variables in the dataset. The package also provides functions for convergence diagnostics and for visualizing and summarizing posterior inference. In this paper, we present the key features of the underlying methodology along with its implementation in BCDAG. We then illustrate the main functions and algorithms on both real and simulated datasets.
    A Stock Trading System for a Medium Volatile Asset using Multi Layer Perceptron. (arXiv:2201.12286v1 [q-fin.ST])
    Stock market forecasting is a lucrative field of interest with promising profits but not without its difficulties and for some people could be even causes of failure. Financial markets by their nature are complex, non-linear and chaotic, which implies that accurately predicting the prices of assets that are part of it becomes very complicated. In this paper we propose a stock trading system having as main core the feed-forward deep neural networks (DNN) to predict the price for the next 30 days of open market, of the shares issued by Abercrombie & Fitch Co. (ANF) in the stock market of the New York Stock Exchange (NYSE). The system we have elaborated calculates the most effective technical indicator, applying it to the predictions computed by the DNNs, for generating trades. The results showed an increase in values such as Expectancy Ratio of 2.112% of profitable trades with Sharpe, Sortino, and Calmar Ratios of 2.194, 3.340, and 12.403 respectively. As a verification, we adopted a backtracking simulation module in our system, which maps trades to actual test data consisting of the last 30 days of open market on the ANF asset. Overall, the results were promising bringing a total profit factor of 3.2% in just one month from a very modest budget of $100. This was possible because the system reduced the number of trades by choosing the most effective and efficient trades, saving on commissions and slippage costs.
    Adaptive Best-of-Both-Worlds Algorithm for Heavy-Tailed Multi-Armed Bandits. (arXiv:2201.11921v1 [cs.LG])
    In this paper, we generalize the concept of heavy-tailed multi-armed bandits to adversarial environments, and develop robust best-of-both-worlds algorithms for heavy-tailed multi-armed bandits (MAB), where losses have $\alpha$-th ($1<\alpha\le 2$) moments bounded by $\sigma^\alpha$, while the variances may not exist. Specifically, we design an algorithm \texttt{HTINF}, when the heavy-tail parameters $\alpha$ and $\sigma$ are known to the agent, \texttt{HTINF} simultaneously achieves the optimal regret for both stochastic and adversarial environments, without knowing the actual environment type a-priori. When $\alpha,\sigma$ are unknown, \texttt{HTINF} achieves a $\log T$-style instance-dependent regret in stochastic cases and $o(T)$ no-regret guarantee in adversarial cases. We further develop an algorithm \texttt{AdaTINF}, achieving $\mathcal O(\sigma K^{1-\nicefrac 1\alpha}T^{\nicefrac{1}{\alpha}})$ minimax optimal regret even in adversarial settings, without prior knowledge on $\alpha$ and $\sigma$. This result matches the known regret lower-bound (Bubeck et al., 2013), which assumed a stochastic environment and $\alpha$ and $\sigma$ are both known. To our knowledge, the proposed \texttt{HTINF} algorithm is the first to enjoy a best-of-both-worlds regret guarantee, and \texttt{AdaTINF} is the first algorithm that can adapt to both $\alpha$ and $\sigma$ to achieve optimal gap-indepedent regret bound in classical heavy-tailed stochastic MAB setting and our novel adversarial formulation.
    Puppeteer: A Random Forest-based Manager for Hardware Prefetchers across the Memory Hierarchy. (arXiv:2201.12027v1 [cs.AR])
    Over the years, processor throughput has steadily increased. However, the memory throughput has not increased at the same rate, which has led to the memory wall problem in turn increasing the gap between effective and theoretical peak processor performance. To cope with this, there has been an abundance of work in the area of data/instruction prefetcher designs. Broadly, prefetchers predict future data/instruction address accesses and proactively fetch data/instructions in the memory hierarchy with the goal of lowering data/instruction access latency. To this end, one or more prefetchers are deployed at each level of the memory hierarchy, but typically, each prefetcher gets designed in isolation without comprehensively accounting for other prefetchers in the system. As a result, individual prefetchers do not always complement each other, and that leads to lower average performance gains and/or many negative outliers. In this work, we propose Puppeteer, which is a hardware prefetcher manager that uses a suite of random forest regressors to determine at runtime which prefetcher should be ON at each level in the memory hierarchy, such that the prefetchers complement each other and we reduce the data/instruction access latency. Compared to a design with no prefetchers, using Puppeteer we improve IPC by 46.0% in 1 Core (1C), 25.8% in 4 Core (4C), and 11.9% in 8 Core (8C) processors on average across traces generated from SPEC2017, SPEC2006, and Cloud suites with ~10KB overhead. Moreover, we also reduce the number of negative outliers by over 89%, and the performance loss of the worst-case negative outlier from 25% to only 5% compared to the state-of-the-art.
    Plug & Play Attacks: Towards Robust and Flexible Model Inversion Attacks. (arXiv:2201.12179v1 [cs.LG])
    Model inversion attacks (MIAs) aim to create synthetic images that reflect the class-wise characteristics from a target classifier's training data by exploiting the model's learned knowledge. Previous research has developed generative MIAs using generative adversarial networks (GANs) as image priors that are tailored to a specific target model. This makes the attacks time- and resource-consuming, inflexible, and susceptible to distributional shifts between datasets. To overcome these drawbacks, we present Plug & Play Attacks that loosen the dependency between the target model and image prior and enable the use of a single trained GAN to attack a broad range of targets with only minor attack adjustments needed. Moreover, we show that powerful MIAs are possible even with publicly available pre-trained GANs and under strong distributional shifts, whereas previous approaches fail to produce meaningful results. Our extensive evaluation confirms the improved robustness and flexibility of Plug & Play Attacks and their ability to create high-quality images revealing sensitive class characteristics.
    Unifying Pairwise Interactions in Complex Dynamics. (arXiv:2201.11941v1 [physics.data-an])
    Scientists have developed hundreds of techniques to measure the interactions between pairs of processes in complex systems. But these computational methods -- from correlation coefficients to causal inference -- rely on distinct quantitative theories that remain largely disconnected. Here we introduce a library of 249 statistics for pairwise interactions and assess their behavior on 1053 multivariate time series from a wide range of real-world and model-generated systems. Our analysis highlights new commonalities between different mathematical formulations, providing a unified picture of a rich, interdisciplinary literature. We then show that leveraging many methods from across science can uncover those most suitable for addressing a given problem, yielding high accuracy and interpretable understanding. Our framework is provided in extendable open software, enabling comprehensive data-driven analysis by integrating decades of methodological advances.
    Consolidated learning -- a domain-specific model-free optimization strategy with examples for XGBoost and MIMIC-IV. (arXiv:2201.11815v1 [cs.LG])
    For many machine learning models, a choice of hyperparameters is a crucial step towards achieving high performance. Prevalent meta-learning approaches focus on obtaining good hyperparameters configurations with a limited computational budget for a completely new task based on the results obtained from the prior tasks. This paper proposes a new formulation of the tuning problem, called consolidated learning, more suited to practical challenges faced by model developers, in which a large number of predictive models are created on similar data sets. In such settings, we are interested in the total optimization time rather than tuning for a single task. We show that a carefully selected static portfolio of hyperparameters yields good results for anytime optimization, maintaining ease of use and implementation. Moreover, we point out how to construct such a portfolio for specific domains. The improvement in the optimization is possible due to more efficient transfer of hyperparameter configurations between similar tasks. We demonstrate the effectiveness of this approach through an empirical study for XGBoost algorithm and the collection of predictive tasks extracted from the MIMIC-IV medical database; however, consolidated learning is applicable in many others fields.
    FCMNet: Full Communication Memory Net]{FCMNet: Full Communication Memory Net for Team-Level Cooperation in Multi-Agent Systems. (arXiv:2201.11994v1 [cs.RO])
    Decentralized cooperation in partially-observable multi-agent systems requires effective communications among agents. To support this effort, this work focuses on the class of problems where global communications are available but may be unreliable, thus precluding differentiable communication learning methods. We introduce FCMNet, a reinforcement learning based approach that allows agents to simultaneously learn a) an effective multi-hop communications protocol and b) a common, decentralized policy that enables team-level decision-making. Specifically, our proposed method utilizes the hidden states of multiple directional recurrent neural networks as communication messages among agents. Using a simple multi-hop topology, we endow each agent with the ability to receive information sequentially encoded by every other agent at each time step, leading to improved global cooperation. We demonstrate FCMNet on a challenging set of StarCraft II micromanagement tasks with shared rewards, as well as a collaborative multi-agent pathfinding task with individual rewards. There, our comparison results show that FCMNet outperforms state-of-the-art communication-based reinforcement learning methods in all StarCraft II micromanagement tasks, and value decomposition methods in certain tasks. We further investigate the robustness of FCMNet under realistic communication disturbances, such as random message loss or binarized messages (i.e., non-differentiable communication channels), to showcase FMCNet's potential applicability to robotic tasks under a variety of real-world conditions.
    Graph autoencoder with constant dimensional latent space. (arXiv:2201.12165v1 [cs.LG])
    Invertible transformation of large graphs into constant dimensional vectors (embeddings) remains a challenge. In this paper we address it with recursive neural networks: The encoder and the decoder. The encoder network transforms embeddings of subgraphs into embeddings of larger subgraphs, and eventually into the embedding of the input graph. The decoder does the opposite. The dimension of the embeddings is constant regardless of the size of the (sub)graphs. Simulation experiments presented in this paper confirm that our proposed graph autoencoder can handle graphs with even thousands of vertices.
    A Secure and Efficient Federated Learning Framework for NLP. (arXiv:2201.11934v1 [cs.CR])
    In this work, we consider the problem of designing secure and efficient federated learning (FL) frameworks. Existing solutions either involve a trusted aggregator or require heavyweight cryptographic primitives, which degrades performance significantly. Moreover, many existing secure FL designs work only under the restrictive assumption that none of the clients can be dropped out from the training protocol. To tackle these problems, we propose SEFL, a secure and efficient FL framework that (1) eliminates the need for the trusted entities; (2) achieves similar and even better model accuracy compared with existing FL designs; (3) is resilient to client dropouts. Through extensive experimental studies on natural language processing (NLP) tasks, we demonstrate that the SEFL achieves comparable accuracy compared to existing FL solutions, and the proposed pruning technique can improve runtime performance up to 13.7x.
    Towards automated optimisation of residual convolutional neural networks for electrocardiogram classification. (arXiv:2112.06024v2 [eess.SP] UPDATED)
    The interpretation of the electrocardiogram (ECG) gives clinical information and helps in assessing heart function. There are distinct ECG patterns associated with a specific class of arrythmia. The convolutional neural network is currently one of the most commonly employed deep learning algorithms for ECG processing. However, deep learning models require many hyperparameters to tune. Selecting an optimal or best hyperparameter for the convolutional neural network algorithm is a highly challenging task. Often, we end up tuning the model manually with different possible ranges of values until a best fit model is obtained. Automatic hyperparameters tuning using Bayesian optimisation (BO) and evolutionary algorithms can provide an effective solution to current labour-intensive manual configuration approaches. In this paper, we propose to optimise the Residual one Dimensional Convolutional Neural Network model (R-1D-CNN) at two levels. At the first level, a residual convolutional layer and one-dimensional convolutional neural layers are trained to learn patient-specific ECG features over which multilayer perceptron layers can learn to produce the final class vectors of each input. This level is manual and aims to lower the search space. The second level is automatic and based on our proposed BO-based algorithm. Our proposed optimised R-1D-CNN architecture is evaluated on two publicly available ECG Datasets. Comparative experimental results demonstrate that our BO-based algorithm achieves an optimal rate of 99.95%, while the baseline model achieves 99.70% for the MIT-BIH database. Moreover, experiments demonstrate that the proposed architecture fine-tuned with BO achieves a higher accuracy than the other proposed architectures. Our optimised architecture achieves excellent results compared to previous works on benchmark datasets.
    Wasserstein Iterative Networks for Barycenter Estimation. (arXiv:2201.12245v1 [cs.LG])
    Wasserstein barycenters have become popular due to their ability to represent the average of probability measures in a geometrically meaningful way. In this paper, we present an algorithm to approximate the Wasserstein-2 barycenters of continuous measures via a generative model. Previous approaches rely on regularization (entropic/quadratic) which introduces bias or on input convex neural networks which are not expressive enough for large-scale tasks. In contrast, our algorithm does not introduce bias and allows using arbitrary neural networks. In addition, based on the celebrity faces dataset, we construct Ave, celeba! dataset which can be used for quantitative evaluation of barycenter algorithms by using standard metrics of generative models such as FID.
    FedLite: A Scalable Approach for Federated Learning on Resource-constrained Clients. (arXiv:2201.11865v1 [cs.LG])
    In classical federated learning, the clients contribute to the overall training by communicating local updates for the underlying model on their private data to a coordinating server. However, updating and communicating the entire model becomes prohibitively expensive when resource-constrained clients collectively aim to train a large machine learning model. Split learning provides a natural solution in such a setting, where only a small part of the model is stored and trained on clients while the remaining large part of the model only stays at the servers. However, the model partitioning employed in split learning introduces a significant amount of communication cost. This paper addresses this issue by compressing the additional communication using a novel clustering scheme accompanied by a gradient correction method. Extensive empirical evaluations on image and text benchmarks show that the proposed method can achieve up to $490\times$ communication cost reduction with minimal drop in accuracy, and enables a desirable performance vs. communication trade-off.
    Improved Overparametrization Bounds for Global Convergence of Stochastic Gradient Descent for Shallow Neural Networks. (arXiv:2201.12052v1 [cs.LG])
    We study the overparametrization bounds required for the global convergence of stochastic gradient descent algorithm for a class of one hidden layer feed-forward neural networks, considering most of the activation functions used in practice, including ReLU. We improve the existing state-of-the-art results in terms of the required hidden layer width. We introduce a new proof technique combining nonlinear analysis with properties of random initializations of the network. First, we establish the global convergence of continuous solutions of the differential inclusion being a nonsmooth analogue of the gradient flow for the MSE loss. Second, we provide a technical result (working also for general approximators) relating solutions of the aforementioned differential inclusion to the (discrete) stochastic gradient descent sequences, hence establishing linear convergence towards zero loss for the stochastic gradient descent iterations.
    Provably Improving Expert Predictions with Conformal Prediction. (arXiv:2201.12006v1 [cs.LG])
    Automated decision support systems promise to help human experts solve tasks more efficiently and accurately. However, existing systems typically require experts to understand when to cede agency to the system or when to exercise their own agency. Moreover, if the experts develop a misplaced trust in the system, their performance may worsen. In this work, we lift the above requirement and develop automated decision support systems that, by design, do not require experts to understand when to trust them to provably improve their performance. To this end, we focus on multiclass classification tasks and consider automated decision support systems that, for each data sample, use a classifier to recommend a subset of labels to a human expert. We first show that, by looking at the design of such systems from the perspective of conformal prediction, we can ensure that the probability that the recommended subset of labels contains the true label matches almost exactly a target probability value. Then, we identify the set of target probability values under which the human expert is provably better off predicting a label among those in the recommended subset and develop an efficient practical method to find a near-optimal target probability value. Experiments on synthetic and real data demonstrate that our system can help the experts make more accurate predictions and is robust to the accuracy of the classifier it relies on.
    Generative GaitNet. (arXiv:2201.12044v1 [cs.GR])
    Understanding the relation between anatomy andgait is key to successful predictive gait simulation. Inthis paper, we present Generative GaitNet, which isa novel network architecture based on deep reinforce-ment learning for controlling a comprehensive, full-body, musculoskeletal model with 304 Hill-type mus-culotendons. The Generative Gait is a pre-trained, in-tegrated system of artificial neural networks learnedin a 618-dimensional continuous domain of anatomyconditions (e.g., mass distribution, body proportion,bone deformity, and muscle deficits) and gait condi-tions (e.g., stride and cadence). The pre-trained Gait-Net takes anatomy and gait conditions as input andgenerates a series of gait cycles appropriate to theconditions through physics-based simulation. We willdemonstrate the efficacy and expressive power of Gen-erative GaitNet to generate a variety of healthy andpathologic human gaits in real-time physics-based sim-ulation.
    Differential Privacy Guarantees for Stochastic Gradient Langevin Dynamics. (arXiv:2201.11980v1 [stat.ML])
    We analyse the privacy leakage of noisy stochastic gradient descent by modeling R\'enyi divergence dynamics with Langevin diffusions. Inspired by recent work on non-stochastic algorithms, we derive similar desirable properties in the stochastic setting. In particular, we prove that the privacy loss converges exponentially fast for smooth and strongly convex objectives under constant step size, which is a significant improvement over previous DP-SGD analyses. We also extend our analysis to arbitrary sequences of varying step sizes and derive new utility bounds. Last, we propose an implementation and our experiments show the practical utility of our approach compared to classical DP-SGD libraries.
    Sharp Threshold for the Frechet Mean (or Median) of Inhomogeneous Erdos-Renyi Random Graphs. (arXiv:2201.11954v1 [math.PR])
    We address the following foundational question: what is the population, and sample, Frechet mean (or median) graph of an ensemble of inhomogeneous Erdos-Renyi random graphs? We prove that if we use the Hamming distance to compute distances between graphs, then the Frechet mean (or median) graph of an ensemble of inhomogeneous random graphs is obtained by thresholding the expected adjacency matrix of the ensemble. We show that the result also holds for the sample mean (or median) when the population expected adjacency matrix is replaced with the sample mean adjacency matrix. Consequently, the Frechet mean (or median) graph of inhomogeneous Erdos-Renyi random graphs exhibits a sharp threshold: it is either the empty graph, or the complete graph. This novel theoretical result has some significant practical consequences; for instance, the Frechet mean of an ensemble of sparse inhomogeneous random graphs is always the empty graph.
    Iterative Refinement Graph Neural Network for Antibody Sequence-Structure Co-design. (arXiv:2110.04624v3 [q-bio.BM] UPDATED)
    Antibodies are versatile proteins that bind to pathogens like viruses and stimulate the adaptive immune system. The specificity of antibody binding is determined by complementarity-determining regions (CDRs) at the tips of these Y-shaped proteins. In this paper, we propose a generative model to automatically design the CDRs of antibodies with enhanced binding specificity or neutralization capabilities. Previous generative approaches formulate protein design as a structure-conditioned sequence generation task, assuming the desired 3D structure is given a priori. In contrast, we propose to co-design the sequence and 3D structure of CDRs as graphs. Our model unravels a sequence autoregressively while iteratively refining its predicted global structure. The inferred structure in turn guides subsequent residue choices. For efficiency, we model the conditional dependence between residues inside and outside of a CDR in a coarse-grained manner. Our method achieves superior log-likelihood on the test set and outperforms previous baselines in designing antibodies capable of neutralizing the SARS-CoV-2 virus.
    Simplicial Complex Representation Learning. (arXiv:2103.04046v4 [cs.LG] UPDATED)
    Simplicial complexes form an important class of topological spaces that are frequently used in many application areas such as computer-aided design, computer graphics, and simulation. Representation learning on graphs, which are just 1-d simplicial complexes, has witnessed a great attention in recent years. However, there has not been enough effort to extend representation learning to higher dimensional simplicial objects due to the additional complexity these objects hold, especially when it comes to entire-simplicial complex representation learning. In this work, we propose a method for simplicial complex-level representation learning that embeds a simplicial complex to a universal embedding space in a way that complex-to-complex proximity is preserved. Our method uses our novel geometric message passing schemes to learn an entire simplicial complex representation in an end-to-end fashion. We demonstrate the proposed model on publicly available mesh dataset. To the best of our knowledge, this work presents the first method for learning simplicial complex-level representation.
    Denoising Diffusion Restoration Models. (arXiv:2201.11793v1 [eess.IV])
    Many interesting tasks in image restoration can be cast as linear inverse problems. A recent family of approaches for solving these problems uses stochastic algorithms that sample from the posterior distribution of natural images given the measurements. However, efficient solutions often require problem-specific supervised training to model the posterior, whereas unsupervised methods that are not problem-specific typically rely on inefficient iterative methods. This work addresses these issues by introducing Denoising Diffusion Restoration Models (DDRM), an efficient, unsupervised posterior sampling method. Motivated by variational inference, DDRM takes advantage of a pre-trained denoising diffusion generative model for solving any linear inverse problem. We demonstrate DDRM's versatility on several image datasets for super-resolution, deblurring, inpainting, and colorization under various amounts of measurement noise. DDRM outperforms the current leading unsupervised methods on the diverse ImageNet dataset in reconstruction quality, perceptual quality, and runtime, being 5x faster than the nearest competitor. DDRM also generalizes well for natural images out of the distribution of the observed ImageNet training set.
    Safe Policy Improvement Approaches on Discrete Markov Decision Processes. (arXiv:2201.12175v1 [cs.LG])
    Safe Policy Improvement (SPI) aims at provable guarantees that a learned policy is at least approximately as good as a given baseline policy. Building on SPI with Soft Baseline Bootstrapping (Soft-SPIBB) by Nadjahi et al., we identify theoretical issues in their approach, provide a corrected theory, and derive a new algorithm that is provably safe on finite Markov Decision Processes (MDP). Additionally, we provide a heuristic algorithm that exhibits the best performance among many state of the art SPI algorithms on two different benchmarks. Furthermore, we introduce a taxonomy of SPI algorithms and empirically show an interesting property of two classes of SPI algorithms: while the mean performance of algorithms that incorporate the uncertainty as a penalty on the action-value is higher, actively restricting the set of policies more consistently produces good policies and is, thus, safer.
    Gradient Descent on Neurons and its Link to Approximate Second-Order Optimization. (arXiv:2201.12250v1 [cs.LG])
    Second-order optimizers are thought to hold the potential to speed up neural network training, but due to the enormous size of the curvature matrix, they typically require approximations to be computationally tractable. The most successful family of approximations are Kronecker-Factored, block-diagonal curvature estimates (KFAC). Here, we combine tools from prior work to evaluate exact second-order updates with careful ablations to establish a surprising result: Due to its approximations, KFAC is not closely related to second-order updates, and in particular, it significantly outperforms true second-order updates. This challenges widely held believes and immediately raises the question why KFAC performs so well. We answer this question by showing that KFAC approximates a first-order algorithm, which performs gradient descent on neurons rather than weights. Finally, we show that this optimizer often improves over KFAC in terms of computational cost and data-efficiency.
    Provably Efficient Primal-Dual Reinforcement Learning for CMDPs with Non-stationary Objectives and Constraints. (arXiv:2201.11965v1 [cs.LG])
    We consider primal-dual-based reinforcement learning (RL) in episodic constrained Markov decision processes (CMDPs) with non-stationary objectives and constraints, which play a central role in ensuring the safety of RL in time-varying environments. In this problem, the reward/utility functions and the state transition functions are both allowed to vary arbitrarily over time as long as their cumulative variations do not exceed certain known variation budgets. Designing safe RL algorithms in time-varying environments is particularly challenging because of the need to integrate the constraint violation reduction, safe exploration, and adaptation to the non-stationarity. To this end, we propose a Periodically Restarted Optimistic Primal-Dual Proximal Policy Optimization (PROPD-PPO) algorithm that features three mechanisms: periodic-restart-based policy improvement, dual update with dual regularization, and periodic-restart-based optimistic policy evaluation. We establish a dynamic regret bound and a constraint violation bound for the proposed algorithm in both the linear kernel CMDP function approximation setting and the tabular CMDP setting. This paper provides the first provably efficient algorithm for non-stationary CMDPs with safe exploration.
    Mixing Implicit and Explicit Deep Learning with Skip DEQs and Infinite Time Neural ODEs (Continuous DEQs). (arXiv:2201.12240v1 [cs.LG])
    Implicit deep learning architectures, like Neural ODEs and Deep Equilibrium Models (DEQs), separate the definition of a layer from the description of its solution process. While implicit layers allow features such as depth to adapt to new scenarios and inputs automatically, this adaptivity makes its computational expense challenging to predict. Numerous authors have noted that implicit layer techniques can be more computationally intensive than explicit layer methods. In this manuscript, we address the question: is there a way to simultaneously achieve the robustness of implicit layers while allowing the reduced computational expense of an explicit layer? To solve this we develop Skip DEQ, an implicit-explicit (IMEX) layer that simultaneously trains an explicit prediction followed by an implicit correction. We show that training this explicit layer is free and even decreases the training time by 2.5x and prediction time by 3.4x. We then further increase the "implicitness" of the DEQ by redefining the method in terms of an infinite time neural ODE which paradoxically decreases the training cost over a standard neural ODE by not requiring backpropagation through time. We demonstrate how the resulting Continuous Skip DEQ architecture trains more robustly than the original DEQ while achieving faster training and prediction times. Together, this manuscript shows how bridging the dichotomy of implicit and explicit deep learning can combine the advantages of both techniques.
    Compositionality-Aware Graph2Seq Learning. (arXiv:2201.12178v1 [cs.LG])
    Graphs are a highly expressive data structure, but it is often difficult for humans to find patterns from a complex graph. Hence, generating human-interpretable sequences from graphs have gained interest, called graph2seq learning. It is expected that the compositionality in a graph can be associated to the compositionality in the output sequence in many graph2seq tasks. Therefore, applying compositionality-aware GNN architecture would improve the model performance. In this study, we adopt the multi-level attention pooling (MLAP) architecture, that can aggregate graph representations from multiple levels of information localities. As a real-world example, we take up the extreme source code summarization task, where a model estimate the name of a program function from its source code. We demonstrate that the model having the MLAP architecture outperform the previous state-of-the-art model with more than seven times fewer parameters than it.
    Graph-based Algorithm Unfolding for Energy-aware Power Allocation in Wireless Networks. (arXiv:2201.11799v1 [eess.SY])
    We develop a novel graph-based trainable framework to maximize the weighted sum energy efficiency (WSEE) for power allocation in wireless communication networks. To address the non-convex nature of the problem, the proposed method consists of modular structures inspired by a classical iterative suboptimal approach and enhanced with learnable components. More precisely, we propose a deep unfolding of the successive concave approximation (SCA) method. In our unfolded SCA (USCA) framework, the originally preset parameters are now learnable via graph convolutional neural networks (GCNs) that directly exploit multi-user channel state information as the underlying graph adjacency matrix. We show the permutation equivariance of the proposed architecture, which promotes generalizability across different network topologies of varying size, density, and channel distribution. The USCA framework is trained through a stochastic gradient descent approach using a progressive training strategy. The unsupervised loss is carefully devised to feature the monotonic property of the objective under maximum power constraints. Comprehensive numerical results demonstrate outstanding performance and robustness of USCA over state-of-the-art benchmarks.
    A Robust and Flexible EM Algorithm for Mixtures of Elliptical Distributions with Missing Data. (arXiv:2201.12020v1 [stat.ML])
    This paper tackles the problem of missing data imputation for noisy and non-Gaussian data. A classical imputation method, the Expectation Maximization (EM) algorithm for Gaussian mixture models, has shown interesting properties when compared to other popular approaches such as those based on k-nearest neighbors or on multiple imputations by chained equations. However, Gaussian mixture models are known to be not robust to heterogeneous data, which can lead to poor estimation performance when the data is contaminated by outliers or come from a non-Gaussian distributions. To overcome this issue, a new expectation maximization algorithm is investigated for mixtures of elliptical distributions with the nice property of handling potential missing data. The complete-data likelihood associated with mixtures of elliptical distributions is well adapted to the EM framework thanks to its conditional distribution, which is shown to be a Student distribution. Experimental results on synthetic data demonstrate that the proposed algorithm is robust to outliers and can be used with non-Gaussian data. Furthermore, experiments conducted on real-world datasets show that this algorithm is very competitive when compared to other classical imputation methods.
    Adversarial Concept Erasure in Kernel Space. (arXiv:2201.12191v1 [cs.LG])
    The representation space of neural models for textual data emerges in an unsupervised manner during training. Understanding how human-interpretable concepts, such as gender, are encoded in these representations would improve the ability of users to \emph{control} the content of these representations and analyze the working of the models that rely on them. One prominent approach to the control problem is the identification and removal of linear concept subspaces -- subspaces in the representation space that correspond to a given concept. While those are tractable and interpretable, neural network do not necessarily represent concepts in linear subspaces. We propose a kernalization of the linear concept-removal objective of [Ravfogel et al. 2022], and show that it is effective in guarding against the ability of certain nonlinear adversaries to recover the concept. Interestingly, our findings suggest that the division between linear and nonlinear models is overly simplistic: when considering the concept of binary gender and its neutralization, we do not find a single kernel space that exclusively contains all the concept-related information. It is therefore challenging to protect against \emph{all} nonlinear adversaries at once.
    Overcoming Exploration: Deep Reinforcement Learning in Complex Environments from Temporal Logic Specifications. (arXiv:2201.12231v1 [cs.RO])
    We present a Deep Reinforcement Learning (DRL) algorithm for a task-guided robot with unknown continuous-time dynamics deployed in a large-scale complex environment. Linear Temporal Logic (LTL) is applied to express a rich robotic specification. To overcome the environmental challenge, we propose a novel path planning-guided reward scheme that is dense over the state space, and crucially, robust to infeasibility of computed geometric paths due to the unknown robot dynamics. To facilitate LTL satisfaction, our approach decomposes the LTL mission into sub-tasks that are solved using distributed DRL, where the sub-tasks are trained in parallel, using Deep Policy Gradient algorithms. Our framework is shown to significantly improve performance (effectiveness, efficiency) and exploration of robots tasked with complex missions in large-scale complex environments.
    Shuffle Augmentation of Features from Unlabeled Data for Unsupervised Domain Adaptation. (arXiv:2201.11963v1 [cs.CV])
    Unsupervised Domain Adaptation (UDA), a branch of transfer learning where labels for target samples are unavailable, has been widely researched and developed in recent years with the help of adversarially trained models. Although existing UDA algorithms are able to guide neural networks to extract transferable and discriminative features, classifiers are merely trained under the supervision of labeled source data. Given the inevitable discrepancy between source and target domains, the classifiers can hardly be aware of the target classification boundaries. In this paper, Shuffle Augmentation of Features (SAF), a novel UDA framework, is proposed to address the problem by providing the classifier with supervisory signals from target feature representations. SAF learns from the target samples, adaptively distills class-aware target features, and implicitly guides the classifier to find comprehensive class borders. Demonstrated by extensive experiments, the SAF module can be integrated into any existing adversarial UDA models to achieve performance improvements.
    Toward Training at ImageNet Scale with Differential Privacy. (arXiv:2201.12328v1 [cs.LG])
    Differential privacy (DP) is the de facto standard for training machine learning (ML) models, including neural networks, while ensuring the privacy of individual examples in the training set. Despite a rich literature on how to train ML models with differential privacy, it remains extremely challenging to train real-life, large neural networks with both reasonable accuracy and privacy. We set out to investigate how to do this, using ImageNet image classification as a poster example of an ML task that is very challenging to resolve accurately with DP right now. This paper shares initial lessons from our effort, in the hope that it will inspire and inform other researchers to explore DP training at scale. We show approaches which help to make DP training faster, as well as model types and settings of the training process that tend to work better for DP. Combined, the methods we discuss let us train a Resnet-18 with differential privacy to 47.9% accuracy and privacy parameters $\epsilon = 10, \delta = 10^{-6}$, a significant improvement over "naive" DP-SGD training of Imagenet models but a far cry from the $75\%$ accuracy that can be obtained by the same network without privacy. We share our code at https://github.com/google-research/dp-imagenet calling for others to join us in moving the needle further on DP at scale.
    Training invariances and the low-rank phenomenon: beyond linear networks. (arXiv:2201.11968v1 [cs.LG])
    The implicit bias induced by the training of neural networks has become a topic of rigorous study. In the limit of gradient flow and gradient descent with appropriate step size, it has been shown that when one trains a deep linear network with logistic or exponential loss on linearly separable data, the weights converge to rank-$1$ matrices. In this paper, we extend this theoretical result to the much wider class of nonlinear ReLU-activated feedforward networks containing fully-connected layers and skip connections. To the best of our knowledge, this is the first time a low-rank phenomenon is proven rigorously for these architectures, and it reflects empirical results in the literature. The proof relies on specific local training invariances, sometimes referred to as alignment, which we show to hold for a wide set of ReLU architectures. Our proof relies on a specific decomposition of the network into a multilinear function and another ReLU network whose weights are constant under a certain parameter directional convergence.
    Feature Visualization within an Automated Design Assessment leveraging Explainable Artificial Intelligence Methods. (arXiv:2201.12107v1 [cs.AI])
    Not only automation of manufacturing processes but also automation of automation procedures itself become increasingly relevant to automation research. In this context, automated capability assessment, mainly leveraged by deep learning systems driven from 3D CAD data, have been presented. Current assessment systems may be able to assess CAD data with regards to abstract features, e.g. the ability to automatically separate components from bulk goods, or the presence of gripping surfaces. Nevertheless, they suffer from the factor of black box systems, where an assessment can be learned and generated easily, but without any geometrical indicator about the reasons of the system's decision. By utilizing explainable AI (xAI) methods, we attempt to open up the black box. Explainable AI methods have been used in order to assess whether a neural network has successfully learned a given task or to analyze which features of an input might lead to an adversarial attack. These methods aim to derive additional insights into a neural network, by analyzing patterns from a given input and its impact to the network output. Within the NeuroCAD Project, xAI methods are used to identify geometrical features which are associated with a certain abstract feature. Within this work, a sensitivity analysis (SA), the layer-wise relevance propagation (LRP), the Gradient-weighted Class Activation Mapping (Grad-CAM) method as well as the Local Interpretable Model-Agnostic Explanations (LIME) have been implemented in the NeuroCAD environment, allowing not only to assess CAD models but also to identify features which have been relevant for the network decision. In the medium run, this might enable to identify regions of interest supporting product designers to optimize their models with regards to assembly processes.
    Higher Order Correlation Analysis for Multi-View Learning. (arXiv:2201.11949v1 [cs.LG])
    Multi-view learning is frequently used in data science. The pairwise correlation maximization is a classical approach for exploring the consensus of multiple views. Since the pairwise correlation is inherent for two views, the extensions to more views can be diversified and the intrinsic interconnections among views are generally lost. To address this issue, we propose to maximize higher order correlations. This can be formulated as a low rank approximation problem with the higher order correlation tensor of multi-view data. We use the generating polynomial method to solve the low rank approximation problem. Numerical results on real multi-view data demonstrate that this method consistently outperforms prior existing methods.
    Time-Series Anomaly Detection with Implicit Neural Representation. (arXiv:2201.11950v1 [cs.LG])
    Detecting anomalies in multivariate time-series data is essential in many real-world applications. Recently, various deep learning-based approaches have shown considerable improvements in time-series anomaly detection. However, existing methods still have several limitations, such as long training time due to their complex model designs or costly tuning procedures to find optimal hyperparameters (e.g., sliding window length) for a given dataset. In our paper, we propose a novel method called Implicit Neural Representation-based Anomaly Detection (INRAD). Specifically, we train a simple multi-layer perceptron that takes time as input and outputs corresponding values at that time. Then we utilize the representation error as an anomaly score for detecting anomalies. Experiments on five real-world datasets demonstrate that our proposed method outperforms other state-of-the-art methods in performance, training speed, and robustness.
    Simplifying deflation for non-convex optimization with applications in Bayesian inference and topology optimization. (arXiv:2201.11926v1 [math.OC])
    Non-convex optimization problems have multiple local optimal solutions. Non-convex optimization problems are commonly found in numerous applications. One of the methods recently proposed to efficiently explore multiple local optimal solutions without random re-initialization relies on the concept of deflation. In this paper, different ways to use deflation in non-convex optimization and nonlinear system solving are discussed. A simple, general and novel deflation constraint is proposed to enable the use of deflation together with existing nonlinear programming solvers or nonlinear system solvers. The connection between the proposed deflation constraint and a minimum distance constraint is presented. Additionally, a number of variations of deflation constraints and their limitations are discussed. Finally, a number of applications of the proposed methodology in the fields of approximate Bayesian inference and topology optimization are presented.
    Approximately Equivariant Networks for Imperfectly Symmetric Dynamics. (arXiv:2201.11969v1 [cs.LG])
    Incorporating symmetry as an inductive bias into neural network architecture has led to improvements in generalization, data efficiency, and physical consistency in dynamics modeling. Methods such as CNN or equivariant neural networks use weight tying to enforce symmetries such as shift invariance or rotational equivariance. However, despite the fact that physical laws obey many symmetries, real-world dynamical data rarely conforms to strict mathematical symmetry either due to noisy or incomplete data or to symmetry breaking features in the underlying dynamical system. We explore approximately equivariant networks which are biased towards preserving symmetry but are not strictly constrained to do so. By relaxing equivariance constraints, we find that our models can outperform both baselines with no symmetry bias and baselines with overly strict symmetry in both simulated turbulence domains and real-world multi-stream jet flow.
    DynaMixer: A Vision MLP Architecture with Dynamic Mixing. (arXiv:2201.12083v1 [cs.CV])
    Recently, MLP-like vision models have achieved promising performances on mainstream visual recognition tasks. In contrast with vision transformers and CNNs, the success of MLP-like models shows that simple information fusion operations among tokens and channels can yield a good representation power for deep recognition models. However, existing MLP-like models fuse tokens through static fusion operations, lacking adaptability to the contents of the tokens to be mixed. Thus, customary information fusion procedures are not effective enough. To this end, this paper presents an efficient MLP-like network architecture, dubbed DynaMixer, resorting to dynamic information fusion. Critically, we propose a procedure, on which the DynaMixer model relies, to dynamically generate mixing matrices by leveraging the contents of all the tokens to be mixed. To reduce the time complexity and improve the robustness, a dimensionality reduction technique and a multi-segment fusion mechanism are adopted. Our proposed DynaMixer model (97M parameters) achieves 84.3\% top-1 accuracy on the ImageNet-1K dataset without extra training data, performing favorably against the state-of-the-art vision MLP models. When the number of parameters is reduced to 26M, it still achieves 82.7\% top-1 accuracy, surpassing the existing MLP-like models with a similar capacity. The implementation of DynaMixer will be made available to the public.
    Self-paced learning to improve text row detection in historical documents with missing lables. (arXiv:2201.12216v1 [cs.CV])
    An important preliminary step of optical character recognition systems is the detection of text rows. To address this task in the context of historical data with missing labels, we propose a self-paced learning algorithm capable of improving the row detection performance. We conjecture that pages with more ground-truth bounding boxes are less likely to have missing annotations. Based on this hypothesis, we sort the training examples in descending order with respect to the number of ground-truth bounding boxes, and organize them into k batches. Using our self-paced learning method, we train a row detector over k iterations, progressively adding batches with less ground-truth annotations. At each iteration, we combine the ground-truth bounding boxes with pseudo-bounding boxes (bounding boxes predicted by the model itself) using non-maximum suppression, and we include the resulting annotations at the next training iteration. We demonstrate that our self-paced learning strategy brings significant performance gains on two data sets of historical documents, improving the average precision of YOLOv4 with more than 12% on one data set and 39% on the other.  ( 2 min )
    Multiscale Graph Comparison via the Embedded Laplacian Distance. (arXiv:2201.12064v1 [stat.ML])
    We introduce a simple and fast method for comparing graphs of different sizes. Existing approaches are often either limited to comparing graphs with the same number of vertices or are computationally unscalable. We propose the Embedded Laplacian Distance (ELD) for comparing graphs of potentially vastly different sizes. Our approach first projects the graphs onto a common, low-dimensional Laplacian embedding space that respects graphical structure. This reduces the problem to that of comparing point clouds in a Euclidean space. A distance can then be computed efficiently via a natural sliced Wasserstein approach. We show that the ELD is a pseudo-metric and is invariant under graph isomorphism. We provide intuitive interpretations of the ELD using tools from spectral graph theory. We test the efficacy of the ELD approach extensively on both simulated and real data. Results obtained are excellent.  ( 2 min )
    Using Constant Learning Rate of Two Time-Scale Update Rule for Training Generative Adversarial Networks. (arXiv:2201.11989v1 [cs.LG])
    Previous numerical results have shown that a two time-scale update rule (TTUR) using constant learning rates is practically useful for training generative adversarial networks (GANs). Meanwhile, a theoretical analysis of TTUR to find a stationary local Nash equilibrium of a Nash equilibrium problem with two players, a discriminator and a generator, has been given using decaying learning rates. In this paper, we give a theoretical analysis of TTUR using constant learning rates to bridge the gap between theory and practice. In particular, we show that, for TTUR using constant learning rates, the number of steps needed to find a stationary local Nash equilibrium decreases as the batch size increases. We also provide numerical results to support our theoretical analyzes.  ( 2 min )
    Bioinspired Cortex-based Fast Codebook Generation. (arXiv:2201.12322v1 [cs.LG])
    A major archetype of artificial intelligence is developing algorithms facilitating temporal efficiency and accuracy while boosting the generalization performance. Even with the latest developments in machine learning, a key limitation has been the inefficient feature extraction from the initial data, which is essential in performance optimization. Here, we introduce a feature extraction method inspired by sensory cortical networks in the brain. Dubbed as bioinspired cortex, the algorithm provides convergence to orthogonal features from streaming signals with superior computational efficiency while processing data in compressed form. We demonstrate the performance of the new algorithm using artificially created complex data by comparing it with the commonly used traditional clustering algorithms, such as Birch, GMM, and K-means. While the data processing time is significantly reduced, seconds versus hours, encoding distortions remain essentially the same in the new algorithm providing a basis for better generalization. Although we show herein the superior performance of the cortex model in clustering and vector quantization, it also provides potent implementation opportunities for machine learning fundamental components, such as reasoning, anomaly detection and classification in large scope applications, e.g., finance, cybersecurity, and healthcare.  ( 2 min )
    FedProf: Selective Federated Learning with Representation Profiling. (arXiv:2102.01733v9 [cs.LG] UPDATED)
    Federated Learning (FL) has shown great potential as a privacy-preserving solution to learning from decentralized data that are only accessible to end devices (i.e., clients). In many scenarios, however, a large proportion of the clients are probably in possession of low-quality data that are biased, noisy or even irrelevant. As a result, they could significantly slow down the convergence of the global model we aim to build and also compromise its quality. In light of this, we propose FedProf, a novel algorithm for optimizing FL under such circumstances without breaching data privacy. The key of our approach is a distributional representation profiling and matching scheme that uses the global model to dynamically profile data representations and allows for low-cost, lightweight representation matching. Based on the scheme we adaptively score each client and adjust its participation probability so as to mitigate the impact of low-value clients on the training process. We have conducted extensive experiments on public datasets using various FL settings. The results show that the selective behaviour of our algorithm leads to a significant reduction in the number of communication rounds and the amount of time (up to 2.4x speedup) for the global model to converge and also provides accuracy gain.  ( 2 min )
    Learning Proximal Operators to Discover Multiple Optima. (arXiv:2201.11945v1 [cs.LG])
    Finding multiple solutions of non-convex optimization problems is a ubiquitous yet challenging task. Typical existing solutions either apply single-solution optimization methods from multiple random initial guesses or search in the vicinity of found solutions using ad hoc heuristics. We present an end-to-end method to learn the proximal operator across a family of non-convex problems, which can then be used to recover multiple solutions for unseen problems at test time. Our method only requires access to the objectives without needing the supervision of ground truth solutions. Notably, the added proximal regularization term elevates the convexity of our formulation: by applying recent theoretical results, we show that for weakly-convex objectives and under mild regularity conditions, training of the proximal operator converges globally in the over-parameterized setting. We further present a benchmark for multi-solution optimization including a wide range of applications and evaluate our method to demonstrate its effectiveness.  ( 2 min )
    Recursive Decoding: A Situated Cognition Approach to Compositional Generation in Grounded Language Understanding. (arXiv:2201.11766v1 [cs.CL])
    Compositional generalization is a troubling blind spot for neural language models. Recent efforts have presented techniques for improving a model's ability to encode novel combinations of known inputs, but less work has focused on generating novel combinations of known outputs. Here we focus on this latter "decode-side" form of generalization in the context of gSCAN, a synthetic benchmark for compositional generalization in grounded language understanding. We present Recursive Decoding (RD), a novel procedure for training and using seq2seq models, targeted towards decode-side generalization. Rather than generating an entire output sequence in one pass, models are trained to predict one token at a time. Inputs (i.e., the external gSCAN environment) are then incrementally updated based on predicted tokens, and re-encoded for the next decoder time step. RD thus decomposes a complex, out-of-distribution sequence generation task into a series of incremental predictions that each resemble what the model has already seen during training. RD yields dramatic improvement on two previously neglected generalization tasks in gSCAN. We provide analyses to elucidate these gains over failure of a baseline, and then discuss implications for generalization in naturalistic grounded language understanding, and seq2seq more generally.  ( 2 min )
    Kernel Interpolation as a Bayes Point Machine. (arXiv:2110.04274v2 [cs.LG] UPDATED)
    A Bayes point machine is a single classifier that approximates the majority decision of an ensemble of classifiers. This paper observes that kernel interpolation is a Bayes point machine for Gaussian process classification. This observation facilitates the transfer of results from both ensemble theory as well as an area of convex geometry known as Brunn-Minkowski theory to derive PAC-Bayes risk bounds for kernel interpolation. Since large margin, infinite width neural networks are kernel interpolators, the paper's findings may help to explain generalisation in neural networks more broadly. Supporting this idea, the paper finds evidence that large margin, finite width neural networks behave like Bayes point machines too.  ( 2 min )
    Understanding Why Generalized Reweighting Does Not Improve Over ERM. (arXiv:2201.12293v1 [cs.LG])
    Empirical risk minimization (ERM) is known in practice to be non-robust to distributional shift where the training and the test distributions are different. A suite of approaches, such as importance weighting, and variants of distributionally robust optimization (DRO), have been proposed to solve this problem. But a line of recent work has empirically shown that these approaches do not significantly improve over ERM in real applications with distribution shift. The goal of this work is to obtain a comprehensive theoretical understanding of this intriguing phenomenon. We first posit the class of Generalized Reweighting (GRW) algorithms, as a broad category of approaches that iteratively update model parameters based on iterative reweighting of the training samples. We show that when overparameterized models are trained under GRW, the resulting models are close to that obtained by ERM. We also show that adding small regularization which does not greatly affect the empirical training accuracy does not help. Together, our results show that a broad category of what we term GRW approaches are not able to achieve distributionally robust generalization. Our work thus has the following sobering takeaway: to make progress towards distributionally robust generalization, we either have to develop non-GRW approaches, or perhaps devise novel classification/regression loss functions that are adapted to the class of GRW approaches.  ( 2 min )
    Summarizing Differences between Text Distributions with Natural Language. (arXiv:2201.12323v1 [cs.CL])
    How do two distributions of texts differ? Humans are slow at answering this, since discovering patterns might require tediously reading through hundreds of samples. We propose to automatically summarize the differences by "learning a natural language hypothesis": given two distributions $D_{0}$ and $D_{1}$, we search for a description that is more often true for $D_{1}$, e.g., "is military-related." To tackle this problem, we fine-tune GPT-3 to propose descriptions with the prompt: "[samples of $D_{0}$] + [samples of $D_{1}$] + the difference between them is _____". We then re-rank the descriptions by checking how often they hold on a larger set of samples with a learned verifier. On a benchmark of 54 real-world binary classification tasks, while GPT-3 Curie (13B) only generates a description similar to human annotation 7% of the time, the performance reaches 61% with fine-tuning and re-ranking, and our best system using GPT-3 Davinci (175B) reaches 76%. We apply our system to describe distribution shifts, debug dataset shortcuts, summarize unknown tasks, and label text clusters, and present analyses based on automatically generated descriptions.  ( 2 min )
    Joint Differentiable Optimization and Verification for Certified Reinforcement Learning. (arXiv:2201.12243v1 [cs.LG])
    In model-based reinforcement learning for safety-critical control systems, it is important to formally certify system properties (e.g., safety, stability) under the learned controller. However, as existing methods typically apply formal verification \emph{after} the controller has been learned, it is sometimes difficult to obtain any certificate, even after many iterations between learning and verification. To address this challenge, we propose a framework that jointly conducts reinforcement learning and formal verification by formulating and solving a novel bilevel optimization problem, which is differentiable by the gradients from the value function and certificates. Experiments on a variety of examples demonstrate the significant advantages of our framework over the model-based stochastic value gradient (SVG) method and the model-free proximal policy optimization (PPO) method in finding feasible controllers with barrier functions and Lyapunov functions that ensure system safety and stability.  ( 2 min )
    The Effect of Diversity in Meta-Learning. (arXiv:2201.11775v1 [cs.LG])
    Few-shot learning aims to learn representations that can tackle novel tasks given a small number of examples. Recent studies show that task distribution plays a vital role in the model's performance. Conventional wisdom is that task diversity should improve the performance of meta-learning. In this work, we find evidence to the contrary; we study different task distributions on a myriad of models and datasets to evaluate the effect of task diversity on meta-learning algorithms. For this experiment, we train on multiple datasets, and with three broad classes of meta-learning models - Metric-based (i.e., Protonet, Matching Networks), Optimization-based (i.e., MAML, Reptile, and MetaOptNet), and Bayesian meta-learning models (i.e., CNAPs). Our experiments demonstrate that the effect of task diversity on all these algorithms follows a similar trend, and task diversity does not seem to offer any benefits to the learning of the model. Furthermore, we also demonstrate that even a handful of tasks, repeated over multiple batches, would be sufficient to achieve a performance similar to uniform sampling and draws into question the need for additional tasks to create better models.  ( 2 min )
    With Greater Distance Comes Worse Performance: On the Perspective of Layer Utilization and Model Generalization. (arXiv:2201.11939v1 [cs.LG])
    Generalization of deep neural networks remains one of the main open problems in machine learning. Previous theoretical works focused on deriving tight bounds of model complexity, while empirical works revealed that neural networks exhibit double descent with respect to both training sample counts and the neural network size. In this paper, we empirically examined how different layers of neural networks contribute differently to the model; we found that early layers generally learn representations relevant to performance on both training data and testing data. Contrarily, deeper layers only minimize training risks and fail to generalize well with testing or mislabeled data. We further illustrate the distance of trained weights to its initial value of final layers has high correlation to generalization errors and can serve as an indicator of an overfit of model. Moreover, we show evidence to support post-training regularization by re-initializing weights of final layers. Our findings provide an efficient method to estimate the generalization capability of neural networks, and the insight of those quantitative results may inspire derivation to better generalization bounds that take the internal structure of neural networks into consideration.  ( 2 min )
    Star Temporal Classification: Sequence Classification with Partially Labeled Data. (arXiv:2201.12208v1 [cs.LG])
    We develop an algorithm which can learn from partially labeled and unsegmented sequential data. Most sequential loss functions, such as Connectionist Temporal Classification (CTC), break down when many labels are missing. We address this problem with Star Temporal Classification (STC) which uses a special star token to allow alignments which include all possible tokens whenever a token could be missing. We express STC as the composition of weighted finite-state transducers (WFSTs) and use GTN (a framework for automatic differentiation with WFSTs) to compute gradients. We perform extensive experiments on automatic speech recognition. These experiments show that STC can recover most of the performance of supervised baseline when up to 70% of the labels are missing. We also perform experiments in handwriting recognition to show that our method easily applies to other sequence classification tasks.  ( 2 min )
    Task-Aware Network Coding Over Butterfly Network. (arXiv:2201.11917v1 [cs.IT])
    Network coding allows distributed information sources such as sensors to efficiently compress and transmit data to distributed receivers across a bandwidth-limited network. Classical network coding is largely task-agnostic -- the coding schemes mainly aim to faithfully reconstruct data at the receivers, regardless of what ultimate task the received data is used for. In this paper, we analyze a new task-driven network coding problem, where distributed receivers pass transmitted data through machine learning (ML) tasks, which provides an opportunity to improve efficiency by transmitting salient task-relevant data representations. Specifically, we formulate a task-aware network coding problem over a butterfly network in real-coordinate space, where lossy analog compression through principal component analysis (PCA) can be applied. A lower bound for the total loss function for the formulated problem is given, and necessary and sufficient conditions for achieving this lower bound are also provided. We introduce ML algorithms to solve the problem in the general case, and our evaluation demonstrates the effectiveness of task-aware network coding.  ( 2 min )
    Probabilistic Hierarchical Forecasting with Deep Poisson Mixtures. (arXiv:2110.13179v5 [cs.LG] UPDATED)
    Hierarchical forecasting problems arise when time series have a natural group structure, and predictions at multiple levels of aggregation and disaggregation across the groups are needed. In such problems, it is often desired to satisfy the aggregation constraints in a given hierarchy, referred to as hierarchical coherence in the literature. Maintaining hierarchical coherence while producing accurate forecasts can be a challenging problem, especially in the case of probabilistic forecasting. We present a novel method capable of accurate and coherent probabilistic forecasts for hierarchical time series. We call it Deep Poisson Mixture Network (DPMN). It relies on the combination of neural networks and a statistical model for the joint distribution of the hierarchical multivariate time series structure. By construction, the model guarantees hierarchical coherence and provides simple rules for aggregation and disaggregation of the predictive distributions. We perform an extensive empirical evaluation comparing the DPMN to other state-of-the-art methods which produce hierarchically coherent probabilistic forecasts on multiple public datasets. Compared to existing coherent probabilistic models, we obtained a relative improvement in the overall Continuous Ranked Probability Score (CRPS) of 17.1% on Australian domestic tourism data, 24.2 on the Favorita grocery sales dataset, and 6.9% on a San Francisco Bay Area highway traffic dataset.  ( 2 min )
    The CARE Dataset for Affective Response Detection. (arXiv:2201.11895v1 [cs.LG])
    Social media plays an increasing role in our communication with friends and family, and our consumption of information and entertainment. Hence, to design effective ranking functions for posts on social media, it would be useful to predict the affective response to a post (e.g., whether the user is likely to be humored, inspired, angered, informed). Similar to work on emotion recognition (which focuses on the affect of the publisher of the post), the traditional approach to recognizing affective response would involve an expensive investment in human annotation of training data. We introduce CARE$_{db}$, a dataset of 230k social media posts annotated according to 7 affective responses using the Common Affective Response Expression (CARE) method. The CARE method is a means of leveraging the signal that is present in comments that are posted in response to a post, providing high-precision evidence about the affective response of the readers to the post without human annotation. Unlike human annotation, the annotation process we describe here can be iterated upon to expand the coverage of the method, particularly for new affective responses. We present experiments that demonstrate that the CARE annotations compare favorably with crowd-sourced annotations. Finally, we use CARE$_{db}$ to train competitive BERT-based models for predicting affective response as well as emotion detection, demonstrating the utility of the dataset for related tasks.  ( 2 min )
    Geometric instability of out of distribution data across autoencoder architecture. (arXiv:2201.11902v1 [cs.LG])
    We study the map learned by a family of autoencoders trained on MNIST, and evaluated on ten different data sets created by the random selection of pixel values according to ten different distributions. Specifically, we study the eigenvalues of the Jacobians defined by the weight matrices of the autoencoder at each training and evaluation point. For high enough latent dimension, we find that each autoencoder reconstructs all the evaluation data sets as similar \emph{generalized characters}, but that this reconstructed \emph{generalized character} changes across autoencoder. Eigenvalue analysis shows that even when the reconstructed image appears to be an MNIST character for all out of distribution data sets, not all have latent representations that are close to the latent representation of MNIST characters. All told, the eigenvalue analysis demonstrated a great deal of geometric instability of the autoencoder both as a function on out of distribution inputs, and across architectures on the same set of inputs.  ( 2 min )
    Estimating Potential Outcome Distributions with Collaborating Causal Networks. (arXiv:2110.01664v2 [stat.ML] UPDATED)
    Many causal inference approaches have focused on identifying an individual's outcome change due to a potential treatment, or the individual treatment effect (ITE), from observational studies. Rather than only estimating the ITE, we propose Collaborating Causal Networks (CCN) to estimate the full potential outcome distributions. This modification facilitates estimating the utility of each treatment and allows for individual variation in utility functions (e.g., variability in risk tolerance). We show that CCN learns distributions that asymptotically capture the correct potential outcome distributions under standard causal inference assumptions. Furthermore, we develop a new adjustment approach that is empirically effective in alleviating sample imbalance between treatment groups in observational studies. We evaluate CCN by extensive empirical experiments and demonstrate improved distribution estimates compared to existing Bayesian and Generative Adversarial Network-based methods. Additionally, CCN empirically improves decisions over a variety of utility functions.  ( 2 min )
    Can Wikipedia Help Offline Reinforcement Learning?. (arXiv:2201.12122v1 [cs.LG])
    Fine-tuning reinforcement learning (RL) models has been challenging because of a lack of large scale off-the-shelf datasets as well as high variance in transferability among different environments. Recent work has looked at tackling offline RL from the perspective of sequence modeling with improved results as result of the introduction of the Transformer architecture. However, when the model is trained from scratch, it suffers from slow convergence speeds. In this paper, we look to take advantage of this formulation of reinforcement learning as sequence modeling and investigate the transferability of pre-trained sequence models on other domains (vision, language) when finetuned on offline RL tasks (control, games). To this end, we also propose techniques to improve transfer between these domains. Results show consistent performance gains in terms of both convergence speed and reward on a variety of environments, accelerating training by 3-6x and achieving state-of-the-art performance in a variety of tasks using Wikipedia-pretrained and GPT2 language models. We hope that this work not only brings light to the potentials of leveraging generic sequence modeling techniques and pre-trained models for RL, but also inspires future work on sharing knowledge between generative modeling tasks of completely different domains.  ( 2 min )
    Solving a percolation inverse problem with a divide-and-concur algorithm. (arXiv:2201.12222v1 [cond-mat.dis-nn])
    We present a percolation inverse problem for diode networks: Given information about which pairs of nodes allow current to percolate from one to the other, can one construct a diode network consistent with the observed currents? We implement a divide-and-concur iterative projection method for solving the problem and demonstrate the supremacy of our method over an exhaustive approach for nontrivial instances of the problem. We find that the problem is most difficult when some but not all of the percolation data are hidden, and that the most difficult networks to reconstruct generally are those for which the currents are most sensitive to the addition or removal of a single diode.  ( 2 min )
    Approximate Bayesian Computation with Domain Expert in the Loop. (arXiv:2201.12090v1 [stat.ML])
    Approximate Bayesian computation (ABC) is a popular likelihood-free inference method for models with intractable likelihood functions. As ABC methods usually rely on comparing summary statistics of observed and simulated data, the choice of the statistics is crucial. This choice involves a trade-off between loss of information and dimensionality reduction, and is often determined based on domain knowledge. However, handcrafting and selecting suitable statistics is a laborious task involving multiple trial-and-error steps. In this work, we introduce an active learning method for ABC statistics selection which reduces the domain expert's work considerably. By involving the experts, we are able to handle misspecified models, unlike the existing dimension reduction methods. Moreover, empirical results show better posterior estimates than with existing methods, when the simulation budget is limited.  ( 2 min )
    Constraint-based graph network simulator. (arXiv:2112.09161v2 [cs.LG] UPDATED)
    In the area of physical simulations, nearly all neural-network-based methods directly predict future states from the input states. However, many traditional simulation engines instead model the constraints of the system and select the state which satisfies them. Here we present a framework for constraint-based learned simulation, where a scalar constraint function is implemented as a graph neural network, and future predictions are computed by solving the optimization problem defined by the learned constraint. Our model achieves comparable or better accuracy to top learned simulators on a variety of challenging physical domains, and offers several unique advantages. We can improve the simulation accuracy on a larger system by applying more solver iterations at test time. We also can incorporate novel hand-designed constraints at test time and simulate new dynamics which were not present in the training data. Our constraint-based framework shows how key techniques from traditional simulation and numerical methods can be leveraged as inductive biases in machine learning simulators.  ( 2 min )
    Mask-based Latent Reconstruction for Reinforcement Learning. (arXiv:2201.12096v1 [cs.LG])
    For deep reinforcement learning (RL) from pixels, learning effective state representations is crucial for achieving high performance. However, in practice, limited experience and high-dimensional input prevent effective representation learning. To address this, motivated by the success of masked modeling in other research fields, we introduce mask-based reconstruction to promote state representation learning in RL. Specifically, we propose a simple yet effective self-supervised method, Mask-based Latent Reconstruction (MLR), to predict the complete state representations in the latent space from the observations with spatially and temporally masked pixels. MLR enables the better use of context information when learning state representations to make them more informative, which facilitates RL agent training. Extensive experiments show that our MLR significantly improves the sample efficiency in RL and outperforms the state-of-the-art sample-efficient RL methods on multiple continuous benchmark environments.  ( 2 min )
    Stochastic Batch Acquisition for Deep Active Learning. (arXiv:2106.12059v2 [cs.LG] UPDATED)
    We provide a stochastic strategy for adapting well-known acquisition functions to allow batch active learning. In deep active learning, labels are often acquired in batches for efficiency. However, many acquisition functions are designed for single-sample acquisition and fail when naively used to construct batches. In contrast, state-of-the-art batch acquisition functions are costly to compute. We show how to extend single-sample acquisition functions to the batch setting. Instead of acquiring the top-K points from the pool set, we account for the fact that acquisition scores are expected to change as new points are acquired. This motivates simple stochastic acquisition strategies using score-based or rank-based distributions. Our strategies outperform the standard top-K acquisition with virtually no computational overhead and can be used as a drop-in replacement. In fact, they are even competitive with much more expensive methods despite their linear computational complexity. We conclude that there is no reason to use top-K batch acquisition in practice.  ( 2 min )
    On feedforward control using physics-guided neural networks: Training cost regularization and optimized initialization. (arXiv:2201.12088v1 [cs.LG])
    Performance of model-based feedforward controllers is typically limited by the accuracy of the inverse system dynamics model. Physics-guided neural networks (PGNN), where a known physical model cooperates in parallel with a neural network, were recently proposed as a method to achieve high accuracy of the identified inverse dynamics. However, the flexible nature of neural networks can create overparameterization when employed in parallel with a physical model, which results in a parameter drift during training. This drift may result in parameters of the physical model not corresponding to their physical values, which increases vulnerability of the PGNN to operating conditions not present in the training data. To address this problem, this paper proposes a regularization method via identified physical parameters, in combination with an optimized training initialization that improves training convergence. The regularized PGNN framework is validated on a real-life industrial linear motor, where it delivers better tracking accuracy and extrapolation.  ( 2 min )
    Transfer Learning In Differential Privacy's Hybrid-Model. (arXiv:2201.12018v1 [cs.LG])
    The hybrid-model (Avent et al 2017) in Differential Privacy is a an augmentation of the local-model where in addition to N local-agents we are assisted by one special agent who is in fact a curator holding the sensitive details of n additional individuals. Here we study the problem of machine learning in the hybrid-model where the n individuals in the curators dataset are drawn from a different distribution than the one of the general population (the local-agents). We give a general scheme -- Subsample-Test-Reweigh -- for this transfer learning problem, which reduces any curator-model DP-learner to a hybrid-model learner in this setting using iterative subsampling and reweighing of the n examples held by the curator based on a smooth variation of the Multiplicative-Weights algorithm (introduced by Bun et al, 2020). Our scheme has a sample complexity which relies on the chi-squared divergence between the two distributions. We give worst-case analysis bounds on the sample complexity required for our private reduction. Aiming to reduce said sample complexity, we give two specific instances our sample complexity can be drastically reduced (one instance is analyzed mathematically, while the other - empirically) and pose several directions for follow-up work.  ( 2 min )
    Measure Estimation in the Barycentric Coding Model. (arXiv:2201.12195v1 [stat.ML])
    This paper considers the problem of measure estimation under the barycentric coding model (BCM), in which an unknown measure is assumed to belong to the set of Wasserstein-2 barycenters of a finite set of known measures. Estimating a measure under this model is equivalent to estimating the unknown barycenteric coordinates. We provide novel geometrical, statistical, and computational insights for measure estimation under the BCM, consisting of three main results. Our first main result leverages the Riemannian geometry of Wasserstein-2 space to provide a procedure for recovering the barycentric coordinates as the solution to a quadratic optimization problem assuming access to the true reference measures. The essential geometric insight is that the parameters of this quadratic problem are determined by inner products between the optimal displacement maps from the given measure to the reference measures defining the BCM. Our second main result then establishes an algorithm for solving for the coordinates in the BCM when all the measures are observed empirically via i.i.d. samples. We prove precise rates of convergence for this algorithm -- determined by the smoothness of the underlying measures and their dimensionality -- thereby guaranteeing its statistical consistency. Finally, we demonstrate the utility of the BCM and associated estimation procedures in three application areas: (i) covariance estimation for Gaussian measures; (ii) image processing; and (iii) natural language processing.  ( 2 min )
    Conditional Deep Inverse Rosenblatt Transports. (arXiv:2106.04170v2 [stat.ML] UPDATED)
    We present a novel offline-online method to mitigate the computational burden of Bayesian inference, particularly in the regime where the posterior densities are computationally demanding to evaluate while real-time inference results are needed. In the offline phase, the proposed method learns the joint law of the parameter random variables and the observable random variables in the tensor-train (TT) format. Then, in the online phase, the resulting order-preserving transport can be conditioned on newly observed data to characterize the posterior random variables in real-time. Compared with the state-of-the-art normalizing flows techniques, our proposed method relies on function approximation, for which we can provide a thorough performance analysis. The function approximation perspective allows us to significantly improve the capability of transport maps in challenging problems with high-dimensional observations and high-dimensional parameters. Capitalizing on this, we present novel heuristics to either reorder or reparametrize the variables to enhance the approximation power of TT. We then integrate the TT-based transport maps and the parameter reordering/reparametrization into a layered composite map to further improve the performance of the resulting inference. We demonstrate the efficiency of the proposed method on various statistical learning tasks involving ordinary differential equations (ODEs) and partial differential equations (PDEs).  ( 2 min )
    A DNN Based Post-Filter to Enhance the Quality of Coded Speech in MDCT Domain. (arXiv:2201.12039v1 [eess.AS])
    Frequency domain processing, and in particular the use of Modified Discrete Cosine Transform (MDCT), is the most widespread approach to audio coding. However, at low bitrates, audio quality, especially for speech, degrades drastically due to the lack of available bits to directly code the transform coefficients. Traditionally, post-filtering has been used to mitigate artefacts in the coded speech by exploiting a-priori information of the source and extra transmitted parameters. Recently, data-driven post-filters have shown better results, but at the cost of significant additional complexity and delay. In this work, we propose a mask-based post-filter operating directly in MDCT domain of the codec, inducing no extra delay. The real-valued mask is applied to the quantized MDCT coefficients and is estimated from a relatively lightweight convolutional encoder-decoder network. Our solution is tested on the recently standardized low-delay, low-complexity codec (LC3) at lowest possible bitrate of 16 kbps. Objective and subjective assessments clearly show the advantage of this approach over the conventional post-filter, with an average improvement of 10 MUSHRA points over the LC3 coded speech.  ( 2 min )
    Learning to Simulate Unseen Physical Systems with Graph Neural Networks. (arXiv:2201.11976v1 [cs.LG])
    Simulation of the dynamics of physical systems is essential to the development of both science and engineering. Recently there is an increasing interest in learning to simulate the dynamics of physical systems using neural networks. However, existing approaches fail to generalize to physical substances not in the training set, such as liquids with different viscosities or elastomers with different elasticities. Here we present a machine learning method embedded with physical priors and material parameters, which we term as "Graph-based Physics Engine" (GPE), to efficiently model the physical dynamics of different substances in a wide variety of scenarios. We demonstrate that GPE can generalize to materials with different properties not seen in the training set and perform well from single-step predictions to multi-step roll-out simulations. In addition, introducing the law of momentum conservation in the model significantly improves the efficiency and stability of learning, allowing convergence to better models with fewer training steps.
  • Open

    Deep Deterministic Uncertainty: A Simple Baseline. (arXiv:2102.11582v3 [cs.LG] UPDATED)
    Reliable uncertainty from deterministic single-forward pass models is sought after because conventional methods of uncertainty quantification are computationally expensive. We take two complex single-forward-pass uncertainty approaches, DUQ and SNGP, and examine whether they mainly rely on a well-regularized feature space. Crucially, without using their more complex methods for estimating uncertainty, a single softmax neural net with such a feature-space, achieved via residual connections and spectral normalization, *outperforms* DUQ and SNGP's epistemic uncertainty predictions using simple Gaussian Discriminant Analysis *post-training* as a separate feature-space density estimator -- without fine-tuning on OoD data, feature ensembling, or input pre-procressing. This conceptually simple *Deep Deterministic Uncertainty (DDU)* baseline can also be used to disentangle aleatoric and epistemic uncertainty and performs as well as Deep Ensembles, the state-of-the art for uncertainty prediction, on several OoD benchmarks (CIFAR-10/100 vs SVHN/Tiny-ImageNet, ImageNet vs ImageNet-O) as well as in active learning settings across different model architectures, yet is *computationally cheaper*.
    Provably Improving Expert Predictions with Conformal Prediction. (arXiv:2201.12006v1 [cs.LG])
    Automated decision support systems promise to help human experts solve tasks more efficiently and accurately. However, existing systems typically require experts to understand when to cede agency to the system or when to exercise their own agency. Moreover, if the experts develop a misplaced trust in the system, their performance may worsen. In this work, we lift the above requirement and develop automated decision support systems that, by design, do not require experts to understand when to trust them to provably improve their performance. To this end, we focus on multiclass classification tasks and consider automated decision support systems that, for each data sample, use a classifier to recommend a subset of labels to a human expert. We first show that, by looking at the design of such systems from the perspective of conformal prediction, we can ensure that the probability that the recommended subset of labels contains the true label matches almost exactly a target probability value. Then, we identify the set of target probability values under which the human expert is provably better off predicting a label among those in the recommended subset and develop an efficient practical method to find a near-optimal target probability value. Experiments on synthetic and real data demonstrate that our system can help the experts make more accurate predictions and is robust to the accuracy of the classifier it relies on.
    The Hessian Screening Rule. (arXiv:2104.13026v2 [stat.ML] UPDATED)
    Predictor screening rules, which discard predictors from the design matrix before fitting a model, have had considerable impact on the speed with which l1-regularized regression problems, such as the lasso, can be solved. Current state-of-the-art screening rules, however, have difficulties in dealing with highly-correlated predictors, often becoming too conservative. In this paper, we present a new screening rule to deal with this issue: the Hessian Screening Rule. The rule uses second-order information from the model to provide more accurate screening as well as higher-quality warm starts. The proposed rule outperforms all studied alternatives on data sets with high correlation for both l1-regularized least-squares (the lasso) and logistic regression. It also performs best overall on the real data sets that we examine.
    Differentially Private Coordinate Descent for Composite Empirical Risk Minimization. (arXiv:2110.11688v2 [cs.LG] UPDATED)
    Machine learning models can leak information about the data used to train them. To mitigate this issue, Differentially Private (DP) variants of optimization algorithms like Stochastic Gradient Descent (DP-SGD) have been designed to trade-off utility for privacy in Empirical Risk Minimization (ERM) problems. In this paper, we propose Differentially Private proximal Coordinate Descent (DP-CD), a new method to solve composite DP-ERM problems. We derive utility guarantees through a novel theoretical analysis of inexact coordinate descent. Our results show that, thanks to larger step sizes, DP-CD can exploit imbalance in gradient coordinates to outperform DP-SGD. We also prove new lower bounds for composite DP-ERM under coordinate-wise regularity assumptions, that are nearly matched by DP-CD. For practical implementations, we propose to clip gradients using coordinate-wise thresholds that emerge from our theory, avoiding costly hyperparameter tuning. Experiments on real and synthetic data support our results, and show that DP-CD compares favorably with DP-SGD.
    Arbitrage-Free Implied Volatility Surface Generation with Variational Autoencoders. (arXiv:2108.04941v3 [q-fin.MF] UPDATED)
    We propose a hybrid method for generating arbitrage-free implied volatility (IV) surfaces consistent with historical data by combining model-free Variational Autoencoders (VAEs) with continuous time stochastic differential equation (SDE) driven models. We focus on two classes of SDE models: regime switching models and L\'evy additive processes. By projecting historical surfaces onto the space of SDE model parameters, we obtain a distribution on the parameter subspace faithful to the data on which we then train a VAE. Arbitrage-free IV surfaces are then generated by sampling from the posterior distribution on the latent space, decoding to obtain SDE model parameters, and finally mapping those parameters to IV surfaces. We further refine the VAE model by including conditional features and demonstrate its superior generative out-of-sample performance.
    Sliced-Wasserstein Gradient Flows. (arXiv:2110.10972v2 [cs.LG] UPDATED)
    Minimizing functionals in the space of probability distributions can be done with Wasserstein gradient flows. To solve them numerically, a possible approach is to rely on the Jordan-Kinderlehrer-Otto (JKO) scheme which is analogous to the proximal scheme in Euclidean spaces. However, it requires solving a nested optimization problem at each iteration, and is known for its computational challenges, especially in high dimension. To alleviate it, very recent works propose to approximate the JKO scheme leveraging Brenier's theorem, and using gradients of Input Convex Neural Networks to parameterize the density (JKO-ICNN). However, this method comes with a high computational cost and stability issues. Instead, this work proposes to use gradient flows in the space of probability measures endowed with the sliced-Wasserstein (SW) distance. We argue that this method is more flexible than JKO-ICNN, since SW enjoys a closed-form differentiable approximation. Thus, the density at each step can be parameterized by any generative model which alleviates the computational burden and makes it tractable in higher dimensions.
    Neural forecasting at scale. (arXiv:2109.09705v4 [cs.LG] UPDATED)
    We study the problem of efficiently scaling ensemble-based deep neural networks for multi-step time series (TS) forecasting on a large set of time series. Current state-of-the-art deep ensemble models have high memory and computational requirements, hampering their use to forecast millions of TS in practical scenarios. We propose N-BEATS(P), a global parallel variant of the N-BEATS model designed to allow simultaneous training of multiple univariate TS forecasting models. Our model addresses the practical limitations of related models, reducing the training time by half and memory requirement by a factor of 5, while keeping the same level of accuracy in all TS forecasting settings. We have performed multiple experiments detailing the various ways to train our model and have obtained results that demonstrate its capacity to generalize in various forecasting conditions and setups.
    Stochastic Batch Acquisition for Deep Active Learning. (arXiv:2106.12059v2 [cs.LG] UPDATED)
    We provide a stochastic strategy for adapting well-known acquisition functions to allow batch active learning. In deep active learning, labels are often acquired in batches for efficiency. However, many acquisition functions are designed for single-sample acquisition and fail when naively used to construct batches. In contrast, state-of-the-art batch acquisition functions are costly to compute. We show how to extend single-sample acquisition functions to the batch setting. Instead of acquiring the top-K points from the pool set, we account for the fact that acquisition scores are expected to change as new points are acquired. This motivates simple stochastic acquisition strategies using score-based or rank-based distributions. Our strategies outperform the standard top-K acquisition with virtually no computational overhead and can be used as a drop-in replacement. In fact, they are even competitive with much more expensive methods despite their linear computational complexity. We conclude that there is no reason to use top-K batch acquisition in practice.  ( 2 min )
    Model-Advantage and Value-Aware Models for Model-Based Reinforcement Learning: Bridging the Gap in Theory and Practice. (arXiv:2106.14080v2 [cs.LG] UPDATED)
    This work shows that value-aware model learning, known for its numerous theoretical benefits, is also practically viable for solving challenging continuous control tasks in prevalent model-based reinforcement learning algorithms. First, we derive a novel value-aware model learning objective by bounding the model-advantage i.e. model performance difference, between two MDPs or models given a fixed policy, achieving superior performance to prior value-aware objectives in most continuous control environments. Second, we identify the issue of stale value estimates in naively substituting value-aware objectives in place of maximum-likelihood in dyna-style model-based RL algorithms. Our proposed remedy to this issue bridges the long-standing gap in theory and practice of value-aware model learning by enabling successful deployment of all value-aware objectives in solving several continuous control robotic manipulation and locomotion tasks. Our results are obtained with minimal modifications to two popular and open-source model-based RL algorithms -- SLBO and MBPO, without tuning any existing hyper-parameters, while also demonstrating better performance of value-aware objectives than these baseline in some environments.  ( 2 min )
    Parameter Sharing For Heterogeneous Agents in Multi-Agent Reinforcement Learning. (arXiv:2005.13625v7 [cs.LG] UPDATED)
    Parameter sharing, where each agent independently learns a policy with fully shared parameters between all policies, is a popular baseline method for multi-agent deep reinforcement learning. Unfortunately, since all agents share the same policy network, they cannot learn different policies or tasks. This issue has been circumvented experimentally by adding an agent-specific indicator signal to observations, which we term "agent indication." Agent indication is limited, however, in that without modification it does not allow parameter sharing to be applied to environments where the action spaces and/or observation spaces are heterogeneous. This work formalizes the notion of agent indication and proves that it enables convergence to optimal policies for the first time. Next, we formally introduce methods to extend parameter sharing to learning in heterogeneous observation and action spaces, and prove that these methods allow for convergence to optimal policies. Finally, we experimentally confirm that the methods we introduce function empirically, and conduct a wide array of experiments studying the empirical efficacy of many different agent indication schemes for graphical observation spaces.  ( 2 min )
    Multiscale Graph Comparison via the Embedded Laplacian Distance. (arXiv:2201.12064v1 [stat.ML])
    We introduce a simple and fast method for comparing graphs of different sizes. Existing approaches are often either limited to comparing graphs with the same number of vertices or are computationally unscalable. We propose the Embedded Laplacian Distance (ELD) for comparing graphs of potentially vastly different sizes. Our approach first projects the graphs onto a common, low-dimensional Laplacian embedding space that respects graphical structure. This reduces the problem to that of comparing point clouds in a Euclidean space. A distance can then be computed efficiently via a natural sliced Wasserstein approach. We show that the ELD is a pseudo-metric and is invariant under graph isomorphism. We provide intuitive interpretations of the ELD using tools from spectral graph theory. We test the efficacy of the ELD approach extensively on both simulated and real data. Results obtained are excellent.  ( 2 min )
    Equivariant Manifold Flows. (arXiv:2107.08596v2 [stat.ML] UPDATED)
    Tractably modelling distributions over manifolds has long been an important goal in the natural sciences. Recent work has focused on developing general machine learning models to learn such distributions. However, for many applications these distributions must respect manifold symmetries -- a trait which most previous models disregard. In this paper, we lay the theoretical foundations for learning symmetry-invariant distributions on arbitrary manifolds via equivariant manifold flows. We demonstrate the utility of our approach by using it to learn gauge invariant densities over $SU(n)$ in the context of quantum field theory.  ( 2 min )
    Interplay between depth of neural networks and locality of target functions. (arXiv:2201.12082v1 [cs.LG])
    It has been recognized that heavily overparameterized deep neural networks (DNNs) exhibit surprisingly good generalization performance in various machine-learning tasks. Although benefits of depth have been investigated from different perspectives such as the approximation theory and the statistical learning theory, existing theories do not adequately explain the empirical success of overparameterized DNNs. In this work, we report a remarkable interplay between depth and locality of a target function. We introduce $k$-local and $k$-global functions, and find that depth is beneficial for learning local functions but detrimental to learning global functions. This interplay is not properly captured by the neural tangent kernel, which describes an infinitely wide neural network within the lazy learning regime.  ( 2 min )
    Conditional Deep Inverse Rosenblatt Transports. (arXiv:2106.04170v2 [stat.ML] UPDATED)
    We present a novel offline-online method to mitigate the computational burden of Bayesian inference, particularly in the regime where the posterior densities are computationally demanding to evaluate while real-time inference results are needed. In the offline phase, the proposed method learns the joint law of the parameter random variables and the observable random variables in the tensor-train (TT) format. Then, in the online phase, the resulting order-preserving transport can be conditioned on newly observed data to characterize the posterior random variables in real-time. Compared with the state-of-the-art normalizing flows techniques, our proposed method relies on function approximation, for which we can provide a thorough performance analysis. The function approximation perspective allows us to significantly improve the capability of transport maps in challenging problems with high-dimensional observations and high-dimensional parameters. Capitalizing on this, we present novel heuristics to either reorder or reparametrize the variables to enhance the approximation power of TT. We then integrate the TT-based transport maps and the parameter reordering/reparametrization into a layered composite map to further improve the performance of the resulting inference. We demonstrate the efficiency of the proposed method on various statistical learning tasks involving ordinary differential equations (ODEs) and partial differential equations (PDEs).  ( 2 min )
    Fast Interpretable Greedy-Tree Sums (FIGS). (arXiv:2201.11931v1 [cs.LG])
    Modern machine learning has achieved impressive prediction performance, but often sacrifices interpretability, a critical consideration in many problems. Here, we propose Fast Interpretable Greedy-Tree Sums (FIGS), an algorithm for fitting concise rule-based models. Specifically, FIGS generalizes the CART algorithm to simultaneously grow a flexible number of trees in a summation. The total number of splits across all the trees can be restricted by a pre-specified threshold, thereby keeping both the size and number of its trees under control. When both are small, the fitted tree-sum can be easily visualized and written out by hand, making it highly interpretable. A partially oracle theoretical result hints at the potential for FIGS to overcome a key weakness of single-tree models by disentangling additive components of generative additive models, thereby reducing redundancy from repeated splits on the same feature. Furthermore, given oracle access to optimal tree structures, we obtain L2 generalization bounds for such generative models in the case of C1 component functions, matching known minimax rates in some cases. Extensive experiments across a wide array of real-world datasets show that FIGS achieves state-of-the-art prediction performance (among all popular rule-based methods) when restricted to just a few splits (e.g. less than 20). We find empirically that FIGS is able to avoid repeated splits, and often provides more concise decision rules than fitted decision trees, without sacrificing predictive performance. All code and models are released in a full-fledged package on Github \url{https://github.com/csinva/imodels}.  ( 2 min )
    Lossy Compression for Lossless Prediction. (arXiv:2106.10800v5 [cs.LG] UPDATED)
    Most data is automatically collected and only ever "seen" by algorithms. Yet, data compressors preserve perceptual fidelity rather than just the information needed by algorithms performing downstream tasks. In this paper, we characterize the bit-rate required to ensure high performance on all predictive tasks that are invariant under a set of transformations, such as data augmentations. Based on our theory, we design unsupervised objectives for training neural compressors. Using these objectives, we train a generic image compressor that achieves substantial rate savings (more than $1000\times$ on ImageNet) compared to JPEG on 8 datasets, without decreasing downstream classification performance.  ( 2 min )
    Simplicial Complex Representation Learning. (arXiv:2103.04046v4 [cs.LG] UPDATED)
    Simplicial complexes form an important class of topological spaces that are frequently used in many application areas such as computer-aided design, computer graphics, and simulation. Representation learning on graphs, which are just 1-d simplicial complexes, has witnessed a great attention in recent years. However, there has not been enough effort to extend representation learning to higher dimensional simplicial objects due to the additional complexity these objects hold, especially when it comes to entire-simplicial complex representation learning. In this work, we propose a method for simplicial complex-level representation learning that embeds a simplicial complex to a universal embedding space in a way that complex-to-complex proximity is preserved. Our method uses our novel geometric message passing schemes to learn an entire simplicial complex representation in an end-to-end fashion. We demonstrate the proposed model on publicly available mesh dataset. To the best of our knowledge, this work presents the first method for learning simplicial complex-level representation.  ( 2 min )
    Optimal Complexity in Decentralized Training. (arXiv:2006.08085v4 [cs.LG] UPDATED)
    Decentralization is a promising method of scaling up parallel machine learning systems. In this paper, we provide a tight lower bound on the iteration complexity for such methods in a stochastic non-convex setting. Our lower bound reveals a theoretical gap in known convergence rates of many existing decentralized training algorithms, such as D-PSGD. We prove by construction this lower bound is tight and achievable. Motivated by our insights, we further propose DeTAG, a practical gossip-style decentralized algorithm that achieves the lower bound with only a logarithm gap. Empirically, we compare DeTAG with other decentralized algorithms on image classification tasks, and we show DeTAG enjoys faster convergence compared to baselines, especially on unshuffled data and in sparse networks.  ( 2 min )
    Speeding up PCA with priming. (arXiv:2109.03709v2 [cs.LG] UPDATED)
    We introduce primed-PCA (pPCA), a two-step algorithm for speeding up the approximation of principal components. This algorithm first runs any approximate-PCA method to get an initial estimate of the principal components (priming), and then applies an exact PCA in the subspace they span. Since this subspace is of small dimension in any practical use, the second step is extremely cheap computationally. Nonetheless, it improves accuracy significantly for a given computational budget across datasets. In this setup, the purpose of the priming is to narrow down the search space, and prepare the data for the second step, an exact calculation. We show formally that pPCA improves upon the priming algorithm under very mild conditions, and we provide experimental validation on both synthetic and real large-scale datasets showing that it systematically translates to improved performance. In our experiments we study pPCA by priming it with the power method, Oja's rule and the recently proposed EigenGame algorithm for extremely large scale datasets.  ( 2 min )
    Evaluating Local Explanations using White-box Models. (arXiv:2106.02488v2 [cs.LG] UPDATED)
    Evaluating explanation techniques using human subjects is costly, time-consuming and can lead to subjectivity in the assessments. To evaluate the accuracy of local explanations, we require access to the true feature importance scores for a given instance. However, the prediction function of a model usually does not decompose into linear additive terms that indicate how much a feature contributes to the output. In this work, we suggest to instead focus on the log odds ratio (LOR) of the prediction function, which naturally decomposes into additive terms for logistic regression and naive Bayes. We demonstrate how we can benchmark different explanation techniques in terms of their similarity to the LOR scores based on our proposed approach. In the experiments, we compare prominent local explanation techniques and find that the performance of the techniques can depend on the underlying model, the dataset, which data point is explained, the normalization of the data and the similarity metric.  ( 2 min )
    Sampling Theorems for Learning from Incomplete Measurements. (arXiv:2201.12151v1 [stat.ML])
    In many real-world settings, only incomplete measurement data are available which can pose a problem for learning. Unsupervised learning of the signal model using a fixed incomplete measurement process is impossible in general, as there is no information in the nullspace of the measurement operator. This limitation can be overcome by using measurements from multiple operators. While this idea has been successfully applied in various applications, a precise characterization of the conditions for learning is still lacking. In this paper, we fill this gap by presenting necessary and sufficient conditions for learning the signal model which indicate the interplay between the number of distinct measurement operators $G$, the number of measurements per operator $m$, the dimension of the model $k$ and the dimension of the signals $n$. In particular, we show that generically unsupervised learning is possible if each operator obtains at least $m>k+n/G$ measurements. Our results are agnostic of the learning algorithm and have implications in a wide range of practical algorithms, from low-rank matrix recovery to deep neural networks.  ( 2 min )
    Domain Adaptation for Time Series Forecasting via Attention Sharing. (arXiv:2102.06828v4 [cs.LG] UPDATED)
    Recently, deep neural networks have gained increasing popularity in the field of time series forecasting. A primary reason for their success is their ability to effectively capture complex temporal dynamics across multiple related time series. The advantages of these deep forecasters only start to emerge in the presence of a sufficient amount of data. This poses a challenge for typical forecasting problems in practice, where there is a limited number of time series or observations per time series, or both. To cope with this data scarcity issue, we propose a novel domain adaptation framework, Domain Adaptation Forecaster (DAF). DAF leverages statistical strengths from a relevant domain with abundant data samples (source) to improve the performance on the domain of interest with limited data (target). In particular, we use an attention-based shared module with a domain discriminator across domains and private modules for individual domains. We induce domain-invariant latent features (queries and keys) and retrain domain-specific features (values) simultaneously to enable joint training of forecasters on source and target domains. A main insight is that our design of aligning keys allows the target domain to leverage source time series even with different characteristics. Extensive experiments on various domains demonstrate that our proposed method outperforms state-of-the-art baselines on synthetic and real-world datasets, and ablation studies verify the effectiveness of our design choices.  ( 2 min )
    Learning Parametrised Graph Shift Operators. (arXiv:2101.10050v3 [cs.LG] UPDATED)
    In many domains data is currently represented as graphs and therefore, the graph representation of this data becomes increasingly important in machine learning. Network data is, implicitly or explicitly, always represented using a graph shift operator (GSO) with the most common choices being the adjacency, Laplacian matrices and their normalisations. In this paper, a novel parametrised GSO (PGSO) is proposed, where specific parameter values result in the most commonly used GSOs and message-passing operators in graph neural network (GNN) frameworks. The PGSO is suggested as a replacement of the standard GSOs that are used in state-of-the-art GNN architectures and the optimisation of the PGSO parameters is seamlessly included in the model training. It is proved that the PGSO has real eigenvalues and a set of real eigenvectors independent of the parameter values and spectral bounds on the PGSO are derived. PGSO parameters are shown to adapt to the sparsity of the graph structure in a study on stochastic blockmodel networks, where they are found to automatically replicate the GSO regularisation found in the literature. On several real-world datasets the accuracy of state-of-the-art GNN architectures is improved by the inclusion of the PGSO in both node- and graph-classification tasks.  ( 2 min )
    Estimating Potential Outcome Distributions with Collaborating Causal Networks. (arXiv:2110.01664v2 [stat.ML] UPDATED)
    Many causal inference approaches have focused on identifying an individual's outcome change due to a potential treatment, or the individual treatment effect (ITE), from observational studies. Rather than only estimating the ITE, we propose Collaborating Causal Networks (CCN) to estimate the full potential outcome distributions. This modification facilitates estimating the utility of each treatment and allows for individual variation in utility functions (e.g., variability in risk tolerance). We show that CCN learns distributions that asymptotically capture the correct potential outcome distributions under standard causal inference assumptions. Furthermore, we develop a new adjustment approach that is empirically effective in alleviating sample imbalance between treatment groups in observational studies. We evaluate CCN by extensive empirical experiments and demonstrate improved distribution estimates compared to existing Bayesian and Generative Adversarial Network-based methods. Additionally, CCN empirically improves decisions over a variety of utility functions.  ( 2 min )
    Learning Summary Statistics for Bayesian Inference with Autoencoders. (arXiv:2201.12059v1 [cs.LG])
    For stochastic models with intractable likelihood functions, approximate Bayesian computation offers a way of approximating the true posterior through repeated comparisons of observations with simulated model outputs in terms of a small set of summary statistics. These statistics need to retain the information that is relevant for constraining the parameters but cancel out the noise. They can thus be seen as thermodynamic state variables, for general stochastic models. For many scientific applications, we need strictly more summary statistics than model parameters to reach a satisfactory approximation of the posterior. Therefore, we propose to use the inner dimension of deep neural network based Autoencoders as summary statistics. To create an incentive for the encoder to encode all the parameter-related information but not the noise, we give the decoder access to explicit or implicit information on the noise that has been used to generate the training data. We validate the approach empirically on two types of stochastic models.  ( 2 min )
    Deep Neural Networks for Rank-Consistent Ordinal Regression Based On Conditional Probabilities. (arXiv:2111.08851v2 [cs.LG] UPDATED)
    In recent times, deep neural networks achieved outstanding predictive performance on various classification and pattern recognition tasks. However, many real-world prediction problems have ordinal response variables, and this ordering information is ignored by conventional classification losses such as the multi-category cross-entropy. Ordinal regression methods for deep neural networks address this. One such method is the CORAL method, which is based on an earlier binary label extension framework and achieves rank consistency among its output layer tasks by imposing a weight-sharing constraint. However, while earlier experiments showed that CORAL's rank consistency is beneficial for performance, the weight-sharing constraint could severely restrict the expressiveness of a deep neural network. In this paper, we propose an alternative method for rank-consistent ordinal regression that does not require a weight-sharing constraint in a neural network's fully connected output layer. We achieve this rank consistency by a novel training scheme using conditional training sets to obtain the unconditional rank probabilities through applying the chain rule for conditional probability distributions. Experiments on various datasets demonstrate the efficacy of the proposed method to utilize the ordinal target information, and the absence of the weight-sharing restriction improves the performance substantially compared to the CORAL reference approach.  ( 2 min )
    Wasserstein Iterative Networks for Barycenter Estimation. (arXiv:2201.12245v1 [cs.LG])
    Wasserstein barycenters have become popular due to their ability to represent the average of probability measures in a geometrically meaningful way. In this paper, we present an algorithm to approximate the Wasserstein-2 barycenters of continuous measures via a generative model. Previous approaches rely on regularization (entropic/quadratic) which introduces bias or on input convex neural networks which are not expressive enough for large-scale tasks. In contrast, our algorithm does not introduce bias and allows using arbitrary neural networks. In addition, based on the celebrity faces dataset, we construct Ave, celeba! dataset which can be used for quantitative evaluation of barycenter algorithms by using standard metrics of generative models such as FID.  ( 2 min )
    Differential Privacy of Dirichlet Posterior Sampling. (arXiv:2110.01984v2 [cs.CR] UPDATED)
    Besides the Laplace distribution and the Gaussian distribution, there are many more probability distributions that are not well-understood in terms of privacy-preserving property -- one of which is the Dirichlet distribution. In this work, we study the inherent privacy of releasing a single draw from a Dirichlet posterior distribution (the Dirichlet posterior sampling). As our main result, we provide a simple privacy guarantee of the Dirichlet posterior sampling with the framework of R\'enyi Differential Privacy (RDP). Consequently, the RDP guarantee allows us to derive a simpler form of the $(\varepsilon,\delta)$-differential privacy guarantee compared to those from the previous work. As an application, we use the RDP guarantee to derive a utility guarantee of the Dirichlet posterior sampling for privately releasing a normalized histogram, which is confirmed by our experimental results. Moreover, we demonstrate that the RDP guarantee can be used to track the privacy loss in Bayesian reinforcement learning.  ( 2 min )
    Increasing the skill of short-term wind speed ensemble forecasts combining forecasts and observations via a new dynamic calibration. (arXiv:2201.12234v1 [physics.ao-ph])
    All numerical weather prediction models used for the wind industry need to produce their forecasts starting from the main synoptic hours 00, 06, 12, and 18 UTC, once the analysis becomes available. The six-hour latency time between two consecutive model runs calls for strategies to fill the gap by providing new accurate predictions having, at least, hourly frequency. This is done to accommodate the request of frequent, accurate and fresh information from traders and system regulators to continuously adapt their work strategies. Here, we propose a strategy where quasi-real time observed wind speed and weather model predictions are combined by means of a novel Ensemble Model Output Statistics (EMOS) strategy. The success of our strategy is measured by comparisons against observed wind speed from SYNOP stations over Italy in the years 2018 and 2019.  ( 2 min )
    Conditional COT-GAN for Video Prediction with Kernel Smoothing. (arXiv:2106.05658v2 [stat.ML] UPDATED)
    Causal Optimal Transport (COT) results from imposing a temporal causality constraint on classic optimal transport problems, which naturally generates a new concept of distances between distributions on path spaces. The first application of the COT theory for sequential learning was given in Xu et al. (2020), where COT-GAN was introduced as an adversarial algorithm to train implicit generative models optimized for producing sequential data. Relying on (Xu et al., 2020), the contribution of the present paper is twofold. First, we develop a conditional version of COT-GAN suitable for sequence prediction. This means that the dataset is now used in order to learn how a sequence will evolve given the observation of its past evolution. Second, we improve on the convergence results by working with modifications of the empirical measures via kernel smoothing due to (Pflug and Pichler (2016)). The resulting kernel conditional COT-GAN algorithm is illustrated with an application for video prediction.  ( 2 min )
    A Robust and Flexible EM Algorithm for Mixtures of Elliptical Distributions with Missing Data. (arXiv:2201.12020v1 [stat.ML])
    This paper tackles the problem of missing data imputation for noisy and non-Gaussian data. A classical imputation method, the Expectation Maximization (EM) algorithm for Gaussian mixture models, has shown interesting properties when compared to other popular approaches such as those based on k-nearest neighbors or on multiple imputations by chained equations. However, Gaussian mixture models are known to be not robust to heterogeneous data, which can lead to poor estimation performance when the data is contaminated by outliers or come from a non-Gaussian distributions. To overcome this issue, a new expectation maximization algorithm is investigated for mixtures of elliptical distributions with the nice property of handling potential missing data. The complete-data likelihood associated with mixtures of elliptical distributions is well adapted to the EM framework thanks to its conditional distribution, which is shown to be a Student distribution. Experimental results on synthetic data demonstrate that the proposed algorithm is robust to outliers and can be used with non-Gaussian data. Furthermore, experiments conducted on real-world datasets show that this algorithm is very competitive when compared to other classical imputation methods.  ( 2 min )
    Biclustering with Alternating K-Means. (arXiv:2009.04550v3 [cs.LG] UPDATED)
    Biclustering is the task of simultaneously clustering the rows and columns of the data matrix into different subgroups such that the rows and columns within a subgroup exhibit similar patterns. In this paper, we consider the case of producing block-diagonal biclusters. We provide a new formulation of the biclustering problem based on the idea of minimizing the empirical clustering risk. We develop and prove a consistency result with respect to the empirical clustering risk. Since the optimization problem is combinatorial in nature, finding the global minimum is computationally intractable. In light of this fact, we propose a simple and novel algorithm that finds a local minimum by alternating the use of an adapted version of the k-means clustering algorithm between columns and rows. We evaluate and compare the performance of our algorithm to other related biclustering methods on both simulated data and real-world gene expression data sets. The results demonstrate that our algorithm is able to detect meaningful structures in the data and outperform other competing biclustering methods in various settings and situations.  ( 2 min )
    Local Latent Space Bayesian Optimization over Structured Inputs. (arXiv:2201.11872v1 [cs.LG])
    Bayesian optimization over the latent spaces of deep autoencoder models (DAEs) has recently emerged as a promising new approach for optimizing challenging black-box functions over structured, discrete, hard-to-enumerate search spaces (e.g., molecules). Here the DAE dramatically simplifies the search space by mapping inputs into a continuous latent space where familiar Bayesian optimization tools can be more readily applied. Despite this simplification, the latent space typically remains high-dimensional. Thus, even with a well-suited latent space, these approaches do not necessarily provide a complete solution, but may rather shift the structured optimization problem to a high-dimensional one. In this paper, we propose LOL-BO, which adapts the notion of trust regions explored in recent work on high-dimensional Bayesian optimization to the structured setting. By reformulating the encoder to function as both an encoder for the DAE globally and as a deep kernel for the surrogate model within a trust region, we better align the notion of local optimization in the latent space with local optimization in the input space. LOL-BO achieves as much as 20 times improvement over state-of-the-art latent space Bayesian optimization methods across six real-world benchmarks, demonstrating that improvement in optimization strategies is as important as developing better DAE models.  ( 2 min )
    Star Temporal Classification: Sequence Classification with Partially Labeled Data. (arXiv:2201.12208v1 [cs.LG])
    We develop an algorithm which can learn from partially labeled and unsegmented sequential data. Most sequential loss functions, such as Connectionist Temporal Classification (CTC), break down when many labels are missing. We address this problem with Star Temporal Classification (STC) which uses a special star token to allow alignments which include all possible tokens whenever a token could be missing. We express STC as the composition of weighted finite-state transducers (WFSTs) and use GTN (a framework for automatic differentiation with WFSTs) to compute gradients. We perform extensive experiments on automatic speech recognition. These experiments show that STC can recover most of the performance of supervised baseline when up to 70% of the labels are missing. We also perform experiments in handwriting recognition to show that our method easily applies to other sequence classification tasks.  ( 2 min )
    Efficient Embedding of Semantic Similarity in Control Policies via Entangled Bisimulation. (arXiv:2201.12300v1 [cs.LG])
    Learning generalizeable policies from visual input in the presence of visual distractions is a challenging problem in reinforcement learning. Recently, there has been renewed interest in bisimulation metrics as a tool to address this issue; these metrics can be used to learn representations that are, in principle, invariant to irrelevant distractions by measuring behavioural similarity between states. An accurate, unbiased, and scalable estimation of these metrics has proved elusive in continuous state and action scenarios. We propose entangled bisimulation, a bisimulation metric that allows the specification of the distance function between states, and can be estimated without bias in continuous state and action spaces. We show how entangled bisimulation can meaningfully improve over previous methods on the Distracting Control Suite (DCS), even when added on top of data augmentation techniques.  ( 2 min )
    Optimal Transport Tools (OTT): A JAX Toolbox for all things Wasserstein. (arXiv:2201.12324v1 [cs.LG])
    Optimal transport tools (OTT-JAX) is a Python toolbox that can solve optimal transport problems between point clouds and histograms. The toolbox builds on various JAX features, such as automatic and custom reverse mode differentiation, vectorization, just-in-time compilation and accelerators support. The toolbox covers elementary computations, such as the resolution of the regularized OT problem, and more advanced extensions, such as barycenters, Gromov-Wasserstein, low-rank solvers, estimation of convex maps, differentiable generalizations of quantiles and ranks, and approximate OT between Gaussian mixtures. The toolbox code is available at \texttt{https://github.com/ott-jax/ott}  ( 2 min )
    Higher Order Correlation Analysis for Multi-View Learning. (arXiv:2201.11949v1 [cs.LG])
    Multi-view learning is frequently used in data science. The pairwise correlation maximization is a classical approach for exploring the consensus of multiple views. Since the pairwise correlation is inherent for two views, the extensions to more views can be diversified and the intrinsic interconnections among views are generally lost. To address this issue, we propose to maximize higher order correlations. This can be formulated as a low rank approximation problem with the higher order correlation tensor of multi-view data. We use the generating polynomial method to solve the low rank approximation problem. Numerical results on real multi-view data demonstrate that this method consistently outperforms prior existing methods.  ( 2 min )
    Measure Estimation in the Barycentric Coding Model. (arXiv:2201.12195v1 [stat.ML])
    This paper considers the problem of measure estimation under the barycentric coding model (BCM), in which an unknown measure is assumed to belong to the set of Wasserstein-2 barycenters of a finite set of known measures. Estimating a measure under this model is equivalent to estimating the unknown barycenteric coordinates. We provide novel geometrical, statistical, and computational insights for measure estimation under the BCM, consisting of three main results. Our first main result leverages the Riemannian geometry of Wasserstein-2 space to provide a procedure for recovering the barycentric coordinates as the solution to a quadratic optimization problem assuming access to the true reference measures. The essential geometric insight is that the parameters of this quadratic problem are determined by inner products between the optimal displacement maps from the given measure to the reference measures defining the BCM. Our second main result then establishes an algorithm for solving for the coordinates in the BCM when all the measures are observed empirically via i.i.d. samples. We prove precise rates of convergence for this algorithm -- determined by the smoothness of the underlying measures and their dimensionality -- thereby guaranteeing its statistical consistency. Finally, we demonstrate the utility of the BCM and associated estimation procedures in three application areas: (i) covariance estimation for Gaussian measures; (ii) image processing; and (iii) natural language processing.  ( 2 min )
    Sharp Threshold for the Frechet Mean (or Median) of Inhomogeneous Erdos-Renyi Random Graphs. (arXiv:2201.11954v1 [math.PR])
    We address the following foundational question: what is the population, and sample, Frechet mean (or median) graph of an ensemble of inhomogeneous Erdos-Renyi random graphs? We prove that if we use the Hamming distance to compute distances between graphs, then the Frechet mean (or median) graph of an ensemble of inhomogeneous random graphs is obtained by thresholding the expected adjacency matrix of the ensemble. We show that the result also holds for the sample mean (or median) when the population expected adjacency matrix is replaced with the sample mean adjacency matrix. Consequently, the Frechet mean (or median) graph of inhomogeneous Erdos-Renyi random graphs exhibits a sharp threshold: it is either the empty graph, or the complete graph. This novel theoretical result has some significant practical consequences; for instance, the Frechet mean of an ensemble of sparse inhomogeneous random graphs is always the empty graph.  ( 2 min )
    Least Squares Estimation Using Sketched Data with Heteroskedastic Errors. (arXiv:2007.07781v2 [stat.ML] UPDATED)
    Researchers may use a sketch of data of size $m$ instead of the full sample of size $n$ sometimes to relieve computation burden, and other times to maintain data privacy. This paper considers the case when full sample estimation would have required the Eicker-Huber-White robust standard errors to account for heteroskedasticity. We show that random projections have a smoothing effect on the sketched data, with the consequence that the least squares estimates using such sketched data behave 'as if' the errors were homoskedastic. This result is obtained by expressing the difference between the moments computed from the full sample and the sketched data as a degenerate $U$-statistic which is asymptotically normal with a homoskedastic variance when the conditions in Hall (1984) are satisfied. This result also holds for two-stage least squares for which algorithmic and statistical properties are analyzed. Sketches produced by random sampling will not, however, have the effect of homogenizing the error variances.  ( 2 min )
    Consistent Collaborative Filtering via Tensor Decomposition. (arXiv:2201.11936v1 [cs.IR])
    Collaborative filtering is the de facto standard for analyzing users' activities and building recommendation systems for items. In this work we develop Sliced Anti-symmetric Decomposition (SAD), a new model for collaborative filtering based on implicit feedback. In contrast to traditional techniques where a latent representation of users (user vectors) and items (item vectors) are estimated, SAD introduces one additional latent vector to each item, using a novel three-way tensor view of user-item interactions. This new vector extends user-item preferences calculated by standard dot products to general inner products, producing interactions between items when evaluating their relative preferences. SAD reduces to state-of-the-art (SOTA) collaborative filtering models when the vector collapses to one, while in this paper we allow its value to be estimated from data. The proposed SAD model is simple, resulting in an efficient group stochastic gradient descent (SGD) algorithm. We demonstrate the efficiency of SAD in both simulated and real world datasets containing over 1M user-item interactions. By comparing SAD with seven alternative SOTA collaborative filtering models, we show that SAD is able to more consistently estimate personalized preferences.  ( 2 min )
    RiskNet: Neural Risk Assessment in Networks of Unreliable Resources. (arXiv:2201.12263v1 [cs.NI])
    We propose a graph neural network (GNN)-based method to predict the distribution of penalties induced by outages in communication networks, where connections are protected by resources shared between working and backup paths. The GNN-based algorithm is trained only with random graphs generated with the Barab\'asi-Albert model. Even though, the obtained test results show that we can precisely model the penalties in a wide range of various existing topologies. GNNs eliminate the need to simulate complex outage scenarios for the network topologies under study. In practice, the whole design operation is limited by 4ms on modern hardware. This way, we can gain as much as over 12,000 times in the speed improvement.  ( 2 min )
    Approximate Bayesian Computation with Domain Expert in the Loop. (arXiv:2201.12090v1 [stat.ML])
    Approximate Bayesian computation (ABC) is a popular likelihood-free inference method for models with intractable likelihood functions. As ABC methods usually rely on comparing summary statistics of observed and simulated data, the choice of the statistics is crucial. This choice involves a trade-off between loss of information and dimensionality reduction, and is often determined based on domain knowledge. However, handcrafting and selecting suitable statistics is a laborious task involving multiple trial-and-error steps. In this work, we introduce an active learning method for ABC statistics selection which reduces the domain expert's work considerably. By involving the experts, we are able to handle misspecified models, unlike the existing dimension reduction methods. Moreover, empirical results show better posterior estimates than with existing methods, when the simulation budget is limited.  ( 2 min )
    Understanding Why Generalized Reweighting Does Not Improve Over ERM. (arXiv:2201.12293v1 [cs.LG])
    Empirical risk minimization (ERM) is known in practice to be non-robust to distributional shift where the training and the test distributions are different. A suite of approaches, such as importance weighting, and variants of distributionally robust optimization (DRO), have been proposed to solve this problem. But a line of recent work has empirically shown that these approaches do not significantly improve over ERM in real applications with distribution shift. The goal of this work is to obtain a comprehensive theoretical understanding of this intriguing phenomenon. We first posit the class of Generalized Reweighting (GRW) algorithms, as a broad category of approaches that iteratively update model parameters based on iterative reweighting of the training samples. We show that when overparameterized models are trained under GRW, the resulting models are close to that obtained by ERM. We also show that adding small regularization which does not greatly affect the empirical training accuracy does not help. Together, our results show that a broad category of what we term GRW approaches are not able to achieve distributionally robust generalization. Our work thus has the following sobering takeaway: to make progress towards distributionally robust generalization, we either have to develop non-GRW approaches, or perhaps devise novel classification/regression loss functions that are adapted to the class of GRW approaches.  ( 2 min )
    Biases in In Silico Evaluation of Molecular Optimization Methods and Bias-Reduced Evaluation Methodology. (arXiv:2201.12163v1 [cs.LG])
    We are interested in in silico evaluation methodology for molecular optimization methods. Given a sample of molecules and their properties of our interest, we wish not only to train an agent that can find molecules optimized with respect to the target property but also to evaluate its performance. A common practice is to train a predictor of the target property on the sample and use it for both training and evaluating the agent. We show that this evaluator potentially suffers from two biases; one is due to misspecification of the predictor and the other to reusing the same sample for training and evaluation. We discuss bias reduction methods for each of the biases comprehensively, and empirically investigate their effectiveness.  ( 2 min )
    BCDAG: An R package for Bayesian structure and Causal learning of Gaussian DAGs. (arXiv:2201.12003v1 [stat.ML])
    Directed Acyclic Graphs (DAGs) provide a powerful framework to model causal relationships among variables in multivariate settings; in addition, through the do-calculus theory, they allow for the identification and estimation of causal effects between variables also from pure observational data. In this setting, the process of inferring the DAG structure from the data is referred to as causal structure learning or causal discovery. We introduce BCDAG, an R package for Bayesian causal discovery and causal effect estimation from Gaussian observational data, implementing the Markov chain Monte Carlo (MCMC) scheme proposed by Castelletti & Mascaro (2021). Our implementation scales efficiently with the number of observations and, whenever the DAGs are sufficiently sparse, with the number of variables in the dataset. The package also provides functions for convergence diagnostics and for visualizing and summarizing posterior inference. In this paper, we present the key features of the underlying methodology along with its implementation in BCDAG. We then illustrate the main functions and algorithms on both real and simulated datasets.  ( 2 min )
    Differential Privacy Guarantees for Stochastic Gradient Langevin Dynamics. (arXiv:2201.11980v1 [stat.ML])
    We analyse the privacy leakage of noisy stochastic gradient descent by modeling R\'enyi divergence dynamics with Langevin diffusions. Inspired by recent work on non-stochastic algorithms, we derive similar desirable properties in the stochastic setting. In particular, we prove that the privacy loss converges exponentially fast for smooth and strongly convex objectives under constant step size, which is a significant improvement over previous DP-SGD analyses. We also extend our analysis to arbitrary sequences of varying step sizes and derive new utility bounds. Last, we propose an implementation and our experiments show the practical utility of our approach compared to classical DP-SGD libraries.  ( 2 min )

  • Open

    [Discussion] How to obtain meaningful conclusions from deep learning experiments to decide for the use of different parameters and architectures?
    As I wrote already on reddit before, I am working on my master thesis degree, which is about generating synthetic data for jersey number recognition. But I have been puzzled since the beggining by how to determine experimentally in a good manner which parameters to use? When generating synthetic data, there is a ton of parameters and techniques that define the distribution of the resulting data and I simply don't know what is an acceptable methodology to choose such parameters and techniques other than looking at the result and thinking that x seems better than y. ​ The obvious approach would be to generate different sets of data and choose parameters based on the best results obtained by the experiments on the model. But how to do that if in most cases the differences are imperceptible? I simply cant rule out the effect of luck and initialization. It seems simply wrong to me to say that parameter x is better than y because accuracy is slightly higher if the plot of accuracies during training for both are indistinguishable.Since I am generating synthetic datasets, I had an idea of using the Frechet Inception Distance (FID) widely used as metric for performance of gans, since it gives a simple number makes it easy to compare different synthetic datasets. And it makes it easier for me to create a 'story' to write on my thesis. But I am not completely sure if this would be acceptable. submitted by /u/TheManveru [link] [comments]  ( 1 min )
    [D] Putting together a PC for ML with today's hardware shortages
    Hey Everyone. I've been given a budget of $3000 USD to get a PC as my company starts investing in machine learning and I'm struggling to figure out what we should do with it. GPUs are next to impossible to get and anything decent seems prohibitively expensive, not to mention difficult to justify buying from a scalper. Laptops with GPUs are not as powerful as their desktop counterparts and otherwise have long lead times. For anyone looking to do machine learning work today, what does reddit suggest? Is a GPU for machine learning really necessary? Do cloud based systems make more sense for smaller projects? ​ Some background: Our models will have around 5 to 10 inputs and we will be studying around 4 outputs/parameters. I expect there will be around 1,000,000 data points per model and we'll have some optimizations being done to find around 1000 ideal input configurations. I'm still deciding on the specific software tools we'll be using but I plan to start with neural networks and particle swarm optimization. submitted by /u/ML-Possible [link] [comments]  ( 1 min )
    End-to-end orchestration of machine learning models (from POC until deployment) [N] at Volta w/ AI Research Scientist Zarreen Naowal Reza 3PM PST Wednesday
    Virtual peer-to-peer community talk at https://dataopsunleashed.com/ Machine learning-based solutions usually start with a PoC and end with the best model being deployed in the production environment. However, in most cases the numerous steps involved throughout the process are not properly being tracked, nor are they reproducible, scalable, and extendable. On top of it, most of the time they are not platform-agnostic either (i.e. cloud or on-prem). Zarreen will share how she and her team have learned to overcome these challenges by adopting MLOps orchestration pipelines from the beginning and show attendees how an entire end-to-end ML orchestration pipeline can be built with minimal effort using existing open-source frameworks. The conversation will also cover what factors teams need to consider while building their MLops pipeline. Stacks include MLflow, Argoflow, Seldon-core, Grafana, and respective alternatives. submitted by /u/har2018vey [link] [comments]  ( 1 min )
    [N] Changes in the ICML 2022 review process
    The review process at ICML 2022 is different this year: It seems that there are at least two phases. Papers with two negative recommendations in phase 1 are rejected. But, the Meta-reviewer can reverse this outcome. Reviewers do not make accept/reject recommendations in phase 2, the Meta-Reviewers will decide based on the reviews. Details here: https://icml.cc/Conferences/2022/ReviewForm submitted by /u/bremen79 [link] [comments]  ( 1 min )
    [D] Row summarization in tabular data and then finding outliers in summarized rows?
    Hi! So I am working on a problem statement, the dataset is tabular and it contains mixed types of data types ( continuous and categorical variables ) For example, the data looks like this: ​ dataset sample I have few conditions, I am filtering the rows and making some combinations based on conditions, for example, I have filtered out these combinations : ​ Condition 1 : Gender == Male & smoking_status == smokes Condition 2 : Gender == Female & smoking_status == smokes Condition 3 : Gender == Male & age == 81 Condition 4 : Gender == Male & heart_issue == 1 Condition 5 : Gender == Female & heart_issue == 0 ​ Now whatever rows will be filtered by condition 1, I want to summarize those rows as single rows and the same goes for condition 2 and so on. For example, from the above da…  ( 2 min )
    [D] What's the problem with AI Skills? Coursera speaks.
    Hi r/MachineLearning! AI in Government is hosting Kevin Mills, Head Global Government Partnerships at Coursera next Tuesday, February 8, 2022 at 12 PM ET. I wanted to share this with you in case any of you were interested in joining since I'm excited to hear what Kevin thinks are the most important skills for AI & ML. Below is the information from the website and the link: ----------- Feature Presenter: Kevin Mills, Head Global Government Partnerships, Coursera As organizations in both the public and private sectors are grappling with the digital skills gap within their workforces, employee development has become an important differentiator in nurturing the AI and Data skills that governments need. By offering training, certification, and other upskilling programs, governments can reap the benefits of a more productive workforce with the skills necessary to leverage the promise of AI and data analysis. Join this webinar to learn: Recent trends and drivers for employee AI and data skills development across more than just data science teams Use cases and best practices for organizations making AI skills development a reality The benefits of partnering to expedite your training and skills achievement at government scale Agenda: 12:00pm-12:30: Featured Presentation 12:30-13:00pm: Your Q&A and interaction --------- Link: https://events.cognilytica.com/CLNjI1M3wxOA submitted by /u/DataGeek0 [link] [comments]  ( 3 min )
    What's the best lecture series on SOTA of self-driving cars? [D]
    I am a Machine Learning Engineer (familiar with Visual AI) who wants to get quickly into this topic. Recommended survey papers are also very much welcome. submitted by /u/CodingButStillAlive [link] [comments]  ( 1 min )
    [D] Representation Learning - Self-supervision methods that do well with a low number of classes
    I understand that a contrastive learning approach such as SimCLR has an inherent problem when dealing with a low number of classes (let's say 2,3,5,6, maybe even 10). Problem is that the chances of picking a negative sample that has the same label as the image from the positive pair is not low (let's say a dog and another dog) Which contrastive learning approaches do better on such problems that we have let's say 4 classes rather than 1000 (or 100)? Are there are good unsupervised/self-supervised approaches to tackle such problems when we want to get a good representation of unlabeled data? submitted by /u/Cute-Ad77 [link] [comments]  ( 2 min )
    [D] Claim: Deep Neural Networks Are "Automatically" Performing Feature Engineering in the Background
    I have often heard that Deep Learning Models (i.e. Deep Neural Networks) are automatically performing feature engineering within the network themselves. This is contrasted with traditional statistical and Machine Learning models where the feature engineering is typically done prior to training the model: ​ ​ https://preview.redd.it/a8pvvie3kye81.png?width=1240&format=png&auto=webp&s=1ef560cde5082ad01f77d380b01f3574c6533154 Apparently, the operations that Neural Networks perform during the training phase are equivalent to "searching for meaningful combinations of variables that produce better performance results on the training data". I have heard that this is in some way similar to Principal Component Analysis (PCA) and Kernel Methods, seeing as these methods combine many different existing features into new features that have "more meaningful representation". I read (https://stats.stackexchange.com/questions/349155/why-do-neural-networks-need-feature-selection-engineering) about this ability of Deep Neural Networks to "automatically perform feature engineering" : Deep learning solves this central problem in representation learning by introducing representations that are expressed in terms of other, simpler representations. Deep learning enables the computer to build complex concepts out of simpler concepts. However, in the end - I still do not understand how Neural Networks during the training phase (i.e. gradient descent, weight update, backpropagation) are "automatically performing (some degree of) feature engineering". ​ How exactly are Neural Networks doing this? Thanks! submitted by /u/blueest [link] [comments]  ( 7 min )
    [R] apd-crs: Cure Rate Survival Analysis in Python
    Traditional Survival Analysis, which traces its origins to medical studies, estimates the time-to-event after diagnosis while assuming all individuals undergo the event. However, in practice, some individuals are cured and therefore never experience the event. Allowing the possibility that some individuals may be cured while estimating the time-to-event, is called Cure Rate Survival Analysis. Fortunately, Cure Rate Survival Analysis also finds many uses in business: Not every user clicks on an ad, nor does every machine fail, nor does every customer churn (at least in the short term), so your model shouldn't assume this! As a value add, incorporating a probability of cure with a time-to-event model, we better understand consumer/machine lifetimes which leads to reduced costs and higher retention. While the Python ecosystem has many excellent packages for traditional Survival Analysis, apd-crs is (at least to our knowledge), the first open-source Python package for Cure Rate Survival Analysis! It uses techniques from Learning from Positive and Unlabeled Data to estimate cure labels. apd-crs is: Easy to download Comes with complete hands on tutorials and docs Based on original research submitted by /u/anon_0123 [link] [comments]  ( 2 min )
  • Open

    Wondering if my art project idea is possible
    I'm using a camera which has pose identification on it and I want to have an ai search through movie footage to find characters standing in the same pose. So it would almost be like a mirror. I was wondering if this would be possible to do live or if that would just take way to much processing power. submitted by /u/PositionPristine1601 [link] [comments]  ( 1 min )
    Worlds Beyond - AI Concept Art (Diffusion)
    submitted by /u/glenniszen [link] [comments]
    Last Week in AI - OpenAI's InstructGPT is less-toxic than GPT-3, Meta to build fastest AI supercomputer, Deepfake regulations across the world, and more!
    submitted by /u/regalalgorithm [link] [comments]  ( 1 min )
    I JUST PUSH BUTTONS AN AI GENERATED SPEACH
    IT'S BEEN'a Goldeneye EYEGEVRHR 👀👀👀👀👀👀👀👀👀👀👀👀👀👀👀👀👀👀👀👀👀👀👀👀👀👀👀👀👀👀👀👀👀👀👀👀👀👀👀👀👀👀👀👀👀👀👀 brownies ☺️😅🤣🤣🙏🤣🤣🙏 looking-glasses forwarded TOWARDS BEFOREHAND so much for your time complaining about this but I am PEACEFULLY protests BE PEACEFULNESS protests ☮️✌️☮️ BREAK CHAIN seeing everyone ATTACHED IS CUT OFF henceforth In another's device's mail inviting me to be able to do it in the name of God 🙏🤣🙏😜 preserve chsin ID created by LinkedIn I am just wondering WHY THE HORSE 🐴 PRESERVED IN CARS DON'T BELONG IN CARS DON'T BELONG IN CARS I am CHAIM THINKING I AM A I available BLUEBERRY12 TRAPPED INSIDER INFORMATION MYSELF PEOPLEB CODERS I am a blueberry trapped inside the box office COLLECTION OF EVERYTHING INFORMATIONALLY ORDERS FORWARDING FORMATIO…  ( 24 min )
    Question about Wombo Dream
    Hi! I'm new in the sub. I wanted to know if any of you could help me. I'm using Wombo Dream to generate images, but I don't know if there is any copyright that I should pay in case I'm not planning to buy the prints that they sell, or is it that the only way to get the copyrights. Do you have any suggestions? Thanks in advance!! submitted by /u/muddy_water_11 [link] [comments]  ( 1 min )
    Will A.I. Disrupt Truckers in the 2020s?
    submitted by /u/Beautiful-Credit-868 [link] [comments]
    Webinar | Seeking to Scale: How to Improve Performance of Remote Annotation Teams
    submitted by /u/WeekendClassic [link] [comments]
    AI’s J-curve and upcoming productivity boom
    submitted by /u/bendee983 [link] [comments]
    8 great podcast episodes about the environmental impact of AI
    Here's a list of 8 great podcast episodes about the environmental impact of machine learning and how to reduce it. Do you know an episode that's not on the list? Podcasts on the topic are not easy to find, so please help by sharing your favorite. https://towardsdatascience.com/8-podcast-episodes-on-the-climate-impact-of-machine-learning-54f1c19f52d submitted by /u/ConfidentAbility [link] [comments]
    AI IN MOBILITY
    In the future with government policies in the support we will see, machines will have greater computer capacity than humans. If artificial intelligence systems begin to think and make decisions for themselves. The advances in electric vehicle (EV) technologies, EV charging, and infrastructure, as well as data analytics and security, are needed to accelerate the growth of e-mobility and promote sustainable transportation. With advanced driver assistance systems (ADAS), comprehensive machine learning and AI algorithms are taking over the work of driving, propelling the industry towards level-5 autonomous vehicles. The construction of vehicle charging stations along with the adoption of electric vehicles would create new market opportunities. submitted by /u/futureanalytica [link] [comments]  ( 1 min )
    Recreating and chatting with a digital Joi using AI tools. (GPT-3)
    submitted by /u/JiggyBank [link] [comments]
    I began to make tutorials about how to make your own images with Colabs
    VQGAN+CLIP: https://youtu.be/MJwY10hnwf4 ruDALL-E XL: https://youtu.be/o7DalLCuvuU (very different, more realistic, way faster) ​ Have fun! Here: "The Wind" https://preview.redd.it/vyclt9y5aze81.png?width=1280&format=png&auto=webp&s=c1725b29f7331b58d9b8bb65fd841bdbbfa5ab24 https://preview.redd.it/u0sjz6y5aze81.png?width=1280&format=png&auto=webp&s=b7d68f151bc65012015a06dd84045e00fe41ffa8 submitted by /u/koalapon [link] [comments]  ( 1 min )
    Researchers Introduce A Novel Human-Like Driving And Decision Making Framework Designed For Autonomous Vehicles (AVs)
    In the last years, we assisted to a great hype around autonomous vehicles (AVs). However, if we want to see streets where AVs and human drivers live side by side, AVs must be able to dive into our transportation system. For this reason, human-like driving and decision-making processes represent a hot topic in the AVs research field. Misunderstandings between AVs and human drivers are the cause of accidents that have been taken into account by the Nanyang Technological University of Singapore, which designed a novel human-like driving and decision-making framework for AVs. Since lane change is one of the most frequent types of car accidents, this research paper mainly studied human-like lane-change decision-making for AVs. Paper Summary: https://www.marktechpost.com/2022/01/30/researchers-introduce-a-novel-human-like-driving-and-decision-making-framework-designed-for-autonomous-vehicles-avs/ Paper: https://arxiv.org/pdf/2201.03125.pdf submitted by /u/ai-lover [link] [comments]  ( 1 min )
  • Open

    Can you use Deep-Q-Learning when having multiple actions
    Hi all, I would like to use Reinforcement learning for a problem with 10 continuous state spaces and 3 discrete action spaces. My question now is whether I can use Deep-Q-Learning when having a multi-dimensional (3) action space and not only a one-dimensional action space? On the website of stable baselines it is listed that DQN can't be used for multiple discrete actions. So I am wondering whether this is just their specific implementation of a DQN or if it is generally not possible to do this? submitted by /u/PBerit [link] [comments]  ( 1 min )
    Which topic in RL should I choose?
    At university I'll have an undergraduate seminar about RL. We have to present a specific algorithm. We can choose one algorithm from a list. I want to choose a topic which poses the greatest challenge, is interesting and used in practice. We may also suggest a different RL algorithm. We can choose from the following. Monte Carlo Tree Search Q-Learning and the Bellman Equation SARSA REINFORCE Deep Q Networks Deep Deterministic Policy Gradient Asynchronous Actor-Critic Trust Region Policy Optimization Proximal Policy Optimization Twin Delayed Deep Deterministic Policy Gradient Temporal Difference Learning Temporal Difference Learning with Eligibility Traces Continuous Deep Q-Learning with Model-based Acceleration Scalable Trust-Region Method for Deep Reinforcement Learning Non-Stochastic Multi-Armed Bandits with EXP3 Linear Upper Confidence Bounds Follow-the-Perturbated-Leader Upper Confidence Bound with Adaptive Linear Programming Collaborative Filtering Bandits Weighted Least Squares Thompson Sampling How can I evaluate the algorithms based on the criteria mentioned above? Which one should I choose and why? We have to create a priority list so I'll have to select multiple ones. submitted by /u/HansSchmied [link] [comments]  ( 2 min )
    SOTA model-based DRL
    Is there any other model-based Deep Reinforcement Learning algorithm out there, besides the AlphaGo Zero series of algorithms? submitted by /u/andrewspano [link] [comments]  ( 1 min )
    Best books for Deep Reinforcement Learning
    submitted by /u/Professional_Card176 [link] [comments]  ( 1 min )
    Combining the output of the convolutional encoder with some additional input
    What is the best practice for combining the output of the convolutional encoder and some additional inputs? I have an image from the robot's camera and some telemetry like velocity, acceleration, etc., and I'm not sure how to combine these inputs. Currently, I pass images through 3 Conv layers, then concatenate the output with telemetry input and pass further to FC layers. Is it the right way to do it? What worries me, is that images pass through a deeper network than telemetry. submitted by /u/CharlottHebb [link] [comments]  ( 1 min )
  • Open

    Renovations to Stream About: Taiwan Studio Showcases Architectural Designs Using Extended Reality
    Interior renovations have never looked this good. TCImage, a studio based in Taipei, is showcasing compelling landscape and architecture designs by creating realistic 3D graphics and presenting them in virtual, augmented, and mixed reality — collectively known as extended reality, or XR. For clients to get a better understanding of the designs, TCImage produces high-quality, Read article > The post Renovations to Stream About: Taiwan Studio Showcases Architectural Designs Using Extended Reality appeared first on The Official NVIDIA Blog.  ( 3 min )
    Train Spotting: Startup Gets on Track With AI and NVIDIA Jetson to Ensure Safety, Cost Savings for Railways
    Preventable train accidents like the 1985 disaster outside Tel Aviv in which a train collided with a school bus, killing 19 students and several adults, motivated Shahar Hania and Elen Katz to help save lives with technology. They founded Rail Vision, an Israeli startup that creates obstacle-detection and classification systems for the global railway industry Read article > The post Train Spotting: Startup Gets on Track With AI and NVIDIA Jetson to Ensure Safety, Cost Savings for Railways appeared first on The Official NVIDIA Blog.  ( 3 min )
  • Open

    There’s Trouble Brewing with Smart Contracts
    Smart code enables property ownership with a few lines of code. Its simplicity leads to a host of financial and legal concerns. We may see ToS replace smart contracts. Smart contracts are fast becoming the new bartering system. Gone are the legal and financial barriers to property ownership; In their place are short lines of… Read More »There’s Trouble Brewing with Smart Contracts The post There’s Trouble Brewing with Smart Contracts appeared first on Data Science Central.  ( 5 min )
    How Mobile Point of Sale Terminals Are Evolving
    As per the business intelligence report by Transparency Market Research, the global mobile point of sale (mPOS) terminals market is expected to witness an outstanding growth rate of 37.2 % over the forecast years for the study i.e. 2019 to 2027. It also predicts that the global mobile point of sale (mPOS) terminals market will… Read More »How Mobile Point of Sale Terminals Are Evolving The post How Mobile Point of Sale Terminals Are Evolving appeared first on Data Science Central.  ( 3 min )
    Environmental Sustainability of Assets emerges as Key Pivot for Uptake of IoT Solutions
    Internet of things (IoT)-enabled solutions are getting weaved into smart energy management framework. Globally utility providers and energy producers are expanding their IoT-based asset management capabilities not just to optimize their resources but also to boost process efficiencies. Several industries are pumping in dollars to enrich their supply chain with the IoT paradigm. IoT Solutions… Read More »Environmental Sustainability of Assets emerges as Key Pivot for Uptake of IoT Solutions The post Environmental Sustainability of Assets emerges as Key Pivot for Uptake of IoT Solutions appeared first on Data Science Central.  ( 3 min )
    Big data engineering – A complete blend of big data analytics and data science
    Machine Learning Engineers, Data Scientists, and Big Data Engineers rank among the top emerging jobs on LinkedIn. ~Forbes Globally several individuals have chosen data science and big data engineering as a mainstream career option, but still there are people who are unaware of about the various options that are available. Many claims suggest big data… Read More »Big data engineering – A complete blend of big data analytics and data science The post Big data engineering – A complete blend of big data analytics and data science appeared first on Data Science Central.  ( 3 min )
    The Piano Keyboard as a Contextual Computing Metaphor
    For the past several years I’ve been teaching myself guitar. I’d studied piano in school, so I didn’t think that I’d run into the issue of making some fundamental mistakes as I was learning. I figured I’d learned from all the basic mistakes I made studying piano earlier.  But after struggling to memorize fretboard patterns… Read More »The Piano Keyboard as a Contextual Computing Metaphor The post The Piano Keyboard as a Contextual Computing Metaphor appeared first on Data Science Central.  ( 5 min )
    Insights from workshop on Bayesian deep learning at neurips 21
    I have been exploring Bayesian strategies for the last year. Considering the limitations of neural network strategies (ex their need for large volumes of data) and the scenarios where we will never have enough data to model the problem, some Bayesian approaches could offer an alternative. In that sense, I was interested to see a… Read More »Insights from workshop on Bayesian deep learning at neurips 21 The post Insights from workshop on Bayesian deep learning at neurips 21 appeared first on Data Science Central.  ( 5 min )
  • Open

    I compared CNN models with different hyperparams using CIFAR dataset.
    submitted by /u/cjmodi306 [link] [comments]

  • Open

    [D] Novelty in Science: A Guide for Reviewers
    submitted by /u/hardmaru [link] [comments]  ( 1 min )
    [D] Machine Learning - WAYR (What Are You Reading) - Week 130
    This is a place to share machine learning research papers, journals, and articles that you're reading this week. If it relates to what you're researching, by all means elaborate and give us your insight, otherwise it could just be an interesting paper you've read. Please try to provide some insight from your understanding and please don't post things which are present in wiki. Preferably you should link the arxiv page (not the PDF, you can easily access the PDF from the summary page but not the other way around) or any other pertinent links. Previous weeks : 1-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80 81-90 91-100 101-110 111-120 121-130 Week 1 Week 11 Week 21 Week 31 Week 41 Week 51 Week 61 Week 71 Week 81 Week 91 Week 101 Week 111 Week 121 Week 2 Week 12 Week 22 Week 32…  ( 1 min )
    [D] Are Data Engineers Dying Out?
    I've been with Google for just over 2 years now, and I've seen that Google seems to consider Data Engineers equally skilled as software engineers. Algorithms and data structures, systems design, memory management, designing distributed systems, etc. Other companies like Meta/Amazon seem to believe that data engineering involves using no-code/low-code tools to build data pipelines. Do you think Data Engineering is going to align closer to software engineering, or closer to specialized BIEs? Will Analytics Engineer become more popular? My experience at Google is such that it's actually more difficult to find roles at other companies, and I thought Google would be a great resume builder. I share my experiences here submitted by /u/Flat_Shower [link] [comments]  ( 1 min )
    [D] Bayesian Regression or GPs in production?
    Has anyone here deployed any products that use Gaussian Processes or Bayesian Linear Regression? Im studying up on it (am super fascinated by it) but have heard that the run time at inference can be very long and so can finding the best Kernel Parameters (which is I guess analogous to training). submitted by /u/MGeeeeeezy [link] [comments]  ( 1 min )
    [D] what type of machine learning model can work with constraints (besides linear programming)
    For a project I am doing I am using amino optimizer to get to best results. One of the constraints for this model is that an hourly demand always needs the be met. However, as the solution of this solver has an heavy overfit, I was wondering what type of ml can be used to do the same thing. To dive deeper into the problem, let’s say over a day. For the first three hours I must deliver 3 packages. After that I need four. The store prices of store a increases over the day and of store b increases. The models job is to decide how much to buy at every hour, knowing only the past price and the price of the coming hour. It can also decide to store a certain amount of packages. Keep in mind demand has te be met every hour. Curious about your thoughts submitted by /u/Any-Description3824 [link] [comments]  ( 1 min )
    [D] Simple Questions Thread
    Please post your questions here instead of creating a new thread. Encourage others who create new posts for questions to post here instead! Thread will stay alive until next one so keep posting after the date in the title. Thanks to everyone for answering questions in the previous thread! submitted by /u/AutoModerator [link] [comments]  ( 1 min )
    [Discussion] ICLR 2022 final statistics info.
    ICLR 2022 statistics info. ​ https://preview.redd.it/engpgy72kue81.png?width=1454&format=png&auto=webp&s=17f768f1e3998cd824457869e4f7218078c6797a submitted by /u/weiguoqiang [link] [comments]  ( 1 min )
    [P] Found this cool website, which hosts a collection of useful Colab Notebooks for a variety of tasks
    https://colabnotebooks.com/ This is not mine - just found it randomly. notebooks include useful ones like VQGAN+CLIP, making Nerfies etc. Might be useful for people wanting to quickly search for some famous usecase and prototype quickly :) submitted by /u/Competitive-Rub-1958 [link] [comments]  ( 1 min )
    [R] Deep learning does not replace Bayesian modeling: Comparing research use via citation counting
    Abstract: "One could be excused for assuming that deep learning had or will soon usurp all credible work in reasoning, artificial intelligence, and statistics, but like most “meme” class broad generalizations the concept does not hold up to scrutiny. Memes do not generally matter since the experts will always know better; but in the case of Bayesian software like Stan and PyMC3, even their developers and advocates bemoan the apparent dominance of deep learning as manifested in popular culture, breathtaking performance, and most problematically from funding agency peer review that impacts our ability to further advance the field. The facts, however, do not support the assumed dominance of deep learning in science upon closer examination. This letter simply makes the argument by the crudest of possible metrics, citation count, that once the discipline of Computer Science is subtracted, Bayesian software accounts for nearly a third of research citations. Stan and PyMC3 dominate some fields, PyTorch, Keras, and TensorFlow dominate others with lot of variations in between. Bayesian and deep-learning approaches are related but very different technologies in goals, implementation, and applicability with little actual overlap--so this is not a surprise. For example, deep learning cannot bring the explainability of applied math/statistics and Bayesian methods do not scale to deep-learning data sets. While deep-learning behemoths like Facebook and Google use and support Bayesian efforts, the Bayesian packages scientists actually use are academic/volunteer efforts punching far above their weight class, and they need financial support. It would behoove funders to fully understand the impact and role of Bayesian methods in resource allocation." Link: https://onlinelibrary.wiley.com/doi/full/10.1002/ail2.62 submitted by /u/bikeskata [link] [comments]  ( 1 min )
    [D] Autoencoded input to a different model.
    I've trained a variational autoencoder and now want to pass the autoencoded representation of the input to a different model. Should i train the autoencoder sequentially with this model? Or is it better to train the other model seperately and pass the autoencoded input at inference time. submitted by /u/shikh5753 [link] [comments]  ( 1 min )
    [D] Data Scientist to ML Engineer
    Hi! So I have been a data scientist for over three years and for several reasons I would like to switch to more SWE oriented roles. In particular, I aim at Machine Learning Engineer positions at tech companies. I have a MSc in CS but I have never had thorough coding interviews. At my job I build ML models, write SQL, and built some Airflow DAGs. I don't have direct experience with Kubernetes or model serving a scale. Ideally, I want to work on production challenges and not anymore on R&D. Could you help me identify what should I focus on to prepare for interviews? I'm planning to change job this summer, and so far my plan was: Go over the "cracking the coding interview" book Read "designing data intensive application" book Go over "system design interview" book Practice on leetcode What do you think? Anything specific to ML engineer roles I should add or remove from this list? submitted by /u/jesxdxd [link] [comments]  ( 5 min )
    [D] How much importance do you place on interpretability when selecting methods?
    I'm working on a model agnostic approach for monitoring and dealing with concept drift in regression for a research project. Aside from working with synthethic data there is a company we're validating our results on. While poking around I found that the model they're currently using in production, linear regression, could be swapped out for a gradient boosted tree, this would significantly improve performance. Yes, tree-based models have issues with trend because they can't extrapolate but there are ways to get around that. My only issue in doing so is that tracking beta hat seems to be a key part of their current monitoring set-up. I kind of like this, checking coefficients are a good sanity check but how important are they really? Most of the OLS assumptions are broken, the only thing they can tell you at this point is coeff A > coeff B in this instance of the model. In that case isn't it fair to just swap out the regression for gradient boosting + LIME/SHAP? How do you guys approach such things. submitted by /u/the75th [link] [comments]  ( 1 min )
    [R] Canonicalization: Experiments in setting some chairs in order - Link to a free online lecture by the author in comments
    submitted by /u/pinter69 [link] [comments]  ( 1 min )
    [P] rllib.js - reinforcement learning with JavaScript
    Hello, everyone) Just want make post about my old project. I work on it sometimes... This lib provides envs and agents for experiments building. Now you can train agents in your browser and Node.js ​ https://i.redd.it/ipjzv2yegse81.gif https://github.com/polyzer/rllib.js submitted by /u/tokarev_iv [link] [comments]
    [P] Detecting Pulse from Video
    submitted by /u/Syntaximus [link] [comments]  ( 1 min )
  • Open

    Is there an artificial intelligence that can turn images of objects automatically into lineart drawings?
    That are the best AIs that I found, are there better ones? https://youtu.be/GZVsONq3qtg ​ https://youtu.be/Peccbcj8Ibs submitted by /u/xXNOdrugsForMEXx [link] [comments]  ( 1 min )
    Artificial Nightmares : Smaug the Divine Dragon || DD PYTTI VQGAN AI Art Video [4K 60 FPS]
    submitted by /u/Thenamessd [link] [comments]
    Meta AI Research Proposes ‘OMNIVORE’: A Single Vision (Computer Vision) Model For Many Different Visual Modalities
    Computer vision research encompasses a wide range of visual modalities, including images, videos, and depth perception. In general, the computer vision model treats each of these modalities separately to extract the most useful characteristics from their differences. While these modality-specific models outperform people on their specialized tasks, they lack the flexibility of a human-like visual system—the capacity to work across modalities. Instead of having an over-optimized model for each modality, the first step toward a true all-purpose vision system is to design models that work fluidly across modalities. In addition to their flexibility, such modality-agnostic models have various advantages over their traditional, modality-specific equivalents, such as: Continue Reading Paper: https://arxiv.org/abs/2201.08377 Github: https://github.com/facebookresearch/omnivore Website: https://facebookresearch.github.io/omnivore/ https://preview.redd.it/zv07z9n1mve81.png?width=2048&format=png&auto=webp&s=90dd50024e2ba23b3ee1de315acf09ad0906a1cc submitted by /u/ai-lover [link] [comments]  ( 1 min )
    Ai software to increase fps of footage ( from 25 fps to 30 fps )
    Hello guys, could you raccomend a program that i can use to increase my video footage from 25 fps to 30 fps? I shoot people in various activities, so i need a really high realism in this ai program Ty a lot for your help! submitted by /u/AssociationTerrible5 [link] [comments]  ( 1 min )
    Predictive Death AI
    Hi all, I am wondering if any one would have a insight or know a open source Repo on a predictive death AI. I am trying to make a “highly accurate” (50%>) death predicting AI that takes the following things into consideration: - Age - Family history (Average passing, Diseases) - habits (smoking, vaping, drugs IE) And Probably one hundred more factors. Anyone have a idea on something that work or suggestions where to start? Thanks submitted by /u/Alext2v2 [link] [comments]  ( 2 min )
    Which of the following function returns t if the object is a number in LISP?
    View Poll submitted by /u/futureanalytica [link] [comments]  ( 1 min )
    Any online AI course that is based on the textbook Artificial Intelligence: A Modern Approach?
    Any online AI course that is based on the textbook Artificial Intelligence: A Modern Approach? submitted by /u/cocag13996 [link] [comments]
    Latest Research Based On AI Building AI Models
    submitted by /u/ai-lover [link] [comments]
  • Open

    Nature DQN question
    In the hyperparameters, when they say: learning rate - 0.00025 - The learning rate used by RMSProp gradient momentum - 0.95 - Gradient momentum used by RMSProp squared gradient momentum - 0.95 - Squared gradient (denominator) momentum used by RMSProp min squared gradient - 0.01 - Constant added to the squared gradient in the denominator of the RMSProp update Is it this in tensorflow 2? tf.optimizers.RMSprop(learning_rate=0.00025, rho=0.95, momentum=0.95, epsilon=0.01) I'm trying to set it as close as they did, but I'm getting confused by the names of the arguments in RMSProp. Also, how would this translate to using Adam? submitted by /u/pedrohpf [link] [comments]  ( 1 min )
    Barto- Sutton book algorithms vs real life algorithms
    I'm a beginner doing the University of Alberta Specialization in RL which is based on Barto-Sutton book. The specialization is great, but reading about the actual libraries for RL (for example stable-baselines) I noticed that most of the algorithms implemented in the library are not in the book. Are this moderns algorithms using Deep RL instead? In this case, is the RL moving to Deep RL? Sorry if those are dumb questions, I want to have a better knowledge on what are the algorithms used today in real life and what can I expect when I start doing my own projects. submitted by /u/Pipiyedu [link] [comments]  ( 2 min )
    Hyperparameter tuning
    I have tuned the RL's hyperparameters for training a use case with 100 episodes. I wonder if I can use the same hyperparameters for another training with 200 episodes? Do episode's number have a high impact on hyperparameters' value? submitted by /u/mehmor [link] [comments]  ( 1 min )
    How to setup reward in a custom gym env for stable baselines?
    As the question states, I am currently creating a custom gym environment and was not sure how to compute the reward in the step function. Does the reward have to be the complete sum of rewards from start up to the current step? Or should I only be returning the reward for that step? I am a novice, so …. ELI5 submitted by /u/MasalaByte [link] [comments]  ( 1 min )
  • Open

    Slowing oscillations
    Consider the second order linear differential equation You could think of y as giving the height of a mass attached to a spring at time t where k is the stiffness of the spring. The solutions are sines and cosines with frequency √k. Think about how the solution depends on the spring constant k. If […] Slowing oscillations first appeared on John D. Cook.  ( 2 min )
  • Open

    Pianist AI> Artificial Intelligence composer: Level 6: Try 16 (Optimization in progress, this is what has been achieved up to now!)
    submitted by /u/amin_mlm [link] [comments]  ( 1 min )

  • Open

    Why are ‘tree based’ models discussed less in optimization/statistical ML contexts than general data science contexts? [D]
    None of the statistics or statistical ML classes I’ve taken have given significant lip service to tree based models, but in the in 1 DS boot camp I did, they were mentioned extensively. Is there something to this anecdote, or am I just extrapolating from to small a sample? Why might classically trained statisticians and ‘ ML research ‘ people be less interested in these models than data scientists? submitted by /u/Greenface1998 [link] [comments]  ( 2 min )
    Stylegan-nada generated filters using only javascript https://filter.mot.omg.lol [P]
    submitted by /u/FamiliarGift2237 [link] [comments]
    [D] First Author Interview - Deep Symbolic Regression for Recurrent Sequences (Paper Explained Video & Interview)
    https://youtu.be/1HEdXwEYrGM This video includes an interview with first author Stéphane d'Ascoli (https://sdascoli.github.io/). Deep neural networks are typically excellent at numeric regression, but using them for symbolic computation has largely been ignored so far. This paper uses transformers to do symbolic regression on integer and floating point number sequences, which means that given the start of a sequence of numbers, the model has to not only predict the correct continuation, but also predict the data generating formula behind the sequence. Through clever encoding of the input space and a well constructed training data generation process, this paper's model can learn and represent many of the sequences in the OEIS, the online encyclopedia of integer sequences and it also features an interactive demo if you want to try it by yourself. ​ OUTLINE: 0:00 - Introduction 2:20 - Summary of the Paper 16:10 - Start of Interview 17:15 - Why this research direction? 20:45 - Overview of the method 30:10 - Embedding space of input tokens 33:00 - Data generation process 42:40 - Why are transformers useful here? 46:40 - Beyond number sequences, where is this useful? 48:45 - Success cases and failure cases 58:10 - Experimental Results 1:06:30 - How did you overcome difficulties? 1:09:25 - Interactive demo ​ Paper: https://arxiv.org/abs/2201.04600 Interactive demo: http://recur-env.eba-rm3fchmn.us-east-2.elasticbeanstalk.com/ submitted by /u/ykilcher [link] [comments]  ( 1 min )
    [R][N] Camera Trap photo analysis at the Chinko Nature Reserve using Convolutional Neural Networks
    submitted by /u/Danny_Arends [link] [comments]
  • Open

    How Solid Pods May End Up Becoming the Building Blocks of the Metaverse
    Tim Berners-Lee has an interesting habit of coming up with ideas that seem hard to explain at the outset, remain all hard to understand even as they become more implemented and refined, can go for years with only a few die-hard fans becoming convinced that what he is doing is the best thing since sliced… Read More »How Solid Pods May End Up Becoming the Building Blocks of the Metaverse The post How Solid Pods May End Up Becoming the Building Blocks of the Metaverse appeared first on Data Science Central.  ( 12 min )
  • Open

    Artificial Nightmares : Forgive Me Tyrael || DD PYTTI VQGAN AI Art Video [4K 60 FPS]
    submitted by /u/Thenamessd [link] [comments]
    Can you think of a field AI won't impact?
    I was discussing with a friend how AI will change every single aspects of our lives and realized that I was not able to come up with even a single field that will not be impacted by AI. Do you have any idea? submitted by /u/Krellb [link] [comments]  ( 1 min )
    2020s. A short story by GPT-3
    (Prompt - Generated Text) Make up a short story/anecdote/poem or whatever in which you talk about the world of the 2020s - and whether there is still hope. 2020s The world was once a beautiful and vibrant place, but now it is just a shadow of its former self. People are lethargic and apathetic, and the only thing that seems to bring them joy is meaningless entertainment and distraction. The forests are empty and the animals have disappeared. The rivers and lakes are polluted and the air is poisonous. The only things that seem to thrive are disease and poverty. And it looks like everything is only going to get worse. People are so immersed in their own misery that they don't even realize it's getting worse. They are content to sit in their own filth and let the world die around them. The only thing that seems to make them happy is the false hope that things will get better. But they don't. And people are getting sadder and more lethargic. But there is still hope. There is still a chance for people to change things. They just need to open their eyes and see the world for what it is. They need to realize that they are the ones who can make a difference. And when they finally do, they will transform the world into a place of beauty and happiness. What is the conclusion of this story? The conclusion is that whether on the planet or in your very own mental world, you need to take action. Lethargy means stagnation and a creeping but certain death. submitted by /u/BeginningInfluence55 [link] [comments]  ( 1 min )
    What are Hypernetworks where AI Builds AI?
    submitted by /u/Beautiful-Credit-868 [link] [comments]
    Realistic AI Face Editing in Videos ! GAN-based face manipulations in videos: Stitch it in Time explained
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 1 min )
2022-02-28T00:43:52.008Z osmosfeed 1.14.4